Parallel Algorithm with Parameters Based on Alternating Direction for Solving Banded Linear Systems

An efficient parallel iterativemethod with parameters on distributed-memorymulticomputer is investigated for solving the banded linear equations in this work. The parallel algorithm at each iterative step is executed using alternating direction by splitting the coefficient matrix and using parameters properly. Only it twice requires the communications of the algorithm between the adjacent processors, so this method has high parallel efficiency. Some convergence theorems for different coefficient matrices are given, such as a Hermite positive definite matrix or anM-matrix. Numerical experiments implemented on HP rx2600 cluster verify that our algorithm has the advantages over the multisplitting one of high efficiency and low memory space, which has a considerable advantage in CPU-times costs over the BSOR one.The efficiency for Example 1 is better than BSOR one significantly. As to Example 2, the acceleration rates and efficiency of our algorithm are better than the PEk inner iterative one.


Introduction
In recent years, the high-performance parallel computing technology has been rapidly developed.The large sparse banded linear systems are frequently encountered when finite difference or finite element methods are used to discretize partial differential equations in many practice scientific and engineering computing problems, especially in computational fluid dynamics (CFD).While many problems can be efficiently resolved on sequential computers but are difficult to solve on parallel computers, the communications take a significant part of the total execution time.So we need more efforts to investigate more efficient parallel algorithm to improve the experimental results.
The parallel algorithms on the large sparse linear systems have been widely investigated in [1][2][3][4][5][6][7][8].Specifically, the multisplitting algorithm in [1] is a popular method at present.In [3], the authors provide a method for solving block-tridiagonal linear systems in which local lower and upper triangular incomplete factors are combined into an effective approximation for global incomplete lower and upper triangular factors of coefficient matrix based on two-dimensional domain decomposition with small overlapping.The algorithm is applicable to any preconditioner of incomplete type.Duan et al. presented a parallel strategy based on the Galerkin principle for solving block-tridiagonal linear systems in [4].In [5], a parallel direct algorithm based on Divide-and-Conquer principle and the decomposition of the coefficient matrix is investigated for solving the block-tridiagonal linear systems on distributed-memory multicomputers.The communication of the algorithm is only twice between the adjacent processors.In [7], a direct method for solving circulartridiagonal block linear systems is presented.Some parallel algorithms for solving the linear systems can be found in [9][10][11][12][13][14].The algorithm in this paper is discussed on the basis of the advantages of the one in [2].
The goal of this paper is to develop an efficient, stable parallel iterative method on distributed-memory multicomputer and to give some theoretical analysis.We appropriately choose the splitting matrices W and V to establish 2 Mathematical Problems in Engineering the iterative scheme.Two examples have been done on the HP rx2600 cluster; the experimental results indicate that the parallel algorithm has advantages over the multisplitting one of high parallel speedup and efficiency.
The content of this paper is as follows.In Section 2, the parallel iterative algorithm is described.In Section 3, the parallel iterative process is discussed.The analysis of convergence is done in Section 4. The numerical results are shown in Section 5.In Section 6, the conclusion is presented.

Parallel Algorithm
Let a banded linear equation AX = b be represented as where A  is a   ×   matrix, B  and C  are   ×  +1 and   ×  −1 matrices, respectively, and x  and b  are   -dimensional real column vectors.In general, assuming that there are  processors available and  = 2ℎ (ℎ ≥ 1, ℎ ∈  + ), we denote the th processor by   (for  = 1, 2, . . ., ) and split the coefficient matrix Then, we use the alternating direction iterative scheme in [2] and obtain the new iterative scheme here I + W and I + V are nonsingular matrices and  = 2.And hence (2) is changed into here, B  = I − 2(I + V) −1 (I + W) −1 A is the so-called iterative matrix and g = 2(I + V) −1 (I + W) −1 b.
Obviously, the matrices I + W and I + V should be nonsingular and the definition of W and V is the most important key of solving the linear systems by (3) in this paper.If W and V are suitable, the algorithm would have good parallelism and low CPU-times costs.So we choose W and V as follows From (3), let y = (I + W) −1 (Ax () − b); we obtain then the detailed calculation procedure is as follows: here, y = (y 1 , y 2 , . . ., y  , . . ., y  )  and y  is a 2ℎ-dimentional row vector.Let z = (I + V) −1 y; then we have (I + V)z = y, and where z = (z 1 , z 2 , . . ., z  , . . ., z  )  and z  is a 2ℎ-dimentional row vector.Then according to the aforementioned formulas, we get x (+1) = x () − 2z.

Process of Parallel Iterative Algorithm
Here, we show the storage method and computational procedure of the parallel algorithm as follows. 3 Then, assign  ( = /) rows to each processor.The processor stores the corresponding vectors b  , x  with  = 1, 2, . . ., .Here   and   are upper-band width and lowerband width, respectively.In such a case, this saves much of the memory space although programming is difficult.Note that if  is not divisible by , some processors store [/] + 1 rows-block of A, sequentially, and others store [/] rowsblock; meanwhile, each processor stores the corresponding vectors of x (0) and b.Thereby, it makes load of each processor approach balance and shorten wait time.

Analysis of Convergence
To perform the theoretical analysis on convergence of the parallel algorithm, we introduce the definition and several lemmata.
Symbol and Definition (i)  × represents the space of  ×  real matrices.
(ii) I  represents the unit matrix of order .
(iii) W  , V  represent the conjugate transpose matrix of W, V, respectively.
(iv) W −1 represents the inverse matrix of W.
Definition 1 (see [15]).Suppose A ∈  × and A = Q − S, where Q −1 ≥ 0 and S ≥ 0; then A = Q − S is called normal splitting of matrix A.
Definition 2 (see [15]).Suppose A ∈  × and A = Q − S, where Q −1 S ≥ 0; then A = Q − S is called weak normal splitting of matrix A.
Definition 3 (see [15]).Suppose A ∈  × and A = Q − S, where Q  + S is a Hermite positive definite matrix; then A = Q − S is called -normal splitting of matrix A.
Here, we give some theoretical analysis for convergence of the parallel iterative algorithm.
Lemma 5 (see [9]).Let A ∈  × , if the splitting A = M − N is a weak normal splitting or normal splitting of coefficient matrix A; then (M −1 N) < 1 if and only if A −1 ≥ 0. Lemma 6 (see [10]).Let A be an -matrix.If any element of A increases while outside elements of the main diagonal keep nonpositive, then the transformation matrix B is also an matrix and B −1 ≤ A −1 .
Lemma 7 (see [15]).Let A ∈  × be a nonsingular Hermite matrix.If A = M − N is a -normal splitting of the matrix A, then (M −1 N) < 1 if and only if A is a positive definite matrix. here and let then we have here Obviously, WV + V H W H − 2UU H is a semipositive definite matrix or a positive definite matrix.Hence the matrix is a Hermite positive definite matrix.Therefore, A = M−N is a -normal splitting of the matrix A, and then (M −1 N) < 1 by Lemma 7; we know that our algorithm iterative scheme is convergent.
By the theorem, we know that the parallel algorithm is convergent if A is a Hermite positive definite matrix.

Numerical Examples
We performed two numerical experiments on the HP rx2600 cluster.The results are shown as follows.
Example 1.Consider a banded linear system AX = b; here ) , ) , ) . ( Let initialization value x (0)  = (0 0 0) T and  = 80000.We apply this algorithm with the optimal relaxation factor, the multisplitting method, and BSOR method to the systems on the HP rx2600 cluster.Here  is the number of processor,  is the run times (seconds), the  is speedup ( of one processor/ of all processors),  is iteration times,  is the efficiency ( = /), and the error  = 1 × 10 −10 .See Tables 1, 2, and 3 and Figures 1 and 2. We denote   =   =  = 1,  1 =  2 =  1 =  2 = 0. Using the finite difference method, we obtain two blocktridiagonal linear systems on condition that the step sizes ℎ = 1/100.Then, we apply this algorithm with the optimal relaxation factor, BSOR method, PEk method, and the multisplitting algorithm to the systems on the HP rx2600 cluster.The numerical results are shown in Tables 4, 5, 6, and 7 and Figures 3 and 4.

Results Analysis
From Table 1 to Table 7, we can get the following conclusion.
(i) It can be known that the results of the parallel algorithm verify the results of the theoretical analysis.The conditions in the theorems are only sufficient conditions.
(ii) By the numerical results, it can be known that the parallel one has good parallelism.(iv) The parallel algorithm is easily implemented on parallel computer and more flexible and simple than [1] in practice.

Conclusions
An efficient parallel iterative method on a distributed-memory multicomputer has been presented for solving the large banded linear systems.We make full use of the decomposition of the coefficient matrix to choose W and V to save computational cost.The storage strategy can save memory space.Only twice it requires the communications of the algorithm between the adjacent processors.Theoretical analysis and experiment show that the algorithm in this paper has good parallelism and high efficiency.The results also confirm correctness of convergence theorems.When the coefficient matrix is a Hermite positive definite matrix or an -matrix, we know that the parallel algorithm is convergent if the given conditions are established.Our algorithm has an advantage over the multisplitting one of high efficiency.

(
iii) As to Examples 1 and 2, the results of the examples show that the efficiency of the algorithm is better than the multisplitting ones.Our algorithm has good parallel speedup the same as BSOR methods to the examples.As to Example 2, the efficiency of the algorithm is also better than PEk methods.

Table 2 :
The results for model 1 (the multisplitting method).

Table 5 :
The results for model 2 (the multisplitting method).