MPE Mathematical Problems in Engineering 1563-5147 1024-123X Hindawi Publishing Corporation 752651 10.1155/2014/752651 752651 Research Article Parallel Algorithm with Parameters Based on Alternating Direction for Solving Banded Linear Systems http://orcid.org/0000-0003-4291-3599 Ma Xinrong 1, 2 Liu Sanyang 1 http://orcid.org/0000-0001-8121-290X Xiao Manyu 3 http://orcid.org/0000-0003-1047-4990 Xie Gongnan 4 Scalia Massimo 1 Department of Applied Mathematics Xidian University Xi’an 710071 China xidian.edu.cn 2 Department of Applied Mathematics Xianyang Normal University Xianyang 712000 China xytc.edu.cn 3 Department of Applied Mathematics Northwestern Polytechnical University Xi’an 710072 China npu.edu 4 School of Mechanical Engineering Northwestern Polytechnical University Xi’an 710072 China npu.edu 2014 642014 2014 03 12 2013 30 01 2014 07 04 2014 2014 Copyright © 2014 Xinrong Ma et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

An efficient parallel iterative method with parameters on distributed-memory multicomputer is investigated for solving the banded linear equations in this work. The parallel algorithm at each iterative step is executed using alternating direction by splitting the coefficient matrix and using parameters properly. Only it twice requires the communications of the algorithm between the adjacent processors, so this method has high parallel efficiency. Some convergence theorems for different coefficient matrices are given, such as a Hermite positive definite matrix or an M -matrix. Numerical experiments implemented on HP rx2600 cluster verify that our algorithm has the advantages over the multisplitting one of high efficiency and low memory space, which has a considerable advantage in CPU-times costs over the BSOR one. The efficiency for Example 1 is better than BSOR one significantly. As to Example 2, the acceleration rates and efficiency of our algorithm are better than the PEk inner iterative one.

1. Introduction

In recent years, the high-performance parallel computing technology has been rapidly developed. The large sparse banded linear systems are frequently encountered when finite difference or finite element methods are used to discretize partial differential equations in many practice scientific and engineering computing problems, especially in computational fluid dynamics (CFD). While many problems can be efficiently resolved on sequential computers but are difficult to solve on parallel computers, the communications take a significant part of the total execution time. So we need more efforts to investigate more efficient parallel algorithm to improve the experimental results.

The parallel algorithms on the large sparse linear systems have been widely investigated in . Specifically, the multisplitting algorithm in  is a popular method at present. In , the authors provide a method for solving block-tridiagonal linear systems in which local lower and upper triangular incomplete factors are combined into an effective approximation for global incomplete lower and upper triangular factors of coefficient matrix based on two-dimensional domain decomposition with small overlapping. The algorithm is applicable to any preconditioner of incomplete type. Duan et al. presented a parallel strategy based on the Galerkin principle for solving block-tridiagonal linear systems in . In , a parallel direct algorithm based on Divide-and-Conquer principle and the decomposition of the coefficient matrix is investigated for solving the block-tridiagonal linear systems on distributed-memory multicomputers. The communication of the algorithm is only twice between the adjacent processors. In , a direct method for solving circular-tridiagonal block linear systems is presented. Some parallel algorithms for solving the linear systems can be found in . The algorithm in this paper is discussed on the basis of the advantages of the one in .

The goal of this paper is to develop an efficient, stable parallel iterative method on distributed-memory multicomputer and to give some theoretical analysis. We appropriately choose the splitting matrices W and V to establish the iterative scheme. Two examples have been done on the HP rx2600 cluster; the experimental results indicate that the parallel algorithm has advantages over the multisplitting one of high parallel speedup and efficiency.

The content of this paper is as follows. In Section 2, the parallel iterative algorithm is described. In Section 3, the parallel iterative process is discussed. The analysis of convergence is done in Section 4. The numerical results are shown in Section 5. In Section 6, the conclusion is presented.

2. Parallel Algorithm

Let a banded linear equation A X = b be represented as (1) ( A 1 B 1 C 2 A 2 B 2 C n - 1 A n - 1 B n - 1 C n A n ) ( x 1 x 2 x n - 1 x n ) = ( b 1 b 2 b n - 1 b n ) , where A i is a d i × d i matrix, B i and C i are d i × d i + 1 and d i × d i - 1 matrices, respectively, and x i and b i are d i -dimensional real column vectors. In general, assuming that there are p processors available and n = 2 h p ( h 1 , h Z + ), we denote the i th processor by P i (for i = 1,2 , , p ) and split the coefficient matrix A into A = W + V .

Then, we use the alternating direction iterative scheme in  and obtain the new iterative scheme (2) ( I + τ W ) ( I + τ V ) ( x ( k + 1 ) - x ( k ) ) = - α τ ( A x ( k ) - b ) ; here I + τ W and I + τ V are nonsingular matrices and α = 2 . And hence (2) is changed into (3) x ( k + 1 ) = x ( k ) - ( I + τ V ) - 1 ( I + τ W ) - 1 × [ 2 τ ( A x ( k ) - b ) ] = ( I - 2 τ ( I + τ V ) - 1 ( I + τ W ) - 1 A ) x ( k ) + 2 τ ( I + τ V ) - 1 ( I + τ W ) - 1 b = B ω x ( k ) + g ; here, B ω = I - 2 τ ( I + τ V ) - 1 ( I + τ W ) - 1 A is the so-called iterative matrix and g = 2 τ ( I + τ V ) - 1 ( I + τ W ) - 1 b .

Obviously, the matrices I + τ W and I + τ V should be nonsingular and the definition of W and V is the most important key of solving the linear systems by (3) in this paper. If   W and V are suitable, the algorithm would have good parallelism and low CPU-times costs. So we choose W and V as follows (4) W = ( α A 1 B 1 β A 2 C 3 α A 3 B 3 β A 4 C 2 h p - 1 α A 2 h p - 1 B 2 h p - 1 β A 2 h p ) , V = ( ( 1 - α ) A 1 C 2 ( 1 - β ) A 2 B 2 ( 1 - α ) A 3 C 4 ( 1 - β ) A 4 B 4 ( 1 - α ) A 2 h p - 1 C 2 h p ( 1 - β ) A 2 h p ) .

From (3), let y = ( I + τ W ) - 1 ( A x ( k ) - b ) ; we obtain (5) ( I + τ W ) y = A x ( k ) - b ; then the detailed calculation procedure is as follows: (6) ( I + τ β A ( i - 1 ) 2 h + j ) y ( i - 1 ) 2 h + j = C ( i - 1 ) 2 h + j x ( i - 1 ) 2 h + j - 1 ( k ) + A ( i - 1 ) 2 h + j x ( i - 1 ) 2 h + j ( k ) + B ( i - 1 ) 2 h + j x ( i - 1 ) 2 h + j + 1 ( k ) - b ( i - 1 ) 2 h + j ( i = 2,4 , , 2 h ) , ( I + τ α A ( i - 1 ) 2 h + j ) y ( i - 1 ) 2 h + j = C ( i - 1 ) 2 h + j ( x ( i - 1 ) 2 h + j - 1 ( k ) - τ y ( i - 1 ) 2 h + j - 1 ) + A ( i - 1 ) 2 h + j x ( i - 1 ) 2 h + j ( k ) + B ( i - 1 ) 2 h + j ( x ( i - 1 ) 2 h + j + 1 ( k ) - τ y ( i - 1 ) 2 h + j + 1 ) - b ( i - 1 ) 2 h + j ( i = 1,3 , , 2 h - 1 ) ; here, y = ( y 1 , y 2 , , y i , , y p ) T and y i is a 2 h -dimentional row vector.

Let z = ( I + τ V ) - 1 y ; then we have ( I + τ V ) z = y , and (7) [ I + τ ( 1 - α ) A ( i - 1 ) 2 h + j ] z ( i - 1 ) 2 h + j = y ( i - 1 ) 2 h + j , j = ( 1,3 , , 2 h - 1 ) , [ I + τ ( 1 - β ) A ( i - 1 ) 2 h + j ] z ( i - 1 ) 2 h + j = y ( i - 1 ) 2 h + j - τ C ( i - 1 ) 2 h + j z ( i - 1 ) 2 h + j - 1 - τ B ( i - 1 ) 2 h + j z ( i - 1 ) 2 h + j + 1 , ( j = 2,4 , , 2 h ) , where z = ( z 1 , z 2 , , z i , , z p ) T and z i is a 2 h -dimentional row vector. Then according to the aforementioned formulas, we get x ( k + 1 ) = x ( k ) - 2 τ z .

3. Process of Parallel Iterative Algorithm

Here, we show the storage method and computational procedure of the parallel algorithm as follows.

3.1. Storage Method

The coefficient matrix is divided into A 1 , A 2 , , A k l + 1 , , A k l + k u + 1 from left to right as banded order. Let vectors A 1 = ( 0 , , 0 , a k l + 1,1 , a k l + 2,2 , , a n , n - k l ) T , , A k l + 1 = ( a 11 , a 22 , , a n n ) T , , A k l + k u + 1 = ( a 1 , k u + 1 , a 2 , k u + 2 , , a n - k u , n , 0 , , 0 ) T .

The corresponding relationship is as follows: (8) A = ( a 11 a 1 , k u + 1 0 0 a 21 a 22 a 2 , k u + 2 0 0 a k l + 1,1 a k l + 1,2 a k l + 1 , k l + 1 a k l + 1 , k l + k u + 2 0 0 0 0 a n - k u , n - k u - k l a n - k u , n - k u a n - k u , n 0 0 a n , n - k l a n , n - 1 a n n ) A 1 A k l + 1 A k l + k u + 1 ( 0 0 a n , n - k l 0 a k l + 1,1 a n , n - k l a 11 a k l + 1 , k l + 1 a 1 , k u + 1 a k l + 1 , k l + k u + 2 0 0 0 a n , n - k l ) .

Then, assign m ( m = n / p ) rows to each processor. The processor stores the corresponding vectors b i , x i with i = 1,2 , , p . Here k u and k l are upper-band width and lower-band width, respectively. In such a case, this saves much of the memory space although programming is difficult. Note that if n is not divisible by p , some processors store [ n / p ] + 1 rows-block of A , sequentially, and others store [ n / p ] rows-block; meanwhile, each processor stores the corresponding vectors of x ( 0 ) and b . Thereby, it makes load of each processor approach balance and shorten wait time.

3.2. Cycle Process

( 1 )    P i performs a parallel communication to obtain x ( i - 1 ) 2 m ( k ) , x ( i ) 2 m + 1 ( k ) and then computes (9) y i = - 2 τ ( I + τ W i ) - 1 ( A i x i ( k ) - b i ) and implements LU discretization one-step, where W i , A i , b i , and x i are the i th (for i = 1,2 , , p ) block of W , A , b , and x , respectively.

( 2 )    P i performs one parallel communication to obtain y ( i - 1 ) 2 m and then computes (10) x i ( k + 1 ) = x i ( k ) + ( I + τ V i ) - 1 y i and implements LU discretization one-step; here V i is the i th (for i = 1,2 , , p ) block of V .

( 3 ) On the P i processor, judge whether the inequality x i ( k + 1 ) - x i ( k ) < ε ( ε is error bound, i = 1,2 , , p ) holds. Stop if these inequalities hold on every processor, or return to ( 1 ) and continue cycling until all inequalities are satisfied.

4. Analysis of Convergence

To perform the theoretical analysis on convergence of the parallel algorithm, we introduce the definition and several lemmata.

Symbol and Definition

R n × n represents the space of n × n real matrices.

I r represents the unit matrix of order r .

W H , V H represent the conjugate transpose matrix of W , V , respectively.

W - 1 represents the inverse matrix of W .

Definition 1 (see [<xref ref-type="bibr" rid="B15">15</xref>]).

Suppose A R n × n and A = Q - S , where Q - 1 0 and S 0 ; then A = Q - S is called normal splitting of matrix A .

Definition 2 (see [<xref ref-type="bibr" rid="B15">15</xref>]).

Suppose A R n × n and A = Q - S , where Q - 1 S 0 ; then A = Q - S is called weak normal splitting of matrix A .

Definition 3 (see [<xref ref-type="bibr" rid="B15">15</xref>]).

Suppose A R n × n and A = Q - S , where Q H + S is a Hermite positive definite matrix; then A = Q - S is called P -normal splitting of matrix A .

Definition 4 (see [<xref ref-type="bibr" rid="B15">15</xref>]).

Let A = ( a i j ) R n × n , if a i j 0 ( i j ) and A - 1 0 ; then the matrix A is an M -matrix.

Here, we give some theoretical analysis for convergence of the parallel iterative algorithm.

Lemma 5 (see [<xref ref-type="bibr" rid="B9">9</xref>]).

Let A R n × n , if the splitting A = M - N is a weak normal splitting or normal splitting of coefficient matrix A ; then ρ ( M - 1 N ) < 1 if and only if A - 1 0 .

Lemma 6 (see [<xref ref-type="bibr" rid="B10">10</xref>]).

Let A be an M -matrix. If any element of A increases while outside elements of the main diagonal keep nonpositive, then the transformation matrix B is also an M -matrix and B - 1 A - 1 .

Lemma 7 (see [<xref ref-type="bibr" rid="B15">15</xref>]).

Let A C n × n be a nonsingular Hermite matrix. If A = M - N is a P -normal splitting of the matrix A , then ρ ( M - 1 N ) < 1 if and only if A is a positive definite matrix.

Theorem 8.

Let A R n × n be a Hermite positive definite matrix. If τ > 0 ,   β = 1 / 2 , and 0 α 1 , then the iterative scheme (3) is convergent for all vector x ( 0 ) .

Proof.

Since ( I + τ W ) ( I + τ V ) ( x ( k + 1 ) - x ( k ) ) = - 2 τ ( A x ( k ) - b ) and (11) B ω = ( I - 2 τ ( I + τ V ) - 1 ( I + τ W ) - 1 A ) = ( I + τ V ) - 1 ( I + τ W ) - 1 ( I - τ W ) ( I - τ V ) , we have A = M - N ; here M = ( I + τ V ) ( I + τ W ) ,   N = ( I - τ W ) ( I - τ V ) , (12) M H + N = ( I + τ V ) H ( I + τ W ) H + ( I - τ W ) ( I - τ V ) = I + τ A + τ 2 V H W H + Ι - τ A + τ 2 W V = 2 I + τ 2 ( W V + V H W H ) .

Since (13) W V = ( Λ 1 ( 1 - β ) D 1 E 1 β D 1 H Λ 2 β F 2 E 1 H ( 1 - β ) F 2 H Λ 3 ( 1 - β ) D 3 E 3 β D 3 H Λ 4 β F 4 E 3 H ( 1 - β ) F 4 H Λ 5 ( 1 - β ) D 5 E 5 β D 2 h p - 3 H Λ 2 h p - 2 β F 2 h p - 2 E 2 h p - 3 H ( 1 - β ) F 2 h p - 2 H Λ 2 h p - 1 ( 1 - β ) D 2 h p - 1 β D 2 h p - 1 H Λ 2 h p ) ,

here (14) Λ 1 = α ( 1 - α ) A 1 2 + B 1 B 1 H , Λ i = { β ( 1 - β ) A i 2 ( i = 2,4 , , 2 n ) α ( 1 - α ) A i 2 + B i - 1 H B i - 1 + B i B i H ( i = 3,5 , , 2 h p - 1 ) , D i = B i A i + 1 , Ε i = B i B i + 1 ( i = 1,2 , , 2 h p ) , F i = A i B i ( i = 2,4 , , 2 h p ) , and let (15) U = ( 0 B 1 1 2 A 2 C 3 0 B 3 C 2 h p - 1 0 B 2 h p - 1 1 2 A 2 h p ) ; then we have (16) W V + V H W H - 2 U U H = ( 2 Λ 1 - Q 1 2 Λ 2 - Q 2 2 Λ 2 n - 1 - Q 2 n - 1 2 Λ 2 n - Q 2 n ) ;

here (17) Q 1 = 2 B 1 B 1 H , Q i = { 1 2 A i 2 ( i = 2,4 , , 2 h p ) 2 B i - 1 H B i - 1 + 2 B i B i H ( i = 3,5 , , 2 h p - 1 ) , 2 Λ i - Q i = { 2 α ( 1 - α ) A i 2 ( i = 1,3 , , 2 h p - 1 ) [ 2 β ( 1 - β ) - 1 2 ] A i 2 ( i = 2,4 , , 2 h p ) . Obviously, W V + V H W H - 2 U U H is a semipositive definite matrix or a positive definite matrix. Hence the matrix (18) M H + N = 2 I + τ 2 ( W V + V H W H ) = 2 I + τ 2 ( W V + V H W H - 2 U U H ) + 2 τ 2 U U H is a Hermite positive definite matrix.

Therefore, A = M - N is a P -normal splitting of the matrix A , and then ρ ( M - 1 N ) < 1 by Lemma 7; we know that our algorithm iterative scheme is convergent.

By the theorem, we know that the parallel algorithm is convergent if A is a Hermite positive definite matrix.

Theorem 9.

Let A R n × n be an M -matrix. If 0 < τ min { 1 / α , 1 / β , 1 / ( 1 - α ) , 1 / ( 1 - β ) } min ( 1 / a i i ) for i = 1 , 2 , , 2 h p , here 0 < α , β < 1 and a i i is the diagonal element of A ; then the iterative scheme (3) is convergent for all vector x ( 0 ) .

Proof.

Since M = ( I + τ V ) ( I + τ W ) ,   N = ( I - τ W ) ( I - τ V ) , and (19) I + τ V = ( I + ( 1 - α ) τ A 1 τ C 2 I + τ ( 1 - β ) A 2 τ B 2 I + ( 1 - α ) τ A 3 τ C 4 I + τ ( 1 - β ) A 4 τ B 4 I + ( 1 - α ) τ A 2 h p - 1 τ C 2 h p I + τ ( 1 - β ) A 2 h p ) ,

we have (20) ( I + τ V ) - 1 = ( ( I + ( 1 - α ) τ A 1 ) - 1 Q 2 ( I + ( 1 - β ) τ A 2 ) - 1 F 2 ( I + ( 1 - α ) τ A 3 ) - 1 Q 4 ( I + ( 1 - β ) τ A 4 ) - 1 F 4 ( I + ( 1 - α ) τ A 2 h p - 1 ) - 1 Q 2 h p ( I + ( 1 - β ) τ A 2 h p ) - 1 ) .

Here (21) Q i = - τ ( I + ( 1 - β ) τ A i ) - 1 C i ( I + ( 1 - α ) τ A i - 1 ) - 1 , Q i = - τ ( I + ( 1 - β ) τ A i ) - 1 C i 1 - l l l ( i = 2,4 , , 2 h p ) , F i = - τ ( I + ( 1 - β ) τ A i ) - 1 B i ( I + ( 1 - α ) τ A i + 1 ) - 1 , Q i = - τ ( I + ( 1 - β ) τ A i ) - 1 C i ( i = 2,4 , , 2 h p - 2 ) . Hence, we know that ( I + ( 1 - α ) τ A i - 1 ) , ( I + ( 1 - α ) τ A i + 1 ) , and ( I + ( 1 - β ) τ A i ) , ( i = 1,2 , , 2 h p ), are all M -matrices by Lemma 6. Then ( I + ( 1 - α ) τ A i - 1 ) - 1 0 , ( I + ( 1 - α ) τ A i + 1 ) - 1 0 , ( I + ( 1 - β ) τ A i ) - 1 0 , Q i 0 , and F i 0 ; we obtain ( I + τ V ) - 1 0 . Similarly, we can obtain ( I + τ W ) - 1 0 , and M - 1 0 .

Since 0 < τ min { 1 / α , 1 / β , 1 / ( 1 - α ) , 1 / ( 1 - β ) } min ( 1 / a i i ) for i = 1,2 , , 2 h p , we have ( I - τ W ) 0 and ( I - τ V ) 0 . That is, N 0 is obtained and A = M - N is a normal splitting. Since A is an M -matrix, then A - 1 0 ; we know that ρ ( M - 1 N ) < 1 by Lemma 5, and the iterative scheme (3) is convergent.

By the theorem, we know that the parallel algorithm is convergent if A is an M -matrix and 0 < τ min { 1 / α , 1 / β , 1 / ( 1 - α ) , 1 / ( 1 - β ) } min ( 1 / a i i ) for i = 1,2 , , 2 h p .

5. Numerical Examples

We performed two numerical experiments on the HP rx2600 cluster. The results are shown as follows.

Example 1.

Consider a banded linear system A X = b ; here (22) A = ( A 1 B 1 C 2 A 2 B 2 C m - 1 A m - 1 B m - 1 C m A m ) , A i = ( 15.1 - 3.5 - 6.9 - 2.7 20.1 - 4.8 - 15.7 - 5.3 25.1 ) , B i = C i = ( - 3 - 2 - 4 ) , b i = ( 1 1 1 ) . Let initialization value x i ( 0 ) = ( 0 0 0 ) T and m = 80000 . We apply this algorithm with the optimal relaxation factor, the multisplitting method, and BSOR method to the systems on the HP rx2600 cluster. Here P is the number of processor, T is the run times (seconds), the S is speedup ( T of one processor/ T of all processors), L is iteration times, E is the efficiency ( E = S / P ), and the error ε = 1 × 1 0 - 10 . See Tables 1, 2, and 3 and Figures 1 and 2.

The results for model 1 (the algorithm in the paper ( τ = 0.9 , α = β = 1 / 2 )).

P 1 2 4 8
T 2.9921 1.5179 0.8028 0.6492
S 1.9712 3.7271 4.6089
E 0.9856 0.9318 0.5761

L 39 39 39 39

The results for model 1 (the multisplitting method).

P 1 2 4 8
T 7.9544 7.1352 3.3874 2.2426
S 1.1148 2.3482 3.5470
E 0.5745 0.5871 0.4434

L 160 224 224 224

The results for model 1 (BSOR method ( ω = 1.85 )).

P 1 2 4 8
T 17.0168 8.9928 4.6551 3.8031
S 1.8923 3.6555 4.4745
E 0.9461 0.9139 0.5593

L 495 532 532 532

The parallel speedup for Example 1.

The parallel efficiency for Example 1.

Example 2.

Consider an elliptic partial differential equation (23) C x 2 u x 2 + C y 2 u y 2 + ( C 1 sin 2 π x + C 2 ) u x + ( D 1 sin 2 π x + D 2 ) u x + E u = 0 , 0 x , y 1 , equipped with the boundary conditions u | x = 0 = u | x = 1 = 10 + cos π y ,   u | y = 0 = u | y = 1 = 10 + cos π x ; here C x , C y , C 1 , C 2 , D 1 , D 2 , and E are all constants.

We denote C x = C y = E = 1 , C 1 = C 2 = D 1 = D 2 = 0 . Using the finite difference method, we obtain two block-tridiagonal linear systems on condition that the step sizes h = 1 / 100 . Then, we apply this algorithm with the optimal relaxation factor, BSOR method, PEk method, and the multisplitting algorithm to the systems on the HP rx2600 cluster. The numerical results are shown in Tables 4, 5, 6, and 7 and Figures 3 and 4.

The results for model 2 (the algorithm in the paper ( τ = 8.0 , α = β = 1 / 2 )).

P 1 2 4 8 16
T 11.3091 6.3632 5.5117 4.0755 3.3842
S 1.7773 2.0152 2.7749 3.3417
E 0.8886 0.5130 0.3469
E 0.8886 0.5130 0.3469 0.2089
L 1177 1177 1177 1177 1186

Δ 0.8163 e - 10 0.8163 e - 10 0.8163 e - 10 08163 e - 10 0.8140 e - 10

The results for model 2 (the multisplitting method).

P 1 2 4 8 16
T 15.3559 17.3404 10.5411 7.8602 5.9567
S 0.8856 1.4568 1.9536 2.5779
E 0.8886 0.5130 0.3469
E 0.4428 0.3642 0.2442 0.1611

L 310 824 975 1335 1556

The results for model 2 (PEk method ( k = 2.7 )).

P 1 2 4 8 16
T 14.6964 20.0765 9.9533 6.3488 4.8215
S 0.7320 1.4765 2.3148 3.0481
E 0.8886 0.5130 0.3469
E 0.3660 0.3691 0.2894 0.1905

L 159 444 444 444 444

The results for model 2 (BSOR method).

P 1 2 4 8 16
T 27.7668 21.6576 14.3278 10.1420 8.5949
S 1.2821 1.9380 2.7378 3.2306
E 0.8886 0.5130 0.3469
E 0.6410 0.4845 0.3422 0.2019

L 660 1039 1337 2101 2175

The parallel speedup for Example 2.

The parallel efficiency for Example 2.

6. Results Analysis

From Table 1 to Table 7, we can get the following conclusion.

It can be known that the results of the parallel algorithm verify the results of the theoretical analysis. The conditions in the theorems are only sufficient conditions.

By the numerical results, it can be known that the parallel one has good parallelism.

As to Examples 1 and 2, the results of the examples show that the efficiency of the algorithm is better than the multisplitting ones. Our algorithm has good parallel speedup the same as BSOR methods to the examples. As to Example 2, the efficiency of the algorithm is also better than PEk methods.

The parallel algorithm is easily implemented on parallel computer and more flexible and simple than  in practice.

7. Conclusions

An efficient parallel iterative method on a distributed-memory multicomputer has been presented for solving the large banded linear systems. We make full use of the decomposition of the coefficient matrix to choose W and V to save computational cost. The storage strategy can save memory space. Only twice it requires the communications of the algorithm between the adjacent processors. Theoretical analysis and experiment show that the algorithm in this paper has good parallelism and high efficiency. The results also confirm correctness of convergence theorems. When the coefficient matrix is a Hermite positive definite matrix or an M -matrix, we know that the parallel algorithm is convergent if the given conditions are established. Our algorithm has an advantage over the multisplitting one of high efficiency.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant nos. 11002117 and 11302173 and Xianyang Normal University Research Foundation under Grant nos. 09XSYK209 and 09XSYK204.

Zhang B. Gu T. Mo Z. Principles and Methods of Numerical Parallel Computation 1999 National Defence Industry Press Q. Ye T. An improve parallel algorithm for solving linear equations involving block tridiagonal coefficient matrix Journal of Northwestern Polytechnical University 1996 4 2 314 317 2-s2.0-0030134651 Wu J. Song J. Zhang W. Li X. Parallel incomplete factorization preconditioning of block tridiagonal linear systems with 2-D domain decomposition Chinese Journal of Computational Physics 2009 26 2 191 199 2-s2.0-64049098910 Duan Z. Yang Y. Lv Q. Ma X. Parallel strategy for solving block-tridiagonal linear systems Computer Engineering and Applications 2011 47 13 46 49 Fan Y. The Parallel Algorithms for Solving the Large Scale Linear Systems with Typical Structure 2009 Xi’an, China Northwestern Polytechnical University Press Luo Z.-G. Li X.-M. Parallel algorithm for block-tridiagonal linear systems on distributed-memory multicomputers Chinese Journal of Computers 2000 23 10 1028 1034 2-s2.0-0034309528 El-Sayed S. M. A direct method for solving circulant tridiagonal block systems of linear equations Applied Mathematics and Computation 2005 165 1 23 30 10.1016/j.amc.2004.06.041 MR2137022 Cui X. Q. A parallel algorithm for block-tridiagonal linear systems Applied Mathematics and Computation 2006 173 2 1107 1114 10.1016/j.amc.2005.04.037 MR2207999 Varga R. S. Matrix Iterative Analysis 1962 Englewood Cliffs, NJ, USA Prentice-Hall MR0158502 Hu J. Iterative Method of Linear Algebraic Equations 1999 Beijing, China Science Press Frommer A. Szyld D. B. Weighted max norms, splittings, and overlapping additive Schwarz iterations Numerische Mathematik 1999 83 2 259 278 10.1007/s002110050449 MR1712686 Feng J. Che G. Nie Y. Principle of Numerical Analysis 2002 Beijing, China Science Press Bjørstad P. Luskin M. Parallel Solution of Partial Differential Equations 2000 New York, NY, USA Springer 10.1007/978-1-4612-1176-1 MR1838275 Reed W. H. Hill T. R. Triangle mesh methods for the Neutron transport equation Report 1973 LA-UR-73-479 Los Alamos Scientific Laboratory Cheng Y. P. Matrix Theory 2002 Northwestern polytechnical University Press