An efficient parallel iterative method with parameters on distributed-memory multicomputer is investigated for solving the banded linear equations in this work. The parallel algorithm at each iterative step is executed using alternating direction by splitting the coefficient matrix and using parameters properly. Only it twice requires the communications of the algorithm between the adjacent processors, so this method has high parallel efficiency. Some convergence theorems for different coefficient matrices are given, such as a Hermite positive definite matrix or an M-matrix. Numerical experiments implemented on HP rx2600 cluster verify that our algorithm has the advantages over the multisplitting one of high efficiency and low memory space, which has a considerable advantage in CPU-times costs over the BSOR one. The efficiency for Example 1 is better than BSOR one significantly. As to Example 2, the acceleration rates and efficiency of our algorithm are better than the PEk inner iterative one.
1. Introduction
In recent years, the high-performance parallel computing technology has been rapidly developed. The large sparse banded linear systems are frequently encountered when finite difference or finite element methods are used to discretize partial differential equations in many practice scientific and engineering computing problems, especially in computational fluid dynamics (CFD). While many problems can be efficiently resolved on sequential computers but are difficult to solve on parallel computers, the communications take a significant part of the total execution time. So we need more efforts to investigate more efficient parallel algorithm to improve the experimental results.
The parallel algorithms on the large sparse linear systems have been widely investigated in [1–8]. Specifically, the multisplitting algorithm in [1] is a popular method at present. In [3], the authors provide a method for solving block-tridiagonal linear systems in which local lower and upper triangular incomplete factors are combined into an effective approximation for global incomplete lower and upper triangular factors of coefficient matrix based on two-dimensional domain decomposition with small overlapping. The algorithm is applicable to any preconditioner of incomplete type. Duan et al. presented a parallel strategy based on the Galerkin principle for solving block-tridiagonal linear systems in [4]. In [5], a parallel direct algorithm based on Divide-and-Conquer principle and the decomposition of the coefficient matrix is investigated for solving the block-tridiagonal linear systems on distributed-memory multicomputers. The communication of the algorithm is only twice between the adjacent processors. In [7], a direct method for solving circular-tridiagonal block linear systems is presented. Some parallel algorithms for solving the linear systems can be found in [9–14]. The algorithm in this paper is discussed on the basis of the advantages of the one in [2].
The goal of this paper is to develop an efficient, stable parallel iterative method on distributed-memory multicomputer and to give some theoretical analysis. We appropriately choose the splitting matrices W and V to establish the iterative scheme. Two examples have been done on the HP rx2600 cluster; the experimental results indicate that the parallel algorithm has advantages over the multisplitting one of high parallel speedup and efficiency.
The content of this paper is as follows. In Section 2, the parallel iterative algorithm is described. In Section 3, the parallel iterative process is discussed. The analysis of convergence is done in Section 4. The numerical results are shown in Section 5. In Section 6, the conclusion is presented.
2. Parallel Algorithm
Let a banded linear equation AX=b be represented as
(1)(A1B1C2A2B2⋱⋱⋱Cn-1An-1Bn-1CnAn)(x1x2⋮xn-1xn)=(b1b2⋮bn-1bn),
where Ai is a di×di matrix, Bi and Ci are di×di+1 and di×di-1 matrices, respectively, and xi and bi are di-dimensional real column vectors. In general, assuming that there are p processors available and n=2hp (h≥1, h∈Z+), we denote the ith processor by Pi (for i=1,2,…,p) and split the coefficient matrix A into A=W+V.
Then, we use the alternating direction iterative scheme in [2] and obtain the new iterative scheme
(2)(I+τW)(I+τV)(x(k+1)-x(k))=-ατ(Ax(k)-b);
here I+τW and I+τV are nonsingular matrices and α=2. And hence (2) is changed into
(3)x(k+1)=x(k)-(I+τV)-1(I+τW)-1×[2τ(Ax(k)-b)]=(I-2τ(I+τV)-1(I+τW)-1A)x(k)+2τ(I+τV)-1(I+τW)-1b=Bωx(k)+g;
here, Bω=I-2τ(I+τV)-1(I+τW)-1A is the so-called iterative matrix and g=2τ(I+τV)-1(I+τW)-1b.
Obviously, the matrices I+τW and I+τV should be nonsingular and the definition of W and V is the most important key of solving the linear systems by (3) in this paper. If W and V are suitable, the algorithm would have good parallelism and low CPU-times costs. So we choose W and V as follows(4)W=(αA1B1βA2C3αA3B3βA4⋱C2hp-1αA2hp-1B2hp-1βA2hp),V=((1-α)A1C2(1-β)A2B2(1-α)A3C4(1-β)A4B4⋱(1-α)A2hp-1C2hp(1-β)A2hp).
From (3), let y=(I+τW)-1(Ax(k)-b); we obtain
(5)(I+τW)y=Ax(k)-b;
then the detailed calculation procedure is as follows:
(6)(I+τβA(i-1)2h+j)y(i-1)2h+j=C(i-1)2h+jx(i-1)2h+j-1(k)+A(i-1)2h+jx(i-1)2h+j(k)+B(i-1)2h+jx(i-1)2h+j+1(k)-b(i-1)2h+j(i=2,4,…,2h),(I+ταA(i-1)2h+j)y(i-1)2h+j=C(i-1)2h+j(x(i-1)2h+j-1(k)-τy(i-1)2h+j-1)+A(i-1)2h+jx(i-1)2h+j(k)+B(i-1)2h+j(x(i-1)2h+j+1(k)-τy(i-1)2h+j+1)-b(i-1)2h+j(i=1,3,…,2h-1);
here, y=(y1,y2,…,yi,…,yp)T and yi is a 2h-dimentional row vector.
Let z=(I+τV)-1y; then we have (I+τV)z=y, and
(7)[I+τ(1-α)A(i-1)2h+j]z(i-1)2h+j=y(i-1)2h+j,j=(1,3,…,2h-1),[I+τ(1-β)A(i-1)2h+j]z(i-1)2h+j=y(i-1)2h+j-τC(i-1)2h+jz(i-1)2h+j-1-τB(i-1)2h+jz(i-1)2h+j+1,(j=2,4,…,2h),
where z=(z1,z2,…,zi,…,zp)T and zi is a 2h-dimentional row vector. Then according to the aforementioned formulas, we get x(k+1)=x(k)-2τz.
3. Process of Parallel Iterative Algorithm
Here, we show the storage method and computational procedure of the parallel algorithm as follows.
3.1. Storage Method
The coefficient matrix is divided into A1,A2,…,Akl+1,…,Akl+ku+1 from left to right as banded order. Let vectors A1=(0,…,0,akl+1,1,akl+2,2,…,an,n-kl)T,…,Akl+1=(a11,a22,…,ann)T,…,Akl+ku+1=(a1,ku+1,a2,ku+2,…,an-ku,n,0,…,0)T.
The corresponding relationship is as follows:(8)A=(a11⋯a1,ku+10⋯⋯⋯⋯0a21a22⋯a2,ku+20⋯⋯⋯0⋮⋮⋱⋱⋱⋮akl+1,1akl+1,2⋯akl+1,kl+1⋯akl+1,kl+ku+20⋯0⋮⋱⋱⋱⋮⋱⋱⋱⋱⋮0⋯0an-ku,n-ku-kl⋯⋯an-ku,n-ku⋯an-ku,n⋮⋮⋱⋱⋮0⋯0⋯⋯an,n-kl⋯an,n-1ann)⟹A1⋯Akl+1⋯Akl+ku+1⇓⋯⇓⋯⇓(0⋮0⋮an,n-kl0⋮akl+1,1⋮an,n-kl⋯⋯⋯a11⋮akl+1,kl+1⋮⋯⋯⋯⋯a1,ku+1⋮akl+1,kl+ku+2⋮00⋮0⋮an,n-kl).
Then, assign m (m=n/p) rows to each processor. The processor stores the corresponding vectors bi, xi with i=1,2,…,p. Here ku and kl are upper-band width and lower-band width, respectively. In such a case, this saves much of the memory space although programming is difficult. Note that if n is not divisible by p, some processors store [n/p]+1 rows-block of A, sequentially, and others store [n/p] rows-block; meanwhile, each processor stores the corresponding vectors of x(0)and b. Thereby, it makes load of each processor approach balance and shorten wait time.
3.2. Cycle Process
(1)Pi performs a parallel communication to obtain x(i-1)2m(k), x(i)2m+1(k) and then computes
(9)yi=-2τ(I+τWi)-1(Aixi(k)-bi)
and implements LU discretization one-step, where Wi, Ai, bi, and xi are the ith (for i=1,2,…,p) block of W, A, b, and x, respectively.
(2)Pi performs one parallel communication to obtain y(i-1)2m and then computes
(10)xi(k+1)=xi(k)+(I+τVi)-1yi
and implements LU discretization one-step; here Vi is the ith (for i=1,2,…,p) block of V.
(3) On the Pi processor, judge whether the inequality ∥xi(k+1)-xi(k)∥<ε (ε is error bound, i=1,2,…,p) holds. Stop if these inequalities hold on every processor, or return to (1) and continue cycling until all inequalities are satisfied.
4. Analysis of Convergence
To perform the theoretical analysis on convergence of the parallel algorithm, we introduce the definition and several lemmata.
Symbol and Definition
Rn×n represents the space of n×n real matrices.
Ir represents the unit matrix of order r.
WH, VH represent the conjugate transpose matrix of W, V, respectively.
W-1 represents the inverse matrix of W.
Definition 1 (see [15]).
Suppose A∈Rn×n and A=Q-S, where Q-1≥0 and S≥0; then A=Q-S is called normal splitting of matrix A.
Definition 2 (see [15]).
Suppose A∈Rn×n and A=Q-S, where Q-1S≥0; then A=Q-S is called weak normal splitting of matrix A.
Definition 3 (see [15]).
Suppose A∈Rn×n and A=Q-S, where QH+S is a Hermite positive definite matrix; then A=Q-S is called P-normal splitting of matrix A.
Definition 4 (see [15]).
Let A=(aij)∈Rn×n, if aij≤0 (i≠j) and A-1≥0; then the matrix A is an M-matrix.
Here, we give some theoretical analysis for convergence of the parallel iterative algorithm.
Lemma 5 (see [9]).
Let A∈Rn×n, if the splitting A=M-N is a weak normal splitting or normal splitting of coefficient matrix A; then ρ(M-1N)<1 if and only if A-1≥0.
Lemma 6 (see [10]).
Let A be an M-matrix. If any element of A increases while outside elements of the main diagonal keep nonpositive, then the transformation matrix B is also an M-matrix and B-1≤A-1.
Lemma 7 (see [15]).
Let A∈Cn×n be a nonsingular Hermite matrix. If A=M-N is a P-normal splitting of the matrix A, then ρ(M-1N)<1 if and only if A is a positive definite matrix.
Theorem 8.
Let A∈Rn×n be a Hermite positive definite matrix. If τ>0, β=1/2, and 0≤α≤1, then the iterative scheme (3) is convergent for all vector x(0).
Proof.
Since (I+τW)(I+τV)(x(k+1)-x(k))=-2τ(Ax(k)-b) and
(11)Bω=(I-2τ(I+τV)-1(I+τW)-1A)=(I+τV)-1(I+τW)-1(I-τW)(I-τV),
we have A=M-N; here M=(I+τV)(I+τW), N=(I-τW)(I-τV),
(12)MH+N=(I+τV)H(I+τW)H+(I-τW)(I-τV)=I+τA+τ2VHWH+Ι-τA+τ2WV=2I+τ2(WV+VHWH).
here
(14)Λ1=α(1-α)A12+B1B1H,Λi={β(1-β)Ai2(i=2,4,…,2n)α(1-α)Ai2+Bi-1HBi-1+BiBiH(i=3,5,…,2hp-1),Di=BiAi+1,Εi=BiBi+1(i=1,2,…,2hp),Fi=AiBi(i=2,4,…,2hp),
and let
(15)U=(0B112A2C30B3⋱C2hp-10B2hp-112A2hp);
then we have
(16)WV+VHWH-2UUH=(2Λ1-Q12Λ2-Q2⋱2Λ2n-1-Q2n-12Λ2n-Q2n);
here
(17)Q1=2B1B1H,Qi={12Ai2(i=2,4,…,2hp)2Bi-1HBi-1+2BiBiH(i=3,5,…,2hp-1),2Λi-Qi={2α(1-α)Ai2(i=1,3,…,2hp-1)[2β(1-β)-12]Ai2(i=2,4,…,2hp).
Obviously, WV+VHWH-2UUH is a semipositive definite matrix or a positive definite matrix. Hence the matrix
(18)MH+N=2I+τ2(WV+VHWH)=2I+τ2(WV+VHWH-2UUH)+2τ2UUH
is a Hermite positive definite matrix.
Therefore, A=M-N is a P-normal splitting of the matrix A, and then ρ(M-1N)<1 by Lemma 7; we know that our algorithm iterative scheme is convergent.
By the theorem, we know that the parallel algorithm is convergent if A is a Hermite positive definite matrix.
Theorem 9.
Let A∈Rn×n be an M-matrix. If 0<τ≤min{1/α,1/β,1/(1-α),1/(1-β)}min(1/aii) for i=1,2,…,2hp, here 0<α, β<1 and aii is the diagonal element of A; then the iterative scheme (3) is convergent for all vector x(0).
Proof.
Since M=(I+τV)(I+τW), N=(I-τW)(I-τV), and(19)I+τV=(I+(1-α)τA1τC2I+τ(1-β)A2τB2I+(1-α)τA3τC4I+τ(1-β)A4τB4⋱I+(1-α)τA2hp-1τC2hpI+τ(1-β)A2hp),
we have(20)(I+τV)-1=((I+(1-α)τA1)-1Q2(I+(1-β)τA2)-1F2(I+(1-α)τA3)-1Q4(I+(1-β)τA4)-1F4⋱(I+(1-α)τA2hp-1)-1Q2hp(I+(1-β)τA2hp)-1).
Here
(21)Qi=-τ(I+(1-β)τAi)-1Ci(I+(1-α)τAi-1)-1,Qi=-τ(I+(1-β)τAi)-1Ci1-lll(i=2,4,…,2hp),Fi=-τ(I+(1-β)τAi)-1Bi(I+(1-α)τAi+1)-1,Qi=-τ(I+(1-β)τAi)-1Ci(i=2,4,…,2hp-2).
Hence, we know that (I+(1-α)τAi-1), (I+(1-α)τAi+1), and (I+(1-β)τAi), (i=1,2,…,2hp), are all M-matrices by Lemma 6. Then (I+(1-α)τAi-1)-1≥0, (I+(1-α)τAi+1)-1≥0, (I+(1-β)τAi)-1≥0, Qi≥0, and Fi≥0; we obtain (I+τV)-1≥0. Similarly, we can obtain (I+τW)-1≥0, and M-1≥0.
Since 0<τ≤min{1/α,1/β,1/(1-α),1/(1-β)}min(1/aii) for i=1,2,…,2hp, we have (I-τW)≥0 and (I-τV)≥0. That is, N≥0 is obtained and A=M-N is a normal splitting. Since A is an M-matrix, then A-1≥0; we know that ρ(M-1N)<1 by Lemma 5, and the iterative scheme (3) is convergent.
By the theorem, we know that the parallel algorithm is convergent if A is an M-matrix and 0<τ≤min{1/α,1/β,1/(1-α),1/(1-β)}min(1/aii) for i=1,2,…,2hp.
5. Numerical Examples
We performed two numerical experiments on the HP rx2600 cluster. The results are shown as follows.
Example 1.
Consider a banded linear system AX=b; here
(22)A=(A1B1C2A2B2⋱⋱⋱Cm-1Am-1Bm-1CmAm),Ai=(15.1-3.5-6.9-2.720.1-4.8-15.7-5.325.1),Bi=Ci=(-3-2-4),bi=(111).
Let initialization value xi(0)=(000)T and m=80000. We apply this algorithm with the optimal relaxation factor, the multisplitting method, and BSOR method to the systems on the HP rx2600 cluster. Here P is the number of processor, T is the run times (seconds), the S is speedup (T of one processor/T of all processors), L is iteration times, E is the efficiency (E=S/P), and the error ε=1×10-10. See Tables 1, 2, and 3 and Figures 1 and 2.
The results for model 1 (the algorithm in the paper (τ=0.9,α=β=1/2)).
P
1
2
4
8
T
2.9921
1.5179
0.8028
0.6492
S
1.9712
3.7271
4.6089
E
0.9856
0.9318
0.5761
L
39
39
39
39
The results for model 1 (the multisplitting method).
P
1
2
4
8
T
7.9544
7.1352
3.3874
2.2426
S
1.1148
2.3482
3.5470
E
0.5745
0.5871
0.4434
L
160
224
224
224
The results for model 1 (BSOR method (ω=1.85)).
P
1
2
4
8
T
17.0168
8.9928
4.6551
3.8031
S
1.8923
3.6555
4.4745
E
0.9461
0.9139
0.5593
L
495
532
532
532
The parallel speedup for Example 1.
The parallel efficiency for Example 1.
Example 2.
Consider an elliptic partial differential equation
(23)Cx∂2u∂x2+Cy∂2u∂y2+(C1sin2πx+C2)∂u∂x+(D1sin2πx+D2)∂u∂x+Eu=0,0≤x,y≤1,
equipped with the boundary conditions u|x=0=u|x=1=10+cosπy, u|y=0=u|y=1=10+cosπx; here Cx, Cy, C1, C2, D1, D2, and E are all constants.
We denote Cx=Cy=E=1, C1=C2=D1=D2=0. Using the finite difference method, we obtain two block-tridiagonal linear systems on condition that the step sizes h=1/100. Then, we apply this algorithm with the optimal relaxation factor, BSOR method, PEk method, and the multisplitting algorithm to the systems on the HP rx2600 cluster. The numerical results are shown in Tables 4, 5, 6, and 7 and Figures 3 and 4.
The results for model 2 (the algorithm in the paper (τ=8.0, α=β=1/2)).
P
1
2
4
8
16
T
11.3091
6.3632
5.5117
4.0755
3.3842
S
1.7773
2.0152
2.7749
3.3417
E
0.8886
0.5130
0.3469
E
0.8886
0.5130
0.3469
0.2089
L
1177
1177
1177
1177
1186
Δ
0.8163e-10
0.8163e-10
0.8163e-10
08163e-10
0.8140e-10
The results for model 2 (the multisplitting method).
P
1
2
4
8
16
T
15.3559
17.3404
10.5411
7.8602
5.9567
S
0.8856
1.4568
1.9536
2.5779
E
0.8886
0.5130
0.3469
E
0.4428
0.3642
0.2442
0.1611
L
310
824
975
1335
1556
The results for model 2 (PEk method (k=2.7)).
P
1
2
4
8
16
T
14.6964
20.0765
9.9533
6.3488
4.8215
S
0.7320
1.4765
2.3148
3.0481
E
0.8886
0.5130
0.3469
E
0.3660
0.3691
0.2894
0.1905
L
159
444
444
444
444
The results for model 2 (BSOR method).
P
1
2
4
8
16
T
27.7668
21.6576
14.3278
10.1420
8.5949
S
1.2821
1.9380
2.7378
3.2306
E
0.8886
0.5130
0.3469
E
0.6410
0.4845
0.3422
0.2019
L
660
1039
1337
2101
2175
The parallel speedup for Example 2.
The parallel efficiency for Example 2.
6. Results Analysis
From Table 1 to Table 7, we can get the following conclusion.
It can be known that the results of the parallel algorithm verify the results of the theoretical analysis. The conditions in the theorems are only sufficient conditions.
By the numerical results, it can be known that the parallel one has good parallelism.
As to Examples 1 and 2, the results of the examples show that the efficiency of the algorithm is better than the multisplitting ones. Our algorithm has good parallel speedup the same as BSOR methods to the examples. As to Example 2, the efficiency of the algorithm is also better than PEk methods.
The parallel algorithm is easily implemented on parallel computer and more flexible and simple than [1] in practice.
7. Conclusions
An efficient parallel iterative method on a distributed-memory multicomputer has been presented for solving the large banded linear systems. We make full use of the decomposition of the coefficient matrix to choose W and V to save computational cost. The storage strategy can save memory space. Only twice it requires the communications of the algorithm between the adjacent processors. Theoretical analysis and experiment show that the algorithm in this paper has good parallelism and high efficiency. The results also confirm correctness of convergence theorems. When the coefficient matrix is a Hermite positive definite matrix or an M-matrix, we know that the parallel algorithm is convergent if the given conditions are established. Our algorithm has an advantage over the multisplitting one of high efficiency.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research was supported by the National Natural Science Foundation of China under Grant nos. 11002117 and 11302173 and Xianyang Normal University Research Foundation under Grant nos. 09XSYK209 and 09XSYK204.
ZhangB.GuT.MoZ.1999National Defence Industry PressLüQ.YeT.An improve parallel algorithm for solving linear equations involving block tridiagonal coefficient matrix1996423143172-s2.0-0030134651WuJ.SongJ.ZhangW.LiX.Parallel incomplete factorization preconditioning of block tridiagonal linear systems with 2-D domain decomposition20092621911992-s2.0-64049098910DuanZ.YangY.LvQ.MaX.Parallel strategy for solving block-tridiagonal linear systems201147134649FanY.2009Xi’an, ChinaNorthwestern Polytechnical University PressLuoZ.-G.LiX.-M.Parallel algorithm for block-tridiagonal linear systems on distributed-memory multicomputers20002310102810342-s2.0-0034309528El-SayedS. M.A direct method for solving circulant tridiagonal block systems of linear equations20051651233010.1016/j.amc.2004.06.041MR2137022CuiX.LüQ.A parallel algorithm for block-tridiagonal linear systems200617321107111410.1016/j.amc.2005.04.037MR2207999VargaR. S.1962Englewood Cliffs, NJ, USAPrentice-HallMR0158502HuJ.1999Beijing, ChinaScience PressFrommerA.SzyldD. B.Weighted max norms, splittings, and overlapping additive Schwarz iterations199983225927810.1007/s002110050449MR1712686FengJ.CheG.NieY.2002Beijing, ChinaScience PressBjørstadP.LuskinM.2000New York, NY, USASpringer10.1007/978-1-4612-1176-1MR1838275ReedW. H.HillT. R.Triangle mesh methods for the Neutron transport equation1973LA-UR-73-479Los Alamos Scientific LaboratoryChengY. P.2002Northwestern polytechnical University Press