Parallel preconditioned conjugate gradient square method based on normalized approximate inverses

A new class of normalized explicit approximate inverse matrix techniques, based on normalized approximate factorization procedures, for solving sparse linear systems resulting from the finite difference discretization of partial differential equations in three space variables are introduced. A new parallel normalized explicit preconditioned conjugate gradient square method in conjunction with normalized approximate inverse matrix techniques for solving efficiently sparse linear systems on distributed memory systems, using Message Passing Interface (MPI) communication library, is also presented along with theoretical estimates on speedups and efficiency. The implementation and performance on a distributed memory MIMD machine, using Message Passing Interface (MPI) is also investigated. Applications on characteristic initial/boundary value problems in three dimensions are discussed and numerical results are given.


Introduction
The solution of sparse linear systems is of central importance to scientific and engineering computations and is the most time-consuming part, cf.[3,4,13].The solution of sparse linear systems has been obtained by direct or iterative methods, cf.[2][3][4][5][6]9,13].
Let us consider the sparse linear system resulting from Finite Difference (FD) discretization of three dimensional boundary value problems, i.e.
where A is a sparse, diagonally dominant, positive definite, symmetric (n × n) matrix of the following form: ( and m, p are the semi-bandwidths, while u is the FD solution at the nodal points and s is a vector, of which the components result from a combination of source terms and imposed boundary conditions.For symmetric positive definite problems, the rate of convergence of the conjugate gradient method depends on the distribution of the eigenvalues of the coefficient matrix.Hence the preconditioned matrix will have a smaller spectral condition number, and the eigenvalues clustered around one, cf.[2,6,13].The preconditioned form of the linear system Eq.( 2) is where M is a preconditioner.The preconditioner M has to satisfy the following conditions: (i) MA should have a "clustered" spectrum, (ii) M can be efficiently computed in parallel and (iii) finally "M × vector" should be fast to compute in parallel, cf.[1,3,[6][7][8]13].The effectiveness of the explicit approximate inverse preconditioning methods is related to the fact that the approximate inverses exhibit a similar "fuzzy" structure as the coefficient matrix and are close approximant to the coefficient matrix, cf.[6][7][8].
The performance and applicability of the new algorithmic schemes for solving 3D elliptic and parabolic P.D.E's is discussed and numerical results are given.

Normalized optimized approximate inverses
In this section we present a class of normalized approximate inverses based on normalized factorization procedures.Let us now assume the normalized approximate factorization, cf.[8], such that: where r 1 , r 2 are the "fill-in" parameters, D r1,r2 is a diagonal matrix and T r1,r2 is a sparse upper triangular matrix of the same profile as the coefficient matrix A.
The elements of the normalized approximate inverse, by retaining δl elements in the lower and upper part of the inverse, cf.[6][7][8], can be determined by solving recursively the following systems: The Normalized Optimized Approximate Inverse Matrix -3D algorithmic procedure (henceforth called the NOROAIM-3D algorithm) for computing the elements of the approximate inverse, using a "fish-bone" pattern, can be expressed by the following compact form: else if j > npr2 and j nmr1 then call mw (n, δl, i, m call mw (n, δl, i, m call mw (n, δl, i, j call mw (n, δl, i, k call mw (n, δl, i, j call mw (n, δl, i, k call mw (n, δl, i, j call mw (n, δl, i, k call mw (n, δl, i, j call mw (n, δl, i, k The procedure mw(n, δl, s, q, x, y), cf.[6], can then be described as follows: procedure mw(n, δl, s, q, x, y) The computational work of the NOROAIM-3D algorithm is O[nδl(r 1 + r 2 + 1)] multiplicative operations, while the storage of the approximate inverse is only n × (2δl − 1)-vector spaces, using the optimized storage scheme as described by the above given procedure mw(n, δl, s, q, x, y), cf.[6].This optimized form is particularly effective for solving "narrow-banded" sparse systems of very large order, i.e. δe n/2, cf.[6][7][8].The parallel construction of similar approximate inverses has been studied and implemented in [7], and is under further investigation.
It should be noted that this class of normalized approximate inverse includes various families of approximate inverses according to the requirements of accuracy, storage and computational work, as can be seen by the following diagrammatic relation: class I class II where the entries of the class I inverse have been retained after the computation of the exact inverse (r , while the entries of the class II inverse have been computed and retained during the computational procedure of the (approximate) inverse (r . The entries of the class III inverse have been retained after the computation of the approximate inverse (r 1 m − 1, r 2 p − 1).Hence an approximate inverse is derived in which both the sparseness of the coefficient matrix is relatively retained and storage requirements are substantially reduced.The class IV of approximate inverse retains only the diagonal elements, i.e. δl = 1 hence M δ1 r1,r2 ≡ I, resulting in a fast inverse algorithm.It is known that the larger in magnitude elements of the inverse matrices, in almost every case, are clustered around the diagonals at distances ρ 1 m and ρ 2 p (with ρ 1 = 1, 2, . . ., m − 1 and ρ 2 = 1, 2, . . ., p − 1) from the main diagonal in a "recurring wave"-like pattern, cf.[6][7][8].It is reasonable to assume that the value of the retention parameter δl can be chosen as multiples of m and p.
It should be noted that, if w i = 0, cf.Eq. ( 2), then the algorithm reduces to one for solving sparse linear systems, which are encountered in solving 2D boundary value problems by the finite difference method, cf.[5].When w i = 0 and v i = 0, cf.Eq. ( 2), then the algorithm is reduced to one for solving tri-diagonal linear systems, which are encountered in solving two-point boundary value problems.

Parallel normalized explicit preconditioned conjugate gradient methods
In this section we present a class of parallel normalized explicit preconditioned conjugate gradient schemes, based on the derived normalized approximate inverse, for solving sparse linear systems on distributed MIMD parallel systems.
Let us now consider that the coefficient matrix A, the normalized optimized approximate inverse and all the vectors are distributed in a block-row distribution.In this distribution we partition the matrices and vectors into blocks of consecutive rows, and assign a panel of elements to each process.The processors operate with local data, and need synchronization points before computations of inner products and matrix × vector operations.
Let local n be the number of rows allocated to each processor (i.e.local n:=n/no proc).Then, the Parallel form of the Normalized Explicit Preconditioned Conjugate Gradient Square (PNEPCGS) method can be expressed by the following algorithmic procedure: Let u 0 be an arbitrary initial approximation to the solution vector u.Then, for j = (myrank*local n + 1) to (myrank* local n + local n) (myrank is the rank of each process) Gather local p 0 to root process, compute their sum and scatter it to all processes Then, for i = 0, 1, . . ., (until convergence) compute in parallel the vectors u i+1 , r i+1 , σ i+1 and the scalar quantities α i , β i+1 as follows: gather distributed σ i onto each process for j = (myrank*local n + 1) to (myrank* local n + local n) else gather distributed q i onto each process for j = (myrank*local n + 1) to (myrank* local n + local n) for j = (myrank*local n + 1) to (myrank* local n + local n) Gather local p 0 to root process, compute their sum and scatter it to all processes gather distributed σ i onto each process for j = (myrank*local n + 1) to (myrank* local n + local n) else gather distributed q i onto each process for j = (myrank*local n + 1) to (myrank* local n + local n) for j = (myrank*local n + 1) to (myrank* local n + local n) gather local p i+1 to root process, compute their sum and scatter it to all processes The computational complexity of the PNEPCGS method is ≈ O[(4δl + 27) local n mults + 8 local n adds]ν operations, while the total communication cost, using the butterfly technique for collective communications, is , where ν denotes the number of iterations required for the convergence to a certain level of accuracy, t s the message latency, and t w the time necessary for a word to be sent.
Thus, the speedup and efficiency of the PNEPCGS method can be defined as follows: and where t m denotes the computational time of one multiplication.Hence, for δl → ∞, S p →no proc and E p → 1, that is optimum.The total number of arithmetic operations required for the parallel computation of the solution u ν is: while the total communication cost is: The implementation of the two most dominating operations of the PNEPCGS iterative method, i.e. multiplication of the normalized optimized approximate inverse with a vector, cf.Eqs (29)-(30), and the inner product of two vectors, cf.Eq. ( 31), is given using the MPI communication library.For communication operations, the collective communication routines MPI Allreduce and MPI Allgather were used for sending and receiving data among distributed processes, cf.[11,12].
/* perform the multiplication of the normalized optimized approximate inverse with a vector, i.e.
× q */ Notation: dl is the "retention" parameter δl if (dl==1) for (i=myrank*local n + 1; i<=myrank*local n+local n; i++) It should be noted that the elements of the vectors of each process are stored in their original sequential position.Thus, a shift operation is required for the MPI Allgather routine in order to place the elements in the beginning of each vector.

Numerical results
In this section we examine the applicability and effectiveness of the proposed schemes for solving characteristic three dimensional boundary value and initial value problems.

Model Problem I:
Let us consider a 3D-boundary value problem with Dirichlet boundary conditions: where R is the unit cube and ∂R denotes the boundary of R. The right hand side vector of the system Eq.( 1) was computed as the product of the matrix A by the solution vector, with its components equal to unity.The iterative process was terminated when r i ∞ 10 −5 .Numerical results for the new proposed parallel schemes were obtain using a cluster of ten Eq. ( 10) AMD Athlon 1900 processors running at 1.6 GHz, connected in an 100 Mbps Ethernet network, using MPI.
The speedups and the number of iterations of the PNEPCGS method for several values of the "retention" parameter δl with n = 8000, m = 21, p = 401 and r 1 = r 2 = 2 are given in Table 1.In Figs 1, 2 and 3 the speedups and processors allocated for several values of the "retention" parameter δl, the speedups versus the "retention" parameter δl for several numbers of processors and the parallel efficiency for several values of the "retention" parameter δl are presented respectively for the PNEPCGS method with n = 8000, m = 21, p = 401 and r 1 = r 2 = 2.In Fig. 4 the performance evaluation measurements of the PNEPCGS method are given with n = 8000, m = 21, p = 401 and It is observed by the experimental results that the communication cost is responsible for the performance of the PNEPCGS method for small values of parameter δl, in contrast with large values of δl where speedups and efficiency tend to become optimum.

Model Problem II:
Let us consider the following Parabolic P.D.E. in three space variables:

Number of Processors
with initial conditions: and boundary conditions: where R is the unit cube, g(x, y, z) is given sufficiently smooth function and ∆ is the Laplace operator.Let us assume that a network of mesh-sizes h x , h y , h z and ∆t in the X, Y , Z and T directions respectively is superimposed over R.
Then, the time-dependent functions were approximated by certain schemes, cf.[6,7], which can be written in the following parametric form: where the value of the parameter θ denotes the various "time" -schemes.
In the case of θ = 1 we have the "time-implicit" scheme, using backward time differences.This scheme is unconditionally valid, i.e. stable and convergent, and independent of the mesh-ratio.A disadvantage of this scheme

Table 1
Speedups and processors allocated of the PNEPCGS method for several values of δl, with n = 8000, m = 21 and p = 401

Table 2
The convergence behavior of the NEPCGS and NEPBICG-STAB method with r 1 = 2, r 2 = 2 and θ = 1 for various values of the parameter δl and the time step ∆t **The number of outer iterations was 71 iterations.***The number of outer iterations was 94 iterations.