Parallel RFSAI-BFGS Preconditioners for Large Symmetric Eigenproblems

. We propose a parallel preconditioner for the Newton method in the computation of the leftmost eigenpairs of large and sparse symmetric positive definite matrices. A sequence of preconditioners starting from an enhanced approximate inverse RFSAI (Bergamaschi and Mart´ınez, 2012) and enriched by a BFGS-like update formula is proposed to accelerate the preconditioned conjugate gradient solution of the linearized Newton system to solve 𝐴 u = 𝑞( u ) u , 𝑞( u ) being the Rayleigh quotient. In a previous work (Bergamaschi and Mart´ınez, 2013) the sequence of preconditioned Jacobians is proven to remain close to the identity matrix if the initial preconditioned Jacobian is so. Numerical results onto matrices arising from various realistic problems with size up to 1.5 million unknowns account for the efficiency and the scalability of the proposed low rank update of the RFSAI preconditioner. The overall RFSAI-BFGS preconditioned Newton algorithm has shown comparable efficiencies with a well-established eigenvalue solver on all the test problems.


Introduction
The computation of the  ≪  leftmost eigenpairs of a symmetric positive definite (SPD) matrix  is a common task in many scientific applications.Typical examples are offered by the vibrational analysis of mechanical structures [1]: the spectral superposition approach for the solution of large sets of 1st order linear differential equations [2] and the approximation of the generalized inverse of the graph Laplacian [3], to mention a few.
In this paper, we propose to use the Newton method in the unit sphere [4,5] or Newton-Grassman method, which constructs a sequence of vectors {u  } by solving the linear systems ensuring that the correction s be orthogonal to u  .Then the next approximation is set as u +1 = t‖t‖ −1 , where t = u  + s.
Linear system (3) is shown to be better conditioned than the one with  −   .For SPD matrices, this scheme is shown to converge cubically to the wanted eigenvector, provided that the system (3) is solved exactly.The same linear system represents the correction equation in the well-known Jacobi-Davidson method [6], which in its turn can be viewed as an accelerated inexact Newton method [7].When  is SPD and the leftmost eigenpairs are being sought, it has been proved in [8] that the preconditioned conjugate gradient (PCG) method can be employed in the solution of the correction equation.

Journal of Applied Mathematics
Starting from the findings in [9][10][11] the main contribution of this paper is the development of a sequence of preconditioners {  } for the PCG solution of the Newton correction equation (3), based on the BFGS update of a given parallel initial preconditioner for the coefficient matrix .A similar approach has been used in [12] where a rank-two modification of a given preconditioner is used to accelerated MINRES in the framework of the inexact Rayleigh quotient iteration.Updating (cheaply) a given preconditioner to obtain a new preconditioner for a slightly changed linear system is also common in optimization problems.See for example, [13][14][15].
The initial Newton vector is obtained after a small number of iterations of a conjugate gradient procedure for the minimization of the Rayleigh quotient (DACG, [16]).As the initial preconditioner we choose RFSAI [17], which is built by recursively applying the (factorized sparse approximate inverse) FSAI preconditioner developed in [18].We elected a factorized approximate inverse preconditioner (AIP) since it is more naturally parallelizable than preconditioners based on ILU factorizations.Moreover, factorized variants of AIP provide better approximations to  −1 for the same amount of storage than nonfactorized ones, because they can express denser matrices than the total number of nonzeros in their factors [19].The RFSAI formulation enhances this characteristic, since the final preconditioner is implicitly built by a product of four or more triangular factors.
The combined DACG-Newton algorithm is used in the approximation of  = 10 eigenpairs of a number of matrices arising from various realistic applications of size up to 1.5 × 10 6 and number of nonzeros up to 6 × 10 7 .Numerical results show that, in the solution of the correction equation, the PCG method preconditioned by BFGS displays much faster convergence than the same method when the preconditioner is kept fixed during the Newton process, in every test case.Moreover, the proposed approach is shown to outperform the pure DACG method.

BFGS Sequence of Preconditioners
2.1.Choice of the Initial Preconditioner.Following the developments in [17], we propose an implicit enlargement of the sparsity pattern using a banded target matrix ; the lower factor of the FSAI preconditioner is obtained by minimizing ‖ − ‖  over the set of matrices  having a fixed sparsity pattern.Denoting with  out the result of this minimization, we compute explicitly the preconditioned matrix  =  out   out and then evaluate a second FSAI factor  in for .Thus, the final preconditioner can be written as the product   out   in  in  out .This RFSAI-recursive FSAI-procedure can be iterated a number of times to yield a preconditioner written as a product of several triangular factors.This preconditioner has shown its efficiency in the acceleration of the PCG onto realistic geomechanical problems of very large size [17].
We denote FSAIout as the procedure which computes the  out factor by minimizing ‖ − ‖, where  is an arbitrary banded matrix.FSAIout depends on the nband parameter in addition to the usual FSAI parameters:  out prefiltration threshold,  out power of Ã defining the sparsity pattern, and  out postfiltration parameter.
The second preconditioner factor,  in , is the result of the FSAI procedure applied to the whole product matrix  out   out , with parameters  in ,  in , and  in .The steps to obtain the final RFSAI preconditioner are summarized in Algorithm 1.
Following the idea described in [10], we propose a sequence of preconditioners for the Newton systems using the BFGS rank-two update.To precondition the initial Newton system as follows: where We choose to use a projected RFSAI preconditioner,  0 = (− u 0 u ⊤ 0 ) P0 ( − u 0 u ⊤ 0 ) with P0 =    and  =  in  out .

Update of Initial Preconditioner by BFGS-Like Rank-Two
Corrections.A sequence of projected preconditioners for the subsequent linear systems  +1 s +1 = −r +1 may be defined by where s ≡ s  the solution of the previous correction equation and r ≡ r  .In view of the cubic convergence of the Newton process, we used the residual −r  instead of y  = r +1 − r  .Theorem 3 of Section 3 will state that the preconditioner defined in (6) is SPD if P is so.

Theoretical Analysis of the Preconditioner
3.1.Finding the Smallest Eigenpair.We recall in this section the main theoretical results regarding the sequence of preconditioners previously defined.Differently from the classical papers which study BFGS convergence properties, here our Jacobian matrix (u) = ( − uu ⊤ )( − (u))( − uu ⊤ ) is singular whatever u, in particular it is singular when u is equal to the exact eigenvector.The theoretical analysis of the "goodness" of the preconditioner will be, therefore, completely different, though obtaining similar results, than that proposed, for example, in [10,20].In the following developments we will indicate as k 1 the exact eigenvector corresponding to the smallest exact eigenvalue  1 .The error vector at step  is denoted by e  = u  − k 1 , while the error in the eigenvalue approximation is   =   −  1 (>0 Compute the second lower triangular factor:  in = FSAI ( (1) ,  in ,  in ,  in ) algorithm is run within the subspace of vectors orthogonal to u  (in fact also r ⊤ u  = 0).Thus, notion of positive definiteness, eigenvalue distribution, condition number, norms, and so forth, apply as usual but with respect to matrices restricted to this subspace.
The following lemma will bound the extremal eigenvalues of   in the subspace orthogonal to u  .Lemma 2. There is a positive number  such that if ‖e  ‖ <  then is SPD in the subspace orthogonal to u  .Moreover the following bounds hold: for every unit norm vector z orthogonal to u  .
The previous lemma allows us to prove that the preconditioner defined in ( 6) is SPD, as stated in the following theorem.(6) is SPD and hence   is SPD in the subspace orthogonal to u  .Proof.See [21].

Theorem 3. If the correction equation is solved exactly, then any matrix P𝑘 defined by
Let us define the difference between the preconditioned Jacobian and the identity matrix as Since by definition we have   u  = 0 then u  is the eigenvector of   corresponding to the zero eigenvalue.Hence, since also  1/2  u  = 0 the error   can also be defined as The following technical lemma will bound the norm of P in terms of that of   .Being P SPD we can define its norm in the space orthogonal to u  as Lemma 4.There is a positive number Proof.See [21].
The next lemma will relate the norms of the difference s and of the error vector e  .
Before stating Theorem 7 we need to state as a last preliminary result that also the difference between the square root of two consecutive Jacobians is bounded in terms of the norm of the error vector.
for a suitable constant  3 .
The following theorem will state the main theoretical result of this Section: the so called bounded deterioration [22] of the preconditioner at step  + 1 with respect to that of step .In other words it can be proved that the distance of the preconditioned matrix from the identity matrix at step  + 1 is less or equal than that at step  plus a constant that may be small as desired, depending on the closeness of u 0 to the exact eigenvector.We report also the proof of this theorem, which is taken from [21].
where we set w = From (20), we derive a bound for ‖  ‖.

Computing Several Eigenpairs.
When seeking an eigenvalue different from  1 , say   , the Jacobian matrix changes as where is the matrix whose first  columns are the previously computed eigenvectors.Analogously, also the preconditioner must be chosen orthogonal to  as The theoretical analysis developed in Section 3.1 applies with small technical variants also in this case since it is readily proved that  1/2 +1  +1  1/2 +1 =  1/2 +1 P+1  1/2 +1 .The most significant changes regard the definition of   =   −  , e  = u  − k  , and the statement of Lemma 2 (and the proof of Lemma 4 that uses its results); namely, the bound for the smallest eigenvalue of   which in the general case becomes for every unit norm vector z such that  ⊤ z = 0.

Choosing an Initial Eigenvector Guess.
As mentioned in Section 1, another important issue in the efficiency of the Newton approach for eigenvalue computation is represented by the appropriate choice of the initial guess.We propose here to perform some preliminary iterations using the DACG eigensolver [16,23,24], which is based on the preconditioned conjugate gradient (nonlinear) minimization of the Rayleigh quotient.This method has proven very robust, and not particularly sensitive to the initial vector, in the computation of a few eigenpairs of large SPD matrices.Recently in [25], the DACG method has been compared with the Jacobi-Davidson method in parallel environments, showing similar performances when seeking a small number of eigenpairs (1 ÷ 5).

Implementation of the BFGS Preconditioner
Update.At a certain nonlinear iteration level, , we need to compute c =   g  , where g  is the residual of the linear system at iteration .Let us suppose we compute an initial preconditioner  0 .Then, at the initial nonlinear iteration  = 0, we simply have c =  0 z  .At step  + 1 the preconditioner P+1 is defined recursively by ( 6) while  +1 using ( 24) can be written as To compute vector c first we observe that g  is orthogonal to  so there is no need to apply matrix  −  ⊤ on the right of (26).Application of preconditioner P+1 to the vector g  can be performed at the price of 2 dot products and 2 daxpys as described in Algorithm 2. The scalar products   = s ⊤  r  , which appear at the denominator of P+1 , can be computed once and for all before starting the solution of the ( + 1)th linear system.Last, the obtained vector c must be orthogonalized against the columns of  by a classical Gram-Schimdt procedure.

PCG Solution of the Correction Equation.
As a Krylov subspace solver for the correction equation, we chose the preconditioned conjugate gradient (PCG) method since the Jacobian   has been shown to be SPD in the subspace orthogonal to u  .Regarding the implementation of PCG, we mainly refer to the work [8], where the author shows that it is possible to solve the linear system in the subspace orthogonal to u  and hence, the projection step needed in the application of   can be skipped.Moreover, we adopted the exit strategy for the linear system solution described in the previous paper, which allows for stopping the PCG iteration, in addition to the classical exit test based on a tolerance on the relative residual and on the maximum number of iterations, whenever the current solution x  satisfies or when the decrease of ‖r , ‖ is slower than the decrease of ‖g  ‖, because in this case further iterating does not improve the accuracy of the eigenvector.Note that this dynamic exit strategy implicitly defines an inexact Newton method since the correction equation is not solved "exactly", that is, up to machine precision.We have implemented the PCG method as described in Algorithm 5.1 of [8] with the obvious difference in the application of the preconditioner which in this case is accomplished as described in Algorithm 2.

Implementation of the DACG-Newton Method.
The BFGS preconditioner defined in Algorithm 2 suffers from two main drawbacks, namely, increasing costs of memory for storing s and r and the increasing cost of preconditioner application with the iteration index .Note that these drawbacks are common to many iterative schemes, such as sparse (Limited Memory) Broyden implementations [26], GMRES [27], and Arnoldi method for eigenvalue problems [28].If the number of nonlinear iterations is high the application of BFGS update may be too costly as compared with the expected reduction in the iteration number.To this aim we define  max the maximum number of rank-two corrections we allow.When the nonlinear iteration counter  is larger than  max , the vectors s  , r  ,  =  −  max are substituted with the last computed s  , r  .Vectors {s  , r  } are stored in a matrix  with  rows and 2 ×  max columns.
The implementation of our DACG-Newton method for computing the leftmost eigenpairs of large SPD matrices is described in Algorithm 3.

Parallel Implementation.
We have developed a parallel code which implements the construction and application inside parallel PCG, of the RFSAI algorithm along with the BFGS updates.The resulting program is written in Fortran 90  and exploits the MPI library for exchanging data among the processors.We use a block row distribution of all matrices, that is, with complete rows assigned to different processors.All the matrices are stored in static data structures in CSR format.
Parallelization of the FSAI preconditioner, which is the basis of the parallel RFSAI construction, has been performed and tested for example, in [29][30][31] where prefiltration and postfiltration have been implemented together with a priori sparsity pattern based on nonzeros of   with  ≤ 4.
The code makes also use of an optimized parallel matrixvector product which has been developed in [32] showing its effectiveness up to 1024 processors.

Test Problems.
We report the results of our experiments with the RFSAI preconditioner in the solution of a set of problems of large size.
The test matrices are realistic examples arising from 3D FE discretization of fluid flow and geomechanical models.Described in detail as follows: (1) FLOW3D-663 arises from tetrahedral FE discretization of an elliptic PDE describing fluid flow in porous media.
(2) EMILIA-923 arises from the regional geomechanical model of a deep hydrocarbon reservoir.It is obtained discretizing the structural problem with tetrahedral finite elements.Due to the complex geometry of the geological formation, it was not possible to obtain a computational grid characterized by regularly shaped elements.
(3) LONG-1103 is the structural SPD block obtained from a 3D coupled consolidation problem of a geological formation, discretized with tetrahedral finite elements on a higly irregular computational grid.
(4) GEO-1438 arises from a regional geomechanical model of the sedimentary basin underlying the Venice lagoon discretized by a linear FE with randomly heterogeneous properties.The computational domain is a box with an areal extent of 50 × 50 km and 10 km deep consisting of regularly shaped tetrahedral finite elements.
Matrices 2 to 4 are publicly available in the University of Florida Sparse Matrix Collection at http://www.cise.ufl.edu/research/sparse/matrices/. The size and number of nonzero elements for each matrix is provided in Table 1.
All tests are performed on the IBM BlueGene\Q cluster at the CINECA Centre for HPC, equipped with IBM PowerA2 processors at 1.6 GHz with 10240 nodes, 163 840 computing cores, and 164 Tbytes of internal network RAM.The Fortran 90 code is compiled with the native IBM xlf compiler using the -O3 -qarch=qp -qtune=qp options.Parameters for the PCG iteration  PCG = 10 −2 , ITMAX PCG = 50

Results on a Fixed Number of Processors.
We now compare the performance of the DACG-Newton algorithm for various  max values.We tested the proposed algorithm in the computation of the 10 smallest eigenpairs of the afore mentioned large matrices arising from various realistic applications.In all the runs, unless differently specified, we selected the values of the parameters as reported in Table 2.
We will also compute the fill-in  of the initial preconditioner defined as In Table 3, we report the parameters for the RFSAI preconditioner which we selected for the four-test problems.
We first analyze the influence of parameter  max in the convergence of the Newton-DACG method.To this aim we provide, in Table 4 the values of outer iteration number, total matrix-vector products (MVP) and CPU time in evaluating 10 eigenpairs of matrix EMILIA-923 for different values of  max .From the table, we notice how iteration number and CPU time monotonically decreases with increasing  max .Moreover, for  max ≤ 2, the iterative process reached the maximum number of iterations before fulfilling the exit test on the relative residual.Another evidence of the efficiency of the BFGS update is given in Figure 1, where we plot the relative residual norm ‖r , ‖/(x   x  ) versus number of linear iterations in converging to eigenpair number 3 of matrix FLOW3D-663.Note that, with  max = 0, that is, using the constant RFSAI preconditioner, the Newton phase could not reach the 10 −8 accuracy within the prescribed ITMAX = 50 maximum number of outer iterations.
In Table 5, we report the number of matrix vector products and CPU times for Newton-DACG as compared with "pure" DACG, that is, with DACG run with a final tolerance of 10 −8 .In every test problem DACG-Newton reveals superior to DACG.
In particular, for problem FLOW3D-663, the improvement provided by DACG-Newton is impressive.The DACG method, if run until the final tolerance  = 10 −8 , is very slow due to the very small relative separation between consecutive eigenvalues   = ( +1 −   )/  , which drives convergence of  this PCG-like solver [16,25].In fact the ratio   also influences the convergence of Newton's method being related to the condition number of the Jacobian.However, the (RFSAI-BFGS preconditioned) DACG-Newton algorithm seems less sensitive to this ill-conditioning.

Parallel Results and Scalability.
We now analyze the parallel efficiency of the previously described eigensolver preconditioned by RFSAI-BFGS.We will use a strong scaling measure to see how CPU times vary with the number of processors for a fixed total problem size.Denoting with   the total CPU elapsed times expressed in seconds on  processors, we define relative measures of the parallel efficiency and speedup of our code.We define  ()  as the pseudo speedup computed with respect to the smallest number of processors () used to solve a given problem and  ()  the corresponding efficiency: The results obtained with the four test matrices are presented in Tables 6, 7, 8, and 9.As a general comment, we may observe that the overall code scales well for the three largest matrices, up to 256 processors (512 for problem GEO-1438).The parallel pseudo efficiency, as expected, decreases with growing number of processors but it is roughly 50% for the largest number of processors employed.
Regarding the FLOW3D-663 problem, which is characterized by a relatively small dimension and high sparsity (≈15 nonzeros per row), the parallel efficiency starts to worsen with  = 128 processors.

Conclusion
We have proposed a parallel RFSAI-BFGS preconditioner for the acceleration of the PCG method in the approximate solution of the linearized Newton systems in the evaluation of a number of the leftmost eigenpairs of large SPD matrices.We have shown that updating an initial preconditioner (here RFSAI) by a low-rank correction using the BFGS formula, produces significant savings in the total number of iterations of the inner solver (the PCG method).The Newton's algorithm, preconditioner by RFSAI-BFGS, with the aid of a number of initial iterations of DACG to obtain a good initial eigenvector guess, has been completely parallelized and run on the new IBM BlueGene\Q, located at CINECA, Bologna, Italy.The scalability results are very satisfactory, as compared to the size and nonzeros of the problems selected.
In particular, for the largest problem, an efficiency of 72% is obtained with 256 processors.Future research is aimed at investigating the relations between the proposed accelerated Newton method and the well-established Jacobi-Davidson method.

Figure 1 :
Figure 1: Convergence profile of the relative residual norm versus cumulative inner iterations for eigenvalue number 3, matrix FLOW3D-663.
).It is easily proved that there is a constant  independent of  such that At first sight the Jacobian matrix in the correction equation is singular, but this does not matter since the PCG INPUT: nband,  out ,  out ,  out ,  in ,  in ,  in Compute the first lower triangular factor:  out = FSAIout (, nband,  out ,  out ,  out ) Compute the product: (1)=  out   out After defining  =   P+1  1/2 +1 +  1/2 +1 P+1   +   P+1   , we can write there is a positive number  s.t.if ‖e 0 ‖ <  then      +1     ≤           +  √     e      (16)for a suitable constant .Proof.
and using the bound (21) we finally have      −1     ≤           +  4 √     e Matrix ; number of sought eigenpairs  eig ; tolerance and maximum number of its for the outer iteration: , ITMAX; tolerance for the initial eigenvector guess  DACG ; tolerance and maximum number of its for the inner iteration:  PCG .ITMAX PCG ; parameters for the RFSAI preconditioner,  out ,  out ,  out , nband, for the 1st FSAI factor and,  in ,  in ,  in for the 2nd factor; maximum allowed rank-two update in the BFGS preconditioner:  max .Q := [ ].Compute P0 , an RFSAI preconditioner for , for  := 1 to  eig (1) Choose x 0 such that Q⊤ x 0 = 0;(2) Compute u 0 , an approximation to v  by the DACG procedure with initial vector x 0 , preconditioner P0 and tolerance  DACG ; (3)  := 0,   := u ⊤  u  ; (4) while     u  −   u      >   and  < IMAX do (1)  := [ Q u  ]. (2) Solve   s  = −r  for s  ⊥  by the PCG method with preconditioner   and tolerance  PCG .

Table 1 :
Size  and number of nonzeros (nnz) of the test matrices.

Table 2 :
Default values of parameters.

Table 3 :
RFSAI parameters used for the eigensolution of the four test problems.

Table 4 :
Influence of parameter  max in the number of iterations of the Newton phase.Problem EMILIA-923 with  = 64 processors.

Table 5 :
Comparison between DACG and Newton-DACG for the four test problems on  = 64 processors.Results on problem LONG-1103 are obtained using  DACG = 0.5.