APPMATH ISRN Applied Mathematics 2090-5572 International Scholarly Research Network 127647 10.5402/2012/127647 127647 Review Article Preconditioning for Sparse Linear Systems at the Dawn of the 21st Century: History, Current Developments, and Future Perspectives 0000-0002-5077-1394 Ferronato Massimiliano Fernandez J. R. Song Q. Sture S. Zhang Q. Dipartimento ICEA Università di Padova, Via Trieste 63 35121 Padova Italy unipd.it 2012 26 12 2012 2012 01 10 2012 22 10 2012 2012 Copyright © 2012 Massimiliano Ferronato. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Iterative methods are currently the solvers of choice for large sparse linear systems of equations. However, it is well known that the key factor for accelerating, or even allowing for, convergence is the preconditioner. The research on preconditioning techniques has characterized the last two decades. Nowadays, there are a number of different options to be considered when choosing the most appropriate preconditioner for the specific problem at hand. The present work provides an overview of the most popular algorithms available today, emphasizing the respective merits and limitations. The overview is restricted to algebraic preconditioners, that is, general-purpose algorithms requiring the knowledge of the system matrix only, independently of the specific problem it arises from. Along with the traditional distinction between incomplete factorizations and approximate inverses, the most recent developments are considered, including the scalable multigrid and parallel approaches which represent the current frontier of research. A separate section devoted to saddle-point problems, which arise in many different applications, closes the paper.

1. Historical Background

The term “preconditioning” generally refers to “the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly” , while the “preconditioner” is the mathematical operator responsible for such a transformation. In the context of linear systems of equations, (1.1)Ax=b, where An×n and x,bn, the preconditioner is a matrix which transforms (1.1) into an equivalent problem for which the convergence of an iterative method is much faster. Preconditioners are generally used when the matrix A is large and sparse, as it typically, but not exclusively, arises from the numerical discretization of Partial Differential Equations (PDEs). It is not rare that the preconditioner itself is the operator which makes it possible to solve numerically the system (1.1), that otherwise would be intractable from a computational viewpoint.

The development of preconditioning techniques for large sparse linear systems is strictly connected to the history of iterative methods. As mentioned by Benzi , the term “preconditioning” used to identify the acceleration of the iterative solution of a linear system can be found for the first time in a 1968 paper by Evans , but the concept is pretty older and was already introduced by Cesari back in 1937 . However, it is well known that the earliest tricks to solve numerically a linear system were already developed in the 19th century. As reported in the survey by Saad and Van der Vorst  and according to the work by Varga , Gauss set the pace for iterative methods in an 1823 letter, where the famous German mathematician describes a number of clever tricks to solve a singular 4×4 linear system arising from a geodetic problem. Most of the strategies suggested by Gauss compose what we currently know as the stationary Gauss-Seidel method, which was later formalized by Seidel in 1874. In his letter, Gauss jokes saying that the method he suggests is actually a pleasant entertainment, as one can think about many other things while carrying out the repetitive operations required to solve the systems. Actually, it is rather a surprise that such a method attracted a great interest in the era of automatic computing! In 1846 another famous algorithm was introduced by Jacobi to solve a sequence of 7×7 linear systems enforcing diagonal dominance by plane rotations. Other similar methods were later developed, such as the Richardson  and Cimmino  iteration, but no significant improvements in the field of numerical solution of linear systems can be reported until the 40s. This period is referred to as the prehistory of this subject by Benzi .

As it often occurred in the past, a dramatic and tragic historical event like World War II increased the research fundings for military reasons and helped accelerate the development and progress also in the field of numerical linear algebra. World War II brought a great development of the first automatic computing machines, which later led to the modern digital electronic era. With the availability of such new computing tools, the interest in the numerical solution of relatively large linear systems of equations suddenly grew. The more significant advances were obtained in the field of direct methods. Between the 40s and 70s the algorithms of choice for solving automatically a linear system are based on the Gaussian elimination, with the development of several important variants which helped greatly improve the native method. First, it was recognized that pivoting techniques are of paramount importance to stabilize the numerical computations of the triangular factors of the system matrix and reduce the effects of rounding errors, for example, . Second, several reordering techniques were developed in order to limit the fill-in of the triangular factors and the number of operations required for their computation. The most famous bandwidth reducing algorithms, such as Cuthill-McKee and its reverse variant , nested dissection , and minimum degree , are still very popular today. Third, appropriate scaling strategies were introduced so as to balance the system matrix entries and help stabilize the triangular factor computation, for example, [17, 18].

In the meantime, iterative methods lived in a kind of niche, despite the publication in 1952 of the famous papers by Hestenes and Stiefel  and Lanczos  who discovered independently and almost simultaneously the Conjugate Gradient (CG) method for solving symmetric positive definite (SPD) linear systems. These papers did not go unnoticed, but at that time CG was interpreted as a direct method converging in m steps, if m is the number of distinct eigenvalues of the system matrix A. As it was soon realized , this property is lost in practice when working in finite arithmetics, so that convergence is actually achieved in a number of iterations possibly much larger than m. This is why CG was considered an attractive alternative to Gaussian elimination in a few lucky cases only and was not rated much more than just an elegant theory. The iterative method of choice of the period was the Successive Overrelaxation (SOR) algorithm, which was introduced independently by Young  in his Ph.D. thesis and Frankel  as an acceleration of the old Gauss-Seidel stationary iteration. The possibility of estimating theoretically the optimal overrelaxation parameter for a large class of matrices having the so-called property A  gained a remarkable success for SOR especially in problems arising from the discretization of PDEs of elliptic type, for example, in groundwater hydrology or nuclear reactor diffusion .

In the 60s the Finite Element (FE) method was introduced in structural mechanics. The new method had a great success and gave rise to a novel set of large matrices which were very sparse but neither diagonally dominant nor characterized by property A. Unfortunately, SOR techniques were not reliable in many of these problems, either converging very slowly or even diverging. Therefore, it is not a surprise that direct methods soon became the reference techniques in this field. The irregular sparsity pattern of the FE-discretized structural problems gave a renewed impulse to the development of more appropriate ordering strategies based on the graph theory and more efficient direct algorithms, leading in the 70s and early 80s to the formulation of the modern multifrontal solvers [13, 24, 25]. In the 80s the first pioneering commercial codes for FE structural computations became available and the solvers of choice were obviously direct methods because of their reliability and robustness. At this time the solution of sparse linear systems by direct techniques appears to be quite a mature field of research, well summarized in famous reference textbooks such as the ones by Duff et al.  and Davis .

In contrast, during the 60s and 70s iterative solvers were living their infancy. CG was rehabilitated as an iterative method by Reid in 1971  who showed that for reasonably well-conditioned sparse SPD matrices the convergence could have been reached after far fewer than n steps, being n the size of A. This work set the pace for the extension of CG to nonsymmetric and indefinite problems, with the introduction of the earliest Krylov subspace methods such as the Minimal Residual (MINRES)  and the Biconjugate Gradient (Bi-CG)  algorithms. However, the crucial event for the future success of Krylov subspace methods was the publication in 1977 by Meijerink and van der Vorst of the Incomplete Cholesky CG (ICCG) algorithm . Incomplete factorizations had been already introduced in the Soviet Union  and independently by Varga  in the late 50s, and the seminal ideas for improving the conditioning of the system matrix in the CG iteration can be found also in the original work by Hestenes and Stiefel  and in Engeli et al. , but Meijerink and van der Vorst were the first who put the things together and showed the great potential of the preconditioned CG. The idea was later extended by Kershaw  to an SPD matrix not necessarily of M-type and gained soon a great popularity. The key point was that CG as well as any other Krylov subspace method with a proper preconditioning could become competitive with the latest direct methods, though requiring much less memory and so being attractive for large three-dimensional (3D) simulations.

The 80s and 90s are the years of the great development of Krylov subspace methods. In 1986 Saad and Schultz introduced the Generalized Minimal Residual (GMRES) method  which soon became the algorithm of choice among the iterative methods for nonsymmetric linear systems. Some of the drawbacks of Bi-CG were addressed in 1989 by Sonnenveld who developed the Conjugate Gradient Squared (CGS) method , on the basis of which in 1992 van der Vorst presented the Biconjugate Gradient Stabilized (Bi-CGSTAB) algorithm . Along with GMRES, Bi-CGSTAB is the most successful technique for nonsymmetric linear systems, with the advantage of relying on a short-term recurrence. As demonstrated in the famous Faber-Manteuffel theorem , Bi-CGSTAB is not optimal in the sense that it does not theoretically lead to an approximate solution with some minimum property in the current Krylov subspace; however, its convergence can be faster than GMRES. Another family of Krylov subspace methods includes the Quasi-Minimal Residual (QMR) algorithm , with its transpose-free (TFQMR) and symmetric (SQMR) variants [40, 41]. Several other variants of the methods above, including truncated, restarted, and flexible versions, along with nested optimal techniques, for example, [42, 43], have been introduced by several researchers until the late 90s. A recent review of Krylov subspace methods for linear systems is available in , while several textbooks have been written on this subject at the beginning of the 21st century, among which the most famous is perhaps the one by Saad .

Initially, Krylov subspace methods were seen with some suspicion by the practitioners, especially those coming from structural mechanics, because of their apparent lack of reliability. Numerical experiments in different fields, such as FE elastostatics, geomechanics, consolidation of porous media, fluid flow, and transport, for example, , showed a growing interest for these methods, mainly because they potentially could allow for the solution of much bigger and more detailed problems. Whereas direct methods typically scale poorly with the matrix size, especially in 3D models, Krylov subspace methods appeared to be virtually mandatory with large problems. On the other hand, it became soon clear that the key factor to improve the robustness and the computational efficiency of any iterative solver is preconditioning. Nowadays, it is quite well accepted that it is preconditioning, rather than the selected Krylov algorithm, the most important issue to address.

This is why research on the construction of effective preconditioners has significantly grown over the last two decades, while advances on Krylov subspace methods have progressively faded. Currently, preconditioning appears to be a much more active and promising research field than either direct and iterative solution methods, and particularly so within the context of the fast evolution of the hardware technology. On one hand, this is due to the understanding that there are virtually no limits to the available options for obtaining a good preconditioner. On the other hand, it is also generally recognized that an optimal general-purpose preconditioner is unlikely to exist, so that there is the possibility to improve the solver efficiency in different ways for any specific problem at hand within any specific computing environment. Generally, the knowledge of the governing physical processes, the structure of the resulting system matrix, and the available computer technology are factors that cannot be ignored in the design of an appropriate preconditioner. It is also recognized that theoretical results are few, and frequently “empirical” algorithms may work surprisingly well despite the lack of a rigorous foundation. This is why finding a good preconditioner for solving a sparse linear system can be viewed rather as “a combination of art and science”  than a rigorous mathematical exercise.

2. Basic Concepts

Roughly speaking, preconditioning means transforming system (1.1) into an equivalent mathematical problem which is expected to converge faster using an iterative solver. For instance, given a nonsingular matrix M, premultiplying (1.1) by M-1 yields (2.1)M-1Ax=M-1b. If the matrix M-1A is better conditioned than A for a Krylov subspace method, then M-1 is the preconditioner and (2.1) denotes the left preconditioned system. Similarly, M-1 can be applied on the right: (2.2)AM-1y=b,x=M-1y, thus producing a right preconditioned system. The use of left or right preconditioning depends on a number of factors and can produce quite different behaviors; see, for example, . For instance, right preconditioning has the advantage that the residual of the preconditioned system is the same as the native system, while with left preconditioning it is not, and this can be important in the convergence check or using a residual minimization algorithm such as GMRES. By the way, with the right preconditioning a back transformation of the auxiliary variable y must be carried out to resume the original unknown x. If the preconditioner can be written in a factorized form: (2.3)M-1=M2-1M1-1, a split preconditioning can be also used: (2.4)M1-1AM2-1y=M1-1b,x=M2-1y, thus potentially exploiting the advantages of both left and right formulations.

Writing the preconditioned Krylov subspace algorithms is relatively straightforward. Simply, the basic algorithms can be implemented replacing A with either M-1A, or AM-1, or M1-1AM2-1, and then back substituting the original variable x where the auxiliary vector y is used. It can be easily observed that it is not necessary to build the preconditioned matrix, which could be much less sparse than A, but just to compute the product of M-1, or M1-1 and M2-1, if the factorized form (2.3) is used, by a vector. This operation is called application of the preconditioner.

Generally speaking, there are three basic requirements for obtaining a good preconditioner:

the preconditioned matrix should have a clustered eigenspectrum away from 0,

the preconditioner should be as cheap to compute as possible,

its application to a vector should be cost-effective.

The origin of condition (i) relies on the convergence properties of CG. It is quite intuitive that if M-1 resembles in some sense A-1 the preconditioned matrix should be close to the identity, thus making the preconditioned system somewhat “easier” to solve and accelerating the convergence. If A and M-1 are SPD, it has been proved that the iteration count niter of the CG algorithm to go below the tolerance ε depends on the spectral conditioning number ξ of the preconditioned matrix G, for example : (2.5)niter12ξ(G)ln2ε+1, so that clustering the eigenvalues of G away from 0 is of paramount importance to accelerate convergence. If A is not SPD, things can be much more complicated. For example, the performance of GMRES depends also on the conditioning of the matrix of the eigenvectors of G, and the fact that the preconditioned matrix has a clustered eigenspectrum does not guarantee a fast convergence . Anyway, it is quite a common experience that a clustered eigenspectrum away from 0 often yields a rapid convergence also in nonsymmetric problems.

The importance of conditions (ii) and (iii) depends on the specific problem at hand and may be highly influenced by the computer architecture. From a practical point of view, they can also be more important than condition (i). For example, if a sequence of linear systems has to be solved with the same matrix A or with slight changes only, as it often may occur when using a Newton method for a set of nonlinear equations, the same preconditioner can be used several times, with the cost for its computation easily amortized. In this case, it could be convenient spending more time to build an effective preconditioner. On the other hand, the computational cost for the preconditioner application basically depends on its density. As M-1 has to be applied once or twice per iteration, according to the selected Krylov subspace method, it is of paramount importance that the increased cost of each iteration is counterbalanced by an adequate reduction of their number. Typically, pursuing condition (i) reduces the iteration count but is in conflict with the need for a cheap computation and application of the preconditioner, so that in the end preconditioning efficiently the linear system (1.1) is always the result of an appropriate tradeoff.

An easy way to build a preconditioner is based on the decomposition of A. Recalling the definition of stationary iteration, (2.6)xk+1=xk+Crk, where rk=b-Axk is the current residual, the matrix C can be used as a preconditioner. Splitting A as D-E-F, where D, -E, and -F are the diagonal, the lower, and the upper parts of A, respectively, then (2.7)MJ-1=D-1,(2.8)MS-1=(D-E)-1,(2.9)Mω-1=ω(D-ωE)-1,(2.10)MΩ-1=ω(2-ω)(D-ωF)-1D(D-ωE)-1 are known as the Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners, respectively, where ω is the overrelaxation factor. Replacing E with F in (2.8) and (2.9) gives rise to the backward Gauss-Seidel and SOR preconditioners, respectively. The preconditioners obtained from the splitting of A do not require additional memory and have a null computation cost, while their application can be simply done by a forward or backward substitution. With MJ-1 a simple diagonal scaling is actually required.

Because of their simplicity, Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners are still used in some applications. Assume, for instance, that matrix A has two clusters of eigenvalues due to the heterogeneity of the underlying physical problem. This is a quite common experience in geotechnical soil/structure or contact simulations, for example, [49, 5458]. In this case the matrix A can be written as (2.11)A=[KB1B2H], where the square diagonal blocks are responsible of the two clusters of eigenvalues and usually arise in a natural way from the original problem. Factorizing A in (2.11) according to Sylvester's theorem of inertia: (2.12)A=[I0B2K-1I][K00S][IK-1B10I], where (2.13)S=H-B2K-1B1 is the Schur complement, a Generalized Jacobi (GJ) preconditioner , based on the seminal idea in , can be obtained as a diagonal approximation of the central block diagonal factor in (2.12): (2.14)M-1=[diag(K)00φdiag(S~)]-1, where φ is a user-specified parameter and S~ is an approximation of S in (2.13) with diag(K) used instead of K. Quite recently Bergamaschi and Martínez  have indirectly proved that an optimal value for φ is the ratio between the largest eigenvalues of K and S. Another very simple algorithm which has proved effective in similar applications is based on SOR and Symmetric SOR preconditioners, for example, .

Though the algorithms above cannot compete with more sophisticated tools in terms of global performance [58, 65], they are an example of how even somewhat naive ideas can work and be useful for practitioners. Generally speaking, it is always a good idea to develop an effective preconditioner starting from the specific problem at hand. Interesting strategies have been developed approximating directly the differential operator the matrix A arises from, by using a “nearby” PDE or a less accurate problem discretization, for example, . For instance, popular preconditioners typically used within the transport community are often physically based, for example, .

For the developers of computer codes, however, using physically based preconditioners is not always desirable. In fact, it is usually simpler introducing the linear solver as a black box which is expected to work independently of the specific problem at hand, and especially so with less specialized codes such as the commercial ones. This is why a great interest has arisen also in the development of purely algebraic preconditioners that claim to be as much “general purpose” as possible. Algebraic preconditioners are usually robust algorithms which require the knowledge of the matrix A only, independently of the underlying physical problem. This paper will concentrate on such a class of algorithms. Subdividing algebraic preconditioners into categories is not straightforward, as especially in recent years there have been several contaminations from adjacent fields with the development of “hybrid” approaches. Traditionally, and as already done by Benzi , algebraic preconditioners can be roughly divided into two classes, that is, incomplete factorizations and approximate inverses. The main distinction is that with the former approach M is actually computed and M-1 is applied but never formed explicitly, whereas with the latter M-1 is directly built and applied. Within each class, moreover, different strategies can be pursued according to a number of factors, such as whether A is SPD or not, its sparsity structure and size, the magnitude of its coefficients, and ultimately the computer hardware available to solve the problem, so that a large number of variants have been proposed by the researchers. As anticipated before, the unavoidable contamination from adjacent fields, for example, the direct solution methods, has made the boundaries between different groups of algorithms often vague and certainly questionable.

In this work, the traditional distinction between incomplete factorizations and approximate inverses is followed, describing the most successful algorithms belonging to each class and the most significant variants developed to address particular occurrences and increase efficiency. Then, two additional sections will be devoted to the most recent results obtained in the fields that currently appear to be the more active frontier of research in the area of algebraic preconditioning, that is, multigrid techniques and parallel algorithms. Finally, a few words will be spent on a special class of problems characterized by indefinite saddle-point matrices, which arise quite frequently in the applications and have attracted much interest in recent years.

3. Incomplete Factorizations

Given a nonsingular matrix A such that (3.1)A=LU, the basic idea of incomplete factorizations is to approximate the triangular factors L and U in a cost-effective way. If L and U are computed so as to satisfy (3.1), for example, by a Gaussian elimination procedure, typically fill-in takes place, with L and U much less sparse than A. This is what limits the use of direct methods in large 3D problems. The approximations L~ and U~ of L and U, respectively, can be obtained by simply discarding a number of fill-in entries according to some rules. The resulting preconditioner (3.2)M-1=U~-1L~-1 is generally denoted as Incomplete LU (ILU). Quite obviously, M-1 in (3.2) is never built explicitly as it is much denser than both L~ and U~. Its application to a vector is therefore performed by forward and backward substitutions. Moreover, the ILU factorized form (3.2) is well suited for split preconditioning. In case A is SPD, the upper incomplete factor U~ is replaced by L~T and the related preconditioner is known as Incomplete Cholesky (IC).

The native ILU algorithm runs as follows. Define a set 𝒮 of position (i,j), with 1i,jn, where either L~ if i>j or U~ if j>i has a nonzero entry. The set 𝒮 is also denoted as the nonzero pattern of the incomplete factors. To avoid breakdowns in the ILU computation, it is generally recommended to include all the diagonal entries in 𝒮. Then, make a copy of A over an auxiliary matrix which will contain L~ and U~ in the lower and upper parts, respectively, at the end of the computation, and set to zero all entries aij such that (i,j)𝒮. If the main diagonal belongs to L~, then it is implicitly assumed that the diagonal of U~ is unitary and vice versa. For every position (i,j)𝒮, the following update is computed: (3.3)aijaij-aikakk-1akj with k=1,,i-1. At the end of the loop, the ILU factors are stored in the copy of A.

The procedure described above is clearly an incomplete Gauss elimination. Hence, no surprise that several implementation tricks developed to improve the efficiency of direct solution methods can be borrowed in order to reduce the computational cost of the ILU computation. For example, the algorithm can follow either the KIJ, or the IKJ, or the IJK elimination variants. To give an idea, Algorithm 1 provides a sketch of the sequence of operations required by the IKJ implementation, which is generally preferred whenever the sparse matrix A along with both L~ and U~ is stored in a compact form by rows. It can be easily observed that Algorithm 1 proceeds by rows, computing first the row of L~ and then that of U~ and accessing the previous rows only to gather the required information (Figure 1). Many other efficient variants have been developed, for example, storing A in a sparse skyline format or computing L~ by columns [73, 74].

<bold>Algorithm 1: </bold>IKJ variant of ILU factorization with a static pattern <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M135"><mml:mrow><mml:mi>𝒮</mml:mi></mml:mrow></mml:math></inline-formula>.

Input: Matrix A, the nonzero pattern 𝒮

Output: Matrix A containing L~ and U~

for  each  (i,j)𝒮  do

aij=0

end

for  i=2,...,n  do

for  k=1,,i-1 and (i,k)𝒮  do

aikaik/akk

for  j=k+1,...,nand(i,j)𝒮  do

aijaij-aikakj

end

end

end

IKJ variant of ILU factorization.

3.1. Fill-In Strategies

The several existing versions of the ILU preconditioner basically differ for the rules followed to select the retained entries in L~ and U~. The simplest idea is to define 𝒮 statically at the beginning of the algorithm and equal to the set of the nonzero positions in A. This idea was originally implemented by Meijerink and van der Vorst  in their 1977 seminal work, giving rise to what is traditionally referred to as ILU(0), or IC(0) for SPD matrices. This simple choice is easy to implement and very cost-effective, as the preconditioner construction is quite cheap and its application to a vector costs as much as a matrix-by-vector product with A. Moreover, ILU(0) can be quite efficient, especially for matrices arising from the discretization of elliptic problems, for example, .

Unfortunately, there are a number of problems where the ILU(0) preconditioner converges very slowly or even fails. There are several different reasons for this, depending on both the unstable computation and the unstable application of L~ and U~, as was experimentally found by Chow and Saad on a large number of test matrices . For example, ILU(0) often exhibits a bad behavior with structural matrices and the highly nonsymmetric and indefinite matrices arising from computational fluid dynamics. In these cases the nonzero pattern of A proves a too crude approximation of the exact L and U structure, so that a possible remedy is to allow for a larger number of entries in L~ and U~ in order to make them closer to L and U. In the limiting case, the exact factors are computed and the solution to (1.1) is obtained in just one iteration. Obviously, enlarging the ILU pattern on one hand reduces the iteration count but on the other increases the cost of computation and application of the preconditioner, so an appropriate tradeoff must be found.

There are several different ways for enlarging the initial pattern 𝒮, or computing it in a dynamic way. In any specific problem, the winning algorithm is the one that proves able to capture the most significant entries in L~ and U~ at the lowest cost. One of the first attempts for enlarging dynamically 𝒮 is based on the so-called level-of-fill concept [79, 80]. The idea is to assign an integer value ij, denoted as level-of-fill, to each entry in position (i,j) computed during the ILU construction process. The lower the level, the more important the entry. The initial level is set to (3.4)ij={0ifaij0ifaij=0. During the ILU construction, with reference for instance to the Algorithm 1, the level-of-fill of each entry is updated according to the level-of-fill of the entries used for its computation using the following rule: (3.5)ijmin{ij,ik+kj+1}. In practice, any position corresponding to a nonzero entry in A remains with a zero level-of-fill. The terms computed using entries with a zero level-of-fill have a level equal to 1. Those built using one entry of the first level have level 2 and so on. The new level-of-fill of each entry is computed adding 1 to the sum of the levels of the entries used for it. This kind of algorithm creates a hierarchy among the potential entries that allows for selecting the “most important” nonzero terms only. The resulting preconditioner is denoted as ILU(p), where p>0 is the integer user-specified level-of-fill of the entries retained into each row of L~ and U~. It follows immediately that ILU(0) coincides exactly with the zero fill-in ILU previously defined.

In most applications, ILU(1) is already a significant improvement to ILU(0). The experience shows that the fill-in increases quite rapidly with p, so that for p2 the cost for building and applying the preconditioner can become a price too high to pay. Moreover it can be observed that at each level there are also many terms which are very small in magnitude, thus providing a small contribution to the preconditioning of the original matrix A. Hence, a natural way to avoid such an occurrence is to add a dropping rule according to which the terms below a user-specified tolerance τ are neglected. Several criteria have been defined for selecting τ and the way the ILU entries are discarded. A usual choice is to compute the ith row of both L~ and U~ and drop an entry whenever smaller than τai, being · an appropriate vector norm and ai the ith row of A, but other strategies have been also attempted, such as comparing the entry lij of L~ with τliiljj  or with an estimate of the norm of the jth column of L~-1 [82, 83].

Similarly to the level-of-fill approach, a drawback of the drop tolerance strategy is that the amount of fill-in of L~ and U~ is strongly dependent on the selected user-specified parameter and the specific problem at hand and is unpredictable a priori. To address such an issue a second user-specified parameter ρ can be introduced, denoting the largest number of entries that can be retained per row. In some versions, ρ is also defined as the number of entries added at most per row in excess to the number of entries of the same row of A. The use of this dual threshold strategy, which may be quite simple and effective, has been proposed by Saad , and the related preconditioner is referred to as ILUT(ρ,τ). A sketch of the sequence of steps required for its computation is provided in Algorithm 2.

<bold>Algorithm 2: </bold>ILUT(<inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M201"><mml:mi>ρ</mml:mi><mml:mo>,</mml:mo><mml:mi>τ</mml:mi></mml:math></inline-formula>) computation.

Input: Matrix A, the user-specified parameters ρ and τ

Output: Matrices L~ and U~

for  i=1,,n  do

w=ai

for  k=1,,i-1 and wk0  do

wkwk/akk

if  wkτai2  then

ww-wkuk

else

wk=0

end

end

Retain the ρ largest entries in  w

for  j=1,,i-1  do

lij=wj

end

for  j=i,,n  do

uij=wj

end

end

As the drop tolerance may be quite sensitive to the specific problem and so difficult to implement in a black box solver, ILU variants that use only the fill-in parameter ρ have been proposed [85, 86]. These versions are generally outperformed by ILUT(ρ,τ) but are much easier to set and generally less sensitive to the user-specified parameter, allowing the user for a thorough control of the memory occupation. More recent ILU variants developed with the aim of improving the preconditioner performance and better exploiting the hardware potential are based on multilevel techniques, dense block identification, and adaptive selections of the setup parameters , with significant contaminations from the methods developed for the latest multifrontal solvers.

3.2. Stabilization Techniques

Both Algorithms 1 and 2 require the execution of divisions where the denominator, that is, the pivot, is not generally guaranteed to be nonzero. It is well known that in finite arithmetics also small pivots can create numerical difficulties. If A is SPD, the pivot has to be also positive in order to produce a real incomplete factorization. These issues are very important because they can jeopardize the existence of the preconditioner itself and so the robustness of the overall iterative algorithm.

Meijerink and van der Vorst  demonstrated that for an M-matrix a zero fill-in IC is guaranteed to exist. However, if A is SPD but not of M-type zero or negative pivots may arise, thus theoretically preventing from the IC computation. Typically, such an inconvenience may be avoided by increasing properly the fill-in. In this way the incomplete factor tends to the exact factor, which is guaranteed to be SPD. However, this may generate a fill-in degree that is unacceptably high, lowering the performance of the preconditioned solver. To avoid this undesirable occurrence a number of tricks have been proposed which can greatly increase the IC robustness. The first, and maybe simplest, idea was advanced by Kershaw  who just suggested to replace zero or negative pivots with an arbitrary positive value, for example, the last valid pivot. The implementation of this trick is straightforward, but it seldom provides good results as the resulting IC often exhibits a very low quality. In fact, recall from Algorithm 1 that modifying arbitrarily a pivot in the kth row means influencing the computation of all the following rows. Another simple strategy recently advanced in  relies on avoiding the computation of the square root of the pivot, required in the IC algorithm, thus eliminating a source of breakdown. In  a diagonal compensated reduction of the positive off-diagonal entries is suggested in order to transform the native matrix into an M-matrix before the incomplete factorization process is performed. A different algorithm is proposed in  with the incomplete factors obtained through an A-orthogonalization instead of a Cholesky elimination.

A famous stabilization technique to ensure the IC existence is the one advanced by Ajiz and Jennings , that has proved quite effective especially in structural applications [50, 90]. Following this approach, we can write (3.6)A=LLT=L~L~T+R, where R is a residual matrix collecting with all the entries discarded during the incomplete procedure. From (3.6), L~L~T is actually the exact factorization of A-R. As A is SPD, it follows immediately that a sufficient condition for L~ to be real is that R is negative semidefinite. Suppose that the first column of L has been computed and the term lj1=aj1/a11 has to be dropped along with the corresponding l1j term in LT. Hence, the incomplete factorization is equivalent to the exact factorization of A- where the entries a1j and aj1 are replaced by zero, and the error matrix R contains aj1 in position 1j and j1. For example, in a 3×3 matrix with j=2, we have (3.7)A=[a11a12a13a21a22a23a31a32a33]=[a110a130a22a23a31a32a33]+[0a120a2100000]=A-+R. The residual matrix R can be forced to be negative semidefinite by adding two coefficients α and β in position (1,1) and (j,j), such that the submatrix (3.8)[αa1jaj1β] is negative semidefinite. The matrix A- must be modified accordingly by subtracting α and β from the corresponding diagonal terms. Though different choices are possible, for example, [50, 93], the option α=β=-|a1j| is particularly attractive because in this way the sum of the absolute values of the arbitrarily introduced entries |α|+|β| is minimized. This procedure ensures that L~L~T is the exact factorization of a positive definite matrix, hence its existence is guaranteed, but obviously such a matrix can be quite different from A. The quality of this stabilization strategy basically depends on the quality of A- as an approximation of A.

Unstable factors and negative pivots typically arise if the diagonal terms of A are remarkably smaller than the off-diagonal ones, or if there are significant contrasts in the coefficient magnitude. The latter occurrence is particularly frequent in heterogeneous problems or in models coupling different physical processes, such as multiphase flow in deformable media, coupled flow and transport, themomechanical models, fluid-structure interactions, or multibody systems, for example, . All these problems give naturally rise to a multilevel matrix: (3.9)A=[K1B12B13B1B21K2B23B2B31B32K3B3B1B2B3K], where the system unknowns can be subdivided into contiguous groups. By distinction with multigrid methods, a multilevel approach assumes that the different levels do not necessarily correspond to model subgrids. Several strategies have been proposed to handle properly the matrix levels, for example, . Generally speaking, all such algorithms can be viewed as approximate Schur complement methods that differ for the strategy used to perform the level subdivision and the restriction and prolongation operators adopted [103, 104]. For a recent review, see, for instance, . The basic idea of multilevel ILU proceeds as follows. Consider the partial factorization of A in (3.9), where A1 is the submatrix obtained from A without the first level, and B^12 and B^21 are the rectangular submatrices collecting the blocks B1j and Bj1, j=2,,: (3.10)A=[K1B^12B^21A1]=[L10B^21U1-1I][I00S1][U1L1-1B^120I]. In (3.10) K1=L1U1 and S1 is the Schur complement: (3.11)S1=A1-B^21U1-1L1-1B^12. Replacing L1 and U1 with the ILU factors L~1 and U~1 of K1 gives rise to a partial incomplete factorization of A that can be used as a preconditioner: (3.12)M-1=[U~1H120I]-1[I00S~1-1][L~10H21I]-1, where (3.13)H12=L~1-1B^12H21=B^21U~1-1,(3.14)S~1=A1-H21H12. As H12 and H21 tend to be much denser than B^12 and B^21, some degree of dropping can be conveniently enforced, thus replacing these blocks with H~12 and H~21, respectively. Recalling the level structure of A1, the inverse of the approximated Schur complement S~1 can be also replaced by a partial incomplete factorization with the same structure as M-1 in (3.12). Repeating recursively the procedure above starting from S~0=A, at the ith level, S~i-1 is replaced by (3.15)S~i-1[U~i+1H~i+1,i+20I]-1[I00S~i+1-1][L~i+10H~i+2,i+1I]-1, thus giving rise to an (i+1)th level Schur complement S~i+1=Ai+1-H~i+2,i+1H~i+1,i+2. The algorithm stops when the size of S~i+1 equals 0, that is, when i+1 equals the user specified number of levels . Multilevel ILU preconditioners are generally much more robust than standard ILU and can be very effective provided that a suitable level subdivision is defined. If A is SPD, however, it is necessary to enforce the positive definiteness of all approximate Schur complements, which is no longer ensured when incomplete decompositions and dropping procedures are used. Recently, Janna et al.  have developed a stabilization strategy which can guarantee the positive definiteness of each S~i. Basically, they extend to levels of the factorization procedure introduced by Tismenetsky  where the Schur complement is updated by a quadratic formula. The theoretical robustness of Tismenetsky's algorithm has been proved in . Recalling (3.6), the matrix S~i can be additively decomposed as (3.16)S~i=[S~i,11S~i,12S~i,12TS~i,22]=S-i+Ri=[S-i,11S~i,12S~i,12TS~i,22]+[R000], where R is the residual matrix obtained from the IC decomposition of S~i,11. Assuming that an appropriate stabilization technique has been used, for example, the one by Ajiz and Jennings , Ri is negative semidefinite, so S-i is positive definite. Let us collect all the entries dropped in the H~i+1,i+2 computation in the error matrix EH, thus writing Hi+1,i+2=H~i+1,i+2+EH. The Schur complement of S~i can be now rewritten as (3.17)Si+1=S~i,22-H~i+1,i+2TH~i+1,i+2-EHTH~i+1,i+2-H~i+1,i+2TEH-EHTEH. Ignoring the last three terms of (3.17) we obtain (3.18)S~i+1=S~i,22-H~i+1,i+2TH~i+1,i+2, that is the standard Schur complement usually computed in a multilevel algorithm; see (3.14). As EHTH~i+1,i+2 is generally indefinite, S~i+1 in (3.18) may be indefinite too. As a remedy, add and subtract H~i+1,i+2TH~i+1,i+2 to the right-hand side of (3.17): (3.19)Si+1=S~i,22+H~i+1,i+2TH~i+1,i+2-(H~i+1,i+2+EH)TH~i+1,i+2-H~i+1,i+2T(H~i+1,i+2+EH)-EHTEH. Recalling that (3.20)H~i+1,i+2+EH=Hi+1,i+2=L~i+1-1S~i,12, (3.19) becomes (3.21)Si+1=S~i,22+H~i+1,i+2TH~i+1,i+2-S~i,12TL~i+1-TH~i+1,i+2-H~i+1,i+2TL~i+1-1S~i,12-EHTEH. Neglecting the last term in (3.21) we obtain another expression for the Schur complement: (3.22)S~i+1=S~i,22+H~i+1,i+2TH~i+1,i+2-S~i,12TL~i+1-TH~i+1,i+2-H~i+1,i+2TL~i+1-1S~i,12, which is always SPD. In fact, (3.21) yields (3.23)S~i+1=Si+1+EHTEH, which is SPD independently of the degree of dropping enforced on H~i+1,i+2. The procedure above is generally more expensive than the standard one but is also more robust.

If A is not SPD, enforcing the positivity of pivots or the positive definiteness of the Schur complements is not necessary. However, the experience shows that, in many problems, especially those with a high degree of nonsymmetry and a lack of diagonal dominance, ILU can be much more unstable in both the computation and the application, leading to frequent solver failures . In this case no rigorous methods have been introduced to avoid numerical instabilities. The effect of a preliminary reordering and/or scaling of A has been debated for quite a long time, without achieving a shared conclusion. The fact that the matrix nonzero pattern has an effect on the quality of the resulting ILU preconditioner, similarly to what happens with direct methods, is quite well understood, but not all the methods performing well with direct techniques appear to be efficient for ILU, for example, [90, 109112]. A similar result is obtained with a preliminary scaling. As proved in , scaling the native matrix does not change the spectral properties of an ILU preconditioned matrix; however it can greatly stabilize the ILU computation by reducing significantly the impact of the round-off errors. Useful scaling strategies can be taken from the GJ preconditioning of (2.14) or the Least Square Log algorithm introduced in .

4. Approximate Inverses

Incomplete factorizations can be very powerful preconditioners; however the issues concerning their existence and numerical stability can undermine their robustness in several examples. One reason for such a weakness can be understood using the following simple argument. Consider the incomplete factorization L~U~ along with the corresponding residual matrix for a general matrix A: (4.1)A=L~U~+R. Left and right multiplying both sides of (4.1) by L~-1 and U~-1, respectively, yields (4.2)L~-1AU~-1=I+L~-1RU~-1. The matrix controlling the convergence of a preconditioned Krylov subspace method is actually L~-1AU~-1 which differs from the identity by L~-1RU~-1. This means that a residual matrix close to 0 is not sufficient to guarantee a fast convergence. In fact, if the norms of either L~-1 or U~-1 are large, L~-1AU~-1 can be anyway far from the identity and yields a slow convergence, that is, even an accurate incomplete factorization can provide a poor outcome. This is why problems with strongly nonsymmetric matrices and a lack of diagonal dominance incomplete factorizations often have a bad reputation.

To cope with these drawbacks the researchers in the past have developed a second big class of preconditioners with the aim of computing an explicit form for M-1, thus avoiding the solution of the linear system needed to apply the preconditioner. Generally speaking, an approximate inverse is an explicit approximation of A-1. Quite obviously, it has to be sparse in order to maintain a workable cost for its computation and application. It is no surprise that several different methods have been proposed to build an approximate inverse. The most successful ones can be roughly classified according to whether M-1 is computed monolithically or as the product of two matrices. In the first case, the approximate inverse is generally computed by minimizing the Frobenius norm: (4.3)I-AM-1ForI-M-1AF, thus obtaining either a right or a left approximate inverse, which is generally different. In the second case, the basic idea is to find an explicit approximation of the inverse of the triangular factors of A: (4.4)M1-1L-1,M2-1U-1, and then use M-1 in the factored form (2.3). Different approaches have been advanced within each class and also in between these two groups.

Early algorithms for computing an approximate inverse via the Frobenius norm minimization appeared as far back as in the 70s [114, 115]. The basic idea is to select a nonzero pattern 𝒮 for M-1 including all the positions where nonzero terms may be expected. Denoting by 𝒲S the set of matrices with nonzero pattern 𝒮, any algorithm belonging to this class looks for a matrix M-1𝒲S such that I-AM-1F is minimum. Recalling the definition of Frobenius norm: (4.5)AF=j=1ni=1naij2, it follows that (4.6)I-AM-1F2=j=1nej-Amj22, where ej and mj are the jth column of I and M-1, respectively. The sum at the right-hand side of (4.6) is minimum only if each addendum is minimum. Therefore, the computation of M-1 is equivalent to the solution of n least-squares problems with the constraint that mj has nonzero components only in the positions i such that (i,j)𝒮. This requirement reduces enormously the computational cost required to build M-1. Denote by 𝒥 the set (4.7)𝒥={i:(i,j)𝒮} and mj[𝒥] the subvector of mj made of the components included in 𝒥. In the product Amj only the columns of A with index in 𝒥 will provide a nonzero contribution. Moreover, as A is sparse, only a few rows will have a nonzero term in at least one of these columns (Figure 2). Denoting by the set (4.8)={i:aij0withj𝒥} and A[,𝒥] the submatrix of A made of the coefficients aij such that i and j𝒥, each least square problem at the right-hand side of (4.6) reads (4.9)ej[]-A[,𝒥]mj[𝒥]22min and is typically quite small. The procedure illustrated above allows for the computation of a right approximate inverse. To build a left approximate inverse just observe that (4.10)I-M-1AF=I-ATM-TF. Hence, (4.9) can still be used for any j replacing A with AT and obtaining the rows of M-1. It turns out that for strongly nonsymmetric matrices the right and left approximate inverse can be quite different.

Schematic representation of a matrix-vector product subject to sparsity constraints.

The main difference in the algorithms developed within this class relies on how the nonzero pattern 𝒮 is selected, and this is indeed the key factor for the overall quality of the approximate inverse. A more detailed discussion on this point will be provided in the next section. Some of the most popular algorithms in this class have been advanced in , each one with its pros and cons. Probably the most successful approach is the one proposed by Grote and Huckle  which is known as SPAI (SParse Approximate Inverse). Basically, the SPAI algorithm follows the lines described above dynamically enlarging 𝒥 for each column of M-1 until the least square problem (4.9) is solved with a prescribed accuracy. SPAI has proved to be quite a robust method allowing for a good control of its quality by the user. Unfortunately, its computation can be quite expensive also on a parallel machine. A sketch of the SPAI construction is reported in Algorithm 3.

<bold>Algorithm 3: </bold>Computation of the SPAI preconditioner.

Input: Matrix A, the initial nonzero pattern 𝒮 and the tolerance ε

Output: Matrix M-1

for  j=1,,n  do

Compute 𝒥={i:(i,j)𝒮}

Compute ={i:aik0,k𝒥}

Gather A[,𝒥] from A

Solve the least square problem ej[]-A[,𝒥]mj[𝒥]2min

rj=ej[]-A[,𝒥]mj[𝒥]

while  rj2>ε  do

Enlarge 𝒥 and update

Gather A[,𝒥] from A

Solve the least square problem ej[]-A[,𝒥]mj[𝒥]2min

rj=ej[]-A[,𝒥]mj[𝒥]

end

end

A different way to build an approximate inverse of A relies on approximating the inverse of its triangular factors. Assuming that all the leading principal minors of A are not null, that is, A can be factorized as the product LDU where L is unit lower triangular, D is diagonal, and U is upper unit triangular, then (4.11)L-1AU-1=D implying that the rows of L-1 and the columns of U-1 are two sets of A-biconjugate vectors. In other words, denoting by wi and zj the ith row of L-1 and the jth column of U-1, respectively, (4.11) means that (4.12)wiTAzj=0i,j=1,,n,ij. If W and Z are the matrices collecting all vectors wi and zj, i,j=1,,n, satisfying the condition (4.12) and D=WTAZ, then the inverse of A reads (4.13)A-1=ZD-1WT=k=1nzkwkTdk and can be computed explicitly. There are infinite sets of vectors satisfying (4.12), and they can be computed by means of a biconjugation process, for example, via a two-sided Gram-Schmidt algorithm, starting from two arbitrary initial sets. Quite obviously, these vectors are dense and their explicit computation is practically unfeasible. However, if the process is carried out incompletely, for instance, dropping the entries of wk and zk, k=1,,n, whenever smaller than a prescribed tolerance, the matrices W and Z can be acceptable approximations of L-T and U-1 though preserving a workable sparsity. The resulting preconditioner is denoted as AINV (Approximate INVerse) and has been introduced in the late 90s by Benzi and coauthors . In the following years the native algorithm has been further developed by other researchers as well, for example, . A sketch of the basic procedure to build the AINV preconditioner is provided in Algorithm 4.

<bold>Algorithm 4: </bold>Computation of the AINV preconditioner.

Input: Matrix A, the tolerance τ

Output: Matrices W, Z and D such that M-1=ZD-1WT

Set p  =  q  =  0, Z=W=I

fork=1,,n  do

pk=wkTAzk,qk=wkTAzk

for  i=k+1,,n  do

pi=(wiTAzk)/pk,qi=(wkTAzi)/qk

wiwi-piwk,zizi-qizk

for  j=1,,i  do

Drop the jth component of wi and zi if smaller than τ

end

end

end

for  k=1,,n  do

dk=pk

end

Quite obviously, if A is SPD, then W=Z, and in Algorithm 4 only the update of zi must be carried out. The resulting AINV preconditioner can be written as M-1=ZZT, is SPD, and exists if pk0 for all k=1,,n. Unfortunately, the same does not hold true for nonfactored sparse approximate inverses based on minimizing the Frobenius norm in (4.3). In general, using an algorithm developed along the lines of SPAI there is no guarantee that M-1 is SPD if A is so. This is quite a strong limitation, as it prevents from using the powerful CG method to solve an SPD linear system. Such a weak point motivated the development of the Factorized Sparse Approximate Inverse (FSAI) preconditioner by Kolotilina and Yeremin . The FSAI algorithm lies somewhat between the two classes of approximate inverses previously identified, as it produces a factored preconditioner in the form (4.14)M-1=GTG, where G is a sparse approximation of the inverse of the exact Cholesky factor L of an SPD matrix A obtained with a Frobenius norm minimization procedure. Select a lower triangular nonzero pattern 𝒮L and find the matrix G𝒲SL, where 𝒲SL is the set of matrices with pattern 𝒮L such that (4.15)I-GLFmin. The minimization (4.15) can be performed recalling that (4.16)I-GLF=Tr[(I-GL)(I-GL)T]=n-2Tr(GL)+  Tr(GAGT). Deriving the right-hand side of (4.16) with respect to the nonzero entries of G and setting to 0 yield (4.17)  [GA]ij=lji(i,j)𝒮L, where the symbol [·]ij denotes the entry in position (i,j) of the matrix in the brackets. As 𝒮L is a lower triangular pattern, the entry lji of L is nonzero only if i=j and the off-diagonal terms of L are not required. Scaling (4.17) so that the right-hand side is either 1 or 0 yields (4.18)  [G^A]ij=δij(i,j)𝒮L, where G^ is the scaled factor G and δij the Kronecker delta. Solving (4.18) by rows leads to a sequence of dense SPD linear systems with size equal to the number of nonzeroes set for each row of G^. More precisely, denoting by 𝒥 the set of column indices where the ith row gi of G^ has a nonzero entry, the following SPD systems have to be solved: (4.19)A[𝒥,𝒥]gi[𝒥]=im, where imm has zero components except the mth one that is unitary, and m is the cardinality of 𝒥. The FSAI factor G is finally restored from G^ as (4.20)G=[diag(G^)]-1/2G^ such that the preconditioned matrix GAGT has all diagonal entries equal to 1. It has been demonstrated that FSAI is optimal in the sense that it minimizes the Kaporin conditioning number of GAGT for any G𝒲SL [131, 132]. Another remarkable property of FSAI is that the minimization in (4.15) does not require the explicit knowledge of L, which would make the algorithm unfeasible. A sketch of the procedure to build FSAI is reported in Algorithm 5. The FSAI preconditioner was later improved in several ways , with the aim of increasing its efficiency by dropping the smallest entries while preserving its optimal properties. Between the late 90s and early 00s further developments of FSAI were abandoned, as AINV proved generally superior. In recent years, however, a renewed interest for FSAI has started, mainly because of its high intrinsic parallel degree, for example, .

<bold>Algorithm 5: </bold>Computation of the FSAI preconditioner.

Input: Matrix A and the nonzero pattern 𝒮L

Output: Matrix G

for  i=1,,n  do

Compute 𝒥={j:(i,j)𝒮L}

Set m=|𝒥|

Gather A[𝒥,𝒥] from A

Solve A[𝒥,𝒥]gi[𝒥]=im

gigi/gi,m

end

The native FSAI algorithm has been developed for SPD matrices. A generalization to nonsymmetric matrices is possible and quite straightforward [137, 138]; however the resulting preconditioner is much less robust because the solvability of all local linear systems and the nonsingularity of the approximate inverse are no longer theoretically guaranteed. This is why with general sparse matrices SPAI and AINV are generally preferred.

4.1. Nonzero Pattern Selection and Stability Issues

A key factor for the efficiency of approximate inverse preconditioning based on the Frobenius norm minimization is the choice of the nonzero pattern 𝒮 or 𝒮L. If the nonzero pattern is not good for the problem at hand, the result is generally a failure in the Krylov solver convergence. This is due to the fact that approximate inverse techniques assume that a sparse matrix M-1 can be a good approximation, in some sense, of A-1. This is not obvious at all, as the inverse of a sparse matrix is structurally dense. If A is diagonally dominant, the entries of A-1 decay exponentially away from the main diagonal , hence a banded pattern for M-1 typically provides good results. Other favorable situations include block tridiagonal or pentadiagonal matrices , but for a general sparse matrix choosing a good nonzero pattern is not trivial at all.

With respect to the nonzero pattern selection, approximate inverses can be defined as static or dynamic, according to whether 𝒮 and 𝒮L are initially chosen and kept unvaried during the computation, or 𝒮 and 𝒮L are generated by an adaptive algorithm which selects the position of the nonzero entries in an a priori unpredictable way. Typically, dynamic approximate inverses are more powerful than static ones, but their implementation especially on parallel computers can be much more complicated. The most popular and common strategy to define a static nonzero pattern for M-1 is to take the nonzero pattern of a power κ of A as is justified in terms of the Neumann series expansion of A-1. In real problems, however, it is uncommon to set a κ value larger than 2 or 3, as 𝒮 can soon become too dense, so it is often preferable working with sparsified matrices. The sparsification of Aκ can be done by either a prefiltration, or a postfiltration strategy, or both. For example, Chow  suggests dropping the entries of A smaller than a threshold, thus obtaining the sparsified matrix A^ and using the pattern of A^κ. An a posteriori sparsification strategy is advanced for the FSAI algorithm by Kolotilina and Yeremin  with the aim of reducing the cost for the preconditioner application by dropping its smallest entries. Such a dropping, however, is nontrivial and is performed so as to preserve the property that the FSAI preconditioned matrix has unitary diagonal entries, with a possibly nonnegligible computational cost. Anyway, pre- and/or postfiltration of a static approximate inverse is generally to be recommended, as the cost of the preconditioner application can be dramatically reduced with a relatively small increase of the iteration count, for example, [135, 136, 142]. By the way, this suggests that the pattern of Aκ is often much larger than what would be actually required to get a good preconditioner, the difficulty relying on the a priori recognition of the most significant entries.

Dynamic approximate inverses are based on adaptive strategies which start from a simple initial guess, for example, a diagonal pattern, and progressively enlarge it until a certain criterion is satisfied. For instance, a typical dynamic algorithm is the one proposed by Grote and Huckle  for the SPAI preconditioner; see Algorithm 3. More specifically, they suggest to collect the indices of the nonzero components of the residual vector in a new set and to enlarge the current set 𝒥 by adding the new column indices that appear in the rows with index in . An improvement can be obtained by retaining only a few new column indices, specifically the indices j such that the new residual norm: (4.21)ρj=rj22-(rjTAej)2Aej22, is larger than a user-specified tolerance. Then, the set is enlarged correspondingly, and a new least square problem is solved. Other dynamic techniques for generating 𝒮 have been advanced by Huckle for both nonsymmetric  and SPD matrices . More recently, Jia and Zhu  have advanced a novel dynamic approximate inverse denoted as PSAI (Power Sparse Approximate Inverse), and Janna and Ferronato  have developed a new adaptive strategy for Block FSAI that has proved quite efficient also for the native FSAI algorithm and will be presented in more detail in the sequel.

In contrast with approximate inverses based on the Frobenius norm minimization, the AINV algorithm does not need the selection of the nonzero pattern, so it can be classified as a purely dynamic preconditioner. However, its computation turns out to be pretty similar to that of an incomplete factorization, see Algorithms 2 and 4, thus raising concerns on its numerical stability. In particular, due to incompleteness a zero pivot may result, causing a breakdown, or small pivots can trigger unstable behaviors. To avoid such difficulties a stabilized AINV preconditioner has been developed independently in [124, 145] for SPD problems. For generally nonsymmetric problems, especially with large off-diagonal entries, the numerical stability in the AINV computation may still be an issue.

Theoretically, for an arbitrary nonzero pattern 𝒮 it is not guaranteed that M-1 is nonsingular. This could also happen for a few combinations of the user-specified tolerances set for generating dynamically the approximate inverse pattern. However, such an occurrence is really unlikely in practice, and typically approximate inverses prove to be very robust in problems arising from a variety of applications. According to the specific problem at hand, the choice of the most efficient type of approximate inverse may vary. For a thorough comparative study of the available options as of 1999 see the work by Benzi and Tůma . Generally, SPAI is a robust algorithm able to converge also in difficult problems where incomplete factorizations fail, for example, , allowing the user to have a complete control of the quality of the preconditioner through the parameter ε, see Algorithm 3. However, its computation is quite expensive and can be rarely competitive with AINV if only one system has to be solved so that the construction time cannot be amortized by recycling a few times the same preconditioner. For SPD problems, SPAI is not an option and FSAI is always well defined and a very stable preconditioner, even if its convergence rate may be lower than AINV. For unsymmetric matrices, however, FSAI is not reliable and may often fail. On the other hand, AINV seldom fails and generally outperforms both SPAI and FSAI.

Numerical experiments in different problems typically show that, whenever a stable ILU-type preconditioner can be computed, that is probably the most efficient preconditioner on a scalar machine [129, 148, 149]. However, approximate inverses play an important role in offering a stable and robust alternative to ILU in ill-conditioned problems. Moreover, they can become the preconditioner of choice for parallel computations, as it will be discussed in more detail in the sequence.

5. Algebraic Multigrid Methods

It is well known that the convergence rate of any Krylov subspace method preconditioned by either an incomplete factorization or an approximate inverse tends to slow down as the problem size increases. The convergence deterioration along with the increased number of operations per iteration may lead to an unacceptably high computational cost, thus limiting de facto the size of the simulated model even though large memory resources are available. This is why in recent years much work has been devoted to the so-called scalable algorithms, that is, solvers where the iteration count is insensitive to the characteristic size of the discretization mesh.

Multigrid methods can provide an answer to such a demand. Pioneering works on multigrid are due to Brandt [150, 151] who promoted these techniques for the solution of regularly discretized elliptic PDEs. The matrices arising from this kind of problem have a regular structure for which all eigenvalues and eigenvectors are known. For example, let us consider the 1D model problem: (5.1)-u′′(x)=f(x),x[0,1], with homogeneous Dirichlet boundary conditions discretized by centered differences over n+2 equally spaced points xi, i=0,,n+1. The grid characteristic size is therefore h=1/(n+1). This discretization results in the tridiagonal n×n linear system Ahx=b, where, as is well known, the eigenvalues of Ah are (5.2)λk=4sin2(khπ2)k=1,,n, with the associated eigenvectors (5.3)wk=[sin(khπ),sin(2khπ),,sin(nkhπ)]T. In other words, the ith component of wk is actually (5.4)wk,i=sin(kπxi). Let us solve the discretized model problem by a stationary method, for example, the Jacobi iteration. As the main diagonal of Ah has all components equal to 2, the Jacobi iteration matrix Sh is just (5.5)Sh=I-Ah2 and the kth component of the error along the direction of the kth eigenvector of Sh is damped at each iteration by the reduction coefficient ηk: (5.6)ηk=1-2sin2(khπ2). The reduction coefficients with k around n/2 are very small, and the related error components are damped very rapidly in a few iterations. This part of the error is often referred to as the oscillatory part corresponding to high frequency modes. However, if h is small, there are also several reduction coefficients close to 1, related to both the smallest and largest values of k, that slow down significantly the stationary iteration. This part of the error is usually referred to as the smooth part corresponding to low frequency modes. Let us now consider a coarser grid where the characteristic size is doubled, that is, we simply keep the nodes of the initial discretization with an even index. Recalling (5.4) it follows that a smooth mode on the fine grid, that is, k<n/4 and k>3n/4, is now automatically injected into an oscillatory mode on the coarse grid. Building A2h and S2h, it is now possible to damp in a few iterations the error on the coarse grid and then project back the result on the fine grid. Similar conclusions can be drawn also using other stationary iterations, such as Gauss-Seidel, though in a much less straightforward way.

Recalling the previous observations, the basic idea of multigrid proceeds as follows. The operator Sh, usually denoted as the smoother in the multigrid terminology, is applied to the fine grid discretization for ν1 iterations. The resulting fine grid residual is computed and injected into the coarse grid through a restriction operator RhH. Recalling the relationship between error and residual, the coarse grid problem is solved using the residual as known term, thus providing the coarse grid error. Such an error is then projected back to the original fine grid using a prolongation operator PHh, and the fine grid solution, obtained after the initial ν1 smoothing iterations, is corrected. Finally, the fine grid smoother is applied again until convergence. The algorithm described above is known as two-grid cycle and is the basis of multigrid techniques. Quite naturally, the two-grid cycle can be extended in a recursive way. At the coarse level, another two-grid cycle can be applied moving at a coarser grid, defining appropriate smoother, restriction, prolongation operators, and so on, until the coarsest problem is so small that it can be efficiently solved by a direct method. The recursive application of the two-grid cycle gives rise to the so-called V-cycle. Other more complex recursions can provide the so-called W-cycle; see, for instance,  for details.

Multigrid methods have been introduced as a solver for discretized PDEs of elliptic type, and indeed in such problems they have soon proved to be largely superior to existing algorithms. The first idea to extend their use to other applications was to look at the multigrid as a purely algebraic solver, where one has to define the smoother, restriction, and prolongation operators knowing the system matrix A only independently of the grid and the discretized PDE it actually comes from. This strategy, known as Algebraic Multigrid (AMG), was first introduced in the mid 80s  and became soon quite popular. The second idea was to regard AMG not as a competitor with Krylov subspace solvers, rather as a potentially effective preconditioner. In fact, to work as a preconditioner it is simply sufficient to fix the number ν2 of smoothing iterations needed to reconstruct the system solution at the finest level, thus obtaining an inexact solution. The scheme of a V-cycle AMG application as a preconditioner is provided in Algorithm 6.

<bold>Algorithm 6: </bold>Recursive application for <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M586"><mml:mrow><mml:mi>μ</mml:mi></mml:mrow></mml:math></inline-formula> times of the <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M587"><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow></mml:math></inline-formula>-cycle AMG function <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M588"><mml:mtext>MGV</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>,<bold>v</bold>,<bold>u</bold>).

Input: Matrix A(k), the known vector v(k), the restriction and prolongation

operators RhH    (k) and PHh    (k), the smoothers Sh(k), k, ν1, ν2 and μ

Output: Vector u(k)=M-1v(k)

if  k=μ  then

Solve A(k)u(k)=v(k)

else

Apply ν1 iterations of the smoother Sh(k) to A(k)u(k)=v(k)

r(k)=v(k)-A(k)u(k)

r(k+1)=RhH    (k)r(k)

Call MGV(A(k+1),RhH    (k+1),PHh    (k+1),Sh(k+1),r(k+1),e(k+1))

e(k)=PHh    (k)e(k+1)

u(k)u(k)+e(k)

Apply ν2 iterations of the smoother Sh(k) to A(k)u(k)=v(k)

end

Basically, AMG preconditioners vary according to the choice of both the restriction operator and the smoother, while the prolongation operator is often defined as the transposed of the restrictor. The basic idea for defining a restriction is to coarsen the native pattern of A and prescribe an appropriate interpolation over the coarse nodes. Classical coarsening strategies have been introduced in [155, 156] which separate the variables into coarse (C-) and fine (F-) points according to the strength of each matrix connection. A point j strongly influences the point i if (5.7)-aijθmaxk=1,n,ki(-aik), where θ is the user-defined strength threshold. For each point j that strongly influences an F-point i, j is either a C-point or is strongly influenced by another C-point k. Moreover, two C-points should not be connected to each other. An alternative successful coarsening strategy is based on the so-called smoothed aggregation (SA) technique , where a coarse node is defined by an aggregate of a root point i and all the neighboring nodes j such that (5.8)|aij|>θ|aiiajj|. Another possibility is to use a variant of the independent set ordering (ISO) technique which subdivides a matrix into a 2×2 block structure and projects the native system into the Schur complement space. Using ISO or SA can lead to an elegant relationship between AMG and multilevel ILU as exploited in [98, 101, 158, 159]. More recently, a number of both parallel and scalar coarsening strategies have been also advanced, for example, . As far as the smoother is concerned, for many years the method of choice has been the standard Gauss-Seidel iteration. With the development of parallel computers other smoothers have been introduced with the aim of increasing the parallel degree of the resulting AMG algorithm. Multicoloring approaches  can be used in connection to standard Gauss-Seidel, or also polynomial and C-F Jacobi smoothers [166, 167]. It is no surprise that also approximate inverses can be efficiently used as a smoother, for example, [168, 169].

The last decade has witnessed an explosion of research on AMG techniques. The key factor leading to such a great interest is basically their potential scalability with the size of the problem to be solved, in the sense that the iteration count to converge for a given problem does not depend on the number of the mesh nodes. Several theoretical results have been obtained with the aim of generalizing as much as possible AMG to nonelliptic problems, for example, [158, 170174]. AMG techniques have been used in several different applications with promising results, such as fluid-structure interactions, meshless methods, Maxwell and Helmoltz equations, structural mechanics, sedimentary basin reconstruction, and advection-convection and reactive systems, for example, . The above list of references is just representative and of course largely incomplete. Nonetheless, much work is still to be done in order to make AMG the method of choice. In particular, difficulties can be experienced where the system matrix arises from highly distorted and irregular meshes, or in presence of strong heterogeneities. In these cases, even the most advanced AMG strategies can fail.

6. Parallel Preconditioners

Parallel computing is widely accepted as the only pathway toward the possibility of managing millions of unknowns . At the same time, hardware companies are improving the available computational resources with an increasing degree of parallelism rather than accelerating each single computing core. This is why the interest in parallel computations is continuously growing in recent years. Krylov subspace methods are in principle almost ideally parallel, as their kernels are matrix-vector and scalar products along with vector updates. Unfortunately, the same is not true for most algebraic preconditioners. There are basically two approaches for developing parallel preconditioners: the first tries to extract as much parallelism as possible from existing algorithms with the aim of transferring them to high-performance platforms, while the second is based on developing novel techniques which would not make sense on a scalar computer. Quite obviously, the former approach is easier to understand from a conceptual point of view, as the native algorithm is not modified and the difficulties are mainly technological. The latter implies an additional effort to develop new explicitly parallel methods.

ILU-based preconditioners are highly sequential in both their computation and application. Anyway, some degree of parallelism can be achieved through graph coloring techniques. These strategies have been known from a long time by numerical analysts, for example, , and used in the context of relaxation methods. The aim is to build a partition of the matrix graph such that all vertices in the same group form an independent set, so that all unknowns in the same subset can be solved in parallel. Other useful orderings are based on domain decomposition strategies minimizing the edges connecting different subdomains, with a local ordering applied also to each unknown subset. Parallel ILU implementations based on these concepts have been developed in ; however it was soon made clear that generally such algorithms could not have a promising scalability.

In contrast, approximate inverses are intrinsically much more parallel than ILU factorizations, as they can be applied by a matrix-vector product instead of forward and backward substitutions. The construction of an approximate inverse, however, can be difficult to parallelize. For example, the AINV Algorithm 4 proceeds by columns and is conceptually pretty similar to an ILU factorization. A parallel AINV construction can be obtained by means of graph partitioning techniques which decompose the adjacency graph of A in a number of subsets of roughly the same size with a minimal number of edge cuts . In such a way the inner variables of each subset can be managed independently by each processor, while communications occur only with the boundary variables. By distinction, the construction of an approximate inverse based on a Frobenius norm minimization is naturally parallel if a static nonzero pattern is defined a priori. For example, the computation of both static SPAI and FSAI consists of a number of independent least-square problems and dense linear systems, respectively, which can be trivially partitioned among the available processors. Hence, static SPAI and FSAI are almost perfectly parallel algorithms. However, when the nonzero pattern is defined dynamically parallelization is no longer straightforward, as the operation of gathering the required submatrix of A, see Algorithms 3 and 5, gives rise to unpredictable communications. For example, details of the nontrivial parallel implementation of the dynamic SPAI algorithm are provided in .

The theoretical scalability properties of AMG make it very attractive for parallel computations. This is why in the last few years the research on AMG techniques has concentrated on high-performance massively parallel implementations. The main difficulties may arise in parallelizing the Gauss-Seidel smoother and the coarsening stage. As mentioned before, these problems can be overcome by using naturally parallel smoothers, such as Jacobi, relaxed Jacobi, polynomial or a static approximate inverse smoother, for example, [162, 166], and parallel coarsening strategies, for example, [160, 162, 192, 193]. Currently, efficient parallel implementations of AMG are already available in several software packages, for example, Hypre  and Trilinos .

An alternative strategy for developing parallel preconditioners relies on building new algorithms which should consist of matrix-vector products and local triangular solves only. Perhaps the earliest technique of this kind belongs to the class of the polynomial preconditioners which was first introduced by Cesari as back as in 1937 , even though obviously not in the context of Krylov subspace methods and supercomputing. Modern polynomial preconditioners have been developed in . A polynomial preconditioner is defined as (6.1)M-1=s(A), where sk is a polynomial of degree not exceeding k. The preconditioned system becomes: (6.2)s(A)Ax=s(A)b, where s(A) and A commute, so that right and left preconditionings coincide. Quite obviously, s(A)A is never formed explicitly and its application to a vector is simply performed by a sequence of matrix-vector products. Therefore, polynomial preconditioners can be ideally implemented in a parallel environment. There are different polynomial preconditioners according to the choice of s(A). One of the simplest options, advanced in , is using the Neumann expansion I+N+N2++Nl, where (6.3)N=I-ωA and ω is a scaling parameter. The degree lk cannot be too large, and this can limit the effectiveness of this preconditioner. Another option, advanced in , is to select a somewhat “optimal” polynomial, where optimality is evaluated in some sense. A natural choice is prescribing that the eigenspectrum of the preconditioned matrix is as close as possible to that of the identity, that is, finding sk such that (6.4)maxλσ(A)|1-λs(λ)|min, where σ(A) is the eigenspectrum of A. The min-max problem above can be addressed by using Chebyshev polynomials but cannot be solved exactly, as σ(A) is generally not known. An approximate minimization can be done replacing σ(A) with a continuous set which encloses σ(A). A third option is to compute s as the polynomial that minimizes (6.5)1-λs(λ)w, where the symbol ·w denotes the 2-norm induced by the inner product on the space k with respect to some nonnegative weight function defined in the interval [α,β]: (6.6)(p,q)=αβp(λ)q(λ)w(λ)dλ. The resulting polynomial is known as least-squares and was actually introduced by Stiefel in the solution of eigenproblems . According to Saad , least-squares polynomials tend to perform better than Chebyshev and Neumann. However, a major drawback for the use of this kind of preconditioners stems from the need at least for estimates of the smallest and largest eigenvalues of A and often their performance is quite poor in terms of convergence rate .

Another popular parallel preconditioner is based on the Additive Schwarz (AS) methods . The idea is to use a domain decomposition approach for dividing the variables into a number of possibly overlapping subdomains and to address separately each subdomain at a local level (Figure 3). On the overlaps the contribution of each subdomain is added or treated in some special way by averaging techniques. AS preconditioners can be easily parallelized by assigning each subdomain to a single processor. For instance, a sequential ILU factorization can be computed for each subdomain, thus miming a kind of parallel ILU. If no overlaps are allowed for, AS preconditioners with inner ILU factorizations simply reduce to an incomplete Block Jacobi. This class of preconditioners has experienced some success in the 90s as it is conceptually perfectly parallel. Unfortunately, the quality of the preconditioner deteriorates as the number of processors, that is, subdomains, grows, because the size of each subset progressively decreases. This causes an increase of the iteration count which can also completely offset the advantage of using a larger number of processors.

Matrix partitioning into possibly overlapping subdomains.

A novel recent parallel preconditioner which tries to combine the positive features of both approximate inverses based on the Frobenius norm minimization and AS methods is Block FSAI . The basic idea exploits the fact that in the Frobenius norm minimization the preconditioned matrix can be actually forced to resemble any target matrix T, by computing a pseudoapproximate inverse M-1 such that (6.7)T-AM-1Fmin for all matrices having a prescribed nonzero pattern 𝒮. This concept, originally introduced in , can be used to choose a target matrix T that is particularly attractive for parallel computing. The idea of Janna et al.  is to generalize the FSAI approach using as a target a block diagonal matrix. As FSAI is a robust preconditioner for SPD problems, Block FSAI has been originally developed for this class of matrices, although a generalization to nonsymmetric indefinite linear systems has also been attempted . So let A be SPD with 𝒮L and 𝒮BD a sparse lower triangular and a dense block diagonal nonzero n×n pattern, respectively. Denote by nb the number of diagonal blocks and mib the size of the ibth block, and let D be an arbitrary full-rank matrix with nonzero pattern 𝒮BD. The Block FSAI preconditioner of A is defined as the product FTF, where F is a lower block triangular factor with nonzero pattern 𝒮BL=𝒮BD𝒮L such that (6.8)D-FLFmin with L the exact lower Cholesky factor of A. As D is arbitrary, it goes without saying that F as defined in (6.8) is also. Similarly to the FSAI algorithm, minimization of (6.8) yields a linear relationship for the ith row fi of F: (6.9)A[𝒥,𝒥]fi[𝒥]=v, where 𝒥 is the set of column indices in the ith row of F with nonzero entries and v is the null vector except for the last mib components which are arbitrary, being ib the block index the row i belongs to (Figure 4). For the system (6.9) to have a unique solution mib components of fi[𝒥] can be set arbitrarily because of the arbitrariness of D. According to  the most convenient choice is to prescribe all the components of fi[𝒥] falling within the ibth diagonal block to be null, with the exception of the diagonal entries set to 1. This implies that F is actually a unit lower triangular matrix with structure (6.10)F=[I00F21IFnb1Fnb2I]. Similarly to the static FSAI algorithm, the factor F can be efficiently built in parallel as all systems (6.9) are independent. According to (6.8), the preconditioned matrix FAFT should resemble DDT, that is, a block diagonal matrix for any arbitrary D. In other words, FAFT has the block structure (6.11)FAFT=[B1R1,2R1nbR12TB2R2nbR1nbTR2nbTBnb], where Rij are residual blocks and Bib tend to the diagonal blocks of DDT. As D is arbitrary, there is no reason for FAFT to be better than A in a CG iteration, so it is necessary to precondition FAFT again. Assuming that the residual off-diagonal blocks in (6.11) are close to 0, an effective choice could be using as “outer” preconditioner a nonoverlapped AS method containing a local preconditioner for each block Bib. Writing the “outer” preconditioner in the factored block diagonal form J-TJ-1, the resulting preconditioned matrix reads (6.12)J-1FAFTJ-T=WAWT with the final Block FSAI preconditioner: (6.13)M-1=WTW=FTJ-TJ-1F.

Elements belonging to fi[𝒥]  (a) and A[𝒥,𝒥]  (b).

The Block FSAI preconditioner described above can be improved by defining 𝒮BL dynamically. An adaptive algorithm for the 𝒮BL pattern identification can be implemented starting from the theoretical optimal properties of Block FSAI. Janna and Ferronato  have demonstrated that under the hypothesis that the J-TJ-1 contains the exact inverse of each diagonal block Bib the Kaporin conditioning number β of the preconditioned matrix WAWT : (6.14)β(WAWT)=  tr(WAWT)ndet(WAWT)1/n, satisfies the following bound: (6.15)1β(WAWT)Cψ(F), where C is a constant scalar independent of W and (6.16)ψ(F)=(i=1n[FAFT]ii)1/n.

Block FSAI has the theoretical property of minimizing ψ(F) for any given pattern 𝒮BL. The basic idea of the adaptive algorithm is to select the off-block diagonal nonzero entries in any row fi of F so as to reduce ψ(F) as much as possible. This is feasible because each factor [FAFT]ii turns out to be a quadratic form depending on fi only. Denoting by f~i the subvector of fi including the off-block diagonal entries only, A~ib the square submatrix of A built from the first to the mth row/column, with m the sum of size of the first (ib-1) diagonal blocks of 𝒮BD, and ai the subrow of A with the first m elements of the ith row (Figure 5), each factor [FAFT]ii in (6.16) reads (6.17)  [FAFT]ii=f~iTA~ibf~i+2f~iTa~i+aii. Minimizing every [FAFT]ii, i=1,,n, is equivalent to minimizing ψ(F). The adaptive pattern search for F can be therefore implemented as follows. Start from an initial guess 𝒮F0 for the nonzero pattern of the off-block diagonal part of F, for example, the empty pattern, and compute the gradient gi of each quadratic form [FAFT]ii: (6.18)gi=2(A~ibf~i+a~i). Then, add to 𝒮F0 the position j into each row i corresponding to the largest component of gi computed in (6.18), so as to obtain an augmented pattern 𝒮F1. After computing the new factor F over 𝒮F1, the procedure above can be iterated in order to build 𝒮F2, 𝒮F3, and so on. The search into each row can be stopped when either a maximum number kmax of entries are added to the initial pattern or the relative variation of [FAFT]ii in two consecutive steps is smaller than a prescribed tolerance ϵF. The construction of the dynamic Block FSAI is summarized in Algorithm 7. The Block FSAI computed with the dynamic algorithm described above is known as Adaptive Block FSAI (ABF) and has proved very effective even if the outer preconditioner does not contain the exact inverse of the diagonal blocks Bib. As any dynamic algorithm, the parallel construction of ABF is not straightforward. However, an almost ideal efficiency can be obtained in massively parallel computations as well. For more details, see .

<bold>Algorithm 7: </bold>Dynamic construction of the Block FSAI preconditioner.

Input: Matrix A, the initial pattern 𝒮F0

Output: Matrix F

for  i=1,,n  do

Extract 𝒥0 from 𝒮F0

if  𝒥0  then

Gather A[𝒥0,𝒥0] and A[𝒥0,i] from A

Solve A[𝒥0,𝒥0]    fi[𝒥0]=-A[𝒥0,i]

end

Set k=0 and ϵi>ϵF

while  (kkmax) and (ϵi<ϵF)  do

kk+1

Compute gi=2(A~ibf~i+a~i)

Update 𝒥k-1 by adding the index of the largest component of gi

Gather A[𝒥k,𝒥k] and A[𝒥k,i] from A

Solve A[𝒥k,𝒥k]    fi[𝒥k]=-A[𝒥k,i]

Compute the diagonal term [FAFT]iik at the current step

Compute ϵi=([FAFT]iik-[FAFT]iik-1)/[FAFT]iik-1

end

end

Matrix A~ib and vectors f~i and a~i used in the [FAFT]ii and gi computation.

Block FSAI coupled with a block diagonal IC factorization, for example, as implemented in [74, 84, 86, 90], has proved to be a very efficient parallel preconditioner for both SPD linear systems [144, 202] and eigenproblems [204, 205] arising from different areas, such as fluid dynamics, structural mechanics, and contact problems. The combination of FSAI with block diagonal IC joins the parallelization degree of the former with the effectiveness of the latter. Similarly to standard incomplete Block Jacobi algorithms, the iteration count tends to grow increasing the number of computing cores. However, the convergence deterioration is much smoothed by the Block FSAI effect which always outperforms the standard FSAI algorithm. To improve the preconditioner scalability for massively parallel simulations the use of graph partitioning and/or domain decomposition strategies can be of great help [206, 207].

In recent years the interest in the solution of saddle-point problems has grown, with many contributions from different fields. The main reason is the great variety of applications requiring the solution of linear systems with a saddle-point matrix. For example, such problems arise in the discretization of compressible and incompressible Navier-Stokes equations in computational fluid dynamics [208, 209], constrained optimization [210, 211], electromagnetism [212, 213], mixed FE approximations in fluid and solid mechanics [208, 214], and coupled consolidation problems in geomechanics .

A saddle-point problem is defined as a linear system where the matrix A has the structure (7.1)A=[KB1B2-C], where Kn1×n1, B1n1×n2, B2n2×n1, Cn2×n2, and K, B1, and B2 are nonzero. Typically, but not necessarily, n1n2. Moreover, C is symmetric and positive semidefinite, and often C=0, while in most applications K is SPD and B2=B1T, that is, A is symmetric indefinite. For example, a saddle-point matrix A arises naturally in an equality-constrained quadratic problem: (7.2)J(x)=12xTAx-fTxminsubject  to  Bx=g with C=0 and B2=B1T=B, solved with the aid of Lagrangian multipliers. A nonzero C may arise in some interior point methods [210, 211], compressible Navier-Stokes equations , or coupled consolidation problems . Frequently the norm of C is much smaller than that of K. Because of the great variety of applications leading to saddle-point matrices, it is no surprise that several different solution methods have been developed. For an exhaustive, review, see Benzi et al. . In the context of the present paper, the attention will be obviously restricted to preconditioned iterative methods.

A natural way to solve the problem (7.3)[KBTB-C](x1x2)=(b1b2) is to reduce the system (7.3) to the Schur complement form. Solve for x1 in the upper set of equations: (7.4)x1=K-1(b1-BTx2), and substitute (7.4) in the lower set: (7.5)(BK-1BT+C)x2=BK-1b1-b2. The matrix S=BK-1BT+C at the left-hand side of (7.5) is the Schur complement of system (7.3). Hence, the global solution x can be obtained by solving the Schur complement system (7.5) and substituting the result back in (7.4) which implies solving another system with the matrix K. This pair of SPD linear systems can be elegantly solved by partitioned schemes, as those proposed in [217, 218] in the context of coupled diffusion and consolidation problems, resulting in a double CG iteration where two distinct preconditioners for K and S are needed. A usually more efficient alternative is to solve the systems (7.4) and (7.5) inexactly and use the procedure described above as a preconditioner in a Krylov subspace method. This is what is traditionally referred to as constraint preconditioning because the idea was first introduced in the solution of equality-constrained optimization problems such as (7.2) [219, 220].

Quite obviously, there are several different constraint preconditioners according to how K-1 and S-1 in (7.4) and (7.5) are approximated. The most effective choice turns out to be strongly dependent on the specific problem at hand. A popular strategy consists of replacing K-1 with MK-1 and solving exactly the Schur complement system (7.5) . Such an approach can be convenient if n2 is much smaller than n1, so that the system (7.5) can be efficiently solved by a direct method. Otherwise, an inner CG iteration can be set up exiting the procedure even at a relatively large residual . In this case the resulting preconditioner reads (7.6)MECP-1=[MKBTB-C]-1 and is usually denoted as Exact Constraint Preconditioner (ECP). ECP has several interesting theoretical properties which have been studied in detail by several authors. For instance, an exhaustive eigenanalysis of the preconditioned matrix can be found in . In particular, denoting with αK and βK the smallest and the largest eigenvalue, respectively, of MK-1K, Durazzi and Ruggiero  have shown that the eigenvalues λ of MECP-1A are either one, with multiplicity at least n2, or real positive and bounded by αKλβK. Quite interestingly, this result can lead the classical CG algorithm to converge even with the indefinite matrix (7.3), provided that the last n2 components of the initial residual are zero. This can be simply accomplished by choosing as initial guess MECP-1b.

Despite its remarkable properties, ECP has at least two severe limitations. First, MK-1 must be known explicitly to allow for the computation of S and must be very sparse in order to preserve a workable sparsity for S as well. Second, although an accurate solution of (7.5) is not really needed, solving it at each preconditioner application can represent a significant computational burden for the overall scheme. This is why ECP is often used with MK-1=[diag(K)]-1, which can be quite detrimental for the solver convergence in case K is not diagonally dominant or is ill-conditioned, such as in coupled consolidation problems .

To make the preconditioner application cheaper, a substitute for (7.5) can be solved replacing S with another approximation MS as easy to invert as possible. Moreover, in some cases it may also be possible to approximate S without computing it explicitly, thus further reducing the constraint preconditioner cost. Obviously the theoretical properties of ECP no longer hold true. Recalling the factorization (2.12) by Sylvester's theorem of inertia, the new preconditioner is (7.7)MICP-1=[I-MK-1BT0I][MK-100-MS-1][I0-BMK-1I]=[MKBTBBMK-1BT-MS]-1 and is denoted as Inexact Constraint Preconditioner (ICP). This class of preconditioners has been introduced in [130, 225, 226] using [diag(K)]-1 or AINV for MK-1 and a zero fill-in IC factorization for MS-1. The choice for MS-1 is justified by the fact that usually S is well conditioned. Moreover, this allows for splitting the central block diagonal factor in (7.7) into the product of two matrices, thus providing a factorized form for MICP-1. A block triangular variant of ICP can be easily deduced from (7.7) by neglecting either the upper or the lower outer factors: (7.8)MTICP-1=[I-MK-1BT0I][MK-100-MS-1]=[MkBT0-MS]-1. The theoretical properties of block triangular ICP have been investigated by Simoncini  and effectively used in the solution of the discretized Navier-Stokes equations .

Similar to ECP, ICP as well requires an explicit approximation MK-1 to build the Schur complement. This is still a limitation in some problems, such as coupled consolidation , where even AINV can be a poor approximation of K-1 leading to an unacceptably high iteration count to converge. Actually, this is only partially connected with the quality of AINV itself, rather it is the need for the explicit construction of the Schur complement matrix that calls for a reduced fill-in of MK-1. In order to remove this inconvenience an implicit approximation of K-1 can be used, such as a stabilized IC factorization with partial fill-in: (7.9)MK-1=L~K-TL~K-1. The new Schur complement reads (7.10)S=BL~K-TL~K-1BT+C but cannot be computed explicitly. Although the S implicit form (7.10) can still be used to perform a product between S and a vector, and consequently one may think of solving the related inner system by an iterative method, a suitable preconditioner for S is not easily available. Alternatively, an explicit approximation of S can be computed, for example, using the AINV of K-1, namely (7.11)S^=BZZTBT+C, where Z is the upper triangular factor resulting from the AINV Algorithm 4, and then an IC factorization of S^ is performed: (7.12)MS-1=L~S-TL~S-1. In this way the quality of MK-1 is improved, but generally an additional approximation in MS-1 is introduced. However, in several problems, especially those arising from coupled consolidation where such a constraint preconditioning variant has been introduced [148, 231], the overall algorithm has a beneficial effect. This ICP variant blending two different approximations for K-1 in the same scheme reads (7.13)MMCP-1=[I-L~K-TL~K-1BT0I][L~K-TL~K-100-L~S-TL~S-1][I0-BL~K-TL~K-1I]=[L~K-T-L~K-TGT0L~S-T][L~K-10GL~K-1-L~S-1], where G=L~S-1BL~K-T, and is denoted as Mixed Constraint Preconditioner (MCP). Recently an improved MCP version has been advanced by properly relaxing the Schur complement approximation .

It is quite well accepted that constraint preconditioners outperform any other “global” preconditioning approach applied to saddle-point matrices. This remarkable property has attracted the theoretical interest of several researchers who have investigated the spectral properties of constraint-preconditioned saddle-point matrices [225, 227, 231233], obtaining quite accurate bounds for the eigenvalues of M-1A. For instance, in the case of MCP in (7.13) Ferronato et al.  have shown that the eigenvalues λ of MMCP-1 satisfy the following inequality: (7.14)|λ-1|ek(1+g2)+es, where (7.15)ek=I-L~K-1KL~K-T,es=I-L~S-1SL~S-T,g=G. In other words, ek and es are a measure of the error made when using the approximations MK-1 and MS-1 in place of K-1 and S-1, respectively, while g provides an estimate of the strength of coupling between the upper and lower set of equations in (7.3). Inequality (7.14) provides a useful and relatively cheap indication on how the quality of MK-1 and MS-1 affects the overall convergence rate of MCP. In particular, note that the quality of MK-1 is generally more important than that of MS-1, with the difference becoming larger as the coupling becomes stronger.

Quite naturally, there is the distinct interest to implement constraint preconditioners on parallel computers. The main difficulties for developing efficient parallel variants of constraint preconditioners are twofold. First, the constraint preconditioning concept is inherently sequential, as it involves the preliminary computation of the Schur complement and then the subsequent approximate solution to two inner systems. Thus, a standard parallel approach where the system matrix is distributed among the processors as stripes of contiguous rows is not feasible, because all the processors owning the second block of unknowns are idle when the other processors perform the operation on the first block, and viceversa. Second, efficient parallel approximations of K-1 and S-1 have to replace incomplete factorizations. For example, in incompressible Navier-Stokes problems parallel AMG techniques have been efficiently used to approximate the K block which arises from a discretized Laplace operator . In coupled consolidation better results have been obtained by using a double FSAI , parallel multilevel variants , and Block FSAI , which appears to be currently the most promising approach in the context of geomechanical problems.

8. Conclusions and Future Perspectives

The development and implementation of efficient preconditioners are the key factors for improving the performance and robustness of Krylov subspace methods in the solution of large sparse linear systems. Such as issue is a central task in a great number of different applications, not all necessarily related to PDEs, so it is no surprise that much research has been devoted to it. After a time where most efforts were focused on direct and iterative solvers, the last two decades can be denoted as the “preconditioning era.” The number of papers, projects, and international conferences addressing this topic has largely exceeded those aimed at the development of new solvers, and this trend is not likely to fade in the next few years. This is mainly due to the fact that an optimal general-purpose preconditioner does not exist and that any specific problem can afford the opportunity to develop an ad hoc strategy.

The present paper is intended to offer an overview of the most important algebraic preconditioners available today for the solution of sparse linear systems. Hence, it cannot exhaust the subject of preconditioning as the entire class of problem-specific algorithms has been omitted. Algebraic preconditioners have been chosen for their “universality,” in that they can be effectively used as black-box tools in general-purpose codes requiring the knowledge of the system matrix only. A classification of algebraic preconditioners has been attempted based on the most significant representatives of each group along with their prominent features:

incomplete factorizations have been the earliest algebraic preconditioners massively used to accelerate Krylov subspace solvers. The several weak points related to robustness and numerical stability have stimulated the development of many different variants meant to both improve their performance and extend the application area. Although some advances are still under way, this class of preconditioners appears to be quite mature with the related strong and weak points very well known. Incomplete factorizations can be very effective and in many cases are still the most performant preconditioners, but their intrinsic lack of robustness and especially the very reduced parallel degree are probably going to bound their role in the next future;

approximate inverses were originally developed with the aim of resolving the main difficulties of ILU-based preconditioners. Between the late 90s and early 00s they looked as a very promising approach especially in view of their robustness and potentially high parallel degree. Indeed, these preconditioners have proved very robust and so well suited for black-box solvers and in some cases almost perfectly parallel. In a sequential run, however, their performance can be lower than that of an ILU-based preconditioner. At present, the class of approximate inverses appears to be quite a mature field of research, with the most significant breakthroughs advanced more than 10 years ago. Nonetheless, their influence is probably going to be still very important in the next few years;

the potentially high scalability of algebraic multigrid techniques has promoted much research in this area. In contrast with both incomplete factorizations and approximate inverses, AMG can allow for a stable iteration count to converge independently of the characteristic mesh size, that is, the overall system dimension, thus making AMG a potentially ideally scalable tool. Currently, AMG appears as one of the more active research areas, with possible contaminations from incomplete factorizations and approximate inverses in the choice of the smoother. In several applications arising from discretized PDEs with severe mesh distortions, however, AMG can still lack robustness, and more research is still needed to make it the algorithm of choice in the next future;

the current hardware development is leading to a much more pronounced parallel degree of any computer. Besides the parallelization of existing algorithms, this trend is going to enforce the development of totally new techniques which can make sense in a parallel computing environment only. This is why parallel preconditioners are now emerging as a class of algorithms which are likely to expand in the next future. These tools can take advantage of important contributions from all the previous classes and can also relaunch some techniques, such as polynomial preconditioners or FSAI-based approximate inverses, which had been somewhat forgotten over the last years.

Acknowledgments

The author is very thankful to Giuseppe Gambolati and Carlo Janna for their careful reading of the paper and helpful comments.