Preconditioning for Sparse Linear Systems at the Dawn of the 21st Century: History, Current Developments, and Future Perspectives

Iterative methods are currently the solvers of choice for large sparse linear systems of equations. However, it is well known that the key factor for accelerating, or even allowing for, convergence is the preconditioner. The research on preconditioning techniques has characterized the last two decades. Nowadays, there are a number of different options to be considered when choosing the most appropriate preconditioner for the specific problem at hand. The present work provides an overview of the most popular algorithms available today, emphasizing the respective merits and limitations. The overview is restricted to algebraic preconditioners, that is, general-purpose algorithms requiring the knowledge of the systemmatrix only, independently of the specific problem it arises from. Along with the traditional distinction between incomplete factorizations and approximate inverses, the most recent developments are considered, including the scalable multigrid and parallel approaches which represent the current frontier of research. A separate section devoted to saddle-point problems, which arise in many different applications, closes the paper.

The term "preconditioning" generally refers to "the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly" 1 , while the "preconditioner" is the mathematical operator responsible for such a transformation. In the context of linear systems of equations, where A ∈ R n×n and x, b ∈ R n , the preconditioner is a matrix which transforms 1.1 into an equivalent problem for which the convergence of an iterative method is much faster.

ISRN Applied Mathematics
Preconditioners are generally used when the matrix A is large and sparse, as it typically, but not exclusively, arises from the numerical discretization of Partial Differential Equations PDEs . It is not rare that the preconditioner itself is the operator which makes it possible to solve numerically the system 1.1 , that otherwise would be intractable from a computational viewpoint. The development of preconditioning techniques for large sparse linear systems is strictly connected to the history of iterative methods. As mentioned by Benzi 2 , the term "preconditioning" used to identify the acceleration of the iterative solution of a linear system can be found for the first time in a 1968 paper by Evans 3 , but the concept is pretty older and was already introduced by Cesari back in 1937 4 . However, it is well known that the earliest tricks to solve numerically a linear system were already developed in the 19th century. As reported in the survey by Saad and Van der Vorst 5 and according to the work by Varga 6 , Gauss set the pace for iterative methods in an 1823 letter, where the famous German mathematician describes a number of clever tricks to solve a singular 4 × 4 linear system arising from a geodetic problem. Most of the strategies suggested by Gauss compose what we currently know as the stationary Gauss-Seidel method, which was later formalized by Seidel in 1874. In his letter, Gauss jokes saying that the method he suggests is actually a pleasant entertainment, as one can think about many other things while carrying out the repetitive operations required to solve the systems. Actually, it is rather a surprise that such a method attracted a great interest in the era of automatic computing! In 1846 another famous algorithm was introduced by Jacobi to solve a sequence of 7 × 7 linear systems enforcing diagonal dominance by plane rotations. Other similar methods were later developed, such as the Richardson 7 and Cimmino 8 iteration, but no significant improvements in the field of numerical solution of linear systems can be reported until the 40s. This period is referred to as the prehistory of this subject by Benzi 2 . As it often occurred in the past, a dramatic and tragic historical event like World War II increased the research fundings for military reasons and helped accelerate the development and progress also in the field of numerical linear algebra. World War II brought a great development of the first automatic computing machines, which later led to the modern digital electronic era. With the availability of such new computing tools, the interest in the numerical solution of relatively large linear systems of equations suddenly grew. The more significant advances were obtained in the field of direct methods. Between the 40s and 70s the algorithms of choice for solving automatically a linear system are based on the Gaussian elimination, with the development of several important variants which helped greatly improve the native method. First, it was recognized that pivoting techniques are of paramount importance to stabilize the numerical computations of the triangular factors of the system matrix and reduce the effects of rounding errors, for example, 9 . Second, several reordering techniques were developed in order to limit the fill-in of the triangular factors and the number of operations required for their computation. The most famous bandwidth reducing algorithms, such as Cuthill-McKee and its reverse variant 10 , nested dissection 11-13 , and minimum degree 14-16 , are still very popular today. Third, appropriate scaling strategies were introduced so as to balance the system matrix entries and help stabilize the triangular factor computation, for example, 17, 18 . In the meantime, iterative methods lived in a kind of niche, despite the publication in 1952 of the famous papers by Hestenes and Stiefel 19 and Lanczos 20 who discovered independently and almost simultaneously the Conjugate Gradient CG method for solving symmetric positive definite SPD linear systems. These papers did not go unnoticed, but at that time CG was interpreted as a direct method converging in m steps, if m is the number ISRN Applied Mathematics 3 of distinct eigenvalues of the system matrix A. As it was soon realized 21 , this property is lost in practice when working in finite arithmetics, so that convergence is actually achieved in a number of iterations possibly much larger than m. This is why CG was considered an attractive alternative to Gaussian elimination in a few lucky cases only and was not rated much more than just an elegant theory. The iterative method of choice of the period was the Successive Overrelaxation SOR algorithm, which was introduced independently by Young 22 in his Ph.D. thesis and Frankel 23 as an acceleration of the old Gauss-Seidel stationary iteration. The possibility of estimating theoretically the optimal overrelaxation parameter for a large class of matrices having the so-called property A 22 gained a remarkable success for SOR especially in problems arising from the discretization of PDEs of elliptic type, for example, in groundwater hydrology or nuclear reactor diffusion 6 . In the 60s the Finite Element FE method was introduced in structural mechanics. The new method had a great success and gave rise to a novel set of large matrices which were very sparse but neither diagonally dominant nor characterized by property A. Unfortunately, SOR techniques were not reliable in many of these problems, either converging very slowly or even diverging. Therefore, it is not a surprise that direct methods soon became the reference techniques in this field. The irregular sparsity pattern of the FE-discretized structural problems gave a renewed impulse to the development of more appropriate ordering strategies based on the graph theory and more efficient direct algorithms, leading in the 70s and early 80s to the formulation of the modern multifrontal solvers 13, 24, 25 . In the 80s the first pioneering commercial codes for FE structural computations became available and the solvers of choice were obviously direct methods because of their reliability and robustness. At this time the solution of sparse linear systems by direct techniques appears to be quite a mature field of research, well summarized in famous reference textbooks such as the ones by Duff et al. 26 and Davis 27 . In contrast, during the 60s and 70s iterative solvers were living their infancy. CG was rehabilitated as an iterative method by Reid in 1971 28 who showed that for reasonably well-conditioned sparse SPD matrices the convergence could have been reached after far fewer than n steps, being n the size of A. This work set the pace for the extension of CG to nonsymmetric and indefinite problems, with the introduction of the earliest Krylov subspace methods such as the Minimal Residual MINRES 29 and the Biconjugate Gradient Bi-CG 30 algorithms. However, the crucial event for the future success of Krylov subspace methods was the publication in 1977 by Meijerink and van der Vorst of the Incomplete Cholesky CG ICCG algorithm 31 . Incomplete factorizations had been already introduced in the Soviet Union 32 and independently by Varga 33 in the late 50s, and the seminal ideas for improving the conditioning of the system matrix in the CG iteration can be found also in the original work by Hestenes and Stiefel 19 and in Engeli et al. 21 , but Meijerink and van der Vorst were the first who put the things together and showed the great potential of the preconditioned CG. The idea was later extended by Kershaw 34 to an SPD matrix not necessarily of M-type and gained soon a great popularity. The key point was that CG as well as any other Krylov subspace method with a proper preconditioning could become competitive with the latest direct methods, though requiring much less memory and so being attractive for large three-dimensional 3D simulations.
The 80s and 90s are the years of the great development of Krylov subspace methods. In 1986 Saad and Schultz introduced the Generalized Minimal Residual GMRES method 35 which soon became the algorithm of choice among the iterative methods for nonsymmetric linear systems. Some of the drawbacks of Bi-CG were addressed in 1989 by Sonnenveld who developed the Conjugate Gradient Squared CGS method 36 , on the basis of which in 1992 If the matrix M −1 A is better conditioned than A for a Krylov subspace method, then M −1 is the preconditioner and 2.1 denotes the left preconditioned system. Similarly, M −1 can be applied on the right: thus producing a right preconditioned system. The use of left or right preconditioning depends on a number of factors and can produce quite different behaviors; see, for example, 45 . For instance, right preconditioning has the advantage that the residual of the preconditioned system is the same as the native system, while with left preconditioning it is not, and this can be important in the convergence check or using a residual minimization algorithm such as GMRES. By the way, with the right preconditioning a back transformation of the auxiliary variable y must be carried out to resume the original unknown x. If the preconditioner can be written in a factorized form: a split preconditioning can be also used: thus potentially exploiting the advantages of both left and right formulations. Writing the preconditioned Krylov subspace algorithms is relatively straightforward. Simply, the basic algorithms can be implemented replacing A with either M −1 A, or AM −1 , or M −1 1 AM −1 2 , and then back substituting the original variable x where the auxiliary vector y is used. It can be easily observed that it is not necessary to build the preconditioned matrix, which could be much less sparse than A, but just to compute the product of M −1 , or M −1 1 and M −1 2 , if the factorized form 2.3 is used, by a vector. This operation is called application of the preconditioner.
Generally speaking, there are three basic requirements for obtaining a good preconditioner: i the preconditioned matrix should have a clustered eigenspectrum away from 0, ii the preconditioner should be as cheap to compute as possible, iii its application to a vector should be cost-effective.
The origin of condition i relies on the convergence properties of CG. It is quite intuitive that if M −1 resembles in some sense A −1 the preconditioned matrix should be close to the identity, thus making the preconditioned system somewhat "easier" to solve and accelerating the convergence. If A and M −1 are SPD, it has been proved that the iteration count n iter of the CG algorithm to go below the tolerance ε depends on the spectral conditioning number ξ of the preconditioned matrix G, for example 52 : so that clustering the eigenvalues of G away from 0 is of paramount importance to accelerate convergence. If A is not SPD, things can be much more complicated. For example, the performance of GMRES depends also on the conditioning of the matrix of the eigenvectors of G, and the fact that the preconditioned matrix has a clustered eigenspectrum does not guarantee a fast convergence 53 . Anyway, it is quite a common experience that a clustered eigenspectrum away from 0 often yields a rapid convergence also in nonsymmetric problems. The importance of conditions ii and iii depends on the specific problem at hand and may be highly influenced by the computer architecture. From a practical point of view, they can also be more important than condition i . For example, if a sequence of linear systems has to be solved with the same matrix A or with slight changes only, as it often may occur when using a Newton method for a set of nonlinear equations, the same preconditioner can be used several times, with the cost for its computation easily amortized. In this case, it could be convenient spending more time to build an effective preconditioner. On the other hand, the computational cost for the preconditioner application basically depends on its density. As M −1 has to be applied once or twice per iteration, according to the selected Krylov subspace method, it is of paramount importance that the increased cost of each iteration is counterbalanced by an adequate reduction of their number. Typically, pursuing condition i reduces the iteration count but is in conflict with the need for a cheap computation and application of the preconditioner, so that in the end preconditioning efficiently the linear system 1.1 is always the result of an appropriate tradeoff.
An easy way to build a preconditioner is based on the decomposition of A. Recalling the definition of stationary iteration, where r k b − Ax k is the current residual, the matrix C can be used as a preconditioner. Splitting A as D −E −F, where D, −E, and −F are the diagonal, the lower, and the upper parts of A, respectively, then are known as the Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners, respectively, where ω is the overrelaxation factor. Replacing E with F in 2.8 and 2.9 gives rise to the backward Gauss-Seidel and SOR preconditioners, respectively. The preconditioners obtained from the splitting of A do not require additional memory and have a null computation cost, while their application can be simply done by a forward or backward substitution. With M −1 J a simple diagonal scaling is actually required. Because of their simplicity, Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners are still used in some applications. Assume, for instance, that matrix A has two clusters of eigenvalues due to the heterogeneity of the underlying physical problem. This is ISRN Applied Mathematics 7 a quite common experience in geotechnical soil/structure or contact simulations, for example, 49, 54-58 . In this case the matrix A can be written as where the square diagonal blocks are responsible of the two clusters of eigenvalues and usually arise in a natural way from the original problem. Factorizing A in 2.11 according to Sylvester's theorem of inertia: is the Schur complement, a Generalized Jacobi GJ preconditioner 59 , based on the seminal idea in 60 , can be obtained as a diagonal approximation of the central block diagonal factor in 2.12 : where ϕ is a user-specified parameter and S is an approximation of S in 2.13 with diag K used instead of K. Quite recently Bergamaschi and Martínez 61 have indirectly proved that an optimal value for ϕ is the ratio between the largest eigenvalues of K and S. Another very simple algorithm which has proved effective in similar applications is based on SOR and Symmetric SOR preconditioners, for example, 62-64 . Though the algorithms above cannot compete with more sophisticated tools in terms of global performance 58, 65 , they are an example of how even somewhat naive ideas can work and be useful for practitioners. Generally speaking, it is always a good idea to develop an effective preconditioner starting from the specific problem at hand. Interesting strategies have been developed approximating directly the differential operator the matrix A arises from, by using a "nearby" PDE or a less accurate problem discretization, for example, 66-69 . For instance, popular preconditioners typically used within the transport community are often physically based, for example, 70-72 . For the developers of computer codes, however, using physically based preconditioners is not always desirable. In fact, it is usually simpler introducing the linear solver as a black box which is expected to work independently of the specific problem at hand, and especially so with less specialized codes such as the commercial ones. This is why a great interest has arisen also in the development of purely algebraic preconditioners that claim to be as much "general purpose" as possible. Algebraic preconditioners are usually robust algorithms which require the knowledge of the matrix A only, independently of the underlying physical problem. This paper will concentrate on such a class of algorithms. Subdividing algebraic 8 ISRN Applied Mathematics preconditioners into categories is not straightforward, as especially in recent years there have been several contaminations from adjacent fields with the development of "hybrid" approaches. Traditionally, and as already done by Benzi 2 , algebraic preconditioners can be roughly divided into two classes, that is, incomplete factorizations and approximate inverses. The main distinction is that with the former approach M is actually computed and M −1 is applied but never formed explicitly, whereas with the latter M −1 is directly built and applied. Within each class, moreover, different strategies can be pursued according to a number of factors, such as whether A is SPD or not, its sparsity structure and size, the magnitude of its coefficients, and ultimately the computer hardware available to solve the problem, so that a large number of variants have been proposed by the researchers. As anticipated before, the unavoidable contamination from adjacent fields, for example, the direct solution methods, has made the boundaries between different groups of algorithms often vague and certainly questionable.
In this work, the traditional distinction between incomplete factorizations and approximate inverses is followed, describing the most successful algorithms belonging to each class and the most significant variants developed to address particular occurrences and increase efficiency. Then, two additional sections will be devoted to the most recent results obtained in the fields that currently appear to be the more active frontier of research in the area of algebraic preconditioning, that is, multigrid techniques and parallel algorithms. Finally, a few words will be spent on a special class of problems characterized by indefinite saddle-point matrices, which arise quite frequently in the applications and have attracted much interest in recent years.

Incomplete Factorizations
Given a nonsingular matrix A such that A LU, 3.1 the basic idea of incomplete factorizations is to approximate the triangular factors L and U in a cost-effective way. If L and U are computed so as to satisfy 3.1 , for example, by a Gaussian elimination procedure, typically fill-in takes place, with L and U much less sparse than A. This is what limits the use of direct methods in large 3D problems. The approximations L and U of L and U, respectively, can be obtained by simply discarding a number of fill-in entries according to some rules. The resulting preconditioner is generally denoted as Incomplete LU ILU . Quite obviously, M −1 in 3.2 is never built explicitly as it is much denser than both L and U. Its application to a vector is therefore performed by forward and backward substitutions. Moreover, the ILU factorized form 3.2 is well suited for split preconditioning. In case A is SPD, the upper incomplete factor U is replaced by L T and the related preconditioner is known as Incomplete Cholesky IC .
The native ILU algorithm runs as follows. Define a set S of position i, j , with 1 ≤ i, j ≤ n, where either L if i > j or U if j > i has a nonzero entry. The set S is also denoted as the nonzero pattern of the incomplete factors. To avoid breakdowns in the ILU computation, it is Input: Matrix A, the nonzero pattern S Output: Matrix A containing L and U for each i, j / ∈ S do a ij 0 end for i 2, ..., n do for k 1, . . . , i − 1 and i, k ∈ S do a ik ← a ik /a kk for j k 1, ..., n and i, j ∈ S do a ij ← a ij − a ik a kj end end end Algorithm 1: IKJ variant of ILU factorization with a static pattern S.
generally recommended to include all the diagonal entries in S. Then, make a copy of A over an auxiliary matrix which will contain L and U in the lower and upper parts, respectively, at the end of the computation, and set to zero all entries a ij such that i, j / ∈ S. If the main diagonal belongs to L, then it is implicitly assumed that the diagonal of U is unitary and vice versa. For every position i, j ∈ S, the following update is computed: with k 1, . . . , i − 1. At the end of the loop, the ILU factors are stored in the copy of A. The procedure described above is clearly an incomplete Gauss elimination. Hence, no surprise that several implementation tricks developed to improve the efficiency of direct solution methods can be borrowed in order to reduce the computational cost of the ILU computation. For example, the algorithm can follow either the KIJ, or the IKJ, or the IJK elimination variants. To give an idea, Algorithm 1 provides a sketch of the sequence of operations required by the IKJ implementation, which is generally preferred whenever the sparse matrix A along with both L and U is stored in a compact form by rows. It can be easily observed that Algorithm 1 proceeds by rows, computing first the row of L and then that of U and accessing the previous rows only to gather the required information Figure 1 . Many other efficient variants have been developed, for example, storing A in a sparse skyline format or computing L by columns 73, 74 .

Fill-In Strategies
The several existing versions of the ILU preconditioner basically differ for the rules followed to select the retained entries in L and U. The simplest idea is to define S statically at the beginning of the algorithm and equal to the set of the nonzero positions in A. This idea was originally implemented by Meijerink and van der Vorst 31 in their 1977 seminal work, giving rise to what is traditionally referred to as ILU 0 , or IC 0 for SPD matrices. This simple choice is easy to implement and very cost-effective, as the preconditioner construction is quite cheap and its application to a vector costs as much as a matrix-by-vector product with A. Moreover, ILU 0 can be quite efficient, especially for matrices arising from the discretization of elliptic problems, for example, 75-77 . Unfortunately, there are a number of problems where the ILU 0 preconditioner converges very slowly or even fails. There are several different reasons for this, depending on both the unstable computation and the unstable application of L and U, as was experimentally found by Chow and Saad on a large number of test matrices 78 . For example, ILU 0 often exhibits a bad behavior with structural matrices and the highly nonsymmetric and indefinite matrices arising from computational fluid dynamics. In these cases the nonzero pattern of A proves a too crude approximation of the exact L and U structure, so that a possible remedy is to allow for a larger number of entries in L and U in order to make them closer to L and U. In the limiting case, the exact factors are computed and the solution to 1.1 is obtained in just one iteration. Obviously, enlarging the ILU pattern on one hand reduces the iteration count but on the other increases the cost of computation and application of the preconditioner, so an appropriate tradeoff must be found.
There are several different ways for enlarging the initial pattern S, or computing it in a dynamic way. In any specific problem, the winning algorithm is the one that proves able to capture the most significant entries in L and U at the lowest cost. One of the first attempts for enlarging dynamically S is based on the so-called level-of-fill concept 79, 80 . The idea is to assign an integer value ij , denoted as level-of-fill, to each entry in position i, j computed during the ILU construction process. The lower the level, the more important the entry. The initial level is set to During the ILU construction, with reference for instance to the Algorithm 1, the level-of-fill of each entry is updated according to the level-of-fill of the entries used for its computation using the following rule: In practice, any position corresponding to a nonzero entry in A remains with a zero level-offill. The terms computed using entries with a zero level-of-fill have a level equal to 1. Those built using one entry of the first level have level 2 and so on. The new level-of-fill of each entry is computed adding 1 to the sum of the levels of the entries used for it. This kind of algorithm creates a hierarchy among the potential entries that allows for selecting the "most important" nonzero terms only. The resulting preconditioner is denoted as ILU p , where p > 0 is the integer user-specified level-of-fill of the entries retained into each row of L and U. It follows immediately that ILU 0 coincides exactly with the zero fill-in ILU previously defined. In most applications, ILU 1 is already a significant improvement to ILU 0 . The experience shows that the fill-in increases quite rapidly with p, so that for p ≥ 2 the cost for building and applying the preconditioner can become a price too high to pay. Moreover it can be observed that at each level there are also many terms which are very small in magnitude, thus providing a small contribution to the preconditioning of the original matrix A. Hence, a natural way to avoid such an occurrence is to add a dropping rule according to which the terms below a user-specified tolerance τ are neglected. Several criteria have been defined for selecting τ and the way the ILU entries are discarded. A usual choice is to compute the ith row of both L and U and drop an entry whenever smaller than τ a i , being · an appropriate vector norm and a i the ith row of A, but other strategies have been also attempted, such as comparing the entry l ij of L with τl ii l jj 81 or with an estimate of the norm of the jth column of L −1 82, 83 .
Similarly to the level-of-fill approach, a drawback of the drop tolerance strategy is that the amount of fill-in of L and U is strongly dependent on the selected user-specified parameter and the specific problem at hand and is unpredictable a priori. To address such an issue a second user-specified parameter ρ can be introduced, denoting the largest number of entries that can be retained per row. In some versions, ρ is also defined as the number of entries added at most per row in excess to the number of entries of the same row of A. The use of this dual threshold strategy, which may be quite simple and effective, has been proposed by Saad 84 , and the related preconditioner is referred to as ILUT ρ, τ . A sketch of the sequence of steps required for its computation is provided in Algorithm 2.
As the drop tolerance may be quite sensitive to the specific problem and so difficult to implement in a black box solver, ILU variants that use only the fill-in parameter ρ have been proposed 85, 86 . These versions are generally outperformed by ILUT ρ, τ but are much easier to set and generally less sensitive to the user-specified parameter, allowing the user for a thorough control of the memory occupation. More recent ILU variants developed with the aim of improving the preconditioner performance and better exploiting the hardware potential are based on multilevel techniques, dense block identification, and adaptive selections of the setup parameters 87-89 , with significant contaminations from the methods developed for the latest multifrontal solvers.

Stabilization Techniques
Both Algorithms 1 and 2 require the execution of divisions where the denominator, that is, the pivot, is not generally guaranteed to be nonzero. It is well known that in finite arithmetics also small pivots can create numerical difficulties. If A is SPD, the pivot has to be also positive in order to produce a real incomplete factorization. These issues are very important because they can jeopardize the existence of the preconditioner itself and so the robustness of the overall iterative algorithm. Input: Matrix A, the user-specified parameters ρ and τ Output: Matrices L and U for i 1, . . . , n do w a i for k 1, . . . , i − 1 and w k / 0 do Meijerink and van der Vorst 31 demonstrated that for an M-matrix a zero fill-in IC is guaranteed to exist. However, if A is SPD but not of M-type zero or negative pivots may arise, thus theoretically preventing from the IC computation. Typically, such an inconvenience may be avoided by increasing properly the fill-in. In this way the incomplete factor tends to the exact factor, which is guaranteed to be SPD. However, this may generate a fill-in degree that is unacceptably high, lowering the performance of the preconditioned solver. To avoid this undesirable occurrence a number of tricks have been proposed which can greatly increase the IC robustness. The first, and maybe simplest, idea was advanced by Kershaw 34 who just suggested to replace zero or negative pivots with an arbitrary positive value, for example, the last valid pivot. The implementation of this trick is straightforward, but it seldom provides good results as the resulting IC often exhibits a very low quality. In fact, recall from Algorithm 1 that modifying arbitrarily a pivot in the kth row means influencing the computation of all the following rows. Another simple strategy recently advanced in 90 relies on avoiding the computation of the square root of the pivot, required in the IC algorithm, thus eliminating a source of breakdown. In 91 a diagonal compensated reduction of the positive off-diagonal entries is suggested in order to transform the native matrix into an M-matrix before the incomplete factorization process is performed. A different algorithm is proposed in 92 with the incomplete factors obtained through an A-orthogonalization instead of a Cholesky elimination.
A famous stabilization technique to ensure the IC existence is the one advanced by Ajiz and Jennings 93 , that has proved quite effective especially in structural applications 50, 90 . Following this approach, we can write where R is a residual matrix collecting with all the entries discarded during the incomplete procedure. From 3.6 , L L T is actually the exact factorization of A − R. As A is SPD, it follows immediately that a sufficient condition for L to be real is that R is negative semidefinite. Suppose that the first column of L has been computed and the term l j1 a j1 /a 11 has to be dropped along with the corresponding l 1j term in L T . Hence, the incomplete factorization is equivalent to the exact factorization of A where the entries a 1j and a j1 are replaced by zero, and the error matrix R contains a j1 in position 1j and j1. For example, in a 3 × 3 matrix with j 2, we have The residual matrix R can be forced to be negative semidefinite by adding two coefficients α and β in position 1, 1 and j, j , such that the submatrix α a 1j a j1 β 3.8 is negative semidefinite. The matrix A must be modified accordingly by subtracting α and β from the corresponding diagonal terms. Though different choices are possible, for example, 50, 93 , the option α β −|a 1j | is particularly attractive because in this way the sum of the absolute values of the arbitrarily introduced entries |α| |β| is minimized. This procedure ensures that L L T is the exact factorization of a positive definite matrix, hence its existence is guaranteed, but obviously such a matrix can be quite different from A. The quality of this stabilization strategy basically depends on the quality of A as an approximation of A. Unstable factors and negative pivots typically arise if the diagonal terms of A are remarkably smaller than the off-diagonal ones, or if there are significant contrasts in the coefficient magnitude. The latter occurrence is particularly frequent in heterogeneous problems or in models coupling different physical processes, such as multiphase flow in deformable media, coupled flow and transport, themomechanical models, fluid-structure interactions, or multibody systems, for example, 94-97 . All these problems give naturally rise to a multilevel matrix: where the system unknowns can be subdivided into contiguous groups. By distinction with multigrid methods, a multilevel approach assumes that the different levels do not necessarily correspond to model subgrids. Several strategies have been proposed to handle properly the matrix levels, for example, 98-102 . Generally speaking, all such algorithms can be viewed as approximate Schur complement methods that differ for the strategy used to perform the level subdivision and the restriction and prolongation operators adopted 103, 104 . For a recent review, see, for instance, 105 . The basic idea of multilevel ILU proceeds as follows. Consider the partial factorization of A in 3.9 , where A 1 is the submatrix obtained from A without the first level, and B 12 and B 21 are the rectangular submatrices collecting the blocks B 1j and B j1 , j 2, . . . , : In 3.10 K 1 L 1 U 1 and S 1 is the Schur complement: Replacing L 1 and U 1 with the ILU factors L 1 and U 1 of K 1 gives rise to a partial incomplete factorization of A that can be used as a preconditioner: where

3.13
3.14 As H 12 and H 21 tend to be much denser than B 12 and B 21 , some degree of dropping can be conveniently enforced, thus replacing these blocks with H 12 and H 21 , respectively. Recalling the level structure of A 1 , the inverse of the approximated Schur complement S 1 can be also replaced by a partial incomplete factorization with the same structure as M −1 in 3.12 .
Repeating recursively the procedure above starting from S 0 A, at the ith level, S −1 i is replaced by thus giving rise to an i 1 th level Schur complement The algorithm stops when the size of S i 1 equals 0, that is, when i 1 equals the user specified number of levels . Multilevel ILU preconditioners are generally much more robust than standard ILU and can be very effective provided that a suitable level subdivision is defined. If A is SPD, however, it is necessary to enforce the positive definiteness of all approximate Schur complements, which is no longer ensured when incomplete decompositions and dropping procedures are used. Recently, Janna et al. 106 have developed a stabilization strategy which can guarantee the positive definiteness of each S i . Basically, they extend to levels of the factorization procedure introduced by Tismenetsky 107 where the Schur complement is updated by a quadratic formula. The theoretical robustness of Tismenetsky's algorithm has been proved in 108 . Recalling 3.6 , the matrix S i can be additively decomposed as where R is the residual matrix obtained from the IC decomposition of S i, 11 . Assuming that an appropriate stabilization technique has been used, for example, the one by Ajiz Ignoring the last three terms of 3.17 we obtain that is the standard Schur complement usually computed in a multilevel algorithm; see 3.14 .
As E T H H i 1,i 2 is generally indefinite, S i 1 in 3.18 may be indefinite too. As a remedy, add and subtract H T i 1,i 2 H i 1,i 2 to the right-hand side of 3.17 : Recalling that

becomes
Neglecting the last term in 3.21 we obtain another expression for the Schur complement: which is always SPD. In fact, 3.21 yields which is SPD independently of the degree of dropping enforced on H i 1,i 2 . The procedure above is generally more expensive than the standard one but is also more robust.
If A is not SPD, enforcing the positivity of pivots or the positive definiteness of the Schur complements is not necessary. However, the experience shows that, in many problems, especially those with a high degree of nonsymmetry and a lack of diagonal dominance, ILU can be much more unstable in both the computation and the application, leading to frequent solver failures 78 . In this case no rigorous methods have been introduced to avoid numerical instabilities. The effect of a preliminary reordering and/or scaling of A has been debated for quite a long time, without achieving a shared conclusion. The fact that the matrix nonzero pattern has an effect on the quality of the resulting ILU preconditioner, similarly to what happens with direct methods, is quite well understood, but not all the methods performing well with direct techniques appear to be efficient for ILU, for example, 90, 109-112 . A similar result is obtained with a preliminary scaling. As proved in 113 , scaling the native matrix does not change the spectral properties of an ILU preconditioned matrix; however it can greatly stabilize the ILU computation by reducing significantly the impact of the round-off errors. Useful scaling strategies can be taken from the GJ preconditioning of 2.14 or the Least Square Log algorithm introduced in 18 .

Approximate Inverses
Incomplete factorizations can be very powerful preconditioners; however the issues concerning their existence and numerical stability can undermine their robustness in several examples. One reason for such a weakness can be understood using the following simple argument. Consider the incomplete factorization L U along with the corresponding residual matrix for a general matrix A: Left and right multiplying both sides of 4.1 by L −1 and U −1 , respectively, yields The matrix controlling the convergence of a preconditioned Krylov subspace method is actually L −1 A U −1 which differs from the identity by L −1 R U −1 . This means that a residual matrix close to 0 is not sufficient to guarantee a fast convergence. In fact, if the norms of either L −1 or U −1 are large, L −1 A U −1 can be anyway far from the identity and yields a slow convergence, that is, even an accurate incomplete factorization can provide a poor outcome. This is why problems with strongly nonsymmetric matrices and a lack of diagonal dominance incomplete factorizations often have a bad reputation.
To cope with these drawbacks the researchers in the past have developed a second big class of preconditioners with the aim of computing an explicit form for M −1 , thus avoiding the solution of the linear system needed to apply the preconditioner. Generally speaking, an approximate inverse is an explicit approximation of A −1 . Quite obviously, it has to be sparse in order to maintain a workable cost for its computation and application. It is no surprise that several different methods have been proposed to build an approximate inverse. The most successful ones can be roughly classified according to whether M −1 is computed monolithically or as the product of two matrices. In the first case, the approximate inverse is generally computed by minimizing the Frobenius norm: thus obtaining either a right or a left approximate inverse, which is generally different. In the second case, the basic idea is to find an explicit approximation of the inverse of the triangular factors of A: and m j J the subvector of m j made of the components included in J. In the product Am j only the columns of A with index in J will provide a nonzero contribution. Moreover, as A is sparse, only a few rows will have a nonzero term in at least one of these columns and is typically quite small. The procedure illustrated above allows for the computation of a right approximate inverse. To build a left approximate inverse just observe that

4.10
Hence, 4.9 can still be used for any j replacing A with A T and obtaining the rows of M −1 . It turns out that for strongly nonsymmetric matrices the right and left approximate inverse can be quite different.
The main difference in the algorithms developed within this class relies on how the nonzero pattern S is selected, and this is indeed the key factor for the overall quality of the approximate inverse. A more detailed discussion on this point will be provided in the next section. Some of the most popular algorithms in this class have been advanced in 116-121 , each one with its pros and cons. Probably the most successful approach is the one proposed by Grote and Huckle 117 which is known as SPAI SParse Approximate Inverse . Basically, the SPAI algorithm follows the lines described above dynamically enlarging J for each column of M −1 until the least square problem 4.9 is solved with a prescribed accuracy. SPAI has proved to be quite a robust method allowing for a good control of its quality by the user. Unfortunately, its computation can be quite expensive also on a parallel machine. A sketch of the SPAI construction is reported in Algorithm 3.
A different way to build an approximate inverse of A relies on approximating the inverse of its triangular factors. Assuming that all the leading principal minors of A are not null, that is, A can be factorized as the product LDU where L is unit lower triangular, D is diagonal, and U is upper unit triangular, then implying that the rows of L −1 and the columns of U −1 are two sets of A-biconjugate vectors. In other words, denoting by w i and z j the ith row of L −1 and the jth column of U −1 , respectively, 4.11 means that w T i Az j 0 ∀i, j 1, . . . , n, i / j.

4.12
If W and Z are the matrices collecting all vectors w i and z j , i, j 1, . . . , n, satisfying the condition 4.12 and D W T AZ, then the inverse of A reads and can be computed explicitly. There are infinite sets of vectors satisfying 4.12 , and they can be computed by means of a biconjugation process, for example, via a two-sided Gram-Schmidt algorithm, starting from two arbitrary initial sets. Quite obviously, these vectors are dense and their explicit computation is practically unfeasible. However, if the process is carried out incompletely, for instance, dropping the entries of w k and z k , k 1, . . . , n, whenever smaller than a prescribed tolerance, the matrices W and Z can be acceptable approximations of L −T and U −1 though preserving a workable sparsity. The resulting preconditioner is denoted as AINV Approximate INVerse and has been introduced in the late 90s by Benzi and coauthors 122-124 . In the following years the native algorithm has been further developed by other researchers as well, for example, [125][126][127][128][129][130] . A sketch of the basic procedure to build the AINV preconditioner is provided in Algorithm 4.  Quite obviously, if A is SPD, then W Z, and in Algorithm 4 only the update of z i must be carried out. The resulting AINV preconditioner can be written as M −1 ZZ T , is SPD, and exists if p k / 0 for all k 1, . . . , n. Unfortunately, the same does not hold true for nonfactored sparse approximate inverses based on minimizing the Frobenius norm in 4.3 . In general, using an algorithm developed along the lines of SPAI there is no guarantee that M −1 is SPD if A is so. This is quite a strong limitation, as it prevents from using the powerful CG method to solve an SPD linear system. Such a weak point motivated the development of the Factorized Sparse Approximate Inverse FSAI preconditioner by Kolotilina and Yeremin 131 . The FSAI algorithm lies somewhat between the two classes of approximate inverses previously identified, as it produces a factored preconditioner in the form where G is a sparse approximation of the inverse of the exact Cholesky factor L of an SPD matrix A obtained with a Frobenius norm minimization procedure. Select a lower triangular nonzero pattern S L and find the matrix G ∈ W SL , where W SL is the set of matrices with pattern S L such that The minimization 4.15 can be performed recalling that Deriving the right-hand side of 4.16 with respect to the nonzero entries of G and setting to 0 yield where the symbol · ij denotes the entry in position i, j of the matrix in the brackets. As S L is a lower triangular pattern, the entry l ji of L is nonzero only if i j and the off-diagonal terms of L are not required. Scaling 4.17 so that the right-hand side is either 1 or 0 yields where G is the scaled factor G and δ ij the Kronecker delta. Solving 4.18 by rows leads to a sequence of dense SPD linear systems with size equal to the number of nonzeroes set for each row of G. More precisely, denoting by J the set of column indices where the ith row g i of G has a nonzero entry, the following SPD systems have to be solved: where i m ∈ R m has zero components except the mth one that is unitary, and m is the cardinality of J. The FSAI factor G is finally restored from G as such that the preconditioned matrix GAG T has all diagonal entries equal to 1. It has been demonstrated that FSAI is optimal in the sense that it minimizes the Kaporin conditioning number of GAG T for any G ∈ W SL 131, 132 . Another remarkable property of FSAI is that the minimization in 4.15 does not require the explicit knowledge of L, which would make the algorithm unfeasible. A sketch of the procedure to build FSAI is reported in Algorithm 5. The FSAI preconditioner was later improved in several ways 133 , with the aim of increasing its efficiency by dropping the smallest entries while preserving its optimal properties. Between the late 90s and early 00s further developments of FSAI were abandoned, as AINV proved generally superior. In recent years, however, a renewed interest for FSAI has started, mainly because of its high intrinsic parallel degree, for example, 134-136 . The native FSAI algorithm has been developed for SPD matrices. A generalization to nonsymmetric matrices is possible and quite straightforward 137, 138 ; however the resulting preconditioner is much less robust because the solvability of all local linear systems and the nonsingularity of the approximate inverse are no longer theoretically guaranteed. This is why with general sparse matrices SPAI and AINV are generally preferred.

Nonzero Pattern Selection and Stability Issues
A key factor for the efficiency of approximate inverse preconditioning based on the Frobenius norm minimization is the choice of the nonzero pattern S or S L . If the nonzero pattern is not good for the problem at hand, the result is generally a failure in the Krylov solver convergence. This is due to the fact that approximate inverse techniques assume that a sparse matrix M −1 can be a good approximation, in some sense, of A −1 . This is not obvious at all, as the inverse of a sparse matrix is structurally dense. If A is diagonally dominant, the entries of A −1 decay exponentially away from the main diagonal 139 , hence a banded pattern for M −1 typically provides good results. Other favorable situations include block tridiagonal or

ISRN Applied Mathematics
Input: Matrix A and the nonzero pattern S L Output: pentadiagonal matrices 140 , but for a general sparse matrix choosing a good nonzero pattern is not trivial at all.
With respect to the nonzero pattern selection, approximate inverses can be defined as static or dynamic, according to whether S and S L are initially chosen and kept unvaried during the computation, or S and S L are generated by an adaptive algorithm which selects the position of the nonzero entries in an a priori unpredictable way. Typically, dynamic approximate inverses are more powerful than static ones, but their implementation especially on parallel computers can be much more complicated. The most popular and common strategy to define a static nonzero pattern for M −1 is to take the nonzero pattern of a power κ of A as is justified in terms of the Neumann series expansion of A −1 . In real problems, however, it is uncommon to set a κ value larger than 2 or 3, as S can soon become too dense, so it is often preferable working with sparsified matrices. The sparsification of A κ can be done by either a prefiltration, or a postfiltration strategy, or both. For example, Chow 141 suggests dropping the entries of A smaller than a threshold, thus obtaining the sparsified matrix A and using the pattern of A κ . An a posteriori sparsification strategy is advanced for the FSAI algorithm by Kolotilina and Yeremin 133 with the aim of reducing the cost for the preconditioner application by dropping its smallest entries. Such a dropping, however, is nontrivial and is performed so as to preserve the property that the FSAI preconditioned matrix has unitary diagonal entries, with a possibly nonnegligible computational cost. Anyway, pre-and/or postfiltration of a static approximate inverse is generally to be recommended, as the cost of the preconditioner application can be dramatically reduced with a relatively small increase of the iteration count, for example, 135, 136, 142 . By the way, this suggests that the pattern of A κ is often much larger than what would be actually required to get a good preconditioner, the difficulty relying on the a priori recognition of the most significant entries.
Dynamic approximate inverses are based on adaptive strategies which start from a simple initial guess, for example, a diagonal pattern, and progressively enlarge it until a certain criterion is satisfied. For instance, a typical dynamic algorithm is the one proposed by Grote and Huckle 117 for the SPAI preconditioner; see Algorithm 3. More specifically, they suggest to collect the indices of the nonzero components of the residual vector in a new set L and to enlarge the current set J by adding the new column indices that appear in the rows with index in L. An improvement can be obtained by retaining only a few new column indices, specifically the indices j such that the new residual norm: is larger than a user-specified tolerance. Then, the set I is enlarged correspondingly, and a new least square problem is solved. Other dynamic techniques for generating S have been advanced by Huckle for both nonsymmetric 143 and SPD matrices 134 . More recently, Jia and Zhu 121 have advanced a novel dynamic approximate inverse denoted as PSAI Power Sparse Approximate Inverse , and Janna and Ferronato 144 have developed a new adaptive strategy for Block FSAI that has proved quite efficient also for the native FSAI algorithm and will be presented in more detail in the sequel.
In contrast with approximate inverses based on the Frobenius norm minimization, the AINV algorithm does not need the selection of the nonzero pattern, so it can be classified as a purely dynamic preconditioner. However, its computation turns out to be pretty similar to that of an incomplete factorization, see Algorithms 2 and 4, thus raising concerns on its numerical stability. In particular, due to incompleteness a zero pivot may result, causing a breakdown, or small pivots can trigger unstable behaviors. To avoid such difficulties a stabilized AINV preconditioner has been developed independently in 124, 145 for SPD problems. For generally nonsymmetric problems, especially with large off-diagonal entries, the numerical stability in the AINV computation may still be an issue.
Theoretically, for an arbitrary nonzero pattern S it is not guaranteed that M −1 is nonsingular. This could also happen for a few combinations of the user-specified tolerances set for generating dynamically the approximate inverse pattern. However, such an occurrence is really unlikely in practice, and typically approximate inverses prove to be very robust in problems arising from a variety of applications. According to the specific problem at hand, the choice of the most efficient type of approximate inverse may vary. For a thorough comparative study of the available options as of 1999 see the work by Benzi and Tůma 146 . Generally, SPAI is a robust algorithm able to converge also in difficult problems where incomplete factorizations fail, for example, 147 , allowing the user to have a complete control of the quality of the preconditioner through the parameter ε, see Algorithm 3. However, its computation is quite expensive and can be rarely competitive with AINV if only one system has to be solved so that the construction time cannot be amortized by recycling a few times the same preconditioner. For SPD problems, SPAI is not an option and FSAI is always well defined and a very stable preconditioner, even if its convergence rate may be lower than AINV. For unsymmetric matrices, however, FSAI is not reliable and may often fail. On the other hand, AINV seldom fails and generally outperforms both SPAI and FSAI.
Numerical experiments in different problems typically show that, whenever a stable ILU-type preconditioner can be computed, that is probably the most efficient preconditioner on a scalar machine 129, 148, 149 . However, approximate inverses play an important role in offering a stable and robust alternative to ILU in ill-conditioned problems. Moreover, they can become the preconditioner of choice for parallel computations, as it will be discussed in more detail in the sequence.

Algebraic Multigrid Methods
It is well known that the convergence rate of any Krylov subspace method preconditioned by either an incomplete factorization or an approximate inverse tends to slow down as the problem size increases. The convergence deterioration along with the increased number of operations per iteration may lead to an unacceptably high computational cost, thus limiting de facto the size of the simulated model even though large memory resources are available. This is why in recent years much work has been devoted to the so-called scalable algorithms, ISRN Applied Mathematics that is, solvers where the iteration count is insensitive to the characteristic size of the discretization mesh.
Multigrid methods can provide an answer to such a demand. Pioneering works on multigrid are due to Brandt 150,151 who promoted these techniques for the solution of regularly discretized elliptic PDEs. The matrices arising from this kind of problem have a regular structure for which all eigenvalues and eigenvectors are known. For example, let us consider the 1D model problem: with homogeneous Dirichlet boundary conditions discretized by centered differences over n 2 equally spaced points x i , i 0, . . . , n 1. The grid characteristic size is therefore h 1/ n 1 . This discretization results in the tridiagonal n × n linear system A h x b In other words, the ith component of w k is actually w k,i sin kπx i .

5.4
Let us solve the discretized model problem by a stationary method, for example, the Jacobi iteration. As the main diagonal of A h has all components equal to 2, the Jacobi iteration matrix S h is just and the kth component of the error along the direction of the kth eigenvector of S h is damped at each iteration by the reduction coefficient η k : The reduction coefficients with k around n/2 are very small, and the related error components are damped very rapidly in a few iterations. This part of the error is often referred to as the oscillatory part corresponding to high frequency modes. However, if h is small, there are also several reduction coefficients close to 1, related to both the smallest and largest values of k, that slow down significantly the stationary iteration. This part of the error is usually referred to as the smooth part corresponding to low frequency modes. Let us now consider a coarser grid where the characteristic size is doubled, that is, we simply keep the nodes of the initial discretization with an even index. Recalling 5.4 it follows that a smooth mode on the fine grid, that is, k < n/4 and k > 3n/4, is now automatically injected into an oscillatory mode on the coarse grid. Building A 2h and S 2h , it is now possible to damp in a few iterations the error on the coarse grid and then project back the result on the fine grid. Similar conclusions can be drawn also using other stationary iterations, such as Gauss-Seidel, though in a much less straightforward way.
Recalling the previous observations, the basic idea of multigrid proceeds as follows. The operator S h , usually denoted as the smoother in the multigrid terminology, is applied to the fine grid discretization for ν 1 iterations. The resulting fine grid residual is computed and injected into the coarse grid through a restriction operator R H h . Recalling the relationship between error and residual, the coarse grid problem is solved using the residual as known term, thus providing the coarse grid error. Such an error is then projected back to the original fine grid using a prolongation operator P h H , and the fine grid solution, obtained after the initial ν 1 smoothing iterations, is corrected. Finally, the fine grid smoother is applied again until convergence. The algorithm described above is known as two-grid cycle and is the basis of multigrid techniques. Quite naturally, the two-grid cycle can be extended in a recursive way. At the coarse level, another two-grid cycle can be applied moving at a coarser grid, defining appropriate smoother, restriction, prolongation operators, and so on, until the coarsest problem is so small that it can be efficiently solved by a direct method. The recursive application of the two-grid cycle gives rise to the so-called V-cycle. Other more complex recursions can provide the so-called W-cycle; see, for instance, 152 for details.
Multigrid methods have been introduced as a solver for discretized PDEs of elliptic type, and indeed in such problems they have soon proved to be largely superior to existing algorithms. The first idea to extend their use to other applications was to look at the multigrid as a purely algebraic solver, where one has to define the smoother, restriction, and prolongation operators knowing the system matrix A only independently of the grid and the discretized PDE it actually comes from. This strategy, known as Algebraic Multigrid AMG , was first introduced in the mid 80s 153-155 and became soon quite popular. The second idea was to regard AMG not as a competitor with Krylov subspace solvers, rather as a potentially effective preconditioner. In fact, to work as a preconditioner it is simply sufficient to fix the number ν 2 of smoothing iterations needed to reconstruct the system solution at the finest level, thus obtaining an inexact solution. The scheme of a V-cycle AMG application as a preconditioner is provided in Algorithm 6.
Basically, AMG preconditioners vary according to the choice of both the restriction operator and the smoother, while the prolongation operator is often defined as the transposed of the restrictor. The basic idea for defining a restriction is to coarsen the native pattern of A and prescribe an appropriate interpolation over the coarse nodes. Classical coarsening strategies have been introduced in 155, 156 which separate the variables into coarse Cand fine F-points according to the strength of each matrix connection. A point j strongly influences the point i if where θ is the user-defined strength threshold. For each point j that strongly influences an Fpoint i, j is either a C-point or is strongly influenced by another C-point k.  is based on the so-called smoothed aggregation SA technique 157 , where a coarse node is defined by an aggregate of a root point i and all the neighboring nodes j such that a ij > θ a ii a jj .

5.8
Another possibility is to use a variant of the independent set ordering ISO technique which subdivides a matrix into a 2 × 2 block structure and projects the native system into the Schur complement space. Using ISO or SA can lead to an elegant relationship between AMG and multilevel ILU as exploited in 98, 101, 158, 159 . More recently, a number of both parallel and scalar coarsening strategies have been also advanced, for example, 160-164 . As far as the smoother is concerned, for many years the method of choice has been the standard Gauss-Seidel iteration. With the development of parallel computers other smoothers have been introduced with the aim of increasing the parallel degree of the resulting AMG algorithm. Multicoloring approaches 165 can be used in connection to standard Gauss-Seidel, or also polynomial and C-F Jacobi smoothers 166, 167 . It is no surprise that also approximate inverses can be efficiently used as a smoother, for example, 168, 169 .
The last decade has witnessed an explosion of research on AMG techniques. The key factor leading to such a great interest is basically their potential scalability with the size of the problem to be solved, in the sense that the iteration count to converge for a given problem does not depend on the number of the mesh nodes. Several theoretical results have been obtained with the aim of generalizing as much as possible AMG to nonelliptic problems, for example, 158, 170-174 . AMG techniques have been used in several different applications with promising results, such as fluid-structure interactions, meshless methods, Maxwell and Helmoltz equations, structural mechanics, sedimentary basin reconstruction, and advectionconvection and reactive systems, for example, 175-185 . The above list of references is just representative and of course largely incomplete. Nonetheless, much work is still to be done in order to make AMG the method of choice. In particular, difficulties can be experienced where the system matrix arises from highly distorted and irregular meshes, or in presence of strong heterogeneities. In these cases, even the most advanced AMG strategies can fail.

Parallel Preconditioners
Parallel computing is widely accepted as the only pathway toward the possibility of managing millions of unknowns 186 . At the same time, hardware companies are improving the available computational resources with an increasing degree of parallelism rather than accelerating each single computing core. This is why the interest in parallel computations is continuously growing in recent years. Krylov subspace methods are in principle almost ideally parallel, as their kernels are matrix-vector and scalar products along with vector updates. Unfortunately, the same is not true for most algebraic preconditioners. There are basically two approaches for developing parallel preconditioners: the first tries to extract as much parallelism as possible from existing algorithms with the aim of transferring them to high-performance platforms, while the second is based on developing novel techniques which would not make sense on a scalar computer. Quite obviously, the former approach is easier to understand from a conceptual point of view, as the native algorithm is not modified and the difficulties are mainly technological. The latter implies an additional effort to develop new explicitly parallel methods.
ILU-based preconditioners are highly sequential in both their computation and application. Anyway, some degree of parallelism can be achieved through graph coloring techniques. These strategies have been known from a long time by numerical analysts, for example, 6 , and used in the context of relaxation methods. The aim is to build a partition of the matrix graph such that all vertices in the same group form an independent set, so that all unknowns in the same subset can be solved in parallel. Other useful orderings are based on domain decomposition strategies minimizing the edges connecting different subdomains, with a local ordering applied also to each unknown subset. Parallel ILU implementations based on these concepts have been developed in 187-190 ; however it was soon made clear that generally such algorithms could not have a promising scalability.
In contrast, approximate inverses are intrinsically much more parallel than ILU factorizations, as they can be applied by a matrix-vector product instead of forward and backward substitutions. The construction of an approximate inverse, however, can be difficult to parallelize. For example, the AINV Algorithm 4 proceeds by columns and is conceptually pretty similar to an ILU factorization. A parallel AINV construction can be obtained by means of graph partitioning techniques which decompose the adjacency graph of A in a number of subsets of roughly the same size with a minimal number of edge cuts 191 . In such a way the inner variables of each subset can be managed independently by each processor, while communications occur only with the boundary variables. By distinction, the construction of an approximate inverse based on a Frobenius norm minimization is naturally parallel if a static nonzero pattern is defined a priori. For example, the computation of both static SPAI and FSAI consists of a number of independent least-square problems and dense linear systems, respectively, which can be trivially partitioned among the available processors. Hence, static SPAI and FSAI are almost perfectly parallel algorithms. However, when the nonzero pattern is defined dynamically parallelization is no longer straightforward, as the operation of gathering the required submatrix of A, see Algorithms 3 and 5, gives rise to unpredictable communications. For example, details of the nontrivial parallel implementation of the dynamic SPAI algorithm are provided in 147 .
The theoretical scalability properties of AMG make it very attractive for parallel computations. This is why in the last few years the research on AMG techniques has concentrated on high-performance massively parallel implementations. The main difficulties may arise in parallelizing the Gauss-Seidel smoother and the coarsening stage. As mentioned before, these problems can be overcome by using naturally parallel smoothers, such as Jacobi, relaxed 28 ISRN Applied Mathematics Jacobi, polynomial or a static approximate inverse smoother, for example, 162, 166 , and parallel coarsening strategies, for example, 160,162,192,193 . Currently, efficient parallel implementations of AMG are already available in several software packages, for example, Hypre 194 and Trilinos 195 . An alternative strategy for developing parallel preconditioners relies on building new algorithms which should consist of matrix-vector products and local triangular solves only. Perhaps the earliest technique of this kind belongs to the class of the polynomial preconditioners which was first introduced by Cesari as back as in 1937 4 , even though obviously not in the context of Krylov subspace methods and supercomputing. Modern polynomial preconditioners have been developed in [196][197][198] . A polynomial preconditioner is defined as where s ∈ P k is a polynomial of degree not exceeding k. The preconditioned system becomes: where s A and A commute, so that right and left preconditionings coincide. Quite obviously, s A A is never formed explicitly and its application to a vector is simply performed by a sequence of matrix-vector products. Therefore, polynomial preconditioners can be ideally implemented in a parallel environment. There are different polynomial preconditioners according to the choice of s A . One of the simplest options, advanced in 196 , is using the Neumann expansion I N N 2 · · · N l , where N I − ωA 6.3 and ω is a scaling parameter. The degree l ≤ k cannot be too large, and this can limit the effectiveness of this preconditioner. Another option, advanced in 197 , is to select a somewhat "optimal" polynomial, where optimality is evaluated in some sense. A natural choice is prescribing that the eigenspectrum of the preconditioned matrix is as close as possible to that of the identity, that is, finding s ∈ P k such that where σ A is the eigenspectrum of A. The min-max problem above can be addressed by using Chebyshev polynomials but cannot be solved exactly, as σ A is generally not known. An approximate minimization can be done replacing σ A with a continuous set which encloses σ A . A third option is to compute s as the polynomial that minimizes 1 − λs λ w , 6.5 where the symbol · w denotes the 2-norm induced by the inner product on the space P k with respect to some nonnegative weight function defined in the interval α, β : The resulting polynomial is known as least-squares and was actually introduced by Stiefel in the solution of eigenproblems 199 . According to Saad 198 , least-squares polynomials tend to perform better than Chebyshev and Neumann. However, a major drawback for the use of this kind of preconditioners stems from the need at least for estimates of the smallest and largest eigenvalues of A and often their performance is quite poor in terms of convergence rate 200 . Another popular parallel preconditioner is based on the Additive Schwarz AS methods 201 . The idea is to use a domain decomposition approach for dividing the variables into a number of possibly overlapping subdomains and to address separately each subdomain at a local level Figure 3 . On the overlaps the contribution of each subdomain is added or treated in some special way by averaging techniques. AS preconditioners can be easily parallelized by assigning each subdomain to a single processor. For instance, a sequential ILU factorization can be computed for each subdomain, thus miming a kind of parallel ILU. If no overlaps are allowed for, AS preconditioners with inner ILU factorizations simply reduce to an incomplete Block Jacobi. This class of preconditioners has experienced some success in the 90s as it is conceptually perfectly parallel. Unfortunately, the quality of the preconditioner deteriorates as the number of processors, that is, subdomains, grows, because the size of each subset progressively decreases. This causes an increase of the iteration count which can also completely offset the advantage of using a larger number of processors.
A novel recent parallel preconditioner which tries to combine the positive features of both approximate inverses based on the Frobenius norm minimization and AS methods is Block FSAI 202 . The basic idea exploits the fact that in the Frobenius norm minimization 30 ISRN Applied Mathematics the preconditioned matrix can be actually forced to resemble any target matrix T , by computing a pseudoapproximate inverse M −1 such that for all matrices having a prescribed nonzero pattern S. This concept, originally introduced in 120 , can be used to choose a target matrix T that is particularly attractive for parallel computing. The idea of Janna et al. 202 is to generalize the FSAI approach using as a target a block diagonal matrix. As FSAI is a robust preconditioner for SPD problems, Block FSAI has been originally developed for this class of matrices, although a generalization to nonsymmetric indefinite linear systems has also been attempted 203 . So let A be SPD with S L and S BD a sparse lower triangular and a dense block diagonal nonzero n × n pattern, respectively. Denote by n b the number of diagonal blocks and m i b the size of the i b th block, and let D be an arbitrary full-rank matrix with nonzero pattern S BD . The Block FSAI preconditioner of A is defined as the product F T F, where F is a lower block triangular factor with nonzero pattern with L the exact lower Cholesky factor of A. As D is arbitrary, it goes without saying that F as defined in 6.8 is also. Similarly to the FSAI algorithm, minimization of 6.8 yields a linear relationship for the ith row f i of F: where J is the set of column indices in the ith row of F with nonzero entries and v is the null vector except for the last m i b components which are arbitrary, being i b the block index the row i belongs to Figure 4 . For the system 6.9 to have a unique solution m i b components of f i J can be set arbitrarily because of the arbitrariness of D. According to 202 the most convenient choice is to prescribe all the components of f i J falling within the i b th diagonal block to be null, with the exception of the diagonal entries set to 1. This implies that F is actually a unit lower triangular matrix with structure

6.10
Similarly to the static FSAI algorithm, the factor F can be efficiently built in parallel as all systems 6.9 are independent. According to 6.8 , the preconditioned matrix FAF T should resemble DD T , that is, a block diagonal matrix for any arbitrary D. In other words, FAF T has the block structure where R ij are residual blocks and B i b tend to the diagonal blocks of DD T . As D is arbitrary, there is no reason for FAF T to be better than A in a CG iteration, so it is necessary to precondition FAF T again. Assuming that the residual off-diagonal blocks in 6.11 are close to 0, an effective choice could be using as "outer" preconditioner a nonoverlapped AS method containing a local preconditioner for each block B i b . Writing the "outer" preconditioner in the factored block diagonal form J −T J −1 , the resulting preconditioned matrix reads with the final Block FSAI preconditioner: The Block FSAI preconditioner described above can be improved by defining S BL dynamically. An adaptive algorithm for the S BL pattern identification can be implemented starting from the theoretical optimal properties of Block FSAI. Janna  Block FSAI has the theoretical property of minimizing ψ F for any given pattern S BL . The basic idea of the adaptive algorithm is to select the off-block diagonal nonzero entries in any row f i of F so as to reduce ψ F as much as possible. This is feasible because each factor FAF T ii turns out to be a quadratic form depending on f i only. Denoting by f i the subvector of f i including the off-block diagonal entries only, A i b the square submatrix of A built from the first to the mth row/column, with m the sum of size of the first i b − 1 diagonal blocks of S BD , and a i the subrow of A with the first m elements of the ith row Figure 5 , each factor FAF T ii in 6.16 reads 6.17 Minimizing every FAF T ii , i 1, . . . , n, is equivalent to minimizing ψ F . The adaptive pattern search for F can be therefore implemented as follows. Start from an initial guess S 0 F for the nonzero pattern of the off-block diagonal part of F, for example, the empty pattern, and compute the gradient g i of each quadratic form FAF T ii : Then, add to S 0 F the position j into each row i corresponding to the largest component of g i computed in 6.18 , so as to obtain an augmented pattern S 1 F . After computing the new factor F over S 1 F , the procedure above can be iterated in order to build S 2 F , S 3 F , and so on. The search into each row can be stopped when either a maximum number k max of entries are added to the initial pattern or the relative variation of FAF T ii in two consecutive steps is smaller than a prescribed tolerance F . The construction of the dynamic Block FSAI is summarized in Algorithm 7. The Block FSAI computed with the dynamic algorithm described above is known as Adaptive Block FSAI ABF and has proved very effective even if the outer preconditioner does not contain the exact inverse of the diagonal blocks B i b . As any dynamic algorithm, the parallel construction of ABF is not straightforward. However, an almost ideal Input: Matrix A, the initial pattern S 0

Saddle-Point Problems
In recent years the interest in the solution of saddle-point problems has grown, with many contributions from different fields. The main reason is the great variety of applications requiring the solution of linear systems with a saddle-point matrix. For example, such problems arise in the discretization of compressible and incompressible Navier-Stokes equations in computational fluid dynamics 208, 209 , constrained optimization 210, 211 , electromagnetism 212, 213 , mixed FE approximations in fluid and solid mechanics 208, 214 , and coupled consolidation problems in geomechanics 58 .
A saddle-point problem is defined as a linear system where the matrix A has the structure where K ∈ R n 1 ×n 1 , B 1 ∈ R n 1 ×n 2 , B 2 ∈ R n 2 ×n 1 , C ∈ R n 2 ×n 2 , and K, B 1 , and B 2 are nonzero. Typically, but not necessarily, n 1 ≥ n 2 . Moreover, C is symmetric and positive semidefinite, and often C 0, while in most applications K is SPD and B 2 B T 1 , that is, A is symmetric indefinite. For example, a saddle-point matrix A arises naturally in an equality-constrained quadratic problem: with C 0 and B 2 B T 1 B, solved with the aid of Lagrangian multipliers. A nonzero C may arise in some interior point methods 210, 211 , compressible Navier-Stokes equations 215 , or coupled consolidation problems 65 . Frequently the norm of C is much smaller than that of K. Because of the great variety of applications leading to saddle-point matrices, it is no surprise that several different solution methods have been developed. For an exhaustive, review, see Benzi et al. 216 . In the context of the present paper, the attention will be obviously restricted to preconditioned iterative methods.
A natural way to solve the problem is to reduce the system 7.3 to the Schur complement form. Solve for x 1 in the upper set of equations: and substitute 7.4 in the lower set: The matrix S BK −1 B T C at the left-hand side of 7.5 is the Schur complement of system 7.3 . Hence, the global solution x can be obtained by solving the Schur complement system 7.5 and substituting the result back in 7.4 which implies solving another system with the matrix K. This pair of SPD linear systems can be elegantly solved by partitioned schemes, as those proposed in 217, 218 in the context of coupled diffusion and consolidation problems, resulting in a double CG iteration where two distinct preconditioners for K and S are needed. A usually more efficient alternative is to solve the systems 7.4 and 7.5 inexactly and use the procedure described above as a preconditioner in a Krylov subspace method. This is what is traditionally referred to as constraint preconditioning because the idea was first introduced in the solution of equality-constrained optimization problems such as 7.2 219, 220 .
Quite obviously, there are several different constraint preconditioners according to how K −1 and S −1 in 7.4 and 7.5 are approximated. The most effective choice turns out to be strongly dependent on the specific problem at hand. A popular strategy consists of replacing K −1 with M −1 K and solving exactly the Schur complement system 7.5 219-222 . Such an approach can be convenient if n 2 is much smaller than n 1 , so that the system 7.5 can be efficiently solved by a direct method. Otherwise, an inner CG iteration can be set up exiting the procedure even at a relatively large residual 130 . In this case the resulting preconditioner reads and is usually denoted as Exact Constraint Preconditioner ECP . ECP has several interesting theoretical properties which have been studied in detail by several authors. For instance, an exhaustive eigenanalysis of the preconditioned matrix can be found in 223 . In particular, denoting with α K and β K the smallest and the largest eigenvalue, respectively, of M −1 K K, Durazzi and Ruggiero 224 have shown that the eigenvalues λ of M −1 ECP A are either one, with multiplicity at least n 2 , or real positive and bounded by α K ≤ λ ≤ β K . Quite interestingly, this result can lead the classical CG algorithm to converge even with the indefinite matrix 7.3 , provided that the last n 2 components of the initial residual are zero. This can be simply accomplished by choosing as initial guess M −1 ECP b. Despite its remarkable properties, ECP has at least two severe limitations. First, M −1 K must be known explicitly to allow for the computation of S and must be very sparse in order to preserve a workable sparsity for S as well. Second, although an accurate solution of 7.5 is not really needed, solving it at each preconditioner application can represent a significant computational burden for the overall scheme. This is why ECP is often used with M −1 K diag K −1 , which can be quite detrimental for the solver convergence in case K is not diagonally dominant or is ill-conditioned, such as in coupled consolidation problems 130 .
To make the preconditioner application cheaper, a substitute for 7.5 can be solved replacing S with another approximation M S as easy to invert as possible. Moreover, in some cases it may also be possible to approximate S without computing it explicitly, thus further reducing the constraint preconditioner cost. Obviously the theoretical properties of ECP no longer hold true. Recalling the factorization 2.12 by Sylvester's theorem of inertia, the new preconditioner is and is denoted as Inexact Constraint Preconditioner ICP . This class of preconditioners has been introduced in 130, 225, 226 using diag K −1 or AINV for M −1 K and a zero fill-in IC factorization for M −1 S . The choice for M −1 S is justified by the fact that usually S is well conditioned. Moreover, this allows for splitting the central block diagonal factor in 7.7 into the product of two matrices, thus providing a factorized form for M −1 ICP . A block triangular variant of ICP can be easily deduced from 7.7 by neglecting either the upper or the lower outer factors: The theoretical properties of block triangular ICP have been investigated by Simoncini 227 and effectively used in the solution of the discretized Navier-Stokes equations 228-230 . Similar to ECP, ICP as well requires an explicit approximation M −1 K to build the Schur complement. This is still a limitation in some problems, such as coupled consolidation 130 , where even AINV can be a poor approximation of K −1 leading to an unacceptably high iteration count to converge. Actually, this is only partially connected with the quality of AINV itself, rather it is the need for the explicit construction of the Schur complement matrix that calls for a reduced fill-in of M −1 K . In order to remove this inconvenience an implicit approximation of K −1 can be used, such as a stabilized IC factorization with partial fill-in: The new Schur complement reads but cannot be computed explicitly. Although the S implicit form 7.10 can still be used to perform a product between S and a vector, and consequently one may think of solving the related inner system by an iterative method, a suitable preconditioner for S is not easily available. Alternatively, an explicit approximation of S can be computed, for example, using the AINV of K −1 , namely where Z is the upper triangular factor resulting from the AINV Algorithm 4, and then an IC factorization of S is performed: In this way the quality of M −1 K is improved, but generally an additional approximation in M −1 S is introduced. However, in several problems, especially those arising from coupled consolidation where such a constraint preconditioning variant has been introduced 148, 231 , the overall algorithm has a beneficial effect. This ICP variant blending two different approximations for K −1 in the same scheme reads

7.13
where G L −1 S B L −T K , and is denoted as Mixed Constraint Preconditioner MCP . Recently an improved MCP version has been advanced by properly relaxing the Schur complement approximation 61 .
It is quite well accepted that constraint preconditioners outperform any other "global" preconditioning approach applied to saddle-point matrices. This remarkable property has attracted the theoretical interest of several researchers who have investigated the spectral properties of constraint-preconditioned saddle-point matrices 225

7.15
In other words, e k and e s are a measure of the error made when using the approximations M −1 K and M −1 S in place of K −1 and S −1 , respectively, while g provides an estimate of the strength of coupling between the upper and lower set of equations in 7.3 . Inequality 7.14 provides a useful and relatively cheap indication on how the quality of M −1 K and M −1 S affects the overall convergence rate of MCP. In particular, note that the quality of M −1 K is generally more important than that of M −1 S , with the difference becoming larger as the coupling becomes stronger.
Quite naturally, there is the distinct interest to implement constraint preconditioners on parallel computers. The main difficulties for developing efficient parallel variants of constraint preconditioners are twofold. First, the constraint preconditioning concept is inherently sequential, as it involves the preliminary computation of the Schur complement and then the subsequent approximate solution to two inner systems. Thus, a standard parallel approach where the system matrix is distributed among the processors as stripes of contiguous rows is not feasible, because all the processors owning the second block of unknowns are idle when the other processors perform the operation on the first block, and viceversa. Second, efficient parallel approximations of K −1 and S −1 have to replace incomplete factorizations. For example, in incompressible Navier-Stokes problems parallel AMG techniques have been efficiently used to approximate the K block which arises from a discretized Laplace operator 234 . In coupled consolidation better results have been obtained by using a double FSAI 235 , parallel multilevel variants 236 , and Block FSAI 237 , which appears to be currently the most promising approach in the context of geomechanical problems.

Conclusions and Future Perspectives
The development and implementation of efficient preconditioners are the key factors for improving the performance and robustness of Krylov subspace methods in the solution of large sparse linear systems. Such as issue is a central task in a great number of different applications, not all necessarily related to PDEs, so it is no surprise that much research has been devoted to it. After a time where most efforts were focused on direct and iterative solvers, the last two decades can be denoted as the "preconditioning era." The number of papers, projects, and international conferences addressing this topic has largely exceeded those aimed at the development of new solvers, and this trend is not likely to fade in the next few years. This is mainly due to the fact that an optimal general-purpose preconditioner does not exist and that any specific problem can afford the opportunity to develop an ad hoc strategy.
The present paper is intended to offer an overview of the most important algebraic preconditioners available today for the solution of sparse linear systems. Hence, it cannot exhaust the subject of preconditioning as the entire class of problem-specific algorithms has been omitted. Algebraic preconditioners have been chosen for their "universality," in that they can be effectively used as black-box tools in general-purpose codes requiring the knowledge of the system matrix only. A classification of algebraic preconditioners has been attempted based on the most significant representatives of each group along with their prominent features: i incomplete factorizations have been the earliest algebraic preconditioners massively used to accelerate Krylov subspace solvers. The several weak points related to robustness and numerical stability have stimulated the development of many different variants meant to both improve their performance and extend the application area. Although some advances are still under way, this class of preconditioners appears to be quite mature with the related strong and weak points very well known. Incomplete factorizations can be very effective and in many cases are still the most performant preconditioners, but their intrinsic lack of robustness and especially the very reduced parallel degree are probably going to bound their role in the next future; ii approximate inverses were originally developed with the aim of resolving the main difficulties of ILU-based preconditioners. Between the late 90s and early 00s they looked as a very promising approach especially in view of their robustness and potentially high parallel degree. Indeed, these preconditioners have proved very robust and so well suited for black-box solvers and in some cases almost perfectly parallel. In a sequential run, however, their performance can be lower than that of an ILU-based preconditioner. At present, the class of approximate inverses appears to be quite a mature field of research, with the most significant breakthroughs advanced more than 10 years ago. Nonetheless, their influence is probably going to be still very important in the next few years; iii the potentially high scalability of algebraic multigrid techniques has promoted much research in this area. In contrast with both incomplete factorizations and approximate inverses, AMG can allow for a stable iteration count to converge independently of the characteristic mesh size, that is, the overall system dimension, thus making AMG a potentially ideally scalable tool. Currently, AMG appears as one of the more active research areas, with possible contaminations from incomplete factorizations and approximate inverses in the choice of the smoother. In several applications arising from discretized PDEs with severe mesh distortions, however, AMG can still lack robustness, and more research is still needed to make it the algorithm of choice in the next future; iv the current hardware development is leading to a much more pronounced parallel degree of any computer. Besides the parallelization of existing algorithms, this trend is going to enforce the development of totally new techniques which can make sense in a parallel computing environment only. This is why parallel preconditioners are now emerging as a class of algorithms which are likely to expand in the next future. These tools can take advantage of important contributions from all the previous classes and can also relaunch some techniques, such as polynomial preconditioners or FSAI-based approximate inverses, which had been somewhat forgotten over the last years.