MRILDU : An Improvement to ILUT Based on Incomplete LDU Factorization and Dropping in Multiple Rows

We provide an improvement MRILDU to ILUT for general sparse linear systems in the paper. The improvement is based on the consideration that relatively large elements should be kept down as much as possible. To do so, two schemes are used. Firstly, incomplete LDU factorization is used instead of incomplete LU. Besides, multiple rows are computed at a time, and then dropping is applied to these rows to extract the relatively large elements in magnitude. Incomplete LDU is not only fairer when there are large differences between the elements of factors L and U, but also more natural for the latter dropping in multiple rows. And the dropping in multiple rows is more profitable, for there may be large differences between elements in different rows in each factor. The provided MRILDU is comparable to ILUT in storage requirement and computational complexity. And the experiments for spare linear systems from UF Sparse Matrix Collection, inertial constrained fusion simulation, numerical weather prediction, and concrete sample simulation show that it is more effective than ILUT in most cases and is not as sensitive as ILUT to the parameter p, the maximum number of nonzeros allowed in each row of a factor.


Introduction
The solving of sparse equations is the core issue of many scientific and engineering calculations, and the time required to solve sparse linear systems generally accounts for a large proportion in the whole numerical simulation.More seriously, with the improvement of the simulation precision, the spatial resolution becomes higher and higher, the grid points become more and more dense, and thus, the size of the linear equations eventually also becomes larger and larger.For the large-scale sparse linear equations, especially for the sparse linear system during the simulation of the three-dimensional problems, the storage requirement and the amount of computation from the traditional direct solution methods are very large; besides, the sparsity of the coefficient matrix cannot be fully utilized [1] and it is difficult to be parallelized efficiently.
The iterative method can be controlled easily and the iterative method can make full use of the sparsity of the matrix; thus the storage requirement and the amount of computation are both very small in a single iteration.In addition, it can solve the corresponding linear equations only if the computational rule of coefficient matrix-vector product is known, not having to know the specific structure or the specific elements of the coefficient matrix, which is impossible for the direct method.Based on these considerations, the iterative method, especially the Krylov subspace method, has drawn more and more attention in recent years.Because the projection scheme is adopted in the Krylov subspace method, the convergence rate is relatively quicker.However, the convergence rate of the Krylov subspace method depends on the eigenvalue distribution of the coefficient matrix; the more concentrated the eigenvalues are, the quicker the convergence will be.To accelerate the convergence rate and reduce the total computation time, efficient preconditioning techniques should be used to convert the original sparse linear equations into another with the same solution, but with narrower eigenvalue distribution region of the coefficient matrix [1,2].
In general, the narrower the region the eigenvalues are located in, the more effectively the number of iterations can be reduced.However, the amount of computation in a single iteration must also be less so as to make the ultimate solution process efficient.On the other hand, when preconditioning is not used, the amount of computation in a single iteration lies mainly in the operation that the coefficient matrix is applied to a vector.Therefore, an efficient preconditioner should ensure that the distribution region of the eigenvalues is improved significantly, and at the same time, the additional amount of computation in a single iteration is almost equal to that of a matrix-vector product.The incomplete factorization is one kind of such preconditioners, which aims to solve the general sparse linear equations.
Many incomplete factorizations have been developed since Meijerink and van der Vorst introduced incomplete Cholesky factorization into the conjugate gradient as a preconditioner [3].The differences among them are mainly in the dropping rules for the elements in the factors.Further, many block forms, diagonal modifications, stabilized versions, and parallel implementations have been developed.
The simplest incomplete factorization is ILU(0).It requires that the sparse structure of the incomplete factor is the same as the original coefficient matrix [3].The implementation of ILU(0) can be very cheap, and when the coefficient matrix is diagonally dominant or a  matrix, it is effective and robust.However, for more complex issues encountered in actual applications, ILU(0) is too rough, and thus it inspires people to design more efficient ones under the way of allowing more fill-ins.ILU() is one of them; it can be seen as an extension of ILU(0) [1,2,4], in which the fill-ins with level number greater than  are set to zero.Generally speaking, ILU(1) is quite effective.When the level threshold  is higher, the improvement is usually very small.However, with the increase of , the amount of computation increases very fast and thus ILU() with  greater than 1 is not considered in general.
When the diagonal dominance of the coefficient matrix is poor, ILU() may drop many relatively large elements,while retaining many elements with small absolute values, making the effectiveness of the method degraded greatly for general sparse matrices.It is due to the fact that ILU() completely starts from the nonzero structure, while the size of element in the factors is not considered.Based on this consideration, the researchers put forward another incomplete factorization in which the dropping is applied based on the magnitude of the nonzero elements in the factors.However, if the dropping rule is based only on the magnitude of elements, the total number of nonzero elements in the derived factors is very difficult to control; thus the storage requirement and the amount of computation are very difficult to control.The double threshold incomplete factorization (ILUT), put forward by Saad, effectively deals with the above-mentioned problem.When the factorization is performed at the th step, the elements with relatively small absolute value in the th row of the factor are dropped according to the threshold value  firstly and then at most  elements with the largest absolute value are retained in the nondiagonal elements in each row of incomplete factors [5].
The LU factorization is not the only way to construct incomplete factorization preconditioner.Many others can also be used.Based on A-orthogonal factorization, Benzi computed a factorization type sparse approximate inverse preconditioner AINV, and the experimental results show that it has similar quality as that of ILU preconditioner but has better potential parallelism [6].Aiming at the sparse linear equations needing to be solved in the numerical simulation of Markov chains, an incomplete WZ factorization is introduced in reference [7] and the experimental results show that this method is quicker than the ILU factorization preconditioner [7,8].For the sparse linear systems with finite element structure, Vannieuwenhoven and Meerbergen provided an incomplete multiple wave-frontal LU factorization preconditioning process, which is obtained through dropping in the process, and the computation performance and the robustness are improved through full use of the dense structure of the elemental stiffness matrix and the selection of local efficient principals [9].
Up to now, various incomplete factorizations are provided.However, ILUT is one of the most widely used incomplete factorizations.Although the effectiveness of the ILUT proposed for the general sparse linear equations is very high, there is still room for improvement.Recently, Maclachlan, Osei-kuffuor, and Saad studied the measures to improve its accuracy and stability through applying compensation [10].This paper can also be seen as an improvement to ILUT.Having realized that the dropping rules in ILUT are not fair, an improvement is proposed in this paper from the following two aspects.First, the incomplete LDU factorization is used to replace the incomplete LU factorization, so as to make the dropping in incomplete factors  and  fairer.Second, when multiple rows computed at a time for incomplete factors  and , the dropping rules are applied, respectively, to  and .The elements with maximum magnitude are retained in the computed rows.

Description of MRILDU
The ILUT preconditioner has been widely and effectively used to solve many difficult sparse linear systems.The specific descriptions of the double threshold incomplete factorization ILUT(, ) are shown in Algorithm 1 [2].
In Algorithm 1, three dropping strategies are used.First, in the fifth row of the described algorithm, if the absolute value of some element is less than the threshold , the element will be dropped and replaced by zero.Second, in the tenth row, another rule is applied to drop the elements whose absolute values are less than   , where   is taken as /  and   is equal to the quotient of the 1-norm of the th row in the original coefficient matrix to the number of the nonzero elements in the row.Third, in the eleventh and twelfth rows, for the th row of , namely, the strictly lower triangular part, at most  largest elements are retained, and at the same time, for the strictly upper triangular part in the row of , at most  largest elements are retained too.In addition, the diagonal elements are always retained.The effectiveness of ILUT preconditioner originates from the following two aspects.First, the number of the nonzero elements in the incomplete factors is reduced through dropping the elements whose relative magnitude is less than  and retaining at most  nonzero elements in each row of the factors.This reduces not only the time used in incomplete factorization, but also the preconditioned iteration time greatly.Second, when the diagonal dominance of the matrix is poor, the quality of the incomplete factors can be improved through increasing the parameter  and decreasing the parameter , especially if  is equal to zero and  is equal to the order of the matrix and the related incomplete factorization is just the LU factorization.
Although ILUT is effective, it still has deficiencies in practical applications.First, it is very difficult to specify the best values for the parameters in ILUT.If  is too small or  is too large, the effectiveness of the preconditioner will be very poor.However, when  is very large and  is very small, although the effectiveness of the obtained preconditioner is good, the cost of the construction of the preconditioner and the overhead in a single iteration are both very large, which is unbearable.Second, when the magnitudes of the elements in each row of the matrix differ greatly, the magnitudes of the elements in each row of the incomplete factors  and  may also differ greatly.However, ILUT almost averagely retains the same number of nonzero elements in each row and this may drop some relatively large elements while retaining many relatively small elements.Although the scheme to retain relatively large elements is not always superior to that to retain relatively small elements, the existing experiments show that it has more advantages to retain relatively large elements in general, especially for diagonally dominant matrices.
Based on the above considerations, here we propose an improvement to ILUT, which is multirow ILDU (MRILDU).The idea can be outlined as follows.Perform incomplete LDU factorization for the matrix, compute multiple rows of factors  and  every time, and then apply dropping strategy to the obtained rows of  and .It has the following two advantages to use the incomplete LDU factorization to replace the incomplete LU factorization.First, when LDU factorization is applied, the elements in  and the elements in  have been, respectively, proportioned to the diagonal elements in the same column or same row and thus it is more equitable to use the same dropping strategy in  and .Second, because the elements in  have already been scaled, it is more equitable for each row when the dropping strategy is used for multiple rows.Furthermore, when multiple rows are computed at a time and then dropping rule is applied, the unified rule can be used to drop nonzero elements in these rows.Compared to the algorithm that one row is computed at a time and simultaneously the dropping strategy is applied, MRILDU is more favorable in retaining the elements with relatively large magnitudes in the incomplete factors.Assuming that every time  rows of the matrix are factorized and then the dropping strategy is used for them, MRILDU(, , ) can be specifically described in Algorithm 2.
During the implementation, the algorithm MRILDU (, , ) needs to efficiently solve the following three problems.The first one is the linear combination of the sparse vectors on step 7.The second one is to select several elements with the largest magnitude from the given vectors on step 19 and step 20, and the third one is that the elements in th row of  must be accessed in ascending order on step 3.It can be found from the comparison of the steps in Algorithms 2 and 1 that the first problem encountered in the algorithm MRILDU(, , ) is the same as that of ILUT(, ).Therefore, the same technology can be used.See literatures [2,4] for specific implementation details.
For the second problem, we adopt the quick sort method which is slightly different from that in ILUT(, ).When the sorting is applied to step 19 and step 20 in MRILDU(, , ), the exchange operation is not applied in deed, while another integer array is used to track the sorting process, and ultimately its first  elements are used to record the positions of the elements with the largest magnitude.After the end of the sorting, the elements on other positions in the column number array are set to zero, and those elements are dropped accordingly.In this way, the remaining elements can be preserved directly without having to spend too much time to determine the row numbers of the nonzero elements.
For the third problem, from its appearance, it should be completely identical in MRILDU and in ILUT.However, it needs to note that, in MRILDU,  rows are taken as a whole at a time to use the dropping strategy.We may as well assume that the dropping strategy is used to the rows from the th to the  +  − 1th altogether, when the th row is computed, similar to ILUT; the elements in the final incomplete factor  are used.However, when the rows from  + 1th to  +  − 1th are computed, the elements in the  − 1th row as well as the rows in front of it in the factor  are the elements in the eventual incomplete factor.But when the elements in the rows from the th to  +  − 2th are used, these elements are only temporary values, and the drooping strategy has not been applied yet.In order to facilitate the calculation, initially,  additional storage units are allocated in advance for each incomplete factor and in the calculation the elements of continuous  rows are stored according to the format of the −1th and the previous rows to facilitate the implementation of the algorithm.When the dropping strategy is applied, the elements in these  rows are processed directly.The dropped elements are set to zero, and after that, these zero elements are dropped completely and their positions are filled with the subsequent nonzero elements.
In the following, we analyzed the differences of the storage requirement and the amount of computation of MRILDU from ILUT.In ILUT, every time one row is computed and  nonzero elements with largest magnitudes are selected and retained from the current row of each triangular factor.There may be only at most  nonzero elements in each row of the factors before applying the dropping rule.Therefore, it can be considered that the temporary space complexity is (), which needs to be allocated additionally.In MRILDU, the nonzero elements in  rows need to be stored before applying dropping rule, and thus, the temporary space complexity which needs to be allocated additionally may be up to ().As previously described, the storage space has been allocated in advance.For the final incomplete factors, due to the fact that the storage units of MRILDU(, , ) and ILUT(, ) are both (2 + 1) at most, therefore, the storage requirements are almost invariant, especially when  is small or  is large.Accordingly in the preconditioned iteration, the computing time in a single iteration will also be almost the same.On the other hand, the quality of MRILDU is higher in general; that is to say, when MRILDU is used, the time spent in iteration will be less than that when ILUT is used.Therefore, if the construction of the preconditioner is not considered, the time elapsed in the iteration itself will be less with MRILDU than that with ILUT.
The time complexity of the construction of MRILDU differs ILUT mainly from two origins.First, in ILUT, the dropping is applied for one row each time.After the dropping is completed, the retained elements of the row are stored in the incomplete factor.When the calculation is applied for the subsequent rows, the elements in front of the current row in the factor  need to be used and these elements are in the final incomplete factor.In MRILDU,  rows of the factors are calculated at a time.It may be assumed that they are the rows from the th to  +  − 1th.When applying dropping to these  rows as a whole, the calculation process is as described previously.Therefore, the additional temporary storage units are greater than that in ILUT in general.At the same time, the amount of computation will be also slightly larger than that with ILUT.
The second origin is the different dimension of the vectors to be split and the different number of elements which need to be extracted from them.For ILUT, if there are  nonzeros in each of the rows from the th to the  +  − 1th before dropping is applied, the time complexity for each row is about (/ log ) when the quick sort is used to extract the required  elements from the factor.Therefore, the time complexity for  rows is (/ log ).For MRILDU, because the dropping strategy is applied to  rows, this operation is equivalent to the extraction of  elements from  elements.Thus, the time complexity should be {/() log()}, namely, {/ log()}.This shows that just from the perspective of extracting the elements with largest magnitudes, it seems that MRILDU will be quicker.However, it needs to note that, when the rows from the +1th to the +−1th are computed, the temporarily storage units in each row are likely more than those in ILUT.Therefore, even if MRILDU will be slightly quicker, it is not very significant, especially when  is large.
Furthermore, it can be proved that similar to other ILU factorizations, when  is a  matrix or diagonally dominant, MRILDU can always continue, and the lower right corner submatrix to be factorized will also be always a  matrix or diagonally dominant.For each of these two issues, after  is split into LDU and  − LDU, the iterative method with LDU as the iterative matrix will be always convergent.Correspondingly, the condition number of the preconditioned coefficient matrix will be better than that of the original coefficient matrix.

Numerical Experiments
In this section, for several sparse linear equations from some scientific engineering applications, the proposed MRILDU is compared to ILUT.The experiments are applied on the high performance server with Intel Xeon CPU E5-2670 0 @ 2.60 GHz CPU, 20480 KB cache, and main memory of 48 G.The operating system is Red Hat 4.4.5-6Linux.The Intel Fortran compiler ifort 11.1.059is used as compiler and -O3 is used for optimization.
In this paper, for convenience, the number of iterations will be denoted by IT, and the time used for the construction of the preconditioner and the iteration process are denoted by CTME and ITME, respectively.The total solution time is denoted by TTME, which is the sum of CTME and ITME.NNZ represents the number of nonzero elements in the preconditioner.All the time results are in seconds.In all iterations, the initial solution vector is selected as all zeros.
3.1.Sparse Linear Systems from Some Test Matrices.The coefficient matrices of the sparse linear equations here come from the UF sparse matrix collection (http://www.cise.ufl.edu/research/sparse/matrices/index.html) website, and the information of these matrices can be briefly described in Table 1.Among them, the last four matrices starting from orsirr 1 can also be downloaded from the website Martrixmarket (http://math.nist.gov/MatrixMarket/browse.html).
The right-hand side vectors are obtained through applying the matrix to the given true solution vector, in which the th component is   = /, where  is the order of the matrix.During the iteration process, BiCGSTAB is used, and the convergence criterion is that the 2-norm of the residual vector is reduced by 10 orders of magnitude.
The results with ILUT and MRILDU are listed in Table 2.In all the tests of this subsection, the threshold value  is taken as 1 − 3 all the time.
As can be seen from Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12, when the same  and  are used, the number of iterations with MRILDU is less than that with ILUT in general; namely, the effectiveness of MRILDU is higher.Besides, when  is small, due to the fact that the number of iterations with MRILDU is less than that with ILUT, the time elapsed for iteration is also less than that with ILUT.However, for some matrices, because the number of the nonzero elements retained by MRILDU is relatively more, while the number of iterations is not reduced significantly, especially when  is large, the time elapsed for iteration is slightly longer than that with ILUT.From the perspective of total execution time including preconditioner construction and iteration process, although it is difficult to say that MRILDU is absolutely better than ILUT, MRILDU is superior in general.Furthermore, when ILUT is used, it is difficult to solve the linear systems for some matrices.However, when MRILDU is used, the solution is more robust with slightly larger .Moreover, the experiments also show that when  is large, MRILDU is not as sensitive as ILUT to the parameter .

Energy Equation in Inertial Constrained Fusion Simulation.
The energy equations in the numerical simulation of inertial confinement fusion can be described as in the literatures [11,12].When the discretization is applied to the equations, the velocity and coordinates are given on grid points, while the temperature, density, pressure, and energy are given at the center of the cells.When uniform quadrilateral mesh is used to discretize the continuous equations, we can derive the discrete nonlinear equations.When Newton iteration is used to solve the nonlinear system and natural ordering is used to grid points, the problem is converted into the solution of block tridiagonal linear equations which corresponds to Jacobi matrices [12].
In this subsection, we perform experiments for two typical sparse matrices (namely, MAT17 and MAT20) extracted from the discrete solution of the two-dimensional three temperature energy equations.The size of the related discrete grid is 21 by 75.Since each grid point corresponds to three temperature variables, namely, electron, ion, and photon, the number of the dimensions of the equations is 4,725.In the experiments, the stop criterion is that the Euclid norm of the residual vector is reduced by 10 orders of magnitude.For matrices MAT17 and MAT20, the results are listed in Tables 13 and 14, respectively.In these experiments, the parameter  is taken as 1 − 3.
As can be seen from Tables 13 and 14, although MRILDU with  = 1 corresponds to ILUT, similar to the previous experiments, the experimental results also show that MRILDU with  = 1 has great advantages and both the number of iterations and the time used in iteration are reduced significantly.Furthermore, when the parameter  is given and the parameter  increases, the convergence rate of the iteration is improved gradually; however, when the parameter  is increased to a certain extent, further improvement is trivial when it continues to increase.Moreover, it can be noted that after the parameter  is increased to a certain extent, the quality of MRILDU will be very close for different parameters .

Helmholtz Equation in Numerical Weather Prediction
Model.When the finite-difference model is used to perform the numerical weather forecast, the sparse linear equations related to the pressure deviation in the three-dimensional space need to be solved on each time step, which are called discrete Helmholtz equations [13].In this paper, we perform experiments for the grid point model GRAPES [14] developed by Chinese Academy of Meteorological Sciences.In this model, the full compressible atmosphere motion system is used as the control equation, the semi-implicit semi-Lagrangian scheme is used as discrete scheme for time derivative, and the discretization adopts Arakawa-C grid in horizontal direction and Charney-Phillips grid in vertical direction.See literature [14] for details.
Although the Helmholtz equations are diagonally dominant, the diagonal dominance is very weak.For this kind of sparse linear equations, the GCR iteration is often used to solve the equations [2], but the solution is very timeconsuming, which occupies a large proportion in the whole numerical simulation time.For example, in GRAPES, when the diagonal scaling GCR is used and the grid size is 144 × 73 × 31, namely, the number of the grid points in the latitude, In the experiments, the grid size used is 144 by 73 by 31, the preconditioned GCR is used to perform the iterations, and the stop criterion is that the 2-norm of the residual vector is less than 1 − 8. Due to the fact that the coefficient matrix of the sparse linear equations remains unchanged in the simulation process, the preconditioner just needs to be constructed in the setting stage and its information can be directly referenced in the solution processes of the sparse linear equations at subsequent time steps.During the simulation, the length of time step is taken as 1,800 seconds and the experiment is integrated for one day, namely, 96 time steps.Thus, there are 96 linear systems to be solved in all.
The experimental results can be described as in Table 15, where the average number of iterations and the average iteration time for the solution of a linear system with preconditioned GCR are provided.The parameter  is taken as 1 − 6 in all tests.
As can be seen from Table 15, MRILDU is significantly superior to ILUT, and this is mainly thanks to the facts that the incomplete ILDU factorization is used to replace the ILU factorization and the dropping strategy for the elements in  in MRILDU is different from that in ILUT.At the same time, it benefits partly from the multirow strategy.

Mesoscale Simulation of Concrete Sample.
In the mesoscale simulation of concrete specimens, the specimens are seen as three-phase composite materials which are composed of aggregate, mortar, and the interface between them and the finite element method is used.In this paper, we perform experiments for a cubic wet-sieved specimen and the size of the specimen is 550 mm by 150 mm by 150 mm.See literature [8] for specific descriptions.There are 71,013 discrete nodes and 78,800 finite elements in all, respectively.The two supporting columns are, respectively, located at the places where  = −0.15m and  = 0 m as well as  = 0.3 m and  = 0m, namely, the places which are, respectively, 0.05 m away from the left and right boundaries at the bottom of the specimen, while the two loading columns are located at the places where  = 0m and  = 0.15 m as well as  = 0.15 m and  = 0.15 m, respectively, namely, the places which are, respectively, 0.2 m away from the left and right boundaries at the top of the specimen.In this subsection, for the linear equations encountered in the static loading test, we compare the efficiency of ILUT and MRILDU.The load is increased step by step, and the increased load for each step is 0.25 kN.When loaded to the 59th step, the damaged elements will appear in the concrete specimen and the specimen is completely damaged at the 94th step.If some damaged element appears at some loading step, it may need to solve multiple sparse linear equations with the same coefficient matrix at this step, to correct the displacements.Therefore, there are 178 sparse linear systems to be solved in all during the whole simulation process.Although the obtained sparse linear systems are symmetric positive definite, due to the fact that ILUT and MRILDU are not symmetric, BiCGSTAB is selected as the iterative method.The stop criterion is that the 2-norm of the residual vector is reduced by six orders of magnitude.
In Table 16, for the sparse linear systems occurring in the static loading simulation, the average number of iterations and the average time elapsed for iteration of each sparse linear system are provided.In the tests, the parameter  is taken as 25 and the parameter  is taken as 1 − 3. Due to the fact that when there are no damaged elements, the global stiffness matrix, namely, the coefficient matrix of sparse linear equations, remains unchanged and even if there are damaged elements, the coefficient matrix is also unchanged during the corrections at each loading step.As long as the coefficient matrix is unchanged, the preconditioner does not need to reconstruct.Therefore, the time consumed by the construction of the preconditioner is relatively trivial and thus the time used to construct the preconditioner is not listed in the table, while only the iteration results are listed.
As can be seen from Table 16, for sparse linear equations which need to be solved in the mesoscale numerical simulation of concrete specimen, MRILDU does not have advantages in the number of iterations compared to ILUT, but the average iteration time used for each linear system is improved slightly.This may be due to the fact that the utilization ratio of Cache is better when MRILDU is used.

Conclusion
In this paper, aiming at the disadvantage that ILUT may drop some relatively large elements and retain relatively small elements, we propose an improved version MRILDU based on the following two techniques.First, the incomplete LDU factorization is used to replace the incomplete LU factorization and the same dropping strategy is used for the elements in  and , so as to make the dropping rules be more equitable for  and .Second, multiple rows are factorized before each dropping, so as to effectively deal with great differences between the diagonal dominance and to retain the relatively large elements, thereby improving the quality of incomplete factorization.The experimental results show that when the same parameters are used, the number of iterations with MRILDU will be significantly smaller than that with ILUT.And in most cases, the iteration time and total solution time with MRILDU can be reduced in general.Furthermore, MRILDU is not as sensitive as ILUT is to the parameter , and thus, in the case that the parameter  cannot effectively be determined, the advantages of MRILDU are more significant.

Table 2 :
Iteration results for matrix af shell10.

Table 15 :
Iteration results for linear systems from GRAPES.

Table 16 :
Iteration results for static loading of concrete sample.