Parallel Multiprojection Preconditioned Methods Based on Subspace Compression

During the last decades, the continuous expansion of supercomputing infrastructures necessitates the design of scalable and robust parallel numerical methods for solving large sparse linear systems. A new approach for the additive projection parallel preconditioned iterative method based on semiaggregation and a subspace compression technique, for general sparse linear systems, is presented. The subspace compression technique utilizes a subdomain adjacency matrix and breadth first search to discover and aggregate subdomains to limit the average size of the local linear systems, resulting in reduced memory requirements. The depth of aggregation is controlled by a user defined parameter. The local coefficient matrices use the aggregates computed during the formation of the subdomain adjacencymatrix in order to avoid recomputation and improve performance.Moreover, the rows and columns corresponding to the newly formed aggregates are ordered last to further reduce fill-in during the factorization of the local coefficient matrices. Furthermore, the method is based on nonoverlapping domain decomposition in conjunction with algebraic graph partitioning techniques for separating the subdomains. Finally, the applicability and implementation issues are discussed and numerical results along with comparative results are presented.


Introduction
Let us consider a sparse linear system of the following form: where  is the coefficient matrix,  is the right-hand-side vector, and  is the solution vector.In order to solve the linear system, in (1), a direct or an iterative method can be used.However, a direct method is computationally expensive with excessive memory requirements.Krylov subspace iterative methods, compare [1], have gained the attention of the scientific community during the recent decades, due to their efficiency and memory requirements for solving large sparse linear systems.Preconditioned iterative methods improve the convergence rate and have been used extensively for solving large sparse linear systems.The left preconditioned form of the linear system, in (1), is as follows: where  −1 is a suitable preconditioning matrix.Moreover, there is the right and the symmetric preconditioned form, compare [1].The preconditioner  −1 has to satisfy the following conditions: (i)  −1  should have a "clustered" spectrum, (ii)  −1 can be efficiently computed in parallel, and (iii) the product of " −1 by vector" should be fast to compute in parallel, compare [1,2].The domain decomposition method has been extensively used for solving (very) large sparse linear systems in parallel systems, compare [3][4][5][6][7][8].Furthermore, domain decomposition-type methods have been combined with Krylov subspace methods or have been used in hybrid schemes in conjunction with direct solvers, compare [6,[9][10][11].
During the last decade, research efforts were directed towards solvers for general sparse linear systems that have improved convergence behavior and are scalable.Various overlapping domain decomposition methods based on Lagrange multipliers have been proposed.The Finite Element Tearing and Interconnect (FETI), compare [12], and its variants, compare [13,14], are scalable and efficient solvers for second-order and fourth-order partial differential equation problems.Related methods are the Balancing Domain Decomposition (BDD), compare [15], and the Balancing Domain Decomposition by constraints (BDDC), compare [16], which are multiplicative and additive domain decomposition methods, respectively.The aforementioned methods are suitable for solving linear systems derived from the finite element method.Additionally, there are domain decomposition-type solvers using graph partitioning algorithms.These algebraic domain decomposition methods do not require any geometric property of the problem as the partitioning is based on the corresponding graph of the coefficient matrix.Parallel hybrid algebraic domain decomposition-type solvers that combine direct and iterative methods have been proposed, compare [17,18].Additionally, algebraic domain decomposition method in conjunction with algebraic multigrid method, compare [19][20][21][22], has satisfactory convergence behavior and is suitable for parallel systems.An algebraic domain decomposition method, based on semiaggregation and subspace compression techniques, namely, multiprojection method with subspace compression (MPMSC), for solving large sparse linear systems is proposed.Several efficient software libraries for graph partitioning, such as Metis, compare [23], or Scotch, compare [24], can be used for algebraic partitioning.The aggregation techniques have been previously used in the context of algebraic multigrid methods and have been shown to be efficient for a wide variety of model problems, compare [25][26][27].In the MPMSC scheme, a direct method is used for solving the local linear systems derived from the semiaggregated domains.The semiaggregation procedure results in multiple grids composed of a fine part, corresponding to the unknowns of a subdomain, and aggregated part, corresponding to unknowns related to other subdomains.The proposed scheme utilizes approximations for the fine part of each subdomain, obtained with respect to its aggregated components, as opposed to aggregation based algebraic multigrid methods, which obtain approximations for the solution of the entire domain, through smoothing, based on a hierarchy of coarser grids.Moreover, the proposed scheme does not require the computation of a hierarchy of coarser grids.The subspace compression technique is an algorithmic procedure, based on breadth first search with limited depth on the adjacency matrix of the subdomains.By this technique the neighboring subdomains are discovered and aggregated together to an aggregated (coarse) component.The aggregation process reduces the number of aggregated (coarse) unknowns of the local linear systems, resulting in a substantial reduction in memory requirements.The computation of the rows and columns corresponding to the aggregates, during the formation of the local coefficient matrices as well as the subdomain adjacency matrix, is reused to avoid recomputation.Avoiding the recomputation enhances the performance of the proposed scheme.Moreover, the rows and columns corresponding to the newly formed aggregates have a large number of nonzero elements and thus are ordered last to further reduce fill-in during the factorization of the local coefficient matrices.Further, the number of aggregates of the multiprojection method with subspace compression is controlled by a parameter that describes the level sets of aggregates to be compressed in a single aggregate.
In Section 2, the multiprojection method with subspace compression is presented, implementation details are given, and the aggregation strategy is discussed.In Section 3, numerical results illustrating the applicability and efficiency of the MPMSC scheme are presented.Moreover, comparative results for the performance and convergence behavior are given.

Multiprojection Method with Subspace Compression
The multiprojection method with subspace compression is a domain decomposition-type method for solving large sparse linear systems (see (1)).The graph, corresponding to the coefficient matrix , has vertices, the unknowns of the linear system, in (1), and edges, the nonzero elements of the coefficient matrix .Since  is expressed as a graph, the algebraic properties of the problem can be utilized in order to partition the graph into  doms nonoverlapping subgraphs (subdomains).The graph partitioning process is performed using the algorithm provided in the Metis software library, compare [23].Consider the domain Ω partitioned into  doms nonoverlapping subdomains Ω  with  = 0, . . .,  doms − 1, so that The number of vertices, in each subdomain Ω  , is denoted by   and   is the index set defined by the following expression: where   are the indices corresponding to the   vertices of each subdomain.

Subspace Compression.
Let us consider the subspace , of the domain Ω, which contains  doms aggregated (coarse) components, corresponding to the subdomains Ω  , and the prolongation matrix  from  onto Ω,  : Κ → Ω, which is denoted according to the following expression: ( The projection of  to subspace  is the matrix  of dimension ( doms ×  doms ) and is given by the following equation: The graph () = (V, ), corresponding to the matrix , concerns the relations between the aggregated (coarse) components.The subspace compression technique is based on the reaggregation of the  doms (number of subdomains) aggregated (coarse) components into  aggs (number of aggregates).

Mathematical Problems in Engineering 3
The number of aggregates is not known a priori and thus is a result of the subspace compression algorithm.Breadth first search with limited depth (BFS-LD) is performed on () to find the distance  neighborhood.The components in a distance  neighborhood of () are reaggregated together.
Then the neighborhood's vertices are removed from the set V and the BFS-LD is performed again on the new graph.When every vertex is removed from the set V, the procedure is terminated.The order in which the reaggregation process is performed may result in a slight variation of the number of aggregates; however this does not affect the efficiency or the convergence behavior of the proposed scheme significantly.The order of the vertices used in the reaggregation scheme is lexicographical with respect to the enumeration of the subdomains by the graph partitioning scheme.Other approaches can be used for the reaggregation process, compare [26,27]; however the computational work of these schemes will affect negatively the performance of the preprocessing phase, especially for large numbers of subdomains.The subspace compression algorithm is as follows: (1)  ← , Agg ← {Agg 0 , Agg 1 , . . ., Agg  aggs −1 } = {Ø, Ø, . . ., Ø}. (

End For
The parameter  of the neighborhood     specifies the maximum distance between the vertex   and any other vertex in the neighborhood.The value of  is selected with respect to the adjacency matrix (), considering that the larger the value of , the less the number of aggregates.Distance one neighborhood is chosen by default; however for parallel systems with limited memory resources, distance two or distance three neighborhood should be chosen to reduce the memory requirements of the proposed preconditioning scheme.An increase in the value of  might affect negatively the convergence behavior of the proposed scheme.The computational complexity for finding the aggregates depends on the number of subdomains and it is a searching procedure on the graph () with negligible execution time compared to the whole initialization phase of the preconditioning scheme.

Semiaggregation.
Let us consider the semiaggregated subdomains   , containing the   fine components of subdomain Ω  and  aggs aggregated components.Furthermore, the semiaggregated   subdomains are associated with their respective prolongation matrices   ∈ R ×(  +( aggs )) :   → Ω, with  = 0, 1, . . .,  doms − 1.The matrices    : Ω →   are restriction operators mapping an arbitrary full vector of the domain Ω into a vector of the semiaggregated subdomain   with size (  +  aggs ).The first   columns of the matrix   are the columns of the identity matrix with dimension (×), corresponding the indices   .The remaining ( aggs ) column vectors of   correspond to the aggregates.The components of the column vector that are included in the corresponding aggregated subdomain have value 1/  and the rest of the components are zero, where   = ∑   [  ⊆ Agg  ], with  ̸ =  and  = 0, 1, . . .,  aggs − 1.
Hence, the matrix   is defined as follows: where   is the column vectors corresponding to the aggregates as they are described above.Moreover, additional prolongation matrices are required, namely,   .The prolongation matrices   of dimension ( × (  +  aggs )) are used to map a vector defined on the subdomain   to a vector defined on the domain Ω by prolonging only the components corresponding to the subdomain Ω  and discarding the aggregated (coarse) components.The th column of the matrix   is given as follows: The linear systems corresponding to each subdomain   are as follows: where   =      is a coefficient matrix of order (  + ( aggs )).
The proposed method is used as a preconditioner to a Krylov subspace iterative method, such as preconditioned GMRES(), compare [28,29].The product of the preconditioning matrix by a vector  can be described by the following compact algorithmic scheme: End For Considering that  * is the exact solution and that  =  −  = ( * − ) = , the error  +1 at ( + 1)th iteration can be written as follows: or equivalently where is an orthogonal projector with respect to the Euclidean inner product.Considering that if  is symmetric positive definite,    −1      is an orthogonal projector with respect to the -norm inner product, compare [8,30].  is a product of the two orthogonal projections, which are orthogonal to different inner products; thus the projection is not orthogonal, compare [31].Taking into consideration the above, the method is categorized into the class of oblique projection procedures, compare [1].
The preconditioned matrix using the MPMSC scheme is given as follows: which is a sum of projections.
Let us consider the unit square domain Ω, which is discretized with mesh size ℎ = 1/7 as depicted in Figure 1.The resulting grid is partitioned into 16 subdomains, containing 4 vertices each.The corresponding matrix  is projected onto the subdomains   , in (10), and the grid corresponding to the projected subdomain  0 is presented schematically in Figure 2.
In the MPMSC scheme,  doms ((  +  aggs ) × (  +  aggs )) linear systems of the form in (10) have to be solved in every iteration.Equivalently, the linear system, in (10), can be written in a reordered block form as follows: where the size of   is (  ×   ) and the size of   is (( aggs ) × ( aggs )).The vector   is prolonged to the full solution vector  of domain Ω, and the aggregated (coarse) components are used as auxiliary.In the MPMSC preprocessing step each matrix   is factorized by sparse  factorization method, compare [32]: Therefore, the preconditioning operation can be described by the following algorithmic scheme:  The solution, in ( 19)-( 20), of the local linear systems is performed inherently in parallel; thus each core computes one backward-forward substitution process.The aggregated (coarse) components are ordered last in the coefficient matrix   , because they have many connections with the fine components and consequently many nonzeros in their rows/columns.By this ordering, the fill-in of the  and  matrices is reduced, compare [32].Direct solvers have memory and computational complexity limitations for large linear systems.In order to keep the size of   below these limits, the coefficient matrix  should be partitioned into a suitable number of subdomains.Maintaining the size of   almost constant, regardless of the size of , results in almost constant  factorization computational work.
It should be stated that the MPMSC scheme is inherently parallel and thus suitable for distributed systems.The MPMSC scheme is used as a preconditioner to the Parallel Preconditioned restarted Generalized Minimum RESidual, namely, (PPGMRES()), method which is a parallel variant of the preconditioned GMRES() method proposed by Douglas et al., compare [28], and is based on Gram-Schmidt orthogonalization, compare [33].( The size of the coefficient matrix  is denoted by "", [] is the expected value operator, and nnz() is an operator returning the number of nonzero elements of a matrix.Also, let us denote by " node " the average number of components that correspond to a node, which usually is equal to (cores per node) * [  ].The restart parameter of PPGMRES() is denoted by "."MPMSC requires the following matrices and vectors in the preprocessing phase.

Numerical Results
In this section, the effectiveness and applicability of the MPMSC scheme are examined.The numerical experiments were performed on a BlueGene/P (BG/P) supercomputer with the following specifications: CPU: 1024x Quad-Core PowerPC-450 850 Mhz; RAM: 4 GB/node; interconnect: 3D-Torus network with bandwidth 5.1 GBps.The methods are designed for distributed systems with multicore nodes, using MPI and OpenMP environments, compare [34].Each multicore node executes an MPI process, which undertakes the computations for as many subdomains as the number of its cores.The parallelization in each multicore node is achieved with OpenMP, so that every subdomain is mapped to one core.Having equal number of cores and subdomains allows the parallel computation of every  factorization.Moreover, the software libraries BLAS (Basic Linear Algebra Subprograms), compare [35], and Metis, compare [23], have been used.The restart parameter of PPGMRES() was set to 20 and the maximum number of iterations was set to 500 iterations.The termination criterion was ‖‖ 2 < 10 −8 ‖‖ 2 .The execution time is given in "seconds."The number of nonzero elements in the matrix  is denoted by nnz().The convergence behavior of the PPGMRES() is given by outer (inner) iterations.The inner iterations are the number of iterations after the last restart of the PPGMRES() and outer iterations are the number of times PPGMRES() was restarted.The neighborhood distance parameter  has been set to one, unless stated otherwise explicitly.
The formation of the prolongation operators and the  factorization of the local matrices are included in the initialization phase of the preconditioning scheme, which is part of the preprocessing time (Pre-Time).The time required for the graph partitioning, using Metis, compare [23], is not included in the Pre-Time.It should be noted that parallel graph partitioning schemes, such as ParMETIS, compare [36], can be used to increase parallel performance.The communication time of the PPGMRES() is given as Comm-Time and the GMRES-Time is the execution time of the PPGMRES().It should be mentioned that the Comm-Time is included in the GMRES-Time.Total-Time is the sum of the Pre-Time and the GMRES-Time.The speedup is computed using the execution on 64 cores as a baseline unless stated otherwise explicitly.

3D Poisson Problem.
The PPGMRES() in conjunction with MPMSC scheme has been used for solving the Poisson equation in three space variables subject to Dirichlet boundary conditions.The Poisson equation was discretized with the 7-point stencil.The right-hand side of the linear systems was computed as the product of coefficient matrix  and the solution vector set to [0, 1, . . .,  − 1]  , where  denotes the order of the linear system.In Table 1, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme, for the 3D Poisson problem ( = 1,000,000) are presented for various numbers of cores-subdomains.The communication time is relative to the required iterations for convergence to the prescribed tolerance and the number of cores.It can be easily seen that as the number of cores and subdomains increases, the number of required iterations for convergence to the prescribed tolerance decreases for the 3D Poisson problem.In Table 2, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the Block Jacobi preconditioner, for the 3D Poisson problem ( = 1,000,000) with various numbers of cores-subdomains are given.It should be stated that the MPMSC scheme presents better convergence behavior than the Block Jacobi preconditioning scheme and therefore requires less overall communication time, especially for many cores where the convergence behavior of the Block Jacobi method is downgraded.Therefore, it should be noted that the MPMSC scheme has better scalability than the Block Jacobi preconditioner.In Figure 3, the speedup of the PPGMRES() method, in conjunction with the MPMSC scheme, for the 3D Poisson problem ( = 1,000,000) is depicted.In Figure 4, the weak scalability diagram of time versus cores for MPMSC scheme is presented.The order of the linear system increases as the number of cores increases since the local linear system should be of constant order.This results in augmenting, by 10,000 more unknowns, the linear system for every added core.The Total-Time should ideally remain constant.It should be noted that the Pre-Time is almost constant, as the initialization phase of the preconditioner is an inherently parallel process.From Figures 3 and 4 it is evident that the proposed method is scalable up to large number of cores, for the efficient solution of large linear systems.It should be mentioned that partitioning the domain Ω into more subdomains improves the convergence behavior of the method, since the grid corresponding to the aggregated (coarse) components becomes finer, which improves the accuracy of the local approximations to the solution.The improved convergence behavior reduces the communications and hence improves the performance of the proposed scheme.Due to the reduced communications, while the local computational work remains almost constant, it results in superlinear speedups, especially in the case of moderate number of subdomains.For small number of cores the Pre-Time is dominant compared to the GMRES-Time and as the preprocessing phase is an inherently parallel process, both methods have satisfactory speedup.When the number of cores is 1024, the Pre-Time is a small percentage of the Total-Time; therefore the speedup reduces, since it relies on the parallelization of the PPGMRES().It should be noted that the GMRES-Time increases proportionally to the problem size, as global communications, related to the synchronization of distributed vectors, and sequential redundant operations of PPGMRES() depend on the size of global vectors, compare [28], as it can be seen in Figure 4.However, the inherently parallel preprocessing phase scales, with satisfactory speedup, up to large number coressubdomains.Thus, the Total-Time required by the proposed scheme is improved.Furthermore, the convergence behavior of MPMSC scheme for 3D Poisson problem with 10,000 unknowns per subdomain is shown in Figure 5.The total number of iterations is computed as ((outer * ) + inner).It should be stated that the outer iterations of MPMSC scheme are almost constant for large number of subdomains.Moreover, the convergence behavior of the MPMSC scheme is related to the quality of the graph partitioning, provided by Metis, compare [23].In Figure 6, the number of aggregates ( aggs ) of the MPMSC scheme against the number of subdomains for the 3D Poisson problem with 10,000 unknowns per subdomain is depicted.In Table 3, the performance and convergence behavior of MPMSC scheme with distance two neighborhood for the 3D Poisson problem ( = 1,000,000) with various numbers of cores-subdomains are presented.It should be mentioned that using distance two neighborhood in the subspace compression technique results in less aggregates compared to distance one neighborhood, which decreases the memory requirements.However, since the required number of iterations for convergence increases, the performance of the iterative scheme is downgraded.
Furthermore, the parallel performance of the proposed method is compared to sequential preconditioned GMRES() in conjunction with dual threshold incomplete  factorization (ILUT(, )) preconditioner, compare [1].The speedup of PPGMRES() in conjunction with MPMSC scheme with respect to the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner, with  = 50 and  = 10 −3 , and various numbers of cores, for the 3D Poisson problem ( = 1,000,000), is presented in Figure 7.The preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner required 2 (14) iterations for convergence to the prescribed tolerance.The PPGMRES() in conjunction with the MPMSC scheme performed better, in terms of total execution time, compared to the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner, especially for large numbers of coressubdomains.

UFL Matrices.
In order to assess the applicability, scalability, and performance of MPMSC scheme for solving general large sparse linear systems, seven matrices from the University of Florida sparse matrix collection, compare [37], have been considered.Information concerning the aforementioned matrices is given in Table 4.The righthand-side vectors for atmosmodd, atmosmodl, and thermal2 problems were given by the University of Florida sparse matrix collection, compare [37], whereas, for the rest of the problems, they were computed as the product of coefficient matrix  and the solution vector set to [0, 1, . . .,  − 1]  .The convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for the atmosmodd, atmosmodl, tmt sym, apache2, and cage14 problems, are presented in Table 5.In Tables 6 and 7, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for the ecology2 and the thermal2 problems, are given.In Figures 8 and 9, the speedup of the PPGMRES() method, in conjunction with the MPMSC scheme for the ecology2 and the thermal2 problem, is shown.
For the ecology2 problem, the local coefficient matrices have reduced number of nonzero elements and hence Pre-Time is a small fraction of the Total-Time, not affecting the Total-Speedup curve, since the  factorization requires reduced number of floating point operations.In contrast, for the thermal2 problem compared to ecology2 problem, the local coefficient matrices have more nonzero elements resulting in increased Pre-Time, since the  factorization requires increased number of floating point operations, especially for small number of cores/subdomains.As the number of cores/subdomains increases, the Pre-Time decreases rapidly, because there are less nonzero elements in the local coefficient matrices resulting in improved performance of the  factorization.Therefore, superlinear speedup is achieved for the thermal2 problem and not for the ecology2 problem.
The speedup of PPGMRES() in conjunction with MPMSC scheme with respect to the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner, with  = 100 and  = 10 −6 , and various numbers of cores, for the thermal2 problem, is presented in Figure 10.The preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner required 11 (2) iterations for convergence to the prescribed tolerance.The PPGMRES() based on MPMSC scheme performed better, in terms of total execution time, compared to the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner, especially for large numbers of cores-subdomains.
Ιt should be noted that the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner for the ecol-ogy2 problem exceeds the available memory resources and does not converge to the prescribed tolerance for any compatible choice of  or  not exceeding available memory resources.

Conclusion
The proposed scheme, namely, multiprojection method with subspace compression, has been shown to be effective in solving various problems in distributed memory parallel systems.Further, the PPGMRES() in conjunction with the MPMSC scheme scales up to a large number of cores exhibiting superlinear speedups, especially for problems that require large computational work in the initialization phase of the preconditioning scheme.The proposed scheme has improved convergence behavior as the number of subdomains increases compared to the extant domain decomposition based preconditioners.Future work will be concentrated towards using asynchronous and neighboring communications between heterogeneous compute nodes, in order to assess the performance of the method in modern cloud environments.Moreover, the choice of the most suitable parallel graph partitioning algorithm, according to the performance and the partitioning quality, will be studied.

Figure 2 :
Figure 2: The grid corresponding to the projection of the domain Ω onto subdomain  0 .

Figure 6 :Figure 7 :
Figure 6: The number of aggregated (coarse) components ( aggs ) of the MPMSC scheme for 3D Poisson problem with various numbers of subdomains.

Figure 10 :
Figure10: The speedup of PPGMRES() in conjunction with MPMSC scheme with respect to the preconditioned GMRES() in conjunction with the ILUT(, ) preconditioner, with  = 100 and  = 10 −6 , and various numbers of cores, for the thermal2 problem.
For each   ∈  or until  ⊃ Ø Agg  ←     ∩ , where     is the -distance neighborhood of component   and  is the index of the aggregated (coarse) components doms − 1 Compute   =         (21)Figure 1: A square domain Ω discretized with mesh size ℎ = 1/7 partitioned into 16 subdomains Ω  .

Table 1 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme, for the 3D Poisson problem ( = 1,000,000).

Table 2 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the Block Jacobi preconditioner, for the 3D Poisson problem ( = 1,000,000).

Table 3 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme with distance two neighborhood for the 3D Poisson problem ( = 1,000,000).

Table 4 :
Problems from the University of Florida sparse matrix collection.

Table 5 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for various problems.
"-": memory limits were reached in the initialization phase.

Table 6 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for ecology2 problem.

Table 7 :
Convergence behavior and parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for thermal2 problem.Figure8: The speedup of the PPGMRES() method, in conjunction with the MPMSC scheme, for the ecology2 problem.The speedup of the PPGMRES() method, in conjunction with the MPMSC scheme, for the thermal2 problem.