^{1}

The present paper describes a parallel preconditioned algorithm for the solution of partial eigenvalue problems for large sparse symmetric matrices, on parallel computers. Namely, we consider the Deflation-Accelerated Conjugate Gradient (DACG) algorithm accelerated by factorized-sparse-approximate-inverse- (FSAI-) type preconditioners. We present an enhanced parallel implementation of the FSAI preconditioner and make use of the recently developed Block FSAI-IC preconditioner, which combines the FSAI and the Block Jacobi-IC preconditioners. Results onto matrices of large size arising from finite element discretization of geomechanical models reveal that DACG accelerated by these type of preconditioners is competitive with respect to the available public parallel

The computation by iterative methods of the

The basic idea of the latter is to minimize the Rayleigh Quotient

The outline of the paper is as follows: in Section

The DACG algorithm sequentially computes the eigenpairs, starting from the leftmost one

Choose tolerance

DO

(3.1)

(3.2)

(3.3)

(3.4)

(3.5)

(3.6)

(3.7)

(3.8)

(3.9)

(3.10)

(3.11)

(3.12)

UNTIL

END DO

The schemes relying on the Rayleigh quotient optimization are quite attractive for parallel computations; however preconditioning is an essential feature to ensure practical convergence. When seeking for an eigenpair

The FSAI preconditioner, initially proposed in [

We have developed a parallel code written in FORTRAN 90 and which exploits the MPI library for exchanging data among the processors. We used a block row distribution of all matrices (

Regarding the preconditioner computation, we stress that any row

The DACG iterative solver is essentially based on scalar and matrix-vector products. We made use of an optimized parallel matrix-vector product which has been developed in [

The Block FSAI-IC preconditioner, BFSAI-IC in the following, is a recent development for the parallel solution to Symmetric Positive Definite (SPD) linear systems. Assume that

Let

Consider the set of lower block triangular matrices

The differentiation of (

It is easy to show that

For its computation BFSAI-IC needs the selection of

Though theoretically not necessary, three additional user-specified parameters are worth introducing in order to better control the memory occupation and the BFSAI-IC density:

An OpenMP implementation of the algorithms above is available in [

In this section we examine the performance of the parallel DACG preconditioned by both FSAI and BFSAI in the partial solution of four large-size sparse eigenproblems. The test cases, which we briefly describe below, are taken from different real engineering mechanical applications. In detail, they are as follows.

FAULT-639 is obtained from a structural problem discretizing a faulted gas reservoir with tetrahedral finite elements and triangular interface elements [

PO-878 arises in the simulation of the consolidation of a real gas reservoir of the Po Valley, Italy, used for underground gas storage purposes (for details, see [

GEO-1438 is obtained from a geomechanical problem discretizing a region of the earth crust subject to underground deformation. The computational domain is a box with an areal extent of 50 × 50 km and 10 km deep consisting of regularly shaped tetrahedral finite elements. The problem arises from a 3D discretization with three displacement unknowns associated to each node of the grid [

CUBE-6091 arises from the equilibrium of a concrete cube discretized by a regular unstructured tetrahedral grid.

Matrices FAULT-639 and GEO-1438 are publicly available in the University of Florida Sparse Matrix Collection at

In Table

Size, number of nonzeros, and three representative eigenvalues of the test matrices.

Size | Nonzeros | ||||
---|---|---|---|---|---|

FAULT-639 | 638,802 | 28,614,564 | |||

PO-878 | 878,355 | 38,847,915 | |||

GEO-1438 | 1,437,960 | 63,156,690 | |||

CUBE-6091 | 6,091,008 | 270,800,586 |

The computational performance of FSAI is compared to the one obtained by using BFSAI as implemented in [

To study parallel performance we will use a strong scaling measure to see how the CPU times vary with the number of processors for a fixed total problem size. Denote with

In this section we report the results of our FSAI-DACG implementation in the computation of the 10 leftmost eigenpairs of the 4 test problems. We used the exit test described in the DACG algorithm (see Algorithm

Number of iterations, timings, and scalability indices for FSAI-DACG in the computation of the 10 leftmost eigenpairs of the four test problems.

iter | |||||||
---|---|---|---|---|---|---|---|

FAULT-639 | 4 | 4448 | 25.9 | 261.4 | 287.3 | ||

8 | 4448 | 13.2 | 139.0 | 152.2 | 7.6 | 0.94 | |

16 | 4448 | 6.6 | 69.4 | 76.0 | 15.1 | 0.95 | |

32 | 4448 | 4.0 | 28.2 | 32.2 | 35.7 | 1.11 | |

64 | 4448 | 1.9 | 15.5 | 17.4 | 66.1 | 1.03 | |

128 | 4448 | 1.1 | 9.4 | 10.5 | 109.0 | 0.85 | |

PO-878 | 4 | 5876 | 48.1 | 722.5 | 770.6 | ||

8 | 5876 | 25.2 | 399.8 | 425.0 | 7.3 | 0.91 | |

16 | 5876 | 11.4 | 130.2 | 141.6 | 21.8 | 1.36 | |

32 | 5876 | 6.8 | 65.8 | 72.5 | 42.5 | 1.33 | |

64 | 5876 | 4.1 | 30.1 | 34.1 | 90.3 | 1.41 | |

128 | 5876 | 1.9 | 19.1 | 21.0 | 146.8 | 1.15 | |

GEO-1437 | 4 | 6216 | 90.3 | 901.5 | 991.7 | ||

8 | 6216 | 47.5 | 478.9 | 526.4 | 7.5 | 0.94 | |

16 | 6216 | 24.7 | 239.4 | 264.1 | 15.0 | 0.94 | |

32 | 6216 | 13.6 | 121.0 | 134.6 | 29.5 | 0.92 | |

64 | 6216 | 8.2 | 60.9 | 69.1 | 57.4 | 0.90 | |

128 | 6216 | 4.2 | 29.5 | 33.8 | 117.5 | 0.92 | |

256 | 6216 | 2.3 | 19.1 | 21.4 | 185.4 | 0.72 | |

iter | |||||||

CUBE-6091 | 16 | 15796 | 121.5 | 2624.8 | 2746.2 | ||

32 | 15796 | 62.2 | 1343.8 | 1406.0 | 31.3 | 0.98 | |

64 | 15796 | 32.5 | 737.0 | 769.5 | 57.1 | 0.89 | |

128 | 15796 | 17.3 | 388.4 | 405.7 | 108.3 | 0.85 | |

256 | 15796 | 9.1 | 183.9 | 192.9 | 227.8 | 0.89 | |

512 | 15796 | 5.7 | 106.0 | 111.7 | 393.5 | 0.77 | |

1024 | 15796 | 3.8 | 76.6 | 80.4 | 546.6 | 0.53 |

We present in this section the results of DACG accelerated by the BFSAI-IC preconditioner for the approximation of the

Table

FAULT-639:

PO-878

GEO-1438:

CUBE-6091:

Performance of BFSAI-DACG for matrix PO-878 with 2 to 8 processors and different parameter values.

iter | iter | iter | |||||||
---|---|---|---|---|---|---|---|---|---|

2 | 10 | 0.01 | 10 | 2333 | 385.76 | 2877 | 286.24 | 3753 | 273.77 |

2 | 10 | 0.05 | 10 | 2345 | 415.81 | 2803 | 245.42 | 3815 | |

2 | 10 | 0.05 | 20 | 2186 | 2921 | 276.41 | 3445 | 257.18 | |

2 | 10 | 0.00 | 10 | 2328 | 445.16 | 2880 | 241.23 | 3392 | 269.41 |

2 | 20 | 0.05 | 10 | 2340 | 418.20 | 2918 | 3720 | 253.98 | |

3 | 10 | 0.05 | 10 | 2122 | 375.17 | 2638 | 228.39 | 3366 | 149.59 |

3 | 10 | 0.05 | 20 | 1946 | 433.04 | 2560 | 304.43 | 3254 | 263.51 |

3 | 10 | 0.05 | 30 | 1822 | 411.00 | 2481 | 321.30 | 3176 | 179.67 |

3 | 10 | 0.05 | 40 | 1729 | 439.47 | 2528 | 346.82 | 3019 | 188.13 |

4 | 10 | 0.05 | 10 | 2035 | 499.45 | 2469 | 350.03 | 3057 | 280.31 |

Number of iterations for BFSAI-DACG in the computations of the 10 leftmost eigenpairs.

Matrix | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | |

FAULT-639 | 1357 | 1434 | 1594 | 2002 | 3053 | 3336 | 3553 | |||

PO-878 | 2122 | 2638 | 3366 | 4157 | 4828 | 5154 | 5373 | |||

GEO-1438 | 1458 | 1797 | 2113 | 2778 | 3947 | 4647 | 4850 | 4996 | ||

CUBE-6091 | 5857 | 6557 | 7746 | 8608 | 9443 | 9996 | 10189 | 9965 |

The user-specified parameters for BFSAI-IC given above provide evidence that it is important to build a dense preconditioner based on the lower nonzero pattern of

We recall that, presently, the code BFSAI-IC is implemented in OpenMP, and the results in terms of CPU time are significant only for

The only meaningful comparison between FSAI-DACG and BFSAI-DACG can be carried out in terms of iteration numbers which are smaller for BFSAI-DACG for a small number of processors. The gap between FSAI and BFSAI iterations reduces when the number of processors increases.

In order to validate the effectiveness of our preconditioning in the proposed DACG algorithm with respect to already available public parallel eigensolvers, the results given in Tables

We first carried out a preliminary set of runs with the aim of assessing the optimal value of the block size

Iterations and CPU time for the iterative solver of LOBPCG-

Matrix | ||||||||

iter | iter | iter | iter | |||||

FAULT-639 | 156 | 79.5 | 157 | 85.3 | 157 | 96.1 | 160 | 128.1 |

PO-878 | 45 | 117.0 | 41 | 131.6 | 38 | 151.3 | 35 | 192.6 |

GEO-1438 | 23 | 123.7 | 72 | 173.7 | 30 | 152.5 | 121 | 291.1 |

CUBE-6091 | 101 | 1670.5 | 143 | 2414.0 | 38 | 1536.7 | 35 | 1680.9 |

Table

Number of iterations, timings, and scalability of LOBPCG-

iter | |||||||
---|---|---|---|---|---|---|---|

FAULT-639 | 4 | 155 | 2.5 | 331.2 | 333.7 | ||

8 | 156 | 1.3 | 167.6 | 168.9 | 7.9 | 0.99 | |

16 | 156 | 0.8 | 79.5 | 80.3 | 16.6 | 1.04 | |

32 | 150 | 0.5 | 38.8 | 39.3 | 34.0 | 1.06 | |

64 | 145 | 0.3 | 22.2 | 22.5 | 59.4 | 0.93 | |

128 | 157 | 0.1 | 14.8 | 14.9 | 89.7 | 0.70 | |

PO-878 | 4 | 45 | 3.3 | 438.4 | 441.7 | ||

8 | 50 | 1.3 | 232.3 | 234.0 | 7.6 | 0.94 | |

16 | 45 | 1.0 | 117.0 | 118.0 | 15.0 | 0.94 | |

32 | 45 | 0.7 | 63.2 | 63.9 | 27.6 | 0.86 | |

64 | 47 | 0.4 | 34.4 | 34.8 | 50.8 | 0.79 | |

128 | 41 | 0.3 | 19.44 | 19.74 | 89.5 | 0.70 | |

GEO-1438 | 4 | 26 | 7.7 | 478.0 | 485.7 | ||

8 | 22 | 4.0 | 256.8 | 260.8 | 7.5 | 0.93 | |

16 | 23 | 2.1 | 123.7 | 125.8 | 15.4 | 0.96 | |

32 | 28 | 1.2 | 73.1 | 74.3 | 26.2 | 0.82 | |

64 | 23 | 0.8 | 35.5 | 36.3 | 53.5 | 0.84 | |

128 | 25 | 0.5 | 20.3 | 20.8 | 93.2 | 0.73 | |

256 | 26 | 0.3 | 12.9 | 13.2 | 147.2 | 0.57 | |

iter | |||||||

CUBE-6091 | 16 | 38 | 9.2 | 1536.7 | 1545.9 | ||

32 | 36 | 4.7 | 807.5 | 812.2 | 30.5 | 0.95 | |

64 | 38 | 3.2 | 408.2 | 411.4 | 60.1 | 0.94 | |

128 | 41 | 1.6 | 251.4 | 253.0 | 97.8 | 0.76 | |

256 | 35 | 0.9 | 105.9 | 106.8 | 231.6 | 0.90 | |

512 | 39 | 0.6 | 65.3 | 65.9 | 375.3 | 0.73 | |

1024 | 37 | 0.3 | 37.7 | 38.0 | 650.9 | 0.64 |

All matrices have to be preliminarily scaled by their maximum coefficient in order to allow for convergence. To make the comparison meaningful, the outer iterations of the different methods are stopped when the average relative error measure of the computed leftmost eigenpairs gets smaller than

To better compare our FSAI DACG with the LOBPCG method, we depict in Figure

Comparison between FSAI-DACG and LOBPCG-

Finally, we have carried out a comparison of the two eigensolvers in the computation of only the leftmost eigenpair. Differently from LOBPCG, which performs a simultaneous approximation of all the selected eigenpairs, DACG proceeds in the computation of the selected eigenpairs in a sequential way. For this reason, DACG should be the better choice, at least in principle, when just one eigenpair is sought. We investigate this feature, and the results are summarized in Table

Performance of LOBPCG-

LOBPCG, | LOBPCG, | FSAI-DACG | ||||

iter | iter | iter | ||||

FAULT-639 | 144 | 10.1 | 132 | 17.5 | 1030 | 15.4 |

PO-878 | 99 | 43.2 | 34 | 29.1 | 993 | 20.4 |

GEO-1493 | 55 | 40.2 | 26 | 37.3 | 754 | 27.0 |

CUBE-6091 | 144 | 5218.8 | 58 | 522.4 | 3257 | 561.1 |

The parameters used to construct the FSAI preconditioner for these experiments are as follows:

FAULT-639.

PO-878.

GEO-1438.

CUBE-6091.

These parameters differ from those employed to compute the FSAI preconditioner in the assessment of the 10 leftmost eigenpairs and have been selected in order to produce a preconditioner relatively cheap to compute. This is so because otherwise the setup time would prevail over the iteration time. Similarly, to compute just one eigenpair with LOBPCG we need to setup a different value for pcgitr, the number of inner iterations. As it can be seen from Table

We have presented the parallel DACG algorithm for the partial eigensolution of large and sparse SPD matrices. The scalability of DACG, accelerated with FSAI-type preconditioners, has been studied on a set of test matrices of very large size arising from real engineering mechanical applications. Our FSAI-DACG code has shown comparable performances with the LOBPCG eigensolver within the well-known public domain package,

The authors acknowledge the CINECA Iscra Award SCALPREC (2011) for the availability of HPC resources and support.