A PETSc-Based Parallel Implementation of Finite Element Method for Elasticity Problems

Starting a parallel code from scratch is not a good choice for parallel programming finite element analysis of elasticity problems because we cannot make full use of our existing serial code and the programming work is painful for developers. PETSc provides libraries for various numerical methods that can give us more flexibility in migrating our serial application code to a parallel implementation.We present the approach to parallelize the existing finite element codewithin the PETSc framework. Our approach permits users to easily implement the formation and solution of linear system arising from finite element discretization of elasticity problem.The main PETSc subroutines are given for the main parallelization step and the corresponding code fragments are listed. Cantilever examples are used to validate the code and test the performance.


Introduction
Elasticity is a general problem in solid mechanics and it is fundamental for civil, structural, mechanical, and aeronautical engineering and also other fields of engineering and applied science.In the numerical techniques to solve elasticity problems, finite element method (FEM) [1] is one of the important methods.Throughout the whole finite element analysis procedure, the equations formation and solution are the main time-consuming parts.In the case of large-scale finite element analysis, most of the computation time is spent on the equations solution.The performance of the equations solver determines the overall performance of finite element code.So the finite element equations solver is attracting much more interest than other components in parallelization of finite element computation.Various types of parallel solvers for sparse matrices arising from finite element analysis have been developed.They are classified into two categories: the direct and the iterative solvers.Direct solvers have many weak points in the parallel processing of the finite element analysis for very large-scale problems.Generally, direct solvers require much larger storage and more operation counts than iterative solvers.Furthermore, direct solvers need much more communications among processors and are generally more difficult to parallelize than iterative solvers.Because of these difficulties and the disadvantages of direct solvers, most of researches on parallel finite element analysis focus on iterative methods and iterative solvers have been installed into more and more large-scale parallel finite element code [2][3][4].
In this paper, parallel finite element computation for elasticity problems is implemented based on the Portable, Extensible Toolkit for Scientific Computation (PETSc) [5] and the parallel code is developed and tested.The remainder of this paper is organized as follows.Section 2 reviews the aspects related to PETSc, FEM, and iterative solution.Sections 3 and 4 present the detailed parallel implementation of finite element method for elasticity problems with PETSc, including finite element equations formation and solution.In Section 5 the performance of the code is comprehensively measured with different test examples.Finally, Section 6 summarizes the main conclusions.and nonlinear equation solvers and time integrators that can be used in application codes written in many languages, such as Fortran, C, and C++.PETSc provides a variety of libraries each of which manipulates a particular family of objects.These libraries are organized hierarchically as in Figure 1, enabling users to employ the most appropriate level of abstraction for a particular problem.The operation performed on the objects has abstract interface which is simply a set of calling sequences, which makes the use of PETSc easy during the development of large-scale scientific application codes.Thus, PETSc provides a powerful set of tools for efficient modeling scientific applications and building largescale applications on high-performance computers.

Finite Element Method for Elasticity.
The displacementbased finite element method introduces an approximation for the displacement field in terms of shape functions and uses a weak formulation of the equations of equilibrium, straindisplacement relations, and constitutive relation to arrive at the linear system where u is the vector of unknown nodal displacements, F is the vector of nodal forces, and K, the structure stiffness matrix, is given by where selective matrix C  plays a transforming role between the local number and the global number of degrees of freedom (DOF).k  is the element stiffness matrix and f  is the element nodal forces, which are both computed with integration over each element.The structure stiffness matrix in (2) is a sparse and symmetric positive definite (SPD) of dimension  × , where  is the total number of degrees of freedom (DOF).

Krylov Subspace
Methods.Krylov subspace methods [6] are currently the most important iterative techniques for solving large linear systems.These techniques are based on projection processes onto Krylov subspaces.For solving the linear system Ax = b, a general projection method extracts an approximate solution x  from an affine subspace x 0 +   of dimension  by imposing the Petrov-Galerkin condition b − Ax  ⊥   , where   is another subspace of dimension .A Krylov subspace method is a method for which the subspace   is the Krylov subspace   (A, r 0 ) = span(r 0 , Ar 0 , A 2 r 0 , . . ., A −1 r 0 ), where r 0 = b − Ax 0 .The different versions of Krylov subspace methods arise from different choices of the subspace   .The conjugate gradient (CG) algorithm is one of the best known iterative techniques for solving sparse symmetric positive definite linear systems.It is a realization of an orthogonal projection technique onto the Krylov subspace   .
Although the methods are well founded in theory, they are likely to suffer from slow convergence for problems from practical applications such as solid mechanics and fluid dynamics.Preconditioning is an important means for improving Krylov subspace methods in these applications.It transforms the original linear system into one with the same solution, but which is easier to solve.
A mathematically equal preconditioned linear system is expressed as follows: where M is a preconditioner.One simple way to construct preconditioners is to split A into A = M − N. In theory, any splitting with nonsingular M which is close to A in some sense can be used.The Jacobi preconditioner [7] is a commonly used preconditioner with the form of M = diag(A).The SOR or SGS preconditioning matrix [6] is of the form M = LU, where L and U are the lower triangular part and the upper triangular part of A, respectively.Another simple way of defining a preconditioner is incomplete factorization of the matrix A [8].These incomplete LU factorization (ILU) preconditioners perform decomposition of the form A = LU − R, where L and U are the lower and upper parts of A with the same nonzero structure and R is the residual of the factorization.Because classical preconditioners, such as ILU and SSOR, have limited amount of parallelism, a number of alternative techniques have been developed that are specifically targeted at parallel environments, for example, additive Schwarz preconditioners [9] and multigrid preconditioners [10].

Finite Element Equations Assembly
According to the theory of finite element method, the calculation of element stiffness matrix and element nodal load vector only needs the information of the local element.So they can be easily parallelized without any communication.The global stiffness matrix and global nodal load vector are assembled with all element stiffness matrices and element nodal loads according to the relationship between the local number and the global number of DOFs.If nonoverlapping domain decomposition is used, the computation of the entities of the global stiffness matrix and global nodal load relating to the interface DOFs needs data exchange between adjacent subdomains.
To implement finite element equations assembly in parallel, the first step is to partition the domain into subdomains.The domain partitioning can be done by some graph partition libraries, such as Metis, which makes loads over processes balanced.The Metis subroutine is where  and  are numbers of elements and nodes,  is the element node array,  indicates the element type,  indicates the numbering scheme,  is the number of the parts,  stores the number of the cut edges,  stores the element partition vector, and  is the node partition vector.
After domain partition, subdomains are assigned to processes and the element stiffness matrices and load vectors of the subdomains are calculated concurrently.These element stiffness matrices and load vectors are then accumulated into the global stiffness matrix.To contain global stiffness matrix K, we must use PETSc calls to create a matrix object.Because the stiffness matrix is a sparse symmetric matrix, AIJ format (CSR) is used to store it.There are several ways to create a matrix with PETSc.We can call MatCreateMPIAIJ to create a parallel matrix.The command is where where  is the size of block,   and   are the numbers of diagonal and off-diagonal nonzero blocks per block row, and   and   are optional arrays of nonzero blocks per block row in the diagonal and off-diagonal portions of local matrix.
Since dynamic memory allocation and copying between old and new storage are very expensive, it is critical to preallocate the memory needed for the sparse matrix.This preallocation of memory is very important for achieving good performance during matrix assembly of an AIJ matrix or a BAIJ matrix, as this reduces the number of allocations and copies required.For a given finite element mesh, we can loop the neighboring nodes of each node to determine the nonzero structure of each block row.So it is easy to determine the   and   in subroutine MatCreateMPIAIJ or MatCreateM-PIBAIJ before computation.The Fortran code to preallocate memory for MPIBAIJ stiffness matrix is listed in Algoritm 1.
After the matrix has been created, it is time to insert values.When implemented with PETSc, each process loops the elements in its local domain, computes the element stiffness matrices, and assembles them into global matrix without regard to which process eventually stores them.This can be done in two ways with PETSc, by either inserting a single value or inserting an array of values.In order to accumulate element stiffness matrices into global matrix, we can use the below subroutine to insert or add a dense subblock of dimension  ×  into the stiffness matrix: where F and nodal displacement.PETSc currently provides two basic vector types: sequential vector and parallel (MPI based) vector.The created vector is distributed over all processes.Any process can set any components of the vector and PETSc insures that they are automatically stored in the appropriate locations.
The Fortran code to create parallel matrix and vector to store stiffness matrix and nodal vectors is listed in Algorithm 2.
Note that the valuation of the element stiffness and nodal load is ignored in the above code for simplicity.

Solution of Assembled System
After the final assembly of the stiffness matrix and nodal load vector, the system is now ready to be solved.PETSc provides easy and efficient access to all of the package's linear system solvers with the object KSP, that is, the heart of PETSc.We here combine CG methods and preconditioners to solve the linear system (1) from finite element discretization.Because KSP provides a simplified interface to the lower-level KSP and PC modules within the PETSc package, we can easily implement this preconditioned CG solver.
The first step to solve a linear system with KSP is to create a solver context with the command where  is the MPI communicator and  is the new solver context.Before solving a linear system with KSP, we must call the following routine to make the matrices associated with the linear system: where the matrix  defines the linear system and  represents the matrix from which the preconditioner is to be constructed.It can be the same as the matrix that defines the linear system.The argument  indicates information about the structure of preconditioner matrix during successive solutions.
To solve a linear system, we set the right-hand side and solution vectors by calling the routine KSPSolve (KSP , Vec , Vec ) , where  and , respectively, denote the rhs and solution vectors.
When solving by Krylov subspace methods with PETSc, a number of options are needed to set.First of all, we need set the Krylov subspace method to be used by calling the command KSPSetType (KSP , KspType ℎ) .
Due to the slow convergence of Krylov subspace methods for the linear system arising from practical elasticity applications, preconditioning is usually combined to accelerate the convergence rate of the methods.To employ a particular preconditioning method, we can set the method with the subroutine PCSetType (PC , PCType ℎ) . ( Each preconditioner may have a number of options to be set.We can set them with different routines [11].During solution of preconditioned Krylov method, the default convergence test is based on the  2 -norm of the residual.Convergence is decided by three values: the relative decrease of the residual norm to that of the right-hand side, , the absolute value of the residual norm, , and the relative increase of the residual, .These parameters and the maximum number of iterations can be set with the command KSPSetTolerances (KSP , double , double , double , int ) . ( Since the linear system derived from finite element discretization of elasticity problems is sparse and symmetric positive definite (SPD), the conjugate gradient (CG) algorithm is chosen here to solve it.For the conjugate gradient method with complex numbers, there are two slightly different algorithms subject to whether the matrix is Hermitian symmetric or truly symmetric.The default option is Hermitian symmetric.Because the solution of finite element equations uses symmetric version, we need indicate that it is symmetric with the command KSPCGSetType (KSP , The Fortran code to create the KSP context and perform the solution is as in Algorithm 3.
In this portion of code, the KSP method being used is the CG with JACOBI preconditioner.The convergence tolerance is set to 1.0 − 7.

Test Platform and Examples.
Our numerical experiments were conducted on a platform composed of 4 Intel Core i5-2450M CPUs @ 2.50 GHz and 4 GB RAM.The operation system is 64-bit CentOS 6.In this section we use a beam problem in bending to validate the PETSc-based finite element code and measure the performance of the linear system solver with different preconditioners.It is known that the number of iterations for convergent solutions to finite element equations from elasticity problems often differs, especially for the cases with different materials.This is because stiffness matrices are very singular when materials are very different.Figure 2 shows this cantilever problem.The size of the beam is 1 m × 1 m × 5 m.The beam is fixed at its left end and loaded by uniform pressure on its top surface.There are two cases with different material composition to be tested.One is of single uniform material and the other is of bimaterial with a strip of much lower Young's modulus than other portions.The base material's Young's modulus and Poisson ratio are of 2.08 and 0.3.The strip material's Young's modulus and Poisson ratio are of 2.03 and 0.3.Two three-dimensional hexahedral meshes with different numbers of elements have been generated and used in computation.The fine mesh has 10000 elements and 12221 nodes, and the coarse one has 80000 elements and 88641 nodes.Second, the parallel performance of the parallel finite element code has been measured.Figures 3 and 4 show parallel performance of the finite element linear system solution stage on coarse and fine meshes, respectively.As shown, the solution stage scales unsatisfactorily on the multicore computer for the communication bottleneck of the computing platform.The AMG preconditioner does not work better than the Jacobi one.This is because there are many options in AMG to tune for optimal performance [12].

Conclusions
We have integrated PETSc into the parallel finite element method for elasticity problems and developed the parallel code.Because PETSc includes libraries of numerical methods that can be applied directly to applications, the process of porting PETSc into the existing application codes becomes easier than developing with low-level parallel interfaces.In this work, we implement the main steps, formation and solution, of the parallel finite element computation of elasticity problems with PETSc subroutines.In the formation stage, memory preallocation is conducted before computation to enhance the performance by avoiding the memory dynamic allocation and copying.For the solution, preconditioned CG method is used as a solver to the linear system derived from finite element discretization.PETSc provides various preconditioners for this solver.Numerical tests show that the formation stage can achieve good performance for its high parallelism and the solution stage scales not very well on the multicore computer for the communication problem.
In future work, this code will be migrated to distributed memory parallel systems and applied to practical problems.Furthermore, more preconditioners will be implemented with PETSc in the code to provide users with more options.

Figure 2 :
Figure 2: The sketch of the testing cantilever.

Figure 3 :Figure 4 :
Figure 3: Speed-up of PCG solutions on coarse grid.
, , and  specify the number of local rows and number of global rows and columns,  is the number of columns corresponding to a local parallel vector,   and   are the number of diagonal and off-diagonal nonzeros per row, and   and   are optional arrays of nonzeros per row in the diagonal and off-diagonal portions of local matrix.Because each node has multiple degrees of freedom in the finite element discretization of elasticity problems, we also can create a sparse parallel matrix in block AIJ format (block compressed row) by the command MatCreateMPIBAIJ (MPI Comm , PetscInt , PetscInt , PetscInt , PetscInt , PetscInt , PetscInt  , const PetscInt   [] , PetscInt  , const PetscInt   [] , Mat * ) , is a logically two-dimensional array of values,  and  are the number of rows and their global indices,  and  are the number of columns and their global indices, and V is the operation of either ADD VALUES or INSERT VALUES, where ADD VALUES means adding values to any existing entries and INSERT VALUES means replacing existing entries by new values.For stiffness matrix assembly, the contributions from related elements are accumulated into global entities and ADD VALUES is used.Also, there are similar procedures to create vectors and insert values into those vectors to store global nodal load

Table 1 :
Serial performance of the PCG (coarse mesh).

Table 2 :
Serial performance of the PCG (fine mesh).
one, and Jacobi one consumes the shortest running time.This is because the AMG and SOR need more preconditioning operations in each CG iteration.Though the number of CG iterations reduces, the overall running time increases.