Efficient CUDA Polynomial Preconditioned Conjugate Gradient Solver for Finite Element Computation of Elasticity Problems

Graphicsprocessingunit(GPU)hasobtainedgreatsuccessinscientificcomputationsforitstremendouscomputationalhorsepower andveryhighmemorybandwidth.Thispaperdiscussestheefficientwaytoimplementpolynomialpreconditionedconjugate gradientsolverforthefiniteelementcomputationofelasticityonNVIDIAGPUsusingcomputeunifieddevicearchitecture (CUDA).SlicedblockELLPACK(SBELL)formatisintroducedtostoresparsematrixarisingfromfiniteelementdiscretization ofelasticitywithfewerpaddingzerosthantraditionalELLPACK-basedformats.Polynomialpreconditioningmethodshavebeen investigatedbothinconvergenceandrunningtime.Fromtheoverallperformance,theleast-squares(L-S)polynomialmethodis chosenasapreconditionerinPCGsolvertofiniteelementequationsderivedfromelasticityforitsbestresultsondifferentexample meshes.InthePCGsolver,mixedprecisionalgorithmisusednotonlytoreducetheoverallcomputational,storagerequirements andbandwidthbuttomakefulluseofthecapacityoftheGPUdevices.WithSBELLformatandmixedprecisionalgorithm,the GPU-basedL-SpreconditionedCGcangetaspeedupofabout7–9toCPU-implementation.


Introduction
Recently, the graphics processing unit (GPU) has evolved from a fixed-function special-purpose processor into a highly parallel, multithreaded, many core processor with tremendous computational horsepower and very high memory bandwidth.This makes them the ideal processor to accelerate data parallel applications including graphic and nongraphic problems.However, programming GPUs for general computation was a great challenge in the past.Early efforts to exploit the GPU for nongraphical applications used shading languages [1,2], such as DirectX, OpenGL, to port various data parallel algorithms to the GPUs, which limited accessibility to the tremendous capability of GPUs for developers from various fields.The advent of compute unified device architecture (CUDA) makes NVIDIA GPUs fully programmable and greatly facilitates the general-purpose applications targeted at GPUs.So far, these applications have ranged from fluid dynamics [3] and molecular dynamics [4] to biomechanics [5], surgical simulation [6], and earthquake modeling [7].
Elasticity is a general problem in solid mechanics.It is therefore fundamental to the practice of mechanical, civil, structural, and aeronautical engineering and also directly relevant to other branches of engineering and applied science.Finite element method is one of the important numerical techniques to solve elasticity problems.As the finite element discretization of elasticity results in a sparse and symmetric, positive definite linear system of equations, preconditioned conjugate gradient (PCG) has become an important iterative solver.To solve large-scale problems, many parallel PCG algorithms and programs have been developed on multi-CPU parallel computers [8][9][10] and GPUs [11,12].However, some commonly used preconditioners for sequential computers have limited parallelism, for example, ILU and SSOR [13].Therefore, alternative techniques need be developed to specifically target parallel environments.The polynomial preconditioners [13,14] are the simple and effective methods for efficient parallel iterative solvers.Only matrix-vector products are required to carry out this preconditioning.
In this paper, we first focus on the efficient sparse matrixvector product (SpMV), which is the major component of the CG iteration and polynomial preconditioning.We propose a sparse representation called sliced block ELLPACK (SBELL), which is a GPU-friendly variant of the storage format of ELLPACK (ITPACK) [15] and specially designed for the iterative solution of finite element equations arising from elasticity.Based on this new format, SBELL, an efficient CUDA SpMV kernel is implemented on NVIDIA GPUs.Then, an extensive performance evaluation of this new approach has been carried out based on a representative set of test matrices.Secondly, this SpMV kernel is assembled into a polynomial preconditioned CG solver.Several polynomial preconditioners are implemented and evaluated.To increase the performance of PCG kernel, mixed precision technique is exploited, that is low precision operations for inner preconditioning and high precision for outer CG iteration.
The remainder of this paper is organized as follows.Section 2 reviews the aspects related to GPU computing, FEM and PCG.Section 3 introduces the proposed format for computation of SpMV on GPUs.Section 4 presents the mixed precision polynomial preconditioned CG.In Section 5, the performance measured on an NVIDIA Geforce GT430 with a set of representative sparse matrices belonging to diverse finite element equations derived from elasticity problems is presented.Finally, Section 6 summarizes the main conclusions.

Backgrounds
2.1.GPU Computing.GPU computing is a new way of GPU programming.It signified broader application support, wider programming language support, and a clear separation from the early "GPGPU" model of programming.This new technique emerged with the advent of NVIDIA G80 unified graphics and compute architecture and CUDA [16].Typical NVIDIA Fermi-architecture GPUs are based on a scalable array of graphics processing clusters (GPCs), streaming multiprocessors (SMs), and memory controllers.Each SM is a highly parallel multiprocessor supporting up to 32 warps at any given time.Different products were launching with different configurations of GPCs, SMs, and memory controllers to address different applications.
Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model developed by NVIDIA.It provides a software environment that allows developers to code algorithms for execution on the GPUs using C and other programming languages.The CUDA programming model [17] assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program.This means the kernels execute on a GPU and the rest of the C program executes on a CPU.The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively.Programs manage the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime, which includes device memory allocation and deallocation as well as data transfer between host and device memory.
The core of CUDA parallel programming model is three key abstractions, a hierarchy of thread groups, shared memories, and barrier synchronization, that are simply exposed to the programmer as a minimal set of language extensions.These abstractions guide the programmer to partition the problem into coarse subproblems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.
A thread block is a set of concurrently executing threads that can cooperate through barrier synchronization and shared memory.These threads are organized as an array of one-dimension, two-dimension, or three-dimension blocks and each of them has a thread ID within its thread block.Blocks are similarly organized into a one-dimensional, twodimensional, or three-dimensional grid and each of them has a block ID within its grid.Thread blocks execute independently in any order, in parallel or in series.This independence allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write a code that scales with the number of cores.Threads within a block can communicate with each other by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses.
CUDA threads may access data from multiple memory spaces during their execution.Each thread has private local memory.Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.All threads have access to the same global memory.There are also two additional read-only memory spaces accessible by all threads, the constant and texture memory spaces.The global, constant, and texture memory spaces are optimized for different memory usages and persistent across kernel launches by the same application.

Finite Element Method for Elasticity.
The displacementbased finite element method introduces an approximation for the displacement field in terms of shape functions and uses a weak formulation of the equations of equilibrium, straindisplacement relations, and constitutive relation to arrive at the linear system where u is the vector of unknown nodal displacements, f is the vector of nodal forces, and K is the structure stiffness matrix, given by with integration over the problem domain Ω.For simplicity, we consider a 2D plane stress problem discretized with elements having nodes without rotational freedoms and linear elastic material behavior.Then, the components of the righthand side of (2) are where  1 ,  2 . . .  are shape functions associated with the  nodes in the mesh and elasticity matrix where  is Young's modulus and ] is Poisson's ratio [18].The structure stiffness matrix in (2) relates to the whole problem domain and is of dimension  × , where  is the total number of degrees of freedom (DOF) ( = 2 ×  for our 2D problem).For linear elasticity, the structure stiffness matrix is sparse and symmetric positive definite (SPD).The sparse nature of the finite element matrix is a consequence of local support of the basis functions.The symmetry is present because the Galerkin procedure by which the weak form is generated is self-adjoint; that is, the basis and test functions are the same.The Galerkin method is equivalent to minimization of potential energy, and the nonnegativity of the strain energy in the domain leads to positive definiteness of the structure stiffness matrix.In addition, since each node has multiple DOFs, the structures stiffness matrix can be organized into a blocked matrix if the DOFs associated with each node numbered consecutively.For 2D and 3D problems, the block size is 2 and 3, respectively.

Preconditioned Conjugate Gradient.
The conjugate gradient algorithm is one of the best known iterative techniques for solving sparse symmetric positive definite linear systems.As an iterative solver, lack of robustness is a widely recognized weakness of CG.This drawback can be improved by using preconditioning.
Consider a linear system Ax = b, where A is symmetric and positive definite.It is assumed that a preconditioner M is available and also Symmetric Positive Definite.A mathematically equal preconditioned system is obtained as follows: Replacing the usual Euclidean inner product in the Conjugate Gradient algorithm by the M-inner product, the following PCG algorithm is obtained [13].
In general, the reliability of iterative techniques, when dealing with various applications, depends much more on the quality of the preconditioner than on the particular iterative solver used.Incomplete LU factorization (ILU) based preconditioners are effective for single processor computational framework.However, for parallel framework, its use is mainly limited as it incurs high parallel communication overhead similar to that of direct solvers.Jacobi preconditioners are commonly used preconditioners for parallel formulations but they are usually not efficient since it requires many iterations to converge.To enhance performance, these preconditioners can themselves be accelerated by polynomial iterations, i.e., polynomial preconditioning.The main advantages of polynomial preconditioning are its simplicity and flexibility.Only matrix-vector products are required to carry out this preconditioning.
In polynomial preconditioning the matrix M is defined by where   (A) is a polynomial in A with a degree of no more than .Thus, the polynomial preconditioned system is given by The most commonly used polynomial preconditioners for SPD linear systems are constructed using Neumannseries, least-squares, and minimax polynomials.Among these, the Neumann preconditioner is the simplest, cheapest (i.e., least construction cost) and the most stable method.It can be constructed as where G = (I−A) and scalar  are adjusted so that (G) < 1.
The least-squares and minimax polynomial preconditioners are the two optimal methods, each of which is derived from a specific optimal approximation.See [13] for full details of these polynomial preconditioners.

SpMV in the Form of Sliced Block ELLPACK
As we can see from Algorithm 1, the main computational components in PCG are composed of matrix-vector product, preconditioning operation, and vector operations.Since the polynomial preconditioning is actually a serial of matrixvector products, the efficiency of polynomial PCG solver on NVIDIA GPUs for finite element equations greatly relies on CUDA SpMV kernel.In this section, we consider the operation y = Ax, where A is a sparse matrix and x and y are column vectors.
Because the stiffness matrix arising from finite element discretization of elasticity on unstructured mesh is a general sparse SPD matrix, the CSR format [13] is commonly used to store these matrices.This data structure takes three arrays to represent sparse matrix and permits a variable number of nonzeros per row.For a  ×  matrix, the column indices and nonzero entries are stored in the arrays JA and A of dimension .Array IA of dimension  + 1 stores the pointers to the beginning of every row in A and JA, both sorted out by row index.The last entry in IA stores , the number of nonzeros in the matrix.For CSR format, a straightforward CUDA implementation is called scalar kernel, which uses one thread per matrix row.The most significant problem of this kernel is that the threads in a warp access A and JA noncontiguously, which leads to noncoalesced global memory access.An alternative to the scalar method, which we call the vector kernel, assigns one warp to each matrix row.The vector kernel accesses indices and data contiguously and therefore overcomes the principal deficiency of the scalar approach.However, efficient execution of the vector kernel demands that matrix rows contain a number of nonzeros greater than the warp size.That means the performance of the vector kernel is sensitive to matrix row size.Another drawback for both scalar and vector kernels is that the locality of access to vector x is not maintained due to the indirect addressing.
ELLPACK [15] was introduced to suit vector computers.This format consists of two dense arrays: array A to save the entries and array JA to save the column index of every entry.Both arrays are of dimension  × , where  is the number of rows and  is the maximum number of nonzeros per row in the matrix.Figure 1 (middle) illustrates the ELLPACK format of matrix, where  = 12 and  = 6.Note that the size of all rows in these compressed arrays A and JA is the same, because every row with fewer than  nonzeros is padded with zeros.Therefore, ELLPACK can be considered as an approach to fit a sparse matrix in a regular data structure similar to a dense matrix.Consequently, this format is appropriate to compute operations with sparse matrices on vector architectures.
Parallelizing SpMV for the ELLPACK format is straightforward: one thread is assigned to each row of the matrix and each thread computes the sparse dot product between the corresponding matrix row and the x vector.As shown in Algorithm 2, if element  of vector y is computed by a thread identified by index  and the arrays store their elements in column-major order, this thread accesses to the elements A[ +  × ] and JA[ +  × ] with 0 ≤  < , where  is the column index into the data structures A and JA.Thanks to the column-major ordering used to store the matrix elements into the data structures, two threads  and  + 1 can access consecutive memory address; thereby the conditions of coalesced global memory access can be fulfilled.We also can see from the kernel in the form of ELLPACK that every block of threads can complete its computation without synchronization with other blocks.This is because there are no data dependencies in the computation of different elements of y, as every thread computes one element of the vector y.These two merits, coalesced global memory access and nonsynchronized executions, improve the performance of SpMV kernels based on ELLPACK.
However, this good performance of ELLPACK only applies when the maximum number of nonzeros per row does not substantially differ from the average.In practice, unstructured meshes do not always meet this requirement.For this case, the percentage of zeros is high in the ELLPACK data structure and there is a relevant amount of padding zeros.This penalty inevitably results in the redundant memory access and arithmetic operations with padding zeros.As described in Algorithm 2, when computing every [], with 0 ≤  < , the -loop must iterate until  =  is reached in every iteration.Clearly, the ELLPACK format alone is an appropriate choice for representing matrices obtained from structured meshes.
To relieve the inherent drawback of ELLPACK, two variants have been proposed.One is the ELLPACK-R format [19] which consists of two arrays, A and JA of dimension  ×  and an additional integer array called rl of dimension  (i.e., the number of rows) with the purpose of storing the actual length of every row, regardless of the number of the zero elements padded.The SpMV based on ELLPACK-R further improves the performance reached by ELLPACK on GPUs due to reduction of useless computation and imbalance of the threads in one warp by the inclusion of the array rl.The problem of ELLPACK-R is that too much extra storage is still required to store padding zeros for the matrices obtained from unstructured meshes.
Another variant is the sliced ELLPACK (SELL) format [20], which first permutes the rows of A in ascending or descending order; and then, partitions this reordered A into slices of  rows and every slice is stored in ELLPACK format.Thus, this format can largely reduce the padding zeros as compared to Figure 1 (middle) with Figure 1 (right), where  = 4.The SpMV kernel in this format is described by Algorithm 3, where  threads are assigned to every slice of rows and collaborate in the computation.We suppose  is a multiple of warp size to enhance coalesced memory access.The array nnz denotes the beginning index of the slice into A and JA.As we can see in this Algorithm, the -loop reaches the maximum value for specific threads into the slice.Then, the runtime of every slice is proportional to the maximum number of nonzeros per row [] related to every slice, and it is not necessary that the loop for all threads into every warp reaches  for all rows.Consequently, the useless iterations are reduced compared with SpMV based on ELLPACK.
For matrices with a natural block structure, blocked formats, for example, blocked CSR (BCSR) [21] and blocked ELLPACK (BELLPACK) [22], have been proposed for SpMV.These proposals compress the sparse matrix by small dense entries blocks with the size is  × . Figure 2 illustrates the storage of a blocked matrix in the BELLPACK format, where the block size is 2 × 2. Since only storing one column index per block is needed, the column index storage and transfer can be reduced by up to roughly 2 × 2. Therefore, the blocked approaches reach better performance than the corresponding nonblocked versions in principle.However, BELLPACK still suffers the same problem as ELLPACK in the case of variable nonzeros across rows.
To fit the finite element stiffness matrix derived from elasticity with the feature of unstructured and blocked sparsity, we here combine the BELLPACK and sliced ELLPACK and propose a sliced block ELLPACK (SBELL) format.Firstly, the sparse blocked matrix is permuted by the number of non-zero blocks of every row in ascending or descending order.Then this permuted matrix is sliced into groups of blocked rows and BELLAPCK is used to store the block entries in each slice.Figure 3 illustrates the SBELL representation of the matrix in Figure 3 (left), where there are 6 blocked rows and slice size    is set to 2. Thanks to permutation and slicing, zero-padding blocks and related useless arithmetic operations are greatly reduced.
For a blocked matrix in the SBELL with 2 × 2 blocks and  blocked rows, the entries in each slice with a size of , as shown in Figure 4, are arranged in global memory as Figure 5.When  = 1 and  = , SBELL collapses into BCSR and BELLPACK, respectively.As the entries in each slice are stored in column-major, the threads assigned to these slices can access the entries in every block contiguously.
The SpMV kernel is given by Algorithm 4, where  threads are assigned to every slice of  blocked rows.Similar as the kernel in the SELL format, the array of nnzb denotes the beginning block index of the slice into BA and BJA, and [] gives the maximum number of nonzero blocks per blocked row related to every slice.In this algorithm, every thread performs the dot-product between a blocked row and the vector x to compute a block of the vector y without synchronizations for no data dependencies in the computation of different blocks.Moreover, threads in each thread set access to consecutive memory address of BA and BJA; thus the conditions of coalesced global memory access can also be reached.

Mixed Precision Polynomial Preconditioned Conjugate Gradient
The purpose of the mixed precision algorithm is to reduce the overall computational and storage requirements by introducing low precision arithmetic.For example, single precision number takes only half the storage of a double precision one.Thus, storage and bandwidth requirements are halved.For the computational demands, this is somewhat more hardware dependent.Most modern CPU architectures obtain twice the performance for single precision execution compared to double precision.On GPU architectures this relation might be more distinct.Taking the NVIDIA Tesla C10760 as an instance, the single precision peak processing power is up to twelve times faster than the double precision's [23].However, scientific applications do not take advantage of these capabilities, as the low precision operations would mean an unacceptable loss of the result accuracy for many problems.
Actually, for many algorithms we need not perform high precision arithmetic for all intermediate computations to gain highly accurate final results.The knowledge about the error for  = 0, 1, . .., until convergence propagation in the algorithm can be used to confine the use of high precision computations to only a few relevant places [24].This leads to a mixed precision method which mixes using low and high precision computations in different parts of the algorithms.The idea behind the mixed precision method is to exploit the large disparity between single and double precision peak performances on current GPUs and obtain a result of high accuracy.
The usual mixed precision strategy used with linear equation solvers on GPUs is defect correction, whose original form to solve Ax = b is given by Algorithm 5.The core idea of this algorithm is to split the solution process into a computationally intensive but less precise inner iteration and a computationally simple but precise outer correction loop.To solve the defect equation, an arbitrary iterative solver running in low precision can be employed.
Another view on this scheme of defect correction is to interpret the low precision inner solver as a preconditioner in a high precision iterative method.Therefore, the polynomial preconditioned CG can be treated as the outer CG loop which is nested by an inner polynomial preconditioning.Furthermore, the polynomial preconditioning is a serial of matrix-vector products which is more computationally intensive than the outer CG loop with one matrix-vector product.Thus, we can greatly improve the performance of the polynomial PCG solution on GPUs if the mixed precision method is employed.The mixed precision polynomial PCG is described as Algorithm 6, where superscript  and  denote the high precision and low precision.In this algorithm, the inner preconditioning is performed in fast and cheap low precision, and the outer CG loop is in accurate high precision.The conversion between the two precision formats is implemented by duplicating the values into a new array with different precision.Compared to the fully high precision version, the distinct deficiency of the mixed algorithm is the extra memory needed to hold the low precision matrix and vectors.

Test Platform and Examples.
Our numerical experiments were conducted on a platform composed of an Intel Core2 Duo CPU E7400 @ 2.8 GHz and an NVIDIA Geforce GT430 GPU running 64-bit Windows 7 system.The NVIDIA Geforce GT430 card is of compute capability of 2.1 and has 2 multiprocessors, 96 cores, 1 GB of global memory, 64 KB of constant memory, and 48 KB of shared memory per block.The CUDA Toolkit and SDK 4.1 are used for programming.
Table 1 lists the test block SPD matrices from finite element discretization of elasticity problems, including their important characteristics, such as element type, number of unknowns, number of nonzero entries, and so forth.Although all of these matrices are symmetric, they all have been considered as general to compute SpMV.Additionally, Table 2 lists the factors of entries stored with different formats to CSR format, in which the size of slice is set to 32 for SELL and SBELL and the block size of BELL and SBELL is 2 and 3 for 2 dimensional and 3 dimensional problems, respectively.From Table 2, we can see that ELLPACK and BELL work well for structured mesh, for example, quadrilateral element and hexahedral element, but store too much padding zeros for unstructured meshes, for example, triangular element and tetrahedral element.However, SELL and SBELL suit general meshes despite mesh structures.
During the performance profiling, the running time is recorded in the best case among different configurations of threads.For the PCG iterative solve, the stopping criterion for convergence is the relative residual ≤ 1.0 −7 .

SpMV Kernels.
A comparative analysis of the performance of different kernels to compute SpMV on NVIDIA GPUs has been carried out in this work.The following formats to store the matrix have been evaluated: CUSPARSE, CSR(vector), ELLPACK, SELL, BELL, and SBELL.We present results for double precision floating point arithmetic.All kernels have been evaluated using the texture memory.Here, the vector x has been stored, binding to the texture memory for all kernels evaluated, since in the computation of y = Ax only the vector V is reused throughout the products with the different rows of the matrix.The reported figures represent an arithmetic average of 100 SpMV operations.The number of FLOPs for one sparse matrix-vector multiplication is precisely twice the number of nonzeros in the matrix.Therefore, the speed of floating operation, which is reported in units of GFLOPS, is simply the number of FLOPs of one single matrix-vector product divided by the average running time.The time of transferring data between host and device is not included in calculating the speed.In the context of iterative solvers, such data transfers can be negligible as they occur only twice: at the beginning and end of iteration, and thus can be amortized over a large number of SpMV operations.Figures 6 and 7 show the different performance results of SELL and SBELL with different slice sizes.All the slice sizes are set into multiples of 32 to have coalescing data read.From these Figures, we can see the performance declines with the size increasing for triangular, quadrilateral, tetrahedral, and hexahedral meshes.We therefore fix the slice size to 32 in the remaining numerical tests.
Figure 8 reports the SpMV performance results of CUS-PARSE, VECTOR, ELLPACK, BELL, SELL, and SBELL kernels.CUSPARSE denotes the kernel using CUDA CUSPARSE library in the storage of CSR and VECTOR is the vector SpMV kernel using 32-thread warp per matrix row for the CSR sparse matrix format.As expected, the SBELL kernel offers the best performance in all the cases.It reaches 3.33, 3.52, 2.44, and 3.64 GFLOPS on triangular, quadrilateral, tetrahedral, and hexahedral meshes, improvements of 24.72%, 23.78%, 72.22%, and 24.86% to SELL.This result is attributable to the decrease of column indices accesses by block storage.From Figure 8, we also can see that all the ELLPACL-based kernels outpace CSR-based kernels in performance for their regular data structure of ELLPACK.Compared to CUSPARSE, SBELL gets speedups of 2.98, 5.52, 1.94, and 2.22 for all the meshes.Figure 9 reports SpMV performance comparison between single precision and double precision arithmetic in the format of SBELL.As expected, the single kernels offer the better performance than double kernels in all cases.Double precision performance in the triangular element mesh example is 54.4% that of the single precision result.For the quadrilateral element mesh, double precision performance is 57.3% of the corresponding single precision result.For hexahedral element mesh, the percent is 51.1%.The tetrahedral element mesh retains 74.8% of its single precision performance.These results show that mixed precision algorithms can potentially achieve better performance than double precision versions.

Polynomial Preconditioners.
The performance of the polynomial preconditioned CG algorithm for the solution of a variety of SPD linear systems from finite element approximation of elasticity problems have been compared and analyzed, with different polynomial preconditioners and different orders of polynomial preconditioners being used.The test matrices and their structural characteristics are presented in Table 1.
The least-squares polynomial preconditioner can produce a good approximation of A −1 assuming that a tight bound of smallest eigenvalue   and the largest eigenvalue  1 can be found.As mentioned in [12], the upper bound must be larger than  1 , but it should not be too large as this would otherwise result in slower convergence.This bound can be obtained inexpensively by using a small number of steps of the Lanczos method [25] and a safeguard term [26] added to guarantee that the upper bound is larger than  1 .In addition, all systems undergo row and column scaling prior to computation of the solution; therefore, good estimates of the spectrums of the scaled systems are also required.
Table 3 lists the convergence iterates for quadrilateral mesh example "quad" by different polynomial methods with different orders of polynomials.Compared to the convergence iterates of 1938 for Jacobi preconditioner, all the methods improve the convergence.The L-S polynomial converges fastest and the Chebyshev shows oscillatory convergence.The number of convergence iterates may be reduced by increasing the value of the degree of the preconditioning polynomial.However, the number of the SpMV operations in each iteration increases accordingly.for all the four meshes listed in Table 5.Consequently, the L-S with order of 6 is chosen as a polynomial preconditioner in the following PCG tests.

PCG.
In our implementation of the GPU-accelerated polynomial preconditioned CG method, the SpMV and level-1 BLAS operations are performed in parallel on the GPU.The polynomial preconditioning operation consists of the SpMV and level-1 BLAS vector operations, both of which can be performed efficiently on GPUs.Typically, a small number of steps of Lanczos iterations are enough to provide a good estimate of extreme eigenvalues.The Lanczos algorithm can be accelerated by GPUs as well, since the computations it required are also SpMVs and level-1 BLAS vector computations.
Figure 10 shows the speedups of GPU-based L-S preconditioned CG to CPU-implementation on four example meshes.As shown, the double precision version in the form of CSR reaches a speedup of 1.2-3.8 and the mixed precision 1.8-4.5.For the SBELL format, the double precision gets a speedup of 5.5-6.9 and the mixed precision 7.2-9.1.This difference between double precision and mixed precision is significant here, and SBELL is more suitable than CSR for GPU computing in finite element computation of elasticity problems.

Conclusions
We have introduced sliced block ELLPACK (SBELL) format to store sparse matrix arising from finite element discretization of elasticity.Based on this sparse representation, a CUDA SpMV kernel on GPU has been implemented.Compared with CUSPARSE library, vector, ELLPACK, BELL, and SELL kernels, the SBELL SpMV kernel gets the best performance.To accelerate convergence of the conjugate gradient iterative method to solve the finite element equations on GPU, polynomial preconditioning methods have been investigated.From the numerical tests on different meshes, polynomial methods are always feasible and generally are of good convergence performance.In addition, the number of convergence iterates may be reduced by increasing the value of the degree of the preconditioning polynomial.Considering the convergence time increasing with the SpMV operations multiplying for high order polynomials, the L-S polynomial method shows the best performance and has been chosen as a preconditioner in PCG solver to finite element equations derived from elasticity.In the PCG solver, mixed precision algorithm is used by introducing single precision arithmetic in computationally intensive preconditioning inner loop and double precision in outer correction loop.This mixed precision implementation not only reduces the overall computational and storage requirements but makes full use of the capacity of the GPU devices.With SBELL format and mixed precision implementation, the GPU-based L-S preconditioned CG can reach a speedup of over 7 to CPUimplementation for different meshes.
In future work, optimizations will be investigated in practical problems.Furthermore, we plan to study the hardwareadaptive algorithms to automatically capture optimal computational performance on GPUs with different configurations.

Figure 4 :
Figure 4: Entries of each slice in the SBELL.

Figure 5 :Algorithm 4 :
Figure 5: Entries of each slice in global memory.
= 0 d  = b − Ax  compute in high precision Ac  = d  solve in low precision x +1 = x  + c  correct in high precision  =  + 1 iterate until convergence in high precision Algorithm 5: Mixed precision strategy for linear equation solvers.

Figure 6 :
Figure 6: Performance results of SELL with different slice sizes.

Figure 7 :
Figure 7: Performance results of SBELL with different slice sizes.

Figure 9 :
Figure 9: SpMV performance comparison between single and double precision results.

Table 1 :
Examples used for testing.

Table 2 :
Factors of entries stored with different formats to CSR.
Table 4 lists the convergence time of different polynomial methods with different orders in the format of CSR on CPU.Compared to the convergence time of 64.92 seconds for Jacobi preconditioner, the running times of Neumann and Chebyshev greatly increase.The L-S has best convergence performance and its convergence time is almost identical to Jacobi with polynomial order set to 6

Table 3 :
Convergence iterates of different polynomial methods with different orders.

Table 4 :
Convergence time of different polynomial methods with different orders (unit: sec).

Table 5 :
Convergence time of different polynomial methods on different meshes (unit: sec).