A Comparative Study on Different Parallel Solvers for Nonlinear Analysis of Complex Structures

. The parallelization of 2D/3D software SAPTIS is discussed for nonlinear analysis of complex structures. A comparative study is madeondifferentparallelsolvers.Thenumericalmodelsarepresented,includinghydrationmodels,watercoolingmodels,modulus models,creepmodel,andautogenousdeformationmodels.Afiniteelementsimulationismadeforthewholeprocessofexcavation andpouringofdamsusingthesemodels.Thenumericalresultsshowagoodagreementwiththemeasuredones.Toachieveabetter computingefficiency,fourparallelsolversutilizingparallelizationtechniquesareemployed:(1)aparallelpreconditionedconjugate gradient(PCG)solverbasedonOpenMP,(2)aparallelpreconditionedKrylovsubspacesolverbasedonMPI,(3)aparallelsparse equationsolverbasedonOpenMP,and(4)aparallelGPUequationsolver.Theparallelsolversruneitherinasharedmemory environmentOpenMPorinadistributedmemoryenvironmentMPI.Acomparativestudyontheseparallelsolversismade,and theresultsshowthattheparallelizationmakesSAPTISmoreefficient,powerful,andadaptable.


Introduction
Complex structures are largely employed in engineering practice in a variety of situations and applications, for example, water resources and hydropower engineering, mining engineering, and traffic engineering.An analysis of such structures is not possible by empirical methods, and moreover in situ experimental studies are costly.Recent advances in numerical techniques have provided the finite element method (FEM) for analysis of much more complex systems in a much more realistic way.The FEM can model the complex behavior of concrete without limitations caused by complexity.
Many authors have studied the nonlinear response of concrete structures using the FEM focusing on the threedimensional elastoplastic problem [1][2][3][4][5][6][7][8][9][10][11][12].Most authors consider the effect of creep [4][5][6][7][8], while others consider cracking [9][10][11][12], with regard to nonlinear analysis of structures.Generally, the models are extremely computation intensive.Many engineers complain about the high computational costs.As a result, some nonlinear analysis codes have to be given up.The situation has changed since the arrival of a variety of high performance computers and the advances in parallel computing techniques, that is, parallel algorithms and parallel platforms.Parallel algorithms on different platforms, that is, algorithms utilizing OpenMP and MPI, are studied by [13][14][15][16].Recently, the GPU high performance computing has been popular.A parallel GPU equation solver is introduced into SAPTIS as well.In these respects, we should be able to model more complex effects like hydration, water cooling, cracking, creep, autogenous deformation, and so forth.
The aim of this work is the prediction of concrete behaviors in different situations as well as the improvement of the runtime in the nonlinear analysis program structure analysis program for temperature and induced stress SAPTIS.A composition of models is presented in terms of the former, and a comparative study on different parallel solvers is made for the sake of the latter.The paper does not address the issue in mathematics itself, but focuses on the application of current algorithms.The goal is to develop practical and efficient parallel strategies for nonlinear analysis of complex structures.
In this study, we will arrange the contents as follows.In the first section, a composition of models and its basic The nonlinear analysis of the model takes into consideration the effects of hydration, water cooling, modulus changes, creep, and autogenous deformation.An appropriate model for these effects as well as the FEM equations is used for building the nonlinear system.

Thermal Simulation
2.1.1.Hydration Models.Hydration of concrete brings heat.Hereby it can be modeled using five models in SAPTIS, namely, exponential model, hyperbolic model, complex model, hydration degree model, and table model [17].We put a superscript "+" as heat increasing, and the first four models can be explicitly formulated as follows.

Water Cooling Models.
There are two water cooling models in SAPTIS: one is a fine model and the other is an equivalent model.Derivations and formulations of the two models are described in detail in [18].The cooling pipes take away heat, which can be written as: where  − () is the adiabatic temperature drop caused by the cooling system,   is the temperature of water at entry,  0 is the initial temperature of concrete, which will be given in (8), and () is the thermal distribution function.
In the fine model, a local grid refinement of the cooling pipes is needed.In every single time step, thermal distribution along the cooling pipes is obtained by calculating the heat exchange between the cooling pipes and the concrete, and the thermal field of concrete is therefore obtained.The fine model can accurately simulate the thermal distribution around the cooling pipes, but it needs a grid refinement, which inevitably enlarge the calculation scale.
The equivalent model does not take into consideration the location of pipes.An average concept is exploited by inputting the pipe spacing, the flux, the temperature, and the time of watering and calculating a mean effect of water cooling.The equivalent model can reduce calculation time without local grid refinements and ensure an average precise, but it does not consider the thermal distribution along the cooling pipes and it cannot obtain the thermal gradient near them.

Basic Thermal Model.
The equilibrium equation of heat transfer can be written as: The initial condition can be written as: The boundary conditions are In ( 7 [19].
Exponential Model: Hyperbolic Model: Complex Model: Aging Degree Model: In practical modeling, choose one model from (10) ∼ (13), and the results show that any of the models can well simulate the process of modulus growth of concrete.[19] proposed a creep formula for loading at different ages after a rigorous derivation:

Creep Model. Zhu
where (, ) represent creep of concrete,  stands for age, (− ) denotes loading time, and  1 ,  2 ,  1 ,  2 , ,  1 ,  2 ,  3 ,  1 , and  2 are regression coefficients.An increment method is employed to calculate the creep influence on deformation and stress.Modulus and creep of concrete in several engineerings are shown in Tables 1 and 2.

Autogenous Deformation and MgO Linear Expansion Models.
Input the ages of concrete and corresponding strains, and autogenous deformations can be obtained from spline interpolation.
Autogenous deformations exist in MgO concrete and can somehow offset the shrinkage due to thermal decrease, which is beneficial for preventing dam cracking.MgO concrete has been applied in a few projects, and good performance has been achieved [20,21].We propose two formulas to calculate autogenous deformations of MgO concrete from practice.

Finite Element Formulation
2.3.1.Element Types.SAPTIS is 2D/3D software for general analysis.In 2D space, elements of triangle with 3-6 nodes and quadrilateral with 4-9 nodes are employed.In 3D space, elements consist of tetrahedra with 4-11 nodes, wedges with 6-16 nodes, and hexahedra with 8-21 nodes.The shape functions of these elements can be found in and elaborated upon by [20,21].Besides, a one-dimensional (1d) element used for links and prestressed anchors is defined in the software.
Two definitions for this 1d element are emphasized: one is an explicit definition, in which the element is a real bar connected by its two nodes, and it has its own stiff matrix (Figures 1(a), 1(b), and 1(c)).The other is an implicit definition and it treats the element as a virtual bar, which does not have a geometric entity in representation, but its stiff matrix is  nominally added to its neighboring element(s) (Figures 1(d), 1(e), and 1(f)).In both definitions, there are three connection types in terms of hexahedral neighbors, that is, edge connection (Figures 1(a) and 1(d)), face diagonal connection (Figures 1(b) and 1(e)), and body diagonal connection (Figures 1(c) and 1(f)).

Finite Element Formulation of Heat Transfer
. By using the variation formulation, the equilibrium equation of heat transfer in (7) can be transformed into the following matrix form: where [] is the matrix for specific heat capacity, [] is the matrix for heat conductivity, and {} is the total heat flux vector for internal hydration heat and heat convection.The matrices of [] and [] and vector {} shown in ( 17) are where [] is the matrix for shape function, Ω is the finite domain, and Γ is the boundary of Ω.

Finite Element Formulation of Stress
Analysis.An incremental method is employed to analyze the deformation and stress of concrete.The sum of elastic and creep strain in concrete is proportional to a sustained load under ordinary service stress level, while the thermal strain is proportional to the temperature rise.The autogenous deformation in MgO concrete can be worked out by the formulas we propose.The incremental method has been elaborated upon in [12,22], and we only present its finite element equation as follows: where {Δ}  is the force increment and []  is the strain matrix.{Δ} is the stress increment caused by elastic, creep, thermal, and autogenous strains, which can be written as where [] is the instant stress matrix, {Δ  } is elastic strain increment, {} relates to creep strain increment, {Δ  } stands for thermal strain increment, and {Δ  } represents autogenous strain increment.Put (20) into (19), and we can get an equilibrium form where []  is the stiff matrix of element, {Δ}  is displacement increment of element, and {Δ  }  , {Δ  }  , and {Δ  }  are nodal load increments caused, respectively, by creep, temperature, and autogenous deformation.The matrix and vectors can be summarized as After the assembly of all element stiffness matrices, a general finite element equation in its global form can be expressed as where

Parallelization Strategy
Since the 1980s, a lot of work has been done in parallel and distributed computing for structural analysis.Parallel processing benefits such analysis a lot by using two different strategies [23].
(1) The analysis problem may be subdivided by geometrically dividing the idealization into a number of subdomains: explicit decomposition approach, also called substructure approach.(2) Alternatively the system of equations for the whole structure may be assembled and solved in parallel without recourse to a physical partitioning of the problem: implicit domain decomposition (IDD) approach, or global approach.
It should be noted that such strategies with domain decomposition techniques are not general solution procedures and are specialized for particular applications.Furthermore, what can be parallelized are [24] (a) input problem characteristics, (b) assembly, (c) boundary conditions, (d) solution of algebraic equations, and (e) postprocessing.The fourth item is the most important to parallelize.Additionally, Gummadi and Palazotto [25] indicated that in a typical linear static FEA, the most consuming operation is the solution of linear equations.We recognize from our own experience that the time used for solving equations can reach 70%∼90%.An optimization in linear equation solvers is required to largely improve computing efficiency.
3.1.Linear Equation Solvers.Linear system of large equations is solved by two kinds of solvers, that is, the direct solvers (e.g., Gauss) and the iterative ones (e.g., PCG).
The direct solvers are applicable with accuracy assurance for any nonsingular linear system with an appropriate scale and density.When solving a large sparse system with many 0's, a direct solver is particularly consuming.Additionally it cannot ensure the accuracy in terms of rounding errors.The parallelization for direct solvers is only suitable for tridiagonal matrices, block tridiagonal matrices and banded matrices.For a general sparse linear system, the sparse equation solvers are popular with multifrontal algorithms and supernode technology.
The iterative solvers overcome the shortcomings of the direct ones and retain the sparsity of coefficient matrix.With small storage and computation, they have more obvious advantages especially for large, sparse, asymmetric, and seriously ill-conditioned matrices.At present, the preconditioned conjugate gradient (PCG) solver on behalf of the Krylov subspace solvers is among the most popular iterative solvers in engineering and sciences.Classical iterative solvers (e.g., Jacobi, GS, SOR, and SSOR) with their improved ones are used for preconditions in most situations.

Mathematical Problems in Engineering
(3) the sparse equation solver, (4) the GPU solver.To implement a parallel procedure, four major problems are faced with.The first to be concerned is what parts of the algorithm can be parallelized.These parts should be decomposed and executed by different processes, and a judgment for decomposition is whether there exists data competition.The second problem is the strategy for decomposition.There are two strategies: one is task strategy, another is data strategy.In task strategy, the procedure is decomposed into different tasks, whose dependence on whom should be noted.Then the tasks are scheduled to avoid mutual interference.Such parallelization strategy belongs to a coarse one and demands high independence of specific problems.In data strategy, the data space of procedure is decomposed into different areas.Each processor takes in charge of its own area.Obviously, this strategy is well suited for parallelization of finite element analysis.The third to be concerned is the programming model.This can be determined by the parallel architecture.The last one is the programming method.In scientific computing, data parallelization strategy is generally employed.As a result, the looping or (Single Program Multiple Data SPMD) programming is adopted.In looping programming, loops without competition are distributed into different processors.In SPMD programming, all processors execute the same procedure, but they use different data, which are transferred and shared among the processors via communication.

Key Technique for Parallel
In SAPTIS, these problems are solved one by one as follows.
(i) The solution of algebraic equations should be urgent to parallelize.
(ii) For FEA parallelization, the data strategy should be exploited.
(iii) The message passing programming model should be employed if based on MPI, and the multithread programming model should be employed if based on OpenMP.
(iv) Data parallelization strategy determines that the looping or SPMD programming method should be used.

Parallel Preconditioned Conjugate Gradient Solver Based on OpenMP.
The parallel preconditioned conjugate gradient (PCG) solver uses symmetric successive over relaxation (SSOR) technique and runs on the OpenMP-based platform.
The bottlenecks of performance for the PCG solver are the matrix and vector operations like W −1 b and W − b, which costs over 80% of the computation time.Therefore, the parallelization strategy for the PCG solver based on OpenMP is to parallelize the loops containing such operations without data competition by controlling statements of OpenMP.After such programming process, parallelization of matrix-vector multiplication and inner product operation is achieved, and the data formats and compile options are optimized simultaneously.
The PCG solver is parallelized in an algorithm level, which requires frequent communication on computers of distributed memory systems.Thus, such parallel algorithm is more suitable for computers of shared memory systems, and an OpenMP-based platform is employed.
The parallelization procedures for the PCG solver can be summarized as follows: (i) first, the matrix and vector operations are parallelized, including AP and A  P, where the vector P is generated by specific algorithm; (ii) then, the inner product operation is parallelized; (iii) furthermore, the vectors are updated; (iv) finally, the preconditions are calculated (if necessary).

Parallel Preconditioned Krylov Subspace Solver Based on MPI.
The Krylov subspace   is defined by means of the square matrix A ∈ R × , the initial vector b ∈ R  , and the positive scalar  and   = Span(b, Ab, . . ., A −1 b).The original system Ax = b is reduced by using the orthogonal transformation Q, obtaining the tridiagonal system (Q  x) = Q  b.By the integration of the solution of the tridiagonal system and the generation of the matrices Q and , let us obtain a complete algorithm to solve the original system, where we obtain a better solution in each step [29].
The parallel preconditioned Krylov subspace (PKS) solver uses iterative techniques, for example, CG, CGS, BiCGSTAB, GMRES, TFQMR, and so forth, and runs on the MPI-based platform.
For the parallelization of the PKS solver, a domain decomposition method is employed.The "divide and conquer" strategy is used, in which the finite analysis domain is divided into several subdomains and each subdomain is in the charge of one or several processors.All the processors solve the whole problem independently and interactively, and synergistically and simultaneously.
Based on the distributed memory environment of parallel computing, the PKS solver is paralleled by using the SPMD programming method and the message passing programming model.Since the "divide and conquer" strategy is used,   (iv) iterative techniques, for example, CG, CGS, BiCGSTAB, GMRES, TFQMR, and so forth, are paralleled and used to solve all the local equations.Algorithms based on iterative methods are not always suitable for specific structural analysis, for such algorithms don not always work since the matrix may not be well conditioned during the iterative process.The direct method has fewer problems in achieving solution convergence.In terms of these reasons, a sparse equation solving method is adopted as well in SAPTIS.This is a numerically stable parallel algorithm for solving ill-conditioned linear systems of equations.
The sparse methods are close to direct methods in essence, but they have also big differences.The rearrangements of matrices, incomplete decompositions, and multifront techniques have made the sparse methods more efficient than the direct ones.
The sparse equation solver is parallelized in CSR matrix format, which supports several types of matrices including real/imaginary matrices and symmetric/asymmetric matrices.
The parallelization procedures for the sparse equation solver are similar to those for the PCG solver.In this case, we will omit the discussions.

Parallel GPU Equation Solver.
Recently, the high performance computing is quite popular, that is, the GPU computing.The graphics processing unit (GPU) has become     metal pipe are performed.The parameters for the simulations are presented in Table 3.
The comparisons between the results of the fine and equivalent models are illustrated in Figure 4.The temperature in the fine model is a real value, while the temperature in the equivalent model is and can only be a mean value of the whole pipe domain.
The figure shows that the temperature of the metal cooling pipe is lower than that of a plastic one, and the temperature difference between them reaches 1.5 ∘ C on day 15.Thus, the metal pipe takes priority over the plastic one in concrete cooling.Meanwhile, the numerical result shows a great agreement with the analytical solution.The temperatures obtained by these two models are nearly the same, which indicates that the equivalent model can ensure its accuracy when simulating a real engineering problem.A simulation for the dam is performed to grasp the real temperature status and thus to develop a measure for preventing cracking.The simulation takes into account the whole process of excavation and pouring of dams and the varieties of influencing factors for dam temperature.The model with its boundary conditions is illustrated in Figure 2.

Simulation of a
The simulation results are shown in Figure 5.The temperature of Elevation 357 m is presented in Figure 5(a) while the temperature of Elevation 408 m is presented in Figure 5(b).The comparisons between the numerical results and the measured ones are made, and we know from the figure that these results show a great agreement with each other.Additionally, the results show the real temperature status.These all indicate that the models presented are reasonable and accurate.

Parallel PCG Solver Test on OpenMP
(1) Serial Codes Test.A serial codes test is performed on a ThinkCentre M Q45t computer.Its parameters are shown in Table 4.This computer can solve a problem with more than 5 million degrees of freedom.
The result of the test is shown in Figure 6.The PCG solver runs on a single PC very efficiently with little memory, and  6 shows the different times solving problems with different DOFs by using the serial SSOR-PCG solver.The result indicates that the sovler takes only 100 seconds when solving a problem with 2 million DOFs and takes 500 seconds when solving a problem with 5 million DOFs.Obviously, it is much faster than many solvers.
(2) Parallel Codes Test.A parallel codes test of the PCG solver runs on a Sun Fire 6800 computer.Table 5 shows the test platform, and Table 6 gives the test results.We can easily know from the results that the speedups increase significantly with the number of threads.The acceleration of speedups drops a little, but the speedups themselves are almost over 80%.

Parallel PKS Solver Test on MPI.
Test of the parallel PKS solver is performed on platform shown in Table 7. Methods are shown in Table 8, and the test results are presented in Table 9.
We can easily obtain that the speedups are significant and the parallel solver is quite efficient.This parallel solver can run either on a cross-node cluster or on a shared memory system such as the multicore CPU.

Parallel SE Solver Test on OpenMP.
The test platform for parallel SE solver is presented in Table 10, and the results are shown in Table 11.For a 930000-order slightly ill-conditioned matrix, the solving time is quite impressive, and it is only 70 seconds when the solver runs on 16 threads.Since a sparse direct method is employed, the time is mainly spent on incomplete LU decomposition.Only about 20G memory is consumed, which is much less than the Gaussian elimination, but more than iterative methods whose memeroy consumption can be only 3-4 G.We can get from the speedups that the parallel sparse equation solver is very efficient and scalable, and it is well suitable for equations of large linear system.7.
The boundary conditions are similar to the ones shown in Figure 2. The results are shown in Figure 8.
In this simulation, we test the speedups, which will be shown in Table 12.

Conclusions
First, numerical models adopted by the software are presented including (i) hydration models, (ii) water cooling models, (iii) modulus models, (iv) creep model, (v) autogenous deformation models taking into account the properties of MgO concrete.
A thermal example for verifying the thermal models is presented.A good agreement is achieved between the numerical results and the analytical ones.A finite element simulation for the whole process of excavation and pouring of dams using these models is made, and the numerical results show a good agreement with the measured ones.
Then, several parallel solvers are introduced with their parallel strategies, consisting of (1) the preconditioned (i.e., SSOR) conjugate gradient solver,

Figure 1 :
Figure 1: Definitions and connection types of 1d element.

Figure 4 :
Figure 4: Comparison between the fine and equivalent models.

Figure 8 :
Figure 8: The simulation results using the GPU parallel solver.

( 8 )
call the function VeksSetDemQ to transfer the thermal loads vector into the GPU solver; (9) call the function VeksSolve to solve the equations; (10) call the functions VeksGetU and VeksGetT to get the displacements and temperatures, respectively.

4. Numerical Examples 4 . 1 .
Thermal Field Verification 4.1.1.Cylinder Water Cooling Pipe.Take a piece of circular area, whose radius  = 1.5 m (see Figure3).The fine and equivalent models of FEA are used for comparison and verification.Comparative simulations between a plastic and a Dam.A hyperbolic arch dam in Southwest China is 285.5 m high.The annual average air temperature in the area is 19.7 ∘ C; the highest monthly average air temperature is 27.1 ∘ C; the lowest monthly average air temperature is 10.6 ∘ C. The annual average ground temperature is 21.4 ∘ C.

Table Model .
Calculate

Table 6 :
380000-order general coefficient matrix test results.

Table 12 :
Speedup of GPU580 versus different CPU machines and methods.