A GPU-Based Parallel Procedure for Nonlinear Analysis of Complex Structures Using a Coupled FEM/DEM Approach

This study reports the GPU parallelization of complex three-dimensional software for nonlinear analysis of concrete structures. It focuses on coupled thermomechanical analysis of complex structures. A coupled FEM/DEM approach (CDEM) is given from a fundamental theoretical viewpoint. As the modeling of a large structure by means of FEM/DEM may lead to prohibitive computation times, a parallelization strategy is required. With the substantial development of computer science, a GPU-based parallel procedure is implemented. A comparative study between the GPU and CPU computation results is presented, and the runtimes and speedups are analyzed. The results show that dramatic performance improvements are gained from GPU parallelization.


Introduction
Many authors have studied the resolution of the nonlinear concrete structure problems, focusing rigorously on the three-dimensional (3D) elastoplastic problem, such as Roca and Marí [1] and Spacone et al. [2].These authors used a finite element model.Other authors (e.g., Rousseau et al. [3], Villard et al. [4]) proposed a coupled FEM/DEM approach with regard to nonlinear material analysis of structures.As is shown, the modeling of a large structure by means of FEM or DEM may lead to prohibitive computation times.Thus, some authors (e.g., Sziveri et al. [5], Romero et al. [6]) developed CPU-based parallel procedures.With significant development of computer science, the implementation of a GPU-based parallel procedure becomes possible.
A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to display [7].GPU has become an integral part of today's mainstream computing systems.The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart [8].Over the past decade, there has been a marked increase in the performance and capabilities of GPUs.As a result, computing on a GPU card has been a hot topic for scientific research in different numerical areas [9][10][11][12][13][14][15][16][17][18][19][20][21].The techniques, which enable GPUs to efficiently undertake large-scale scientific computing tasks, can be summarized as follows [21].
(i) A GPU card consists of a number of multiprocessors (as shown in Figure 1).On each processor, there are several hundred coresident threads that can execute integer, single and double precision calculations simultaneously.
(ii) Fine-grained parallelization with a very large number of threads is the kernel principal of GPU computing.
(iii) The memory access technique is specially designed in a GPU to accelerate the memory access speed of the threads.To achieve near peak memory access performance, memory coalescing is considered in the design of the data structure and the computational arrangement for computations.Each SM is a vertical rectangular strip that contains an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) [23].
(iv) The release of Compute Unified Device Architecture (CUDA) makes it much more straightforward to implement GPU code.
To model the nonlinear response of complex structures, we will parallelize the coupled FEM/DEM code on a GPU with CUDA.In the first section, the coupled FEM/DEM approach with its basic formulations is introduced.Then, the GPU implementation and optimization techniques are presented.Furthermore, the parallel FEM/DEM code is tested on computers equipped with modern GPU cards.Finally, conclusions regarding the GPU parallelization are drawn.

Coupled Finite/Discrete Element Model
2.1.Geometric Representation.Li et al. [24,25] proposed a Continuum-based Discrete Element Method (CDEM) which in fact is a Coupled Finite/Discrete Element Method and has a variety of applications.A detailed description about CDEM will be introduced in the following context.
CDEM is a combination of Finite Element Method (FEM) and Discrete Element Method (DEM).Figure 3 shows the geometric domains used by the CDEM approach.It contains two kinds of elements, blocks (i.e., A, B, C, and D) and contacts (Figures 3(c) and 3(d)).A discrete block consists of one or more FEM elements, all of which share the same nodes and faces.In this paper, it is assumed that a discrete block consists of only one FEM element.All three element types are shown in Figure 4.A contact contains several normal and tangent springs (Figure 5); each connects nodes of neighboring blocks.Inside the block, the FEM is used, while for the interface, the DEM is adopted.

Governing Equations. Block stress analysis means fig-
uring out stress of every single block and the interaction between different blocks.For every block in the model, it needs to satisfy the following governing equations.
The boundary conditions are In ( 6)∼(8),  represents age;  is thermal conductivity;  stands for density;  is specific heat capacity;  is temperature;  0 is the initial temperature of concrete;   is the fixed boundary temperature;  represents heat flux;   denotes ambient temperature in natural convection conditions and adiabatic temperature of boundary layer in forced convection conditions;  is heat transfer coefficient in surface.
By using the variation formulation, the equilibrium equation of heat transfer in (7) can be transformed into the following matrix form: where [] is the matrix for specific heat capacity, [] is the matrix for heat conductivity, and {} is the heat flux vector.A time forward difference scheme is used for thermal analysis.Thus, the time difference equation can be formulated as where {  } stands for the temperature vector at time step ; Δ is the time step.

Dynamic Relaxation Method
. By using the variation formulation, the equilibrium equation of momentum in (1) can be transformed into the following matrix form in an element: where [] represents the diagonal mass matrix; [] represents the damping matrix; [] represents the stiffness matrix; {} represents the vector of displacement; {} represents the vector of external force.
In our approach, global stiffness matrix is not assembled.Instead, (11) is iterated element by element, using the dynamic relaxation method.In every element, its displacement satisfies (11) strictly.If the solution of every element satisfies (11), then the overall solution made up of element solutions must satisfy the assembled equation.In fact, this is beneficial for the parallel implementation.
In time domain, an explicit iteration technique is applied.In this technique, the acceleration is iterated by the central difference scheme, while the velocity is iterated by the unilateral difference scheme.The schemes can be written as where ü   , u   , and    , respectively, represent the acceleration, the velocity, and the displacement of the th node of the th element, at the th time step.
The explicit iteration technique can be formulated from ( 12): The process of explicit iteration using the dynamic relaxation method is illustrated in Figure 6.During the iteration, convergence is reached when the total magnitude of the kinetic energy is minimized.

GPU Implementation with CUDA.
In Section 2, we refer to the CDEM approach, in which global stiffness matrix is not assembled.An element-by-element strategy is employed to solve the equilibrium equation.In this sense, the CDEM is suitable to run on a GPU since calculations involved in the CDEM are all performed independently for the elements and the nodes.
In GPU parallelization, heterogeneous computing methodology is usually used.This means that the serial part of the code executes in a host (CPU) thread, while the parallel part executes in a large amount of device (GPU) threads.The device can be viewed as a virtual computer that has its own separate memory space.It is ideally suited to perform element and nodal calculations across hundreds of threads.Figure 7 shows a flow chart of the GPU implementation of the CDEM.One principle for the implementation is to replace the original CPU-based calculation functions with CUDA kernels.A kernel is a function that runs on the device.Another principle is to integrate as many CUDA kernels into a large one as possible.This is to minimize the delay of kernel launching.The kernels should be integrated only when they can be parallelized.Serial and dependent kernels cannot be integrated together.
An example is given to demonstrate the GPU implementation with CUDA.In the example, the CPU functions DoElems and DoNodes are first replaced by corresponding CUDA kernels.Then, the CUDA kernels are integrated into one CUDA kernel DoKernel.The additional qualifier " global " is used to define a CUDA kernel.Embellished with <<<nBlocks, nThreads>>>, the DoKernel function will be run in  × ℎ threads in parallel (see Algorithms 1 and 2).As illustrated in Figure 7, stress and thermal fields are analyzed on two GPUs concurrently.Such multi-GPU technique ensures higher efficiency when we perform multifield simulation for complex structures.All of the data in two fields will be first sent to the device.During the calculation, only thermal stress will be read from one GPU and sent to another.The communication is realized by using the CUDA function cudaMemcpy.The data transfer from the device to the host only happens when the simulation results need to be output.Since the data are kept on the device for the whole simulation process, memory transfer between host and device is largely reduced.Data transfer from GPU2 to GPU1 only happens under necessary circumstances, which will not affect the overall efficiency.

GPU Parallel Algorithm and Procedure.
The GPU parallel algorithm is realized by GPU kernels.In Section 3.1, we have discussed the implementation of one GPU kernel.Now all the kernels will be presented to demonstrate the detailed process (see Algorithm 3).
The kernels GenerateStiffness() and GenerateConductivity() generate element stiffness matrix and element conductivity matrix, respectively.They use similar algorithms.To generate element matrix, we perform the Gaussian integrations over each element by using the following formula: where [] is strain matrix; [] is stress-strain tensor; || is the determinant of Jacobian matrix;   ,   , and   are weights in Gaussian integrations.The strain matrix [] is calculated by GPU.For linear problems, [] is calculated and stored in GPU memory before iteration.For memory-consuming problems (i.e., nonlinear or big problems), [] is calculated when iteration is taking place.Since the GPU calculation of [] matrix is fast, it costs little time.For different elements, the parallel algorithms to generate [] matrix are different.The algorithm is shown below.

Algorithm of Generation of [𝐵]
Matrix.(1) For each 4node tetrahedron element, there is one or four Gaussian integration points.Each element is divided into four threads.If there is one Gaussian integration point, the first thread performs the Gaussian integration and the weight is unit.If there are four Gaussian integration points, each of the four threads performs its own Gaussian integration and each of the four weights is 0.25.
(2) For each 6-node wedge element, there are six Gaussian integration points.Every 8 threads are one group.In one group, only the first six threads perform the Gaussian integrations, while the left two threads do nothing.
(3) For each 8-node hexahedron element, there are eight Gaussian integration points.Each element is divided into eight threads.Each thread performs the Gaussian integration of one Gaussian point.
The kernel DoKernel() does nodal and element calculation of stress field.The algorithm in DoKernel() can be summarized as follows.
Steps 1∼3 are calculations in terms of elements, while steps 4∼5 are in terms of nodes.In parallel algorithm, all elements and nodes are distributed into nBlocks × nThreads threads.Thus, the DoKernel() function will be run in nBlocks × nThreads threads in parallel.
It is worth mentioning that a CUDA library function syncthreads() must be called between Steps 1 and 2. The matrix multiplication []  {}  is performed by GPU.It demands that the displacement vector {} be synchronized.Thus, the function syncthreads() is used to keep all the displacement components at the same level.
Boundary conditions need special treatments.For  = 0 boundary, no calculation is done.The displacements are set as zero.For traction boundary, tractions are equivalent to nodal forces.Each thread performs the calculation of one node.
Convergence is reached if the kinetic energy is lower than a threshold.The kinetic energy is defined as.To optimize the parallel procedure, the number of threads per block needs to be carefully chosen.As is recommended by NVIDIA [29], the best results will be achieved when the number of threads per block is a multiple of 64.For ideal performance, this number is recommended to be over 192.In practice, the best performance is achieved when the number of elements per block is 16 or 32.Equations ( 16) [18] and (17) show the relationship between the number of threads per block   and the number of elements per block   .
8-node/6-node elements: 4-node element: The effective bandwidth of each memory space depends significantly on the memory access pattern.As described in [29], global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads in a half-warp (during the execution of a single read or write instruction) can be coalesced into a single memory transaction of 32, 64, or 128 bytes.To achieve maximum global memory bandwidth, the data structures are realigned to thread.Besides, shared memory is used to store nodal variables, for the shared memory space is much faster than the local and global memory spaces, and nodal variables are reused among threads [18].1.

Test Model.
The model of a dam is tested by the following steps.
(i) First, the gravity field is simulated to test the accuracy of the GPU procedure, to get the runtime distribution of different GPU kernels, and to obtain the concerning speedups.
(ii) Then, the thermal field is calculated with a distribution of thermal stress.
(iii) Finally, the fracture field is determined by the superposition of the original gravity and the newly-induced thermal stress.
The model of the dam is presented in Figure 8(a).The boundary conditions for gravity simulation are displacement constraints.The front and back faces ( +/− directions) and the left and right faces ( −/+ directions) are all constrained with zero displacements in normal directions.The bottom face ( − direction) is constrained with zero displacements in all directions.The thermal boundary conditions are illustrated in Figure 8(b), and the temperatures used in boundary conditions are presented in Table 2.

Accuracy.
The results from the CPU and the GPU procedures are not identical though they look the same to each other (see Figure 9).A comparison between the results is illustrated in Figure 10.This comparison shows that the results from different GPUs are the same, but they have a little difference compared with the CPU result.Further, the errors between different procedures are shown in Table 3.The average error is about 2%.The reason why errors happen is that the parallel algorithm has a different accumulation order from the serial one.The accumulation occurs in the nodal force redistribution function.The CPU procedure accumulates the nodal forces one by one in the order of the elements, while the GPU version has a different order, which depends on the generation of nodal force group.Besides, the Arithmetic Logical Units (ALUs) on CPU and GPU have different relative error bounds.11 illustrates the runtimes of different GPU kernels.It shows that the iteration kernel of dynamic relaxation takes possession of over 95% of the runtime, while other kernels take possession of that less than 5%.The iteration kernel determines the overall runtime, which means that the iteration speedup can represent the overall speedup to some extent.The second time-consuming is the data output kernel.An average of about 3% of runtime is consumed in data output.When big problems are calculated, this part of runtime can be large.This encourages us to reduce data transfer between host and device.Other kernels consume little.Their total runtime is less than 1%, which can be neglected in this study.

Runtime. Figure
From another point of view, what makes the runtimes different on different platforms is concerned.Take the runtime of stiffness matrix generation as an example.The runtimes on different platforms are shown in Table 4. On GTX680, the runtime is the least, only 0.828582 ms, nearly the same as the GTX570.The GTX580 is the most time-consuming GPU.The reason why this happens can only be the difference of the numbers of CUDA cores.This indicates that more CUDA cores will reduce the runtime of stiffness matrix generation.However, the number of cores is not the only decisive factor.The processor clocks determine the runtimes of dynamic relaxation iterations, which can be learned from Figure 12.The higher the clock rate is, the less the runtime of iteration is consumed.Figure 13 indicates that newer GPU models are designed for less runtime of output.All the figures presented show that double precision calculations cost more runtimes than single precision ones.The runtime analysis helps us to choose a proper GPU model for simulations in both scientific research and engineering applications.

Speedup.
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding serial algorithm [30].Speedup  is defined by the following formula: where   is the runtime of the serial algorithm;   is the runtime of the parallel algorithm.
In the tests, we obtain different speedups, which are presented in Figure 14, on different GPU computers using different precisions.The figure shows that speedups range from about 100 to 400.The maximum speedup reaches dramaticly 417, while the minimum reaches 102.Following the optimization strategies presented in Section 3.2, we have gained dramatic performance improvements from GPU parallelization.The figure also shows that double precision calculations consume more runtimes than single precision ones.The GTX580 is the fastest GPU among the three GPUs we present.

Other Analysis.
Temperatures in summer and in winter are both simulated, as shown in Figures 15 and 16.We obtain the steady thermal fields in summer (see Figure 15) and in winter (see Figure 16), respectively.The thermal differences between two thermal fields will produce thermal stresses, which may cause cracking in concrete.The crack caused by thermal stresses is a major research topic in future work.

Conclusions
In this study, the CDEM is successfully accelerated by using GPUs.In CDEM, global stiffness matrix is not assembled, and an element-by-element strategy is employed to solve the equilibrium equation.This, as well as the performance optimization, enables us to implement an efficient GPU parallel procedure.Detailed tests on accuracy, runtime, and speedup are performed on different GPUs.The results show that dramatic performance improvements are gained from GPU parallelization.A maximum speedup of 417 has been achieved.The GPU parallelization of CDEM makes it more powerful in multifield analysis of complex structures.2013CB036406, and Grant no.2013CB035904) and National Natural Science Foundation of China (50909105).Acknowledgments also go to the colleagues/students Chun Feng, Dong Zhou, Qingbo Zhang, and so forth, for their kind help.

Figure 1 :
Figure 1: Fermi architecture-Fermi's 16 SMs are positioned around a common L2 cache.Each SM is a vertical rectangular strip that contains an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache) [23].

Figure 2 :
Figure2: CUDA programming model-hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared and per-application global memory spaces[23].

Figure 6 :
Figure 6: Flow chart of iteration using dynamic relaxation method.

Figure 8 :−Figure 9 :
Figure 8: The test model of a dam.

Figure 12 :Figure 13 :
Figure 12: Runtime of iteration versus processor clock.Runtime of iteration decreases with processor clock.

Figure 15 :
Figure 15: Summer temperature simulation to obtain the steady thermal field.The last picture ( = 100 d) shows the steady state of thermal field in summer.

Figure 16 :
Figure 16: Winter temperature simulation to obtain the steady thermal field.The last picture ( = 100 d) shows the steady state of thermal field in winter.
(5)5),  , represents the first-order partial derivative of stress tensor versus coordinate;   stands for the body force;  is density; ü  denotes the acceleration; u  denotes the velocity;  is damping ratio;  , ,  , are both the first-order partial derivatives of displacement versus coordinate;   ,   are strains;   is stress; D  is stress-strain tensor.

Table 1 :
Specifications of the GPUs [26-28] used in this study.
half the number of nodes, compared with the 8-node hexahedron element.They share a proportional (1/2) relationship with each other.Thus, a proportional optimization strategy can be used.In this sense, once the 8-node hexahedron element is optimized, the 4-node tetrahedron element is optimized as well by multiplying a proportion of 1/2.
In all our tests, three computers, one Intel Xeon E5506 at 2.13 GHz with NVIDIA GeForce GTX580, one Intel Xeon E5-26090 at 2.40 GHz with NVIDIA GeForce 4.1.Test Platform.

Table 2 :
Temperatures used in boundary conditions.

Table 4 :
Runtimes of stiffness matrix generation on different GPUs.