Parallel N-Body Simulation Based on the PM and P 3 M Methods Using Multigrid Schemes in conjunction with Generic Approximate Sparse Inverses

During the last decades, Multigrid methods have been extensively used for solving large sparse linear systems. Considering their efficiency and the convergence behavior, Multigrid methods are used in many scientific fields as solvers or preconditioners. Herewith, we propose two hybrid parallel algorithms for N-Body simulations using the Particle Mesh method and the Particle Particle Particle Mesh method, respectively, based on the V-Cycle Multigrid method in conjunction with Generic Approximate Sparse Inverses. The N-Body problem resides in a three-dimensional torus space, and the bodies are subject only to gravitational forces. In each time step of the above methods, a large sparse linear system is solved to compute the gravity potential at each nodal point in order to interpolate the solution to each body. Then the Velocity Verlet method is used to compute the new position and velocity from the acceleration of each respective body. Moreover, a parallel Multigrid algorithm, with a truncated approach in the levels computed in parallel, is proposed for solving large linear systems. Furthermore, parallel results are provided indicating the efficiency of the proposedMultigridN-Body scheme.Theoretical estimates for the complexity of the proposed simulation schemes are provided.


Introduction
The simulation of an -Body system is referred to as a dynamical system of bodies influenced by mainly the gravitational force.The -Body simulation has become a fundamental tool in the study of complex physical systems; compare [1].The physical interactions of bodies (e.g., gravitational, Coulomb) and the evolution of the dynamical system present the phase-space density distribution of the system.These systems rely on a substantial number of bodies, thus making the complexity an important factor of the computational procedure.The computation of the gravitational influence of the bodies in a confined space can be made with different approaches.The direct method of solving an -Body problem, called Particle Particle, is based on the computation of the gravitational force in each pair of bodies within the system (Newtonian gravitation).This direct approach scales the problem with ( 2 ) complexity, where  denotes the number of bodies, rendering the scheme restrictive for large number of bodies.A different approach to the problem, called Particle Mesh (PM) method, neglects the close interactions between bodies and takes into account only the dynamics of superparticles, each consisting of a great number of bodies.The particles contribute to their masses to the mass density of the nodal points on the confined space.A more complicated hybrid method, Particle Particle Particle Mesh method (P3M), which is composed of the above two methods, divides the force into two parts: the short range and the long range force.The long range force is computed with the Particle Mesh method, and the short range force is computed via the Particle Particle method leading to more accurate results; compare [1][2][3][4].
Additionally, there are other approaches based on tree codes, such as Tree code methods (cf.[3]) and Fast Multipole methods (cf.[5]).The Tree code methods have ( log()) complexity, but Fast Multipole methods can reach up to () complexity, though "the exact scaling seems implementation dependent and has been debated in the literature" (cf.[6]).Approximate inverses have been used as preconditioners for various iterative schemes combined with fast approximate matrix-vector products such as FMM (cf.[7,8]).Furthermore, various implementations have been proposed based on Mesh, Tree, and Multipole methods and on hybrid models (cf.[9]).
The -Body simulations cover a wide area of cosmological tests.The proposed schemes can be used, as an alternative method, for solving the -Body problem on mainly spherical or disk galaxies simulations; compare [10].These simulations can be solved using Tree code techniques, with computational complexity of ( log ), where  is the number of bodies.Mesh type methods in conjunction with the FFT algorithm for solving the potential have periodical boundary conditions.These boundary conditions when applied to a galaxy replicate the solution on the boundaries of the domain, introducing significant error; compare [10].The proposed scheme takes into consideration the physical space where Dirichlet boundary conditions can be used, to eliminate such errors.Furthermore, the Multigrid method has computational complexity of (), while the FFT algorithm has computational complexity of ( log ), where  is the number of grid points.
The gravity potential required by the field force, on both PM and P3M methods, is computed by solving a linear system derived from a Poisson PDE; compare [3].The potential is commonly computed with a Fast Fourier approach with an ( log ) complexity, where  denotes the number of grid points.Some examples with this approach can be found in [11][12][13].Due to the nature of the problem given by the Poisson equation, a different approach for solving the linear system is applied using Multigrid methods which possess near optimal () complexity.During the last decades, Multigrid methods have been used in a variety of scientific fields for solving Partial Differential Equations in Computational Fluid Dynamics and Computational Physics due to their near optimal computational complexity and convergence behavior; compare [14][15][16][17][18][19][20][21][22][23].In order to accelerate the convergence of the Multigrid method, the Dynamic Over/Under Relaxation (DOUR) scheme is used (cf.[24]) in conjunction with the Generic Approximate Sparse Inverse (GenAspI) matrix (cf.[25]).
Herewith, we propose a new hybrid parallel algorithm for computing the gravity potential using the Particle Mesh and the P3M methods and the Multigrid V-Cycle method in conjunction with the GenAspI (GenAspI-MGV) method to accelerate the solution of the linear system.The Particle Mesh method computes the mass density in parallel by adding each mass influence according to weights computed from the distance of each node to the grid points surrounding each respective body.The interpolation of the respective weights in each nodal point is performed using the Cloud in Cell (CIC) method; compare [1].Then the gravity potential at the cell centers is computed by solving the linear system using the parallel GenAspI-MGV method with a new approach where the lower order levels are not being parallelized in order to avoid overhead in levels with low computational cost.In the P3M method the short range forces are computed with Newton's gravity formula in parallel for each short range force pair.Additionally, by integrating the equations of motion through the Velocity Verlet integrator (cf.[33]), the new position of each body is computed.The low computational complexity of the Multigrid method utilized in each time step for the solution of the linear system renders the algorithm efficient, especially for increasing numbers of bodies.
The parallelization of the proposed algorithms is utilized on a hybrid parallel system with the parallel tools OpenMP and MPI.The hybrid parallel system is preferred due to the nature of the problem that requires a lot of data to be sent.The implementation sends the -Body data to every cluster node and then the data is further parallelized with OpenMP on each computational node.The linear system is solved on the head node, due to the fact that the order of the linear system is much smaller than the number of bodies and requires less computational effort.
OpenMP is a collection of directives available for a variety of languages including C/C++/Fortran used to parallelize programs for Symmetric Multiprocessing Units.On the other hand, MPI (Message Passing Interface) is a message-passing system used to parallelize the algorithm on a cluster level.
Finally, numerical results on the performance and convergence behavior of the proposed -Body GenAspI-MGV algorithms are presented for solving three-dimensional -Body simulation problems on a hybrid system and on Symmetric Multiprocessing Units for both PM and P3M methods.

Parallel Multigrid Method in conjunction with the Generic Approximate Sparse Inverse
Multigrid is a computational method for solving linear systems of equations which is used extensively in Computational Physics due to its near-optimal complexity and its convergence behavior; compare [14-16, 18, 21-23, 31, 32, 34].Stationary iterative methods are not effective on damping the low-frequency components of the error.On the other hand, they are highly efficient on damping the high frequency components of the error within the first few iterations; compare [14,15,22].Multigrid methods exploit the way stationary iterative schemes handle the error by creating a hierarchy of (finer and coarser) grids, as presented in Figure 1, transferring the linear system to various levels with different mesh size.The Multigrid methods can be analyzed in the distinct components: smoother, transfer operators, and cycle strategy.The smoothers are used to obtain the respective corrections for the solution of the linear system on each level and can be described by the following relation (cf.[15,16,22,34]): where ℓ denotes the level in which the smoother is applied,  ℓ is the preconditioner to Richardson's iterative method, and  is the damping parameter (0 <  < 2).The Generic Approximate Sparse Inverse is used as a preconditioner to the Richardson method.By increasing the fill values the approximate inverse tends to the exact inverse of the matrix .The algorithm for the computation of an approximate inverse sparsity pattern is given in [25,35,36].The GenAspI matrix  fill drptol is then computed, according to this sparsity pattern, by solving recursively the following system (cf.[25,36]): where  and  are the lower and upper triangular factors computed by the (0) factorization.The computation of the elements of the GenAspI matrix is given by and the GenAspI algorithm has been presented in [25].
The computational complexity of the GenAspI algorithm (cf.[25]) in terms of its nonzero elements is as follows: ) , In order for a smoother to be effective the smoothing property must be satisfied (cf.[15,18,[20][21][22]34]).The smoothing property has been proven in [31] for the Optimized Banded Generalized Approximate Inverse (OBGAIM) matrix (cf.[31,37]): where Moreover, sharp estimates for the convergence of Multigrid algorithms, for generalized smoothing, have been proven by Bank and Douglas in [38].
Moreover, Hackbusch has found the optimal values for the damping parameter for various iterative schemes and various PDEs; compare [18,20,24].The proposed scheme requires a damping parameter .The damping parameter cannot be defined optimally for the problem, so an algorithm for the dynamic computation of its value must be used.The value of the damping parameter can be computed by the Dynamic Over/Under Relaxation (DOUR) algorithm; compare [24].The DOUR algorithm predicts dynamically the value of the damping parameter (predictor-corrector) to ensure the convergence of the smoothing scheme.
Given an initial damping parameter , we use (1) to predict a value for  (+1) ℓ : The corrected value is given by where From ( 7), (8), and ( 9) we obtain where   = (1 + ) is the effective relaxation parameter and ( 10) is the proposed two-stage nonstationary smoothing scheme.Further information about the DOUR algorithm and its convergence analysis is given in detail in [24].
The vectors needed by the iterative scheme for the solution process of the linear system must be transferred across different grid levels.Transfer operators are special operators that are used to transfer these vectors from coarser to finer grids and finer to coarser grids.The operators are divided into two categories: the restriction operator and the prolongation operator.The restriction operator  is used to transfer vectors from finer to coarser grids.An effective choice for the restriction operator is the full-weighting, which in three dimensions can be expressed as presented in [22,39].
The prolongation operator  is an interpolation procedure used to transfer vectors from coarser to finer grids.An effective choice for the prolongation operator is the trilinear interpolation.The trilinear interpolation can be expressed as presented in [22,39].
These operators  ℓ+1 and  ℓ are related through the Galerkin condition,  ℓ+1 =  ℓ  so the mapping on the data is simplified; compare [14,30].Transferring vectors in the Multigrid method requires about 25% of the total computational work (cf.[15]) so the selection of higher order interpolation schemes may lead to excessive computational cost with no extra gain in the convergence behavior (cf.[15,22]).A sparse matrix representation of these operators is used and thus the transfer operation between the levels is limited to matrix "times" vector multiplication, and their parallelization is covered by the parallelization of the matrixvector multiplication.Further information about the transfer operators is presented in [14,15,22,34].
The cycle strategy refers to the sequence in which the grids are visited and the respective corrections are obtained (cf.[15,22,34]).The most commonly used cycle strategy is the V-Cycle, where the method descends to coarser levels executing V 1 smoother iterations in each level and then the method ascends executing V 2 iterations in each level.The V-Cycle strategy is depicted in Figure 2.More information about the cycle strategies and the V-Cycle is presented in [29,32,40,41].
The parallelization of the V-Cycle algorithm is performed only in the higher order levels to ensure the efficiency of the scheme; compare [39].Lower order levels possess computational effort comparable to the parallelization overhead produced during the procedure of creating and detaching threads present in the parallel computation.Therefore, only levels containing a substantial number of unknowns are computed in parallel.The other levels are computed sequentially.As the number of levels increases, the sequential part requires less computational effort compared to higher order levels, thus increasing the speedup and efficiency of the proposed scheme.

Parallel Particle Mesh Type Algorithms
The Particle Mesh method (PM), introduced by Hockney in [1], avoids the computation of every force pair in the system via the Newtonian gravitation and reduces the computational cost of the -Body simulation; compare [1][2][3][4].This method introduces the so-called "superparticles" term, which defines a large formation of bodies that react together through their gravitational force as a whole.Hence, PM is more efficient than the direct method due to the fact that the nodal points are less than the number of bodies in the system and so the force pairs that have to be computed are much smaller.The PM method is not as accurate as the Particle Particle method.The method computes rapidly the long range (field force) force but is not taking into account the gravity influence between the short range bodies.This approach is more suitable on large scale systems, where low complexity and large scale results are more important; compare [1][2][3][4].
The gravity potential is computed by the Partial Differential Equation (cf.[1,40]), subject to Dirichlet boundary conditions, where Ω denotes the region, Ω is the boundary of the region,  is the gravitational constant, and  is the mass density in each nodal point computed by the mass of the neighboring bodies.
Applying the Finite Difference method with the sevenpoint stencil on the PDE (cf.( 11)-( 12)) results in solving where  is a large sparse diagonally dominant symmetric matrix,  is the gravity potential at each nodal point, and  is the right hand side vector consisting of the forcing term and respective boundary conditions.Each body contributes to the density of the concentrated mass along the grid points, resulting in the mass density vector in that PDE; compare (11).The simplest interpolation method used for computing the mass density vector is the Nearest Grid Point (NGP).The NGP method suggests that the body contributes to the nearest nodal point; thus bodies within one cell are not interactive with other nodal points increasing the error.In the proposed scheme, the multilinear interpolation is used, known as Cloud in Cell (CIC); compare [1].The Cloud in Cell method interpolates the contribution of mass to every nearby grid point via the so-called "cloud" increasing the accuracy of the computations.The cloud can be a square or a cube with its center at the location of the body.The method is based on the area (two dimensions) or the volume (three dimensions) of every part of the "cloud cell" with respect to the other mesh cells, to create the interpolation weights.In three dimensions, using the Cloud in Cell method on a body with mass   located at (  ,   ,   ) results in affecting the mass density of the eight surrounding grid points (cube).The mass density (cf.[1,4]) in the nearest mesh cell can be computed by where ℎ is the mesh size and Δ  , Δ  , Δ  are the distances between the body and the center of the mesh cell.The remaining nodal point densities are computed analogously.The bodies in the Particle Mesh method are divided among the computational nodes for the parallelization of the algorithm.So every node of the hybrid system computes the mass density of their own bodies in parallel on the processors through the parallel Cloud in Cell method.The computed vectors are sent from every other node to the head node through the MPI parallel environment.The PDE subject to Dirichlet boundary conditions (cf. ( 11)-( 12)) discretized with the Finite Differences method is solved with the parallel GenAspI-MGV method on the head node, and the gravitational potential is the solution of the linear system.
The acceleration  in the center of the cells (cf.[1][2][3][4]) can be computed by the following equation: Central differences are used to approximate the first order partial derivative; that is, The method is computed in parallel using the OpenMP at all nodes.The acceleration on the center of the cells must be interpolated back to the bodies.Applying the parallel CIC method for the inverse operation on a body in the three dimensions results in the following formula for the acceleration (cf.[1,4]): where index  denotes the particle and   are computed weights through the CIC method.For the nearest mesh cell the   is computed by The remaining weights are computed in an analogous way.The acceleration vector is sent back to the other nodes from the head node through the MPI interface.The Cloud in Cell method and the inverse procedure are depicted in Figure 3. Finally, knowing the acceleration for each body the equations of motion have to be integrated.On the proposed algorithms the Velocity Verlet integrator is utilized.The Velocity Verlet integrator is used widely on -Body simulations due to its low computational cost and (ℎ 2 ) accuracy; compare [33].The Verlet integration of the equations of motion leads to the following equations:

On every node
where ⃗ , ⃗ , ⃗  are the three-dimensional vectors of the position, the velocity, and the acceleration, respectively.The new positions are computed via the Velocity Verlet method, which is performed in parallel on each computational node using the OpenMP environment.Further information concerning the Particle Mesh method can be found in [1][2][3][4].
The new hybrid parallel PM algorithm, based on MPI and OpenMP, is given by the compact scheme in Algorithm 1.
The computational complexity of the Particle Mesh algorithm is where  nodes denotes the number of computational nodes and  is the number of mesh grid points per dimension and has been proven in [42].The Particle Particle Particle Mesh is a hybrid method, which consists of the Particle Mesh and the Particle Particle (direct) method; compare [1][2][3][4].The P3M method divides the force into two parts: the short range force and the long range force.The long range force is computed via the PM method by solving the resulting sparse linear system with the Multigrid method in conjunction with the GenAspI algorithm.The short range forces are computed with the direct approach using the Newtonian gravitation.The computation of the short range force requires the definition of the short range distance between bodies.
Let us denote by  near the maximum distance between two bodies that can be considered as neighbors.The distance is three or four times the mesh size.In the proposed algorithm the distance is considered as  near = 3ℎ 1 , where ℎ 1 is the mesh size.The distance of every pair of bodies in the system must be smaller than the  near for the short range force to act.If this condition is applied on every pair of bodies, the computational cost will increase.Therefore, a new mesh, called chaining mesh, is created, with ℎ 2 = 4ℎ 1 , to group the short range pairs; compare [1,2,4].
Each body compares its distance from the bodies that are on the nine (in two dimensions) nearby mesh cells of the chaining mesh, with the  near distance.The neighbor chaining mesh cells are shown in Figure 4.If the distance of the bodies is less than  near , the short range force between the two bodies is computed.This assumption of the nine nearby cells of the chaining mesh is based on the fact that bodies within the  near distance cannot be outside these cells as depicted in Figure 4.
In three dimensions, these are twenty-seven cells, and the region of short range interactions around a body can be described as a sphere.
The bodies of the P3M method are also divided into the computational nodes for the parallelization of the algorithm.The computation of the long range force is parallelized the same way as the Particle Mesh method.For the computation of the short range force, every node assembles a vector containing the distinctive numbers of the associated bodies in the chaining mesh and sends it to the other nodes.
Figure 4: The nine neighbor chaining mesh cells.The bold mesh grid denotes the chaining mesh.The dots represent the position of a body in the center mesh cell (chaining mesh).The circles denote the  near of each body.
After the determination of the close distance bodies, the short distance acceleration can be computed by the following equation which is derived by Fourier analysis of the overall potential (cf.[1,2,4]): where  is the gravitational constant, ⃗  0 is the position of the computed body,   and ⃗  1 are the mass and the position of the close range body, and erfc is the complementary error function (cf.[1]): The erfc function is used, in order to remove the influence that was computed through the mass density on the long range force.The computation of the short range force for a body requires the knowledge of the position of each neighboring body, so the positions of the neighboring bodies should be transferred to its respective computational node.Further details can be found in [1].The long range force is computed through the PM method.The PM method is used to compute the acceleration of each body, so it is preferable to compute the short range acceleration instead of the  near .Thus, for each body Finally, the new positions are computed by integrating the equations of motion with the Velocity Verlet integration in parallel (cf.[33]) as in the parallel PM method.
The P3M method corrects the short range solution in the system, but in large scale clusters with large number of concentrated bodies, the short range part of the algorithm monopolizes the simulation time, and the computational time increases.
The new Hybrid parallel P3M algorithm, based on MPI and OpenMP, is given by the compact scheme in Algorithm 2.
The computational complexity of the P3M algorithm is where  nodes denotes the number of computational nodes,  is the number of mesh grid points, and  is the number of chaining mesh points per dimension and has been proven in [42].

Numerical Results
In this section the applicability as well as performance of the proposed scheme is demonstrated by simulating the threedimensional torus space with various numbers of bodies.
The numerical tests were performed on a cluster of eight computer nodes.The head node was a dual socket quad core Xeon processor (8 cores) with 2.5 GHz and 8 GB RAM.The other nodes had 2 quad sockets dual core Xeon processors (8 cores) with 2.5 GHz and 8 GB RAM.The cluster network was an Ethernet network with 1 Gbps bandwidth.The domain chosen for the simulations was the unit cube.The termination criterion was set to ‖  ‖ < 1 − 10‖ 0 ‖, where   =  −   .The presmoothing and postsmoothing steps were set to ] 1 = 2 and ] 2 = 2.For computing the GenAspI matrix, the drop tolerance and levels of fill were set to drptol = 0.0 and fill = 1, respectively.For these parameter values the method presented the fastest performance results; compare [39].The levels below  = 343 were executed sequentially because the computational overhead exceeds the computational effort required in order to obtain the coarse grid corrections.It should be noted that the preprocessing cost of the simulation algorithms designed was several orders less than the computational time needed for the model problems, rendering the methods efficient by slightly enlarging the computational cost.Moreover, new computational schemes that will enhance performance and accuracy of the simulations are researched.

Particle Mesh Numerical Results
. The proposed method based on the Particle Mesh method and the GenAspI-MGV, as shown in the results, has an advantage regarding the computational speed.
In Table 1, the overall performance of the Particle Mesh simulation algorithm based on the GenAspI-MGV method for various numbers of bodies, various numbers of cluster nodes, various numbers of threads, and various resolutions  is presented.In Table 2, the speedups of the Particle Mesh simulation algorithm, based on the GenAspI-MGV method for various numbers of bodies, cluster nodes, threads, and various resolutions, are presented.
In Figure 5, the theoretical estimates along with the experimental results for the speedup on the parallel Particle Mesh algorithm, for different mesh sizes and different numbers of nodes and for  = 20,000,000 bodies, are presented.In Figure 6, visual results of the Particle Mesh simulation on twelve frames are presented.

P3M Numerical Results
. The proposed P3M method has the same characteristics as the proposed PM method but lacks the computational speed of the PM method due to the direct part of the algorithm, which raises the accuracy of the results, as expected.
The numerical results of the P3M method were derived from a high density system with a large number of concentrated bodies that reveal the flaws of the method regarding high density body clusters.In Table 3, visual results of the P3M simulation with different body densities and their sequential performances are presented.In Table 4, the overall performance of the P3M algorithm, based on the GenAspI-MGV method for various numbers of bodies, various numbers of cluster nodes, various  numbers of threads, and various resolutions, is presented.In Table 5, the speedups of the parallel P3M algorithm based on the GenAspI-MGV method for various numbers of bodies, cluster nodes, threads, and various resolutions are presented.In Figure 7, the theoretical estimates along with the experimental results for the speedup on the parallel P3M algorithm, for different mesh sizes and different numbers of nodes and for  = 2,000,000 bodies, are presented.

Conclusion
The proposed schemes based on the GenAspI-MGV method are proven to be highly effective regarding the computational performance; compare [25,31,37].Moreover, the numerical results indicate the effectiveness of the proposed parallelization scheme especially for the Particle Mesh method.The numerical results reveal the differences of the two -Body simulation methods (Particle Mesh, P3M) regarding the computational performance.The Particle Mesh method is faster than the P3M method.Finally, future work will be concentrated on the performance of P3M method with new approaches utilizing adaptive schemes; compare [43].

Figure 3 :
Figure 3: On the left is the Cloud in Cell method in three dimensions.The cube denotes the "cloud" and the values are the mass densities.On the right is the inverse Cloud in Cell method in three dimensions.The lines denote the weights.

Figure 5 :
Figure 5: The theoretical estimates compared to the speedups of the parallel Particle Mesh algorithm, for different mesh sizes and different number of nodes for  = 20,000,000 bodies.

Figure 6 :
Figure 6: Visual results from the Particle Mesh simulation (twelve frames).

Figure 7 :
Figure 7: The theoretical estimates compared to the speedups of the parallel P3M algorithm, for different mesh sizes and different numbers of nodes and for  = 2,000,000 bodies.

Table 1 :
Compute the mass density./ * equation (14) * / (ii) Compute the chaining mesh place vector.(iii) Send the mass density to the head node, and the chaining mesh place vector to every other node.The overall performance of the Particle Mesh simulation algorithm based on the GenAspI-MGV method for various numbers of particles, various numbers of cluster nodes (), various numbers of threads, and various resolutions (results in seconds.hundreds).

Table 2 :
The speedups of the Particle Mesh simulation algorithm based on the GenAspI-MGV method for various numbers of bodies, various numbers of cluster nodes (), various numbers of threads, and various resolutions.

Table 3 :
Visual results of the P3M simulation with different body densities and their sequential performances are presented.

Table 4 :
The overall performance of the P3M simulation algorithm based on the GenAspI-MGV method for various numbers of particles, various numbers of cluster nodes (), various numbers of threads, and various resolutions (results in seconds.hundreds).

Table 5 :
The speedups of the P3M simulation algorithm based on the GenAspI-MGV method for various number of bodies, various numbers of cluster nodes (), various number of threads, and various resolutions.