Generating Multibillion Element Unstructured Meshes on Distributed Memory Parallel Machines

We present a parallel mesh generator called PMSH that is developed as a wrapper code around the open source sequential Netgen mesh generator. Parallelization of the mesh generator is carried out in five stages: (i) generation of a coarse volume mesh; (ii) partitioning of the coarse mesh; (iii) refinement of coarse surface mesh to produce fine surface submeshes; (iv) remeshing of each fine surface submesh to get a final fine mesh; (v) matching of partition boundary vertices followed by global vertex numbering. A new integer based barycentric coordinate method is developed for matching distributed partition boundary vertices. This method does not have precision related problems of floating point coordinate based vertex matching. Test results obtained on an SGI Altix ICEX systemwith 8192 cores confirm that our approach does indeed enable us to generatemultibillion elementmeshes in a scalable way.


Introduction
We developed a tool called PMSH for facilitating fast generation of unstructured multibillion element tetrahedral meshes (grids) on complex geometries for the OpenFOAM computational fluid dynamics (CFD) package [1].PMSH is developed as a C++ wrapper code around the popular open source sequential Netgen mesh generator [2].OpenFOAM provides a mesh generator called blockMesh for simple geometries.The blockMesh utility is a multiblock mesh generator that generates hexahedral meshes from a text configuration file.For complex geometries OpenFOAM also provides a mesh generation utility called snappyHexMesh which generates hexahedral meshes.The snappyHexMesh utility works more like a mesh sculptor rather than a generator.It takes an existing mesh such as the one produced by blockMesh and chisels out a mesh on a complex geometry that is given in STL format.The snappyHexMesh utility has advanced features like the ability to run in parallel and being able to redistribute the mesh so as to perform automatic load balancing.Both utilities, snappyHexMesh and blockMesh, are not as advanced as other commercial or open source mesh generator packages for producing quality tetrahedral meshes on complex geometries.Therefore, there is a great need in the OpenFOAM community for tools that will enable researchers to generate massive tetrahedral meshes on complex geometries.
Löhner states in [3] that "grid sizes have increased by an order of magnitude every 5 years and that presently, grids in of the order of 10 9 elements are commonly used for leading edge applications in the aerospace, defense, automotive, naval, energy and electromagnetics sectors."Löhner also mentions that "for applications where remeshing is an integral part of simulations, for example, problems with moving bodies or changing topologies, the time required for mesh regeneration can easily consume a significant percentage of the total time required to solve the problem."Geometry and mesh generation related concerns have also been voiced in a recent Applied Mathematics for Exascale Report [4] by the US based Exascale Mathematics Group.Therefore, as we move towards exascale computing, the ability to generate massive multibillion tetrahedral meshes especially on complex geometries will be in more demand in the future.
This paper is organized as follows: Section 2 reviews the previous work done in the area of mesh generation and refinement.Section 3 presents the details of our algorithms and data structures used.Section 4 lists a few programming issues and the modifications that needed to be done in order to fix some bugs in Netgen.Section 5 presents the results of various parallel mesh generation tests we have carried out on an SGI Altix ICE X system with 8192 cores.Finally, conclusions and future work are presented in Section 6.

Previous Work
Parallel mesh generation has been addressed in several reported works: Dawes et al. [5] present bottom up octree based parallel mesh generation routines.Antonopoulos et al. [6] parallelize Delaunay based mesh generation on multicore SMT-based architectures.D3D [7] is an MPI based parallel mesh generator that employs octree based mesh generation.MeshSim by Simmetrix Inc. is a commercial multithreaded and distributed parallel mesh generator [8].ParTGen [9] is a parallel mesh generator that uses a geometry decomposition based approach to decouple the meshing process into parallel mesh generation tasks.Performance tests of PartTgen have demonstrated scalability only up to 32 processors.The approach taken in PMSH is similar to that of ParTGen.The recent work by Löhner [3] involves geometry decomposition and the use of advancing front technique on multiple subdomains to generate meshes in parallel.It is reported that a billion-element sized mesh has been generated in roughly forty minutes on a 512-core SGI ITL machine.
Pamgen [10] is another parallel mesh generation library within the Trilinos Project that produces hexahedral or quadrilateral meshes of simple topologies.It is stated in its FAQ page that the number of elements PAMGEN can generate is limited to just above two billions (because of 32-bit integer limitation).
CGAL is a computational geometry algorithms library developed at INRIA.Pion et al. present 3D parallel Delaunay triangulation work in [11].CGAL also provides a parallel mesh generation capability based on TBB (Threading Building Blocks) library [12].
Figure 1 illustrates refinement based approaches that can be used to generate massive meshes.Uniform mesh refinement without the use of geometric information, as shown in (a), works very fast in parallel, is quite scalable to tens of thousands of processors, and can be used to produce meshes with tens of billions of elements.The works by Houzeaux et al. [13] and Kabelikova et al. [14,15] can be shown as examples of uniform refinement.These approaches, however, cannot be used on problems involving complex geometries with curved boundaries.In such cases, they cannot accurately approximate the boundary of the geometry.An alternative method is to do uniform refinement but also to make use of geometric information as shown in Figure 1(b).In [16], mesh multiplication technique with surface correction and volume node movement is presented.This approach is also taken by Yilmaz et al. [17].In this case, in addition to the mesh entities, one needs to have access to the geometry.Such an implementation can be achieved by utilizing mesh generation libraries as, for example, Netgen which is used in [17].One and a half billion element meshes have been generated in this way in one-to two-minute time on one thousand cores.Yilmaz et al. also provide results for a method involving decomposition of geometry in the flavour of Figure 1(c), where meshing of decomposed geometry is performed.However, geometry decomposition into a large number of subgeometries introduces problems caused by thinly cut subdomains.Using this method, meshes of about a few hundred million elements could be accomplished on up to 64 cores on simple geometries.A preliminary implementation was also reported in [17], which generates a surface mesh from a geometry, refines the surface mesh, and generates a fine mesh from it in the style of Figure 1(c).This implementation had some problems and some preliminary results involving generation of a few million element meshes on few tens of cores were reported.
Table 1 summarizes the mesh generation methods and the degree to which they use the geometric information.In our current PMSH work, the approach shown in Figure 1(c) involving generation of a coarse mesh, extraction of a surface mesh, and remeshing from a refined and projected fine surface mesh has been completely redesigned using new data structures and has been successfully implemented.The resulting implementation is robust and enabled us to generate multibillion meshes on up to eight thousand processors (note that 8K processor is not a limitation of our PMSH but rather it was the maximum number of cores for the maximum allowable memory we could get on the specific supercomputer that we used for our tests).Also, note that both our PMSH and
Finally, we note that when new mesh entities are created in parallel, duplicate entities on the partition boundaries result which need to be matched.A somewhat similar problem occurs in the OpenFOAM reconstructParMesh utility.We quote Piscaglia et al. [18] on this problem: ". ..the point, face and cell processor addressing lists, generated at decomposition time, are no longer valid once topological changes have occurred.As a consequence, the only way to reconstruct a parallel case involving dynamic mesh with topological changes consist of using an algorithm based on geometry instead than topology. ... " This geometric approach, however, operates on floating point numbers and also requires a specified tolerance.To resolve this complicated problem, we contribute a new integer barycentric coordinates based approach for parallel duplicate entity matching.Our approach is actually a simple solution whose simplicity makes it even more valuable, since it drastically reduces the effort required for parallel programming of unstructured mesh applications.This approach is explained in detail in Section 3.

PMSH Parallel Mesh Generation Algorithm
Our mesh generation algorithm proceeds in five main stages: (i) generation of a coarse volume mesh, (ii) partitioning of the coarse mesh to get submeshes, each of which will be processed by a processor, (iii) extraction and refinement of coarse surface submeshes to produce fine surface submeshes, (iv) remeshing of each fine surface submesh to get the final fine volume mesh, and (v) matching of distributed duplicate partition boundary vertices followed by global vertex numbering.
Given the list of symbols and their definitions in Abbreviations, Algorithm 1 presents the detailed steps of our parallel mesh generation algorithm.Steps (2)-( 4) basically correspond to stage (i) that involves generation of an initial coarse mesh.Firstly, the geometry description Ω is loaded.Geometry file can be in STL or Netgen's CSG (.geo) format.A coarse mesh is then generated from this geometry.The coarse mesh basically enables us to subdivide our domain into  subdomains in a load balanced way.
In step (5), corresponding to stage (ii), version 5.1.0 of METIS [19] partitioner is used to partition the coarse mesh in a face-connected way.The coarse volume mesh and its partitioning are available on all the processors and hence allow each processor to compute processor adjacencies independently.In our current implementation, we let each processor generate the coarse volume mesh and its partitioning redundantly.Even though energy and resources can be saved if a dynamic malleable job model is used (i.e., starting with a single processor and increasing the number of allocated processors after step ( 5)), there are some complications involving reconstruction of Netgen's internal data structures on dynamically spawned processors.The malleable job model will be considered in the future.
Steps ( 6)-( 9) correspond to stage (iii), which involves extraction and refinement of the coarse surface submesh to produce a fine surface submesh.Faces which are on the geometric boundary and faces which are shared by elements that are on different processors are extracted and a coarse surface submesh is formed.This surface submesh is then refined uniformly  times to form a fine submesh.Refinement is done by our own code and not by Netgen code.A queue of faces to be refined is maintained.A face to be refined is pushed into this queue.After refinement, if it is to be refined again (i.e., −1 more times), the newly created faces are also pushed into the queue.Using Netgen's library, newly formed vertices are projected onto the actual geometry boundary as part of step (9).Currently projection works only when meshing Netgen's CSG (.geo) geometry files.It does not work properly for STL geometries because sometimes Netgen's projection routine returns incorrect projections for some points.
Since we need to match new surface vertices with their counterparts on the neighbouring processors later on, we need to keep track of the addresses of newly created vertices.Our own refinement code helps us to do this tracking.We know the IDs of the coarse mesh vertices, since the same coarse mesh is present on all processors.The newly created vertices, however, have local IDs on each processor and these are different from their counterparts on other processors.Even though one can always use , ,  coordinates of these vertices to match them, such a method uses floating point comparisons and can be sensitive to precision related errors.The generation of new vertices is done in different orders on different processors.Since multibillion element meshes are to be generated, precision related problems are more likely to happen due to very small element sizes.As a result, we wanted to come up with a new integer based vertex identification system.
The new vertex identification system we came up with makes use of integer barycentric coordinates.All processors know the IDs of the coarse mesh vertices.Hence, given a coarse surface mesh face that is defined with vertices having indices , , and , the address of this face can be identified easily by employing a map that maps the key (, , ) to the corresponding local face address on each processor.However, when new faces and vertices are created as a result of refinement, their orders of creation are different on each processor.There can be different connectivity types as shown in Figure 2. Vertex connectivity is easy to resolve since the vertex connectivities of partitions can be established by the coarse mesh vertices which are known by all processors.The newly created vertices on edges and faces introduce Here, , ,  are either 0 or indices of vertices from T V satisfying  <  <  and , , and  are integer barycentric coordinates with  +  +  = 2  .Figure 3 shows an example face defined by coarse mesh vertices 3, 5, and 8 and the newly created vertices as a result of 2-level refinement.The table in Figure 3 shows the barycentric global IDs of some vertices.Note that even though there are six pieces of information in a barycentric global ID compared with three pieces of information in , ,  coordinates, storage-wise barycentric global IDs are more advantageous., ,  coordinates require three doubles (24 bytes).A barycentric coordinate ( or  or ) can be represented using  + 1 bits.The initially generated coarse mesh is small, one or two million at most.A 32-bit unsigned integer is more than enough to fit both the  + 1 bits as well as the vertex ID in the coarse mesh.Since we have 3 pairs, 3 unsigned integers, that is, a total of 12 bytes, are actually sufficient.
We also note that, in the literature, applications of barycentric coordinates can be found in many areas such as finite element analysis and computer graphics.Recently, in particular, in the area of computer graphics, Boubekeur and Schlick [20,21] made use of barycentric coordinate based interpolation in generic and adaptive mesh refinement kernels for GPUs.Our use of barycentric global IDs for distributed vertex matching in PMSH is yet another novel application of barycentric coordinates.
In order to compute vertex adjacencies of processors, adjacent processor IDs are inserted into a map whose keys are barycentric global IDs.This is done in step (10).After this step, the fine volume submesh is generated from the fine surface submesh in step (11).OpenFOAM mesh format uses global vertex IDs.Therefore, we need to assign global numbers to all vertices.In steps ( 12)-( 13), the global numbers of partition boundary vertices are generated by first assigning an owner to one of the processors holding a copy of the vertex.Given a sorted list  of processors that hold a vertex with barycentric global ID [(, ), (, ), (, )] the th processor in the list is calculated as follows:  = ( +  +  +  +  + ) mod || . ( Here, || is the number of holder processors of the vertex.This method of determining owners is similar to our earlier approach given in [17].Once owner processors are determined, owner processors generate global numbers for each vertex they own.A prescan operation is carried out among processors to compute prefix counts of owned vertices on each processor.From these counts the starting point of global numbering on each processor can be determined.Once global numbers for partition boundary vertices are determined, the owner processors then send these numbers to holder processors in step (14).In this way, all integer global IDs for all vertices (whether they are internal or on partition boundary) are determined.Finally, in step ( 15), the mesh is output in OpenFOAM format using these integer global IDs.As data structures, Standard Template Library (STL) maps are used to store various surface mesh data.The main ones are for the storage of geometry and partition boundary faces, global barycentric IDs of partition boundary vertices, and processor adjacencies.

Programming Issues
In this section, we consider programming issues that arose while developing PMSH.While PMSH runs, each processor meshes a different subgeometry in parallel during fine mesh generation.Since we are on a distributed memory machine, each processor has a different address space.The new fine mesh entities have completely different addresses and different orders of creation.As a result, matching these entities to their correspondents on other processors is nontrivial.This situation does not arise in mesh multiplication approaches, since in these approaches all tetrahedrons are refined uniformly and hence one can predict the addresses of newly created entities.Our barycentric coordinates approach helps us to resolve this programming issue in a simple way.
On a petascale machine, one wants to minimize expensive communication costs.In particular, one wants to reduce the number of messages.Large numbers of messages put on the communication network, besides introducing large latency costs, cause contention on the network (by also interfering with other users packets).As a result, one of our major objectives has been to reduce the number of messages.We do this by performing all levels of refinement before doing any communication.This, however, introduces a complication, because large numbers of new vertices are created on faces and edges and their orders of creation and addresses are completely different on neighboring processors.Again, our simple barycentric coordinates approach enables us to agglomerate multiple communication stages of refinement levels into a single communication stage, hence reducing the number of messages by a factor equal to the number of refinement levels.

Bug Fixes and Modifications Made in Netgen 5.1 Code.
During the development of this project, Netgen 5.1 was the most recent version.Our modifications and fixes have been made on this version.When we generated fine surface meshes after multiple levels of refinement and passed them to Netgen for fine volume generation (step (11) in the Algorithm 1), we encountered Netgen crashes on a few processors.This happened repeatedly and prevented us from obtaining massive meshes.We examined Netgen code and traced the bugs.A number of fixes as well as some other modifications that let us access some functions within Netgen were made to Netgen 5.1 code; these are described in the PMSH website at https://code.google.com/p/pmsh/.
The modifications enabled our tests to run successfully.Afterwards, we were able to generate multibillion element meshes.The bugs we discovered were also reported to the Netgen developer.

Tests and Results
We have meshed Onera M6 wing, sphere, and shaft geometries shown in Figure 4 on NTNU's Vilje system.Sphere is a simple geometry whereas Onera M6 wing and shaft are moderately complex geometries.Vilje is an SGI Altix ICE X system that has 936 nodes available to academic users.Each node has 2 Intel Xeon E5-2670 (Sandy Bridge) processors with a total of 16 cores and 28 GB of memory.Table 2 shows the results of the mesh generation tests we have performed.On Vilje, due to restrictions, a job can be allocated at most 14.3 TB total memory on up to 512 nodes.Therefore, we submitted tests that would satisfy this memory limit.If this limit was higher, we would have been able to run our tests on higher numbers of cores and generate even bigger meshes than what is shown in Table 2.
The columns times taken by coarse mesh T and fine mesh T  show the timings of steps (3)-(4) and steps ( 5)-( 14), respectively.Total timing is the summation of these two timings.Geometry input and mesh output are not included in the timings.Note that coarse mesh T is the initial mesh generated on each processor whereas fine mesh T  is the global mesh over all processors (i.e., union of all T   ).When looking at sequential mesh generation times of the coarse mesh, we see that on average 0.23 million elements/minute rate is achieved by Netgen.In [3], Löhner states that "typical speeds for the complete generation of a mesh (surface, mesh, improvement) on current Intel Xeon chips with 3.2 GHz and sufficient memory are of the order of 0.5-2.0 million elements/minute." Netgen's meshing rate on a single 2.6 GHz core is lower than this, but still it is a reasonable rate.
When looking at the parallel multibillion element mesh generation results, we see that within the 1K-8K core range scalability is achieved.Figure 5 plots mesh generation time curves for (|T|, |T  |) pairs.|T| and |T  | are the sizes of initial coarse and the final fine mesh, respectively.The inputs to our program are the following parameters: (i) maximum element edge length in the mesh, (ii) mesh grading factor between 0.0 and 1.0, and (iii) number of levels of refinement.
Parameters (i)-(ii) dictate the size of the coarse mesh and (i)-(iii) dictate the size of the final fine mesh.We can treat coarse and fine mesh sizes, that is, the pair (|T|, |T  |), as our problem size.Since, when changing the number of cores (partitions), the submesh sizes can change (since a new mesh is generated), we have defined ranges on (|T|, |T  |) and plotted mesh generation times for ranges of (|T|, |T  |).Hence, given a roughly fixed problem size, that is, a given range of the pair (|T|, |T  |), we see from Table 2 that execution times decrease as the number of cores is increased.Of course, as the ratio |T  |/ gets smaller and smaller, any further increase in the number of cores will lead to smaller and smaller reductions in execution time and will approach the time it takes to generate and partition the coarse mesh.The plot in Figure 6 shows the dependence of speedup on the number of processors and coarse mesh size to fine mesh size ratio.

Figure 1 :
Figure 1: Given coarse mesh volume refinements of the mesh without (a) or with (b) geometry information and remeshing after surface refinement with geometry information (c).

3 Figure 3 :
Figure 3: Integer based barycentric coordinates and global barycentric IDs when two levels of refinement are done on coarse mesh face with vertices 3, 5, and 8.

Table 1 :
Comparison of parallel mesh generation approaches.

Table 2 :
Continued.Fine T  Coarse T Fine T  Total