A Performance Study of a Dual Xeon-Phi Cluster for the Forward Modelling of Gravitational Fields

With at least 60 processing cores


Introduction
The Xeon-Phi coprocessor is one of the new architectures designed specifically for high performance computing (HPC).Currently, the Xeon-Phi is part of some supercomputers listed on the TOP 500 and will continue to be a fundamental component of these machines for the foreseeable future.The Xeon-Phi is a X86 multicore architecture with low power consumption and with a theoretical performance of one Teraflop, for which it utilizes 60 real processing cores, a 30 MB cache, and high band-width interconnection [1].
One of the current research activities is to analyse the features and benefits of each new technology that emerges across the field of supercomputers, which is based, today, on GPUs and coprocessors.Each new technology carries with it different programming paradigms.Nevertheless, the effort to migrate scientific applications to CUDA (for GPUs) or OpenCL (i.e., programming that requires programming low level kernels) is often much higher as compared to directivebased programming like OpenMP (for CPU or MIC) [2].Prior experiments which test the functionality of the Xeon-Phi coprocessor show that migrating scientific software is relatively easy, thereby making the MICs a promising tool for HPC applications [3].
One important point to be taken into consideration in order to understand the performance expected from the Xeon-Phi is that the multithreading, implemented in each core, is the key to reduce the inherent latency to a microarchitecture multicore [4].This multithreading should not be confused with hyperthreading which consists in rapidly commuting between two threads that use the same core and can be disabled via BIOS.Since the multithreading is inherent in the Xeon-Phi coprocessor (it cannot be disabled), the behaviour with different numbers of threads per processing core must be tested (from 1 to 4 threads per core).
Additionally, the numerical-based geophysical application that is migrated to the Xeon-Phi serves to extract more detailed information from the masses hidden in the subsoil by measuring the gravitational gradients.The gravity  has three elements that represent the components of gravity in the three orthogonal directions , ,  (  ,   ,   ).The gradient of  represents the gravitational field in tensor form and has nine components: (  ,   ,   ,   ,   ,   ,   ,   , and   ) as shown in Figure 1.
The cartography of the gravitational gradients improves conventional gravitational studies by offering a better resolution of the subsoil.For example, an airplane equipped with gradiometers that is flying over a mountain (excess of mass) or a subterranean saline body (deficit of mass) will show only subtle variations in the global gravitational attraction.In contrast, the measurements of the corresponding gravitational gradients will reveal the geological changes with greater detail (see Figure 2).The erratic movement of the plane creates considerable noise in one gravitational profile, while the measurement of the difference between two sensors to obtain the gradient allows elimination of the error source [5].
The forward modelling of the gravity gradiometry consists in the direct conformation of the gravimetric data, where the initial model of the source body is constructed from the geological and geophysical intuition.We calculate the anomalies of the model and compare them to the observed anomaly, after which, the parameters are adapted in order to improve the adjustment between both anomalies.These three steps for adjusting the properties of the body, namely, calculation of the anomalies, comparison of the anomalies, and adjustments between both anomalies, are repeated until the observed and calculated anomalies are sufficiently similar with some criterion of error [6].
The numerical application basically consists in adapting a set of rectangular prisms that approximate the volume of the mass in the subsoil.If they are chosen sufficiently small, each prism can be considered to be of constant density.Then, by the principle of superposition, the gravitational anomaly of the body in any point can be approximated by adding the effects of all of the prisms on that point.Although this method seems simple, reducing the size of the prisms to adjust the source body causes a considerable increase in computation time.Besides, although there are other types of approximations such as points of mass or tesseroids, often for simplicity sake, researchers prefer to generate the prisms to model a body.This requires parallel computing that can carry out the forward modelling as fast as possible [7].
Thus, the computational challenge consists, in essence, in calculating the gravimetric response produced by a rectangular prismatic body with constant density or constant magnetization with respect to a group of observation points (see Figure 3).The set of prisms is known as an assembly of prisms and it is not necessarily regular.An assembly of prisms can be configured in any orientation with the only prerequisite that they cannot overlap.Since the gravitational field (or magnetic field) meets with the superposition property with respect to the observation, if  is the calculated response in a point (, ), then the observed response in the point (, ) is given as [8]  (, ) = where  is the total number of prisms and  is the density of the prism.This problem was previously studied and it was determined that the best parallelization strategy is to carry out the solution via prisms since, in general, the number of prisms is much bigger than the number of observation points [9,10].The partitioning by observation points is efficient as long as there are not too many threads operating on the observation mesh at once.A false sharing could occur when the number of threads increases because the threads operate on the same memory locations.Another drawback of the observation points resides in the creating and closing of the parallel regions; in other words, for each prism, a function to calculate the anomalies is executed in parallel, but upon completion, the region is closed and it must be reopened for the next prism.This produces an unnecessary overload that affects performance (see Figure 4).
Therefore, if scalability is desired, the best parallelization is obtained via subdivision by prisms; in other words, the threads divide the work by the number of prisms.To avoid the problems of coherence in the cache, a different memory location must be created for each execution thread since it is no longer feasible to create only one memory space for only one observation mesh to be shared by all threads.As shown in Figure 5, an observation mesh for each execution thread must be created, thereby avoiding the problems of consistency in the memory and false sharing in the accessing of the memory, since each thread writes to a different space in memory.If only one mesh was used, there would be problems in accessing shared variables, thereby causing inconsistencies.
One of the characteristics of OpenMP is that the division of computing is performed in an implicit way.Therefore, the partition of  prisms that conform the body in the subsoil is performed automatically by a balance algorithm included in OpenMP.In this case, the decision is left to the compiler, which in 99% of the cases is optimal [11].
The main contributions of this work are (i) presenting a complete evaluation of the Phi coprocessors as integrated in the same node and in multiple nodes for the direct modelling of gravitational fields, thereby obtaining very acceptable performance results, (ii) utilizing OpenMP as the controller and communicator for the integrated boards in the same node and also as a distributor of computing within a nested parallelism scheme.Additionally, we utilize MPI to distribute the computing among nodes, (iii) constructing a geological model based on real data in order to test the computational tool and the resulting components delimit the dimensions of a saline body.

Description of the Forward Modelling Problem
The explorations via the gradiometry of gravity or full tensor gravity (FTG) are an exploration technique which is based on multiple variations of the gravitational field, variations which are measured via the local gradients in different directions.
The main advantage of FTG is an improved resolution of the shallow sources.In this way, it is possible to construct more precise models that can detail objects such as gas, oil, or mineral hydrocarbon reservoirs and deposits.
The FTG consists of 9 components; however, only 5 are registered as independent measures from various perspectives.These measures are related by the fact that they can be recorded from the same geological source.Thus, the cojoint process can provide a data response with improved location of the anomaly source.
Each one of the tensors describes an object in a particular way.The   tensor finds the source,   and   identify the limits of the source,   identifies the boundaries,   identifies the axes, both high and low, which in turn define the fault tendencies, and   shows the anomalies that point toward the center of the source mass.
In geophysics, traditional methods designed to estimate the acceleration of gravity begin from seismic and geological information.This information is the basis for the construction of a subsoil domain with a density distribution.The integration of this information is often parametrized in prisms also called voxels and, depending on the complexity of the geometry, can be approximated by the form of the model with a number and size of determined parameters.Often, this final configuration is the basis for inverse models [12][13][14].
The gradiometric response, or direct modelling, is obtained by way of an algorithm that requires the calculation of a significantly large number of prisms.Thus, the computational cost is considerably high if the purpose is to approximate complex geometries with small prisms.
The gravitational field can be expressed in terms of the gravity potential   (r) as The potential   by volume  has the following approximation [15]: where  is the universal constant of gravity,  is the density in the volume,  is the position of the observation point, and   is comprised of the elements of the volume given as V.
The first spatial derivative of the potential   (3) can be expressed as where  is any of the directions , , ; therefore, (r) in the different directions can be written as In order to calculate the tensors, we apply the second partial derivative to (4) to obtain therefore, the tensors can be approximated as For convenience, we can denote the gravitational tensor in matrix form in the following way: where the trace satisfies the Laplace equation (Trace(Γ) =   +   +   = 0).The response of the tensor component   can be approximated in a discrete way due to the fact that a rectangular prism of constant density () can be approximated as (see Figure 6) where and .The discrete approximation for the remaining components can be reviewed in [9].In Figure 7, we show the result of calculating the components of a prism measuring 200 m 3 at a height of 0.01 Km, with an observation mesh of 1 Km × 1 Km, and discretized every 20 m.

Design of the Parallel Algorithm
In terms of programmability, there are two ways to utilize the Xeon-Phi, (1) in offload mode where the main algorithm is executed in the conventional CPU or host, while the computationally expensive part of the code is transferred to the coprocessor, or (2) in native mode, where the algorithm is executed independently from the CPU or other coprocessors via the system bus.Although the performance is similar, the offload mode allows more flexibility in programming [16].
In order to program application software for the Xeon-Phi, a programming paradigm that allows parallelism is required; this is the case with a x86 SMP architecture.Thus, most of the same multicore CPU tools can be reused; specifically, tools such as Pthreads [17], OpenMP [18], Intel Cilk Plus [19], and OpenCL [20] are available in a transparent way for the architecture.An optimized version of MPI is also available.
Considering different alternatives that are available to develop parallel application software, and specifically for scientific applications, the OpenMP model has been proven to be an excellent tool [21,22].Even OpenMP transparency and its better performance in the MIC have been documented [23].Moreover, as mentioned in the Introduction, the application software was previously migrated to a cluster of conventional CPUs using a hybrid OpenMP-MPI methodology; hence, OpenMP will be the development model for the migration.
Currently, it is possible to integrate more than one Xeon-Phi board per node with the purpose of benefiting from the fast interconnection speeds obtained by integrating various devices on the same PCI bus.By sharing the same mainboard, the devices avoid having to move information through an external network, thereby reducing the latency and overload associated with parallelism.The additional benefit obtained from utilizing OpenMP is that it can be employed as a controller and communicator for the integrated boards at the same node in a nested parallelism scheme [24].
The proposed design is shown in Figure 8; it consists of creating a general observation mesh labeled as  shared , implemented as a bidimensional array in the memory of the CPU controlled by the master thread.Subsequently, two execution threads are created in order to address both Phi boards (one to control each board) with its own private observation mesh, all the while maintaining a shared reference in the global mesh  shared .These private meshes in each thread are labeled as   0 shared and   1 shared , and although they are private for the execution threads in the CPU, they will be shared for the threads which are created in the MIC.Next, the calculation threads in each of the MIC are created; each thread contains its own observation mesh  where the calculated gravimetric anomalies will be stored corresponding to the prisms assigned for processing.Once the calculation is completed, the sum over the   0 shared and   1 shared meshes is computed, followed finally by the reduction calculation over the global mesh  shared .The nested parallelism design mounted on the MIC boards is shown in Figure 8 and in the pseudocode in Algorithm 1. Fragments of the code in FORTRAN to show how is implemented the Algorithm 1 is showed in Listing 1.
Once the algorithm has been designed for one node of the cluster, the implementation must be extended to be used for more than one node.The option used is MPI to carry out the reductions of data between nodes.Since the parallelization in MPI is explicit, we must manually distribute the number of prisms to be processed through a modular function and the distribution is carried out in the following way:  is the number of prisms to be calculated,  is the number of the MPI processes created, and  is the MPI process number ( = 1, 2, 3, . . ., ( − 1)).We define the initial number of prisms to be processed by the MPI process  as  start and the final number as  end and  is the quotient of the number of prisms  between the total number of processes  and  as the remainder; the procedure to determine  start and  end continues as follows: Therefore, If  ̸ = 0 and  < , then we adjust as If  ̸ = 0 and  ≥ , then we adjust as In this way, we can distribute the number of prisms  over  MPI processes in a balanced manner; once  start and  end have been determined for each node (one process to one node), the procedure of the Algorithm 1 is applied to internally process the prisms.The design of the algorithm is shown in Figure 9.

Performance Validation Experiments
This section reports the results obtained from the performance experiments, first on a conventional Xeon CPU in serial and parallel form with the purpose of having a baseline for comparison with regards to quality and performance of the numerical solution obtained by the Xeon-Phi coprocessor.

Modelling a Real Prisms Assembly.
The application of the direct calculation of the FTG components is conducted on a portion of an exploration area in the Gulf of Mexico for which we have the geological information; however, we omit the exact location and the details of the zone due to business confidentiality.The objective is to locate the characteristics related to the distribution and geometry of the saline bodies while integrating the information from the zone (see Figure 10).It shows various horizons including the distribution of saline bodies, and in this way the geological information will determine the precise limits of the density, thereby determining the amplitudes of the anomalies in the area in question.
Basing the model on a Cartesian reference frame and positive depth, the geometry is composed of an assembly of prisms of variable density but with constant dimensions of 500 m in both  and  directions and 50 m in  direction.The domain measures 40.5 × 36.5 × 14.1 Km in the eastwest, north-south, and depth directions, respectively (see Figure 11).The total number of prisms that set up the model is 1,667,466 with an observation mesh of 321 × 289 = 92,769 points.The number of prisms with nonzero density is 1,466,424.
The characteristics of the computing node are the following: (i) Two Intel(R) Xeon(R) processors E5-2680 @ 2.70 Ghz.
All the process local meshes are combined together (reduced) via MPI_reduce to form the final global result in the master node (process 0)

Performance Experiments on One
Node.The node has a total of 16 real cores of the CPU; therefore, the algorithm will spawn from 1 to 16 execution threads to analyse the performance.The problem that is tackled has a total of 1, 466, 424 prisms of variable density against an observation Figure 12 shows the computing times required for 1 thread up to 16 threads to calculate the   component.The serial computing time, in other words, the execution time obtained with only one execution thread, to compute   is approximately 8 h 06 m.With this baseline time, if we calculated the entire tensor, we would need more than 48 h of time to compute all components.The required computing time would be unacceptable in an industrial environment where the delays have direct economic costs.Thus, the use of all available cores is necessary to reduce the computing time.Using all the available cores of the CPUs, we reduce the time by 14.11X (34 m 27 s) and consequently all the components could be calculated experimentally in about 3 h 30 min.Figures 12 and 13 show the behaviours in computing time and corresponding speed-ups.

2.00
Obtained speed-up Perfect speed-up (f = 0.01) Figure 13: The speed-up corresponding to the execution times of the CPU.The perfect speed-up is calculated using the Amdahl's law () = (1 + ( − 1) × 0.01); therefore, the fraction serial is estimated around 1%, thus obtaining a near-perfect speed-up.
Once we have conducted CPU experiments, the same data set is processed on the coprocessors.The first consideration is that the Xeon-Phi coprocessor supports the handling of 4 threads per processing core; nevertheless, it is always necessary to determine the optimal number of threads for the application under development since this optimality depends directly on both the type of algorithm and the memory handling within the application.In general, in the Phi architecture, more than one thread per processing core helps to hide the inherent latencies in the application.For example, while one thread is waiting on the assignment of a memory resource, others can be attended by the core.In the conventional Xeon architecture, various programmers have found that some applications for HPC do not regularly benefit from the hyperthreading technology [25,26] but the Phi technology is different, since the multithreading cannot be deactivated like the hyperthreading can.The most recommended method for a program designed in offload mode is to begin with the creation of various threads from as few as ( − 1) and up to 4 × ( − 1), where  is the number of physical cores in the Phi.Thus, the experiments that must be executed, as recommended directly by Intel, are to create (−1), 2×(−1), 3×(−1), and 4×(−1) threads.This set of experiments will determine if, with an increase in the number of threads handled by the core, there is still an improvement in the performance of the algorithm.Multiples of ( − 1) are employed rather than , because one thread is left available for operating system services.
The experiments that are carried out start from ( − 1) up to 8 × ( − 1).Since in this case the Phi coprocessor employed has 60 cores, the experiments will be conducted with 59, 118, 177, 236, 295, 354, 413, and 472 execution threads, using a balanced affinity.First, we carry out the experiments using only one Phi board, followed by using two boards with both Phi boards integrated on one mainboard.The great advantage of the implementation designed in this research is that it permits a transparent communication between the boards, in addition to being scalable, thereby implying that virtually all the Phi boards integrated on the same mainboard could be used.Table 1 shows the resulting computing times obtained and the corresponding speed-up taking the 59 threads in only one coprocessor as the baseline.Figures 14 and 15 show the behavior of the computing time metric and of the speed-up of the Phi boards.We can observe a near-perfect speedup 2.0X using the two coprocessors.
Table 1: Computation times when using the Xeon-Phi coprocessors.
The best results are obtained when 236 threads are created per coprocessor; it means that we are obtaining a gain using the multithreading (4 threads per core), but when the number of threads is increasing, only an overhead is created.The results indicate, as expected, that the best result occurred with 4 threads per core.Therefore, this application does benefit from utilizing the multithreading since it achieves an improvement of a factor of 2.47X per card if 4 threads per core (optimal number per core) are used, as observed in Figure 15.If the number of threads is taken above 4, then the performance is influenced by the overloading of the threads per core and by the affinity.
Of the obtained results, and taking the best performance results shown in Table 1, we can observe that a Xeon-Phi coprocessor is 13X faster with respect to the serial version and 0.92X faster with respect to the original parallel version on the CPU.The two coprocessors working together are 25.91X faster with respect to the serial version and 1.84X faster with respect to the parallel CPU version.This implies that one Xeon-Phi coprocessor affords a similar performance as the 16 cores offered by the Dual Xeon CPU. Figure 16: Behaviour of the computing time obtained for the estimation of   when using 10 nodes of the cluster (the scale is semilogarithmic base 10 on the time axis).The curves compare the behaviour of the times obtained using 16 threads on the CPU versus 236 threads for each Xeon-Phi (these are the best results for both cases).As expected, the behaviour is not completely linear (but almost) as we can see when the speed-up is calculated (see Figure 17).  2 shows the times obtained from using the Dual Xeon CPU and the Dual Xeon-Phi.Figures 16 and 17 show the behavior of the computing time and the speed-up corresponding to the information in Table 2. Figure 17 shows that the behaviour of the speed-up when using the Dual Xeon CPU in the nodes is quasiperfect meaning that the communication overload is virtually nil.When using the Dual Xeon-Phi, however, the overload is greater and the behaviour deviates from the perfect speed up of 14.16, thereby reducing the serial time by a factor of 199.83X; it is important to clarify that this final factor was Figure 17: Behaviour of the speed-up when using ten nodes of the cluster.The serial fraction using the Xeon CPU is around 1% and for the Xeon-Phi, 4%; this means that the Xeon-Phi nodes introduce more overhead although they are faster; the speed-ups are calculated using their corresponding baseline, that is, the computation time used by one node.calculated using one core of the CPU versus the 20 Xeon-Phi; therefore, it is not very formal but still could be useful.

Validation of the Numerical Code.
The main objective of parallel programming is to decompose the program into appropriate components that can be executed simultaneously, thereby reducing the overall execution time.Although execution time is the primary goal, the implementation must be validated as well, since programming errors, inherent to parallelism, may occur.We use the error of the L2 norm to calculate the numerical differences between the different platforms with respect to the component   taking as baseline the current serial version.We omit the differences for the remainder of the components since they are likely similar.The errors are shown in Table 3.
From the obtained errors, we show that there are no significant numerical differences; therefore, the version of the application that is migrated to the Xeon-Phi has a correct implementation.

Discussion about the Forward Modelling Results
The FTG data differ in many ways from the conventional high-resolution gravity data.One important difference is the high density during data collection.This increase in the sensitivity allows a much greater resolution and is the reason why the data can be incorporated successfully into the interpretation of a gas or oil hydrocarbon reservoir.The tensor results obtained from  are shown in Figure 18; these results offer an important understanding into the definition of the geometry of bodies in order to characterize the structures of economic interest, since the FTG data were able to detect superficial bodies, bodies with measured and exact amplitudes in a range of −10 to 7 Eotvos.It is also possible to identify the principle tendencies of the geological structures in the study area and variations in the densities present in these anomalies.The resolution obtained with this technology is reasonably good, but we expect to improve these methods in the near future via better acquisition techniques and the ability to obtain even better resolution.In this way, we will be able to characterize the exploratory reservoirs in deep waters with better precision and definition.

Conclusions
This work implements and validates a parallel design based on a nested parallelization scheme for the calculation of gravitational components in OpenMP + MPI on a Xeon-Phi cluster architecture.The numerical experiments and the performance metrics validate that the implementation is efficient and leads to results which are very close to perfect speed-ups.
The results show a good design and adequate handling of the memory in order to exploit the benefits of the Xeon-Phi.In this case, the nested design in OpenMP allowed a transparent communication between the coprocessors, while the scheme of utilizing different regions of the memory per thread (observation mesh) was very advantageous with respect to the performance.This design benefits from the multithreading technology found on the Xeon-Phi, thereby improving the performance of this application by 147%.
With respect to the results obtained from the application software, the results provide very important insights into the geometry body definition.These insights help characterize the structures that are of financial importance since the algorithm is able to detect superficial bodies that are less than 300 m wide and with less than 7 Eotvos of anomaly.The results also identified the principle tendencies of the geological structures of the study area and the variations in density which these anomalies present.The resolution achieved via this technology is reasonably good.In the future, however, we expect that the processing and acquisition methods can be further improved and perfected in order to achieve even better resolution, thereby characterizing the exploratory oil reservoirs in deep waters with better precision and definition.Without a doubt, these high performance computing applications will be a requisite part of the solution to deep water exploration.

Figure 1 :
Figure 1: Vector and tensor relationships for the gravitational field in the three coordinate directions.

Figure 2 :Figure 3 :Figure 4 :Figure 5 :
Figure 2: Simplified acquisition for recording the response of a geological signal by airborne gravity gradiometry.

Figure 6 :
Figure 6: Delimitation of the rectangular prism in the Cartesian plane.

Figure 7 :
Figure 7: Gravity gradient response for a prism buried a depth of 100 m.Each side having a length of 200 m and constant density contrast of 100 kg/m 3 .

Figure 8 :
Figure 8: Design of the algorithm for only one node utilizing a nested parallelism model.

Figure 9 :Figure 10 :
Figure 9: Design of the software application for clustering.

3 )Figure 11 :
Figure 11: Synthetic 3D model created with prisms.The model is limited with geological horizons (Figure 10) and constant density layers were adjusted.

Figure 12 :
Figure12: Behaviour of the computation times on the Xeon Dual CPU.One thread is mapped to one core using a compact affinity over the sockets.

Figure 15 :
Figure15: Behavior of the speed-up obtained when calculating   on the Xeon-Phi coprocessors.We can observe a near-perfect speedup 2.0X using the two coprocessors.
Number of nodes Speed-up Dual Xeon-Phi Perfect speed-up (f = 0.04) Speed-up Dual Xeon CPU Perfect speed-up (f = 0.01)

Figure 18 :
Figure 18: Free-air tensor components calculated from the geological horizons corresponding to Figure 11.
Behavior of the computing time used when calculating   on the Xeon-Phi coprocessors, as we can observe the reduction of the time is almost 2X using the two coprocessors.

Table 2 :
Computing times and corresponding speed-up obtained when calculating   on the Xeon Dual CPU and the Dual Xeon-Phi while using ten nodes of the cluster.The second speed-up listed in the Phi column is based on the time of the Dual Xeon CPU as a baseline.

Table 3 :
Errors of the L2 normal for different platforms with respect to the serial version.