The objective of this study is to evaluate the performances of Intel Xeon Phi hardware accelerators for Geant4 simulations, especially for multithreaded applications. We present the complete methodology to guide users for the compilation of their Geant4 applications on Phi processors. Then, we propose series of benchmarks to compare the performance of Xeon CPUs and Phi processors for a Geant4 example dedicated to the simulation of electron dose point kernels, the TestEm12 example. First, we compare a distributed execution of a sequential version of the Geant4 example on both architectures before evaluating the multithreaded version of the Geant4 example. If Phi processors demonstrated their ability to accelerate computing time (till a factor 3.83) when distributing sequential Geant4 simulations, we do not reach the same level of speedup when considering the multithreaded version of the Geant4 example.
Monte Carlo simulations have become an indispensable tool for radiation transport calculations in a great variety of applications. The Geant4 simulation toolkit [
Some authors [
In this paper, we describe very precisely the methodology followed for the compilation of Geant4 software and dependencies in the objective to guide any user willing to take part in the expected Xeon Phi computing potentiality for time-consuming simulations. Then, we propose as benchmarks a set of computing tests performed for the Geant4 advanced example “TestEm12” in the objective to conclude the suitability of Xeon Phi architecture for such simulations.
The simulations have been performed on a machine having two Intel Xeon CPUs E5-2690v2 (
The present work was performed with version 10.0p01 of Geant4 [
Dependencies concerning Geant4 10.0.p01 software. Libraries in grey are not used for our parallel benchmark.
Each compilation process (native or cross-compilation) has to be run on Xeon (x86_64) processor using the Intel C Compiler (ICC) version 14.0.3 (compliant with GCC 4.8.1) in order to be later launched on Phi coprocessor architecture. For CMake compilation, the cross-compilation for Xeon Phi accelerators is activated when using the CMake flag “
For libraries built using a configure script, the cross-compilation for Xeon Phi accelerators is activated when using the flag “
In order to compile Geant4 toolkit and dependencies, it is necessary to set specific environment variables. It is mandatory to append the corresponding libraries to the LD_LIBRARY_PATH variable and the executable binaries to the PATH variable. For our compilation, we created a
All variables starting their name with G4 are used for setting the Geant4 data libraries paths (photon evaporation data, radioactive decay data, particle cross sections for different energy ranges, cross section data for impact ionization, nuclear shell effects, optical surface reflectance, and nuclide state properties). It has to be noticed that as the bash and tcsh Unix shells are not supported on Phi coprocessor, environment variables have to be set using a basic sh script file; this file is then sourced using the command line “
The methodology followed for cross-compiling Geant4 toolkit and dependencies is inspired from a preliminary work [
Xerces and Expat libraries have been compiled using the configuration instructions specified in Scripts
Script
Finally, Script
It has to be remarked that we used the “
The Geant4 extended example TestEm12 migrated to enable multithread computing (accessible at $G4INSTALL/examples/extended). To compile TestEm12 for Xeon Phi coprocessors, the listed CMake instructions have to be used (see Script
This example, already validated by authors against other Monte Carlo codes [
In order to verify the correct cross-compilation of Geant4 and dependencies, we tested the reproducibility of results for TestEm12 electromagnetic example on 1 Xeon and its multithreaded version TestEm12MT running on 40 Xeon threads and 240 Phi threads using or not the optimized compilation flag “
Comparison of energy depositions for monoenergetic electrons with initial energy of 4 MeV using TestEm12 Geant4 example calculated using a single Xeon CPU (black line), the multithreaded TestEm12 Geant4 example calculated using 40 Xeon threads (grey dashed line), using 240 Phi threads (grey circles), and using 240 threads on a Phi accelerator with an optimized compilation process (black circles). A perfect agreement is noticed between the example of reference on a single CPU and the optimized compilation process on Phi accelerator.
Geant4 simulations, through TestEm12 extended example, have been tested on Xeon Phi accelerators using distributed (TestEm12) and multithreaded (TestEm12MT) modes. Prior to any computational tests, we profiled TestEm12 example using the Intel VTune toolkit in order to quantify memory bandwidth consumption. We could conclude that the simulation is highly compute-bound.
In this study, we consider that the “distributed” mode means launching several independent simulation instances at the same time without involving any communication between instances. Concerning the “multithreaded” mode, simulations are launched using the pthread library. In both cases, simulations are balanced regarding the number of particles equally spread among worker nodes.
For the distributed mode, the total number of particles is split between the multiple instances of runs like described by authors in [
Table
Description of number of partitions (or threads) used for benchmarking on Xeon and Phi coprocessors.
Number of partitions | Xeon | Phi | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
TestEm12 example | 1 | 10 | 20 | 1 | 60 | 240 | 960 | |||
|
||||||||||
TestEm12MT example | 10 | 20 | 40 | 10 | 20 | 40 | 60 | 120 | 240 | 960 |
The Xeon Phi hardware was used with a “native mode,” meaning whole simulations were executed on the Xeon Phi or directly started on the Xeon Phi using SSH.
Speedup was evaluated for distributed TestEm12 simulations for generated source particles going from 103 to 108. Figure
Computing time in minutes of TestEm12 Geant4 example running on 1 Xeon thread (blue) and 1 Phi thread (green) coprocessors for generated source particles going from 103 to 107. Speedup (red) is indicated for each number of particles.
In order to verify the Intel claimed performances for Xeon and Phi [
Computing time in minutes of TestEm12 Geant4 example distributed on 10 Xeon (blue) cores and 60 Phi (green) cores for generated source particles going from 103 to 107. Speedup (red) is indicated for each number of particles.
When using the total amount of threads available on 1 Xeon and 1 Phi, respectively, 20 and 240 threads (see Figure
Computing time in minutes of TestEm12 Geant4 example running on 20 Xeon (blue) threads and 240 Phi (green) threads for generated source particles going from 103 to 108. Speedup (red) is indicated for each number of particles.
Speedup was evaluated for three different numbers of threads: 10, 20, and 40 corresponding, respectively, to the number of hardware cores for one Xeon CPU, the number of hardware cores for 2 Xeon CPUs, and the number of threads for 2 Xeon CPUs. The goal was to evaluate the potential impact of using a multithreaded version compared to a distributed one on Xeon processors. This case study presented on Figure
Comparison of computing time in minutes of TestEm12 and TestEm12MT Geant4 examples running on 10 (blue), 20 (green), and 40 (purple) Xeon threads for generated source particles going from 105 to 108. Speedup (red) is indicated for each number of particles and the three test cases.
In the objective to evaluate if a high number of threads reduces significantly the execution time of TestEm12MT on Phi whatever the number of particles generated, we plotted on Figure
Comparison of computing times in minutes of TestEm12MT Geant4 example running on 10, 20, 40, 60, 120, and 240 Phi threads for generated source particles going from 105 to 108.
We can remark that the higher the number of generated source particles is, the higher the number of threads must be to reduce the execution times. For 108 particles, we obtain an almost linear diminution of computing time with the number of threads (till 60 threads), as it is also shown in Table
Speedups obtain for 10, 20, 40, 60, 120, and 240 Phi threads for 108 particles compared to 1 Phi thread.
Number of threads | 10 | 20 | 40 | 60 | 120 | 240 |
|
||||||
Speedup | 10.00 | 19.96 | 39.83 | 60.11 | 94.02 | 114.20 |
The speedup was evaluated for TestEm12MT simulations using the standard EM physics list for generated source particles going from 105 to 108 running on 40 Xeon threads and the best computing time obtained for Phi threads. It can be observed that whatever the number of particles generated, Phi provides longer execution times (see Figure
Computing time in minutes of TestEm12MT Geant4 example running on 40 Xeon threads (blue) and the best computing time obtained with Phi threads (green) for generated source particles going from 105 to 108. Speedup (red) is indicated for each number of particles.
In Table
Computing time obtained for 108 and 109 particles on 40 Xeon threads and 960 Phi threads.
Number of particles | Computing time (mins) | |
---|---|---|
40 Xeon threads | 960 Phi threads | |
108 | 40.6 | 49.9 |
109 | 407.5 | 390.4 |
When using 960 Phi threads, the computing time reaches 49.9 minutes for 108 particles, which is 23% higher than when using 40 Xeon threads (computing time corresponding to 40.6 minutes). But when reaching 109 particles the computing time is finally reduced on 960 Phi threads compared to Xeon; for this last configuration, we obtain a speedup of 1.04.
The objective of this paper was to first detail a clear and understandable methodology to compile and execute any Geant4 application on Xeon Phi accelerators. Special attention should be paid for using the optimization compilation flag “
Then, the ambition of authors was to evaluate the performance of Xeon Phi accelerators for such applications especially due to the availability of the multithreaded version of the Geant4 toolkit. We have to remind the reader that, in a first instance, no tuning of the source code has been initiated in this study. Regarding the different outcomes obtained, we may conclude that when distributing sequential Geant4 simulations (40 Xeon threads compared to 240 Phi threads), Phi (5110P at 1 GHZ) are faster than Xeon CPUs (E5-2690v2 at 3 GHZ), almost reaching the maximum speedup (3.83x versus 4.2x) though limited optimization was considered to save the precision of the final results; when considering multithreaded Geant4 simulations on Xeon CPUs, we can remark that this version is unfortunately slightly slower than the classical distribution of the sequential Geant4 simulations whatever the number of threads used; even if we observe a loss of performance for the multithreaded version of Geant4 on Phi compared to Xeon CPUs, it has to be noticed that, using a high number of particles in simulations (corresponding to more than 6 hours of computing on 40 Xeon CPUs for 109 particles), we finally reach a very tiny speedup of 1.04 using 960 Phi threads.
For the moment, we can state that the multithreaded version of Geant4 is not yet optimized to compete with a distributed submission of simulations on a farm of CPU clusters, on a cluster of Phi hardware accelerators, or on a grid infrastructure. It would certainly necessitate tuning drastically the source code and suppressing any verbose display, in order to make such applications fully compliant with Xeon Phi architectures. One would expect that the next generation of the Geant toolkit (Geant5) would answer such problematic.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was carried out as part of the Laboratory of Excellence ClerVolc project. The authors wish to thank the Geant4 collaboration for its technical support.