An Optimized Parallel FDTD Topology for Challenging Electromagnetic Simulations on Supercomputers

It may not be a challenge to run a Finite-Difference Time-Domain (FDTD) code for electromagnetic simulations on a supercomputer with more than 10 thousands of CPU cores; however, to make FDTD code work with the highest efficiency is a challenge. In this paper, the performance of parallel FDTD is optimized through MPI (message passing interface) virtual topology, based on which a communication model is established. The general rules of optimal topology are presented according to the model. The performance of the method is tested and analyzed on three high performance computing platforms with different architectures in China. Simulations including an airplane with a 700-wavelength wingspan, and a complex microstrip antenna array with nearly 2000 elements are performed very efficiently using a maximum of 10240 CPU cores.


Introduction
The principle of FDTD is that the calculation region is discretized by Yee grid to make the components of E and H be distributed at time and space alternately [1].Then, there are four H (or E) components around each E (or H) component.This character makes the algorithm parallel in nature and using this the Maxwell equation can be transferred to a set of difference equations.The electromagnetic fields can be solved at time axis step by step.Then the electromagnetic fields distribution in each time step later can be obtained by the original values and boundary condition [2].
Researches on MPI based parallel FDTD for simulating complicated models have been published in the past decade.In 2001, Volakis et al. presented a parallel FDTD algorithm using the MPI library, where they raised an MPI Cartesian 2D topology [3].Andersson developed parallel FDTD with 3D MPI topology in the same year [4].In 2005, the authors studied the optimum virtual topology for MPI based parallel conformal FDTD algorithm on PC clusters [5][6][7].In 2008, Yu et al. tested the parallel efficiency of the parallel FDTD [8] on BlueGene/L Supercomputer successfully and gave a parallel efficiency of 4000 cores under the case of balanced loads.
Although there are many publications on parallel FDTD, few of them involve parallel FDTD simulations utilizing more than 10000 cores.Most of the papers focused on load balancing when parallel efficiency was concerned, in addition to a more precise rule to achieve the best performance that needs to be given, especially in the case of simulations using tens of thousands of CPU cores on supercomputers.
With these concerned, in this paper, the influence of different virtual topology schemes on parallel performance of FDTD is studied through a theory model analysis.Then some tests are made on National Supercomputer Center in Tianjin (NSCC-TJ) and National Supercomputing Center in Shenzhen (NSCC-SZ) to verify the feasibility of theory.With the proposed theory model, some electrically large problems whose parallel scales up to 10240 cores are provided in this paper.And the parallel efficiency is nearly 80% when 10240 cores of SSC were utilized for an array with nearly 2000 elements.To the best of our knowledge, the proposed method achieves one of the best efficiencies ever reached using more than 10 thousands of CPU cores.

Computation Resources from Supercomputers
The program is tested on different clusters in three supercomputer centers, National Supercomputer Center in Tianjin (NSCC-TJ) [9], National Supercomputing Center in 2 International Journal of Antennas and Propagation Shenzhen (NSCC-SZ) [10], and Shanghai Supercomputer Center (SSC) [11].The parameters of computation resources in this paper are listed in Table 1.

Communication Model for Parallel FDTD
Communication is the main factor which affects the parallel performance of parallel codes.Therefore, reducing the amount of communication in FDTD by adjusting the virtual topology is selected as the optimization target.Assume that the communication time in one time step is where  is communication delay time,  is communication number,  is transmission speed, and  is the communication data amount of E or H.The calculated equation of each parameter is as follows: = (  − 1) where   ,   , and   are the topology values in three directions and   ,   , and   are the grids number in , , and  directions.
From (1), it is known that when the total communication data amount is the same, the different topology scheme may bring the different communication number , which comes to different total time .Take Dawning 5000A as example, the parameters  = 1.8 us∼2.5 us and  = 1/(1.6563Gb/s) [12].Assume that the total grids are 1000 × 1000 × 1000 and total cores are 1000, now the total communication delay time (9.72 ms) is about less an order of magnitude than the total communication time (121 ms).Under this cores scale, the communication delay time is the secondary factor.
The communication amount of single process is ave Divided by a constant   ⋅   ⋅   , (4) will be From ( 5), it is known when and only (  − 1)/  = (  − 1)/  = (  − 1)/  , namely, the topology conformal as calculation region, the communication amount of single process is the least.Generally, the equation above cannot be satisfied absolutely.So the topology distribution is required that it is divided along the direction which is conformal as calculation region as possible to make (5) the least.
Generally speaking, the communication time is less between processes in one node than that between processes which belong to different nodes [12,13]; that is, the one byte data communication time factor  is different between processes in one node and across nodes.So, when the factors  and  are the same between two different topologies, the communication amount across nodes is needed to be considered.
For certain grids, the total memory requirement (called ) keeps the same for different topologies.The memory distribution of each process (called ) is as follows: Equation ( 6) indicates that the memory distribution of each process is unrelated to the virtual topologies.From the analysis above, it is known that the communication surface area varies from different virtual topology scheme for certain grids.The communication time will be changed associated with the virtual topology scheme while the memory distribution of each process remains the same.Thus, the communication amount is the main factor which affects the parallel performance.virtual topology schemes on two supercomputer center platforms, National Supercomputer Center in Tianjin (NSCC-TJ) and National Supercomputing Center in Shenzhen (NSCC-SZ), as listed in Table 1.
Actually, the amount of total grids is just 200 × 200 × 50.However, to test the influence of different virtual topology schemes on parallel performance of parallel FDTD, the computational space needs to be extended.So, in this test, the amount of total grids is set as 1200 × 1200 × 300.
The radiation patterns of the microstrip array are shown in Figure 2, compared with the results obtained from HFSS.The figure shows that there is a well agreement between them.

Discussion of Parallel Performance.
Here, we select several groups of virtual topology schemes to be tested.The following are the test results on the two supercomputer center platforms.2 is the comparison of total computation time (in seconds) with different CPU cores and different virtual topology schemes.The maximum number of CPU cores used for test is 120.

NSCC-TJ. Table
In Table 2, virtual topology schemes are described as ( ×  × ) for all three communication patterns.If the value is 1 in some direction, it implies that there is no topology in this direction.For example, 2 × 1 × 1 means that there is no topology in  and  directions, respectively; thus the virtual topology is actually in one dimension.Similarly, 8 × 8 × 1 means that there is no topology in  direction; thus the virtual topology is actually in two dimensions.In our work, one process uses one CPU core.
The speedup and parallel efficiency of the code are shown in Figure 3. From Figure 3, it can be seen that the parallel efficiency reaches up to 80% on NSCC-TJ.
From Table 2, it is obvious that increasing the number of CPU cores can bring us the reduction of the computation time rapidly.But different virtual topology schemes will cost different computation time even if the code is run with the same number of processes.Next the parallel performance of the parallel FDTD will be discussed.
Here, the cases of 96 and 120 cores are taken as the examples.From (3) The following is known.From above, it is known that, for the virtual topologies with the same dimensions, the total grids at interfaces less (the amount of communication data less) can save calculation time effectively.Meanwhile, it is obvious that the memory distribution of each process is the same for different topologies with the same number of CPU cores.But, from above, it also can be seen that, for the same amount of grids, the calculation time has certain differences.For example, the topologies 6 × 5 × 4 and 5 × 6 × 4 with the same grids number and the calculation time are 1076.41seconds and 724.90 seconds, respectively.Generally, we believe that the consumption time between processes in one node is less and the more time consumed between ones across nodes [12,13].So, here we speculate that the different amount of the communication grids across nodes causes the difference between the two cases.
In Figure 4, every two adjacent figures with different colors have data communication.Figures 1 to 10 present ten nodes, and the adjacent figures with the same colors present that they are in the same node.For (b), the adjacent columns will transfer ((1 + 2/6) × (1200 × 300)) data in  plane, and the adjacent rows will transfer (5 × (1200 × 300)) data in  plane.3 is the comparison of total computation time (in seconds) with different CPU cores and different virtual topology schemes.The maximum number of CPU cores used for test is 480.

NSCC-SZ. Table
In Table 3, virtual topology schemes are described as ( ×  × ) for all three communication patterns and the meaning of each figure in each topology is the same with the description in Section 4.2.1 above.
The speedup and parallel efficiency of the code are shown in Figure 5. From Figure 5, it can be seen that the parallel efficiency reaches up to 80% on NSCC-SZ.
From Table 3, it can be seen that, for the virtual topologies with the same dimensions, the total grids at interfaces less (the amount of communication data less) can save calculation   time effectively.This conclusion coincided with the case on NSCC-TJ.The amount of communication grids of each virtual topology is calculated by (3), while, for certain topologies with the same number of grids, it is found that the calculation time is less with the more crossing-node communication, analyzed by the way of NSCC-TJ.It is contrary to the speculation theory above.Therefore, we speculate that whether it is caused by the reason that the communication amount of nodes with the main communication is great in the topology.Next, the cases of 10 × 2 × 3 and 3 × 5 × 4 in 60 cores are taken as the examples to analyze the speculation (the amount of communication is 6480000).
For 10 × 2 × 3, the crossing-node communication is 4 × (1200 × 300) + 1 × (1200 × 300) = 1800000, while, for 3 × 5 × 4, the one is 2 × (1200 × 300) + 16/12 × (1200 × 300) = 1200000.But the consumption time for 10 × 2 × 3 is 1117.17seconds and the one for 3 × 5 × 4 is 1214.09seconds.This does not agree with the speculation of crossing-node theory on NSCC-TJ.To further explore the reason, the heaviest communication and the lightest communication of the related processes from these two virtual topology schemes are listed in Table 4.The heaviest communication load in the virtual topology scheme 10 × 2 × 3 is (1200 × 300)/30 + (1200 × 300) × 2/6 + (1200 × 1200) × 2/20 = 276000, while, in the virtual topology scheme 3 × 5 × 4, the one is (1200 × 300) × 2/12 + (1200 × 300) × 2/20 + (1200 × 1200) × 2/15 = 288000.The heaviest communication loads are assigned to the processes located at the center of the process grid.Similarly, the lightest communication loads that are set to the processes at the corner of the process grid can be calculated.The difference between the heaviest communication load and the lightest communication load in the virtual topology scheme 3 × 5 × 4 is larger than the one in 10 × 2 × 3, which results in different computation time.This indicates that when the total  amount of communication is the same, a virtual topology scheme with a more balanced communication load may bring a better performance, although with a more crossing-node communication.

The General Rules of Optimal Virtual
Topology.Generally, when the amount of FDTD cells is the same, the MPI virtual topology by the way where the amount of transferred data is less can save the computation time for the same dimensional virtual topology.The best performance of a parallel FDTD code can be obtained by optimizing the MPI virtual topology scheme.The general rules for a better performance are as follows.
(a) Select MPI virtual topology scheme to make the total communication  (equation ( 3)) the smallest.(c) When the amount of crossing-node communication of different topologies is approximately the same, select the topology with a more balanced communication load.

Applications Using 10 Thousands of CPU Cores
Based on the optimal virtual topology scheme above, the parallel FDTD code is applied to analyze some complicated EM problems, and they are run on Shanghai Supercomputer Center (SSC) platform.

Figure 3 :
Figure 3: The speedup and parallel efficiency of the code from 12 CPU cores to 120 CPU cores on NSCC-TJ.

Figure 5 :
Figure 5: The speedup and parallel efficiency of the code from 48 CPU cores to 480 CPU cores on NSCC-SZ.

Figure 6 :
Figure 6: The model of the microstrip antenna array.

Figure 7 :
Figure 7: The radiation patterns of the array.

Figure 8 :
Figure 8: The model of the microstrip array.

Table 1 :
Parameters of computation resources.

Table 2 :
Comparisons of virtual topology, amount of communication, and computation time on NSCC-TJ.

Table 3 :
Comparisons of virtual topology, amount of communication, and computation time on NSCC-SZ.

Table 4 :
Comparisons of communication load.