GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16 × 18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.


Introduction
Electromagnetic researches have penetrated into many aspects of our daily lives, such as wireless signal propagation, mobile communications, and radar technology.Since the electromagnetic wave propagation in a real environment is very complex and usually the characteristics of electromagnetic waves cannot be obtained by theoretical analysis, the numerical solutions of electromagnetic waves have gradually been developed, such as method of moments (MoM), uniform geometrical theory of diffraction (UTD), physical optics (PO), and finite difference time domain (FDTD) method.Each method has its own advantages, for instance, high-frequency methods UTD and PO are suitable for electrical large objects, while the general electrical objects can be treated by the low-frequency methods such as MoM.The FDTD method which will be discussed in this paper was first proposed in 1966 by Yee [1].The electromagnetic field is discretized by the Yee grid [1] and the Maxwell's equations are converted into finite difference form.Thus, the numerical FDTD can be used to simulate the processes of electromagnetic scattering, radiation of large scale, complex shapes objects, or nonuniform mediums.
Thanks to the rapid development of computer industry, besides traditional CPU, other accelerations like Graphics Processing Unit (GPU, including integrated GPU [2]) and Many Integrated Core Architecture (MIC) have been developed to improve the calculation efficiency.The FDTD method using the finite difference method is very convenient for GPU transplant [3][4][5][6].At early times, single GPU card can achieve a much faster execution speed than INTEL or AMD CPUs [7,8].Using NVIDIA FX 5900 by Krakiwsky et al. [7], the speedup is about 7 times faster than Intel P4 CPU.CUDA based code has shown good performance compared with C code operated on Xeon CPU in 3D FDTD simulations.Liuge et al. [9] archived 10 times speedup factors compared with INTEL Xeon E5410 CPU in all their cases by using NVIDIA GTX280 video card.Because of the shortcoming of small memory size of single GPU card, multi-GPU card system was developed for large scale computations [10,11].Nagaoka and 2 International Journal of Antennas and Propagation Watanabe [11] used 3 nodes (21 NVIDIA TESLA C2070 GPU cards in total) to carry out the FDTD numerical simulations and the parallel efficiency is about 30%.The speedup ratio is 1.5 relative to NEC SX-8R CPU (8 cores).In the GPU experiments made by Kim and Park [12], a theoretical computing efficiency was obtained by spliting GPU kernel functions, in which these functions were arranged in an ingenious way and the MPI communications and FDTD calculations overlapped each other.Xu et al. [13] used 64 NVIDIA TESLA K20m GPU cards in their three-dimensional (3D) FDTD simulation, and the parallel efficiency is 67.5% and the speedup ratio is 3.1 relative to Intel XEON E5-2670 CPU (8 cores).With the development of GPU clusters, GPU computing has a great potentials, based on which more and more GPU-accelerated applications appear.
The FDTD method investigated in this paper includes some module functions like source excitation, near-to-far transformation, preprocessing, and postprocessing.Therefore, this FDTD code can be applied to solve a application problem directly.The code is mainly written in Fortran which is combined with CUDA to make use of GPU resources.The Message Passing Interface (MPI, for large scale calculation) programming and Open Multi-Processing programming (OpenMP, for regulation of CPU and GPU resources) are also very important in our code.Thus, these functions are introduced as follows: Section 2 discusses the FDTD equations and discretization method; the GPU algorithm are illustrated in Section 3; Section 4 discusses the results of GPU and CPU tests; finally, the conclusion and remarks are presented in Section 5.

Numerical Scheme of FDTD
Maxwell's curl equations are given as follows [14,15]: where E, H are electric field intensity and magnetic field intensity.The other parameters , , , and   are permittivity, permeability, electrical conductivity, and magnetic permeability, respectively.In vacuum,  = 0,   = 0, and  =  0 = 8.85 × 10 −12 F/M and  =  0 = 4 × 10 where where Δ, Δ, Δ are space steps and Δ is time step.() is the value of  at the positon , for example,  = ( + 1/2, , ) in formula (2).Other variables   (), (), and () have the same meaning.From formulae (2) and (3), we know that the electric or magnetic field at any point is regarded with the four nearest magnetic or electric fields around.

FDTD Parallelization Strategy
3.1.Calculation Flow.This section describes the strategy of using both CPU and GPU resources.For the convenience of expressing, the new FDTD code after GPU transplant is referred to as cudaFDTD hereafter.cudaFDTD includes several functional modules as follows (for detailed formulae we refer the readers to the books by Taflove and Hagness [15] and Ge and Yan [16]).
(1) E and H updates.As described above, cudaFDTD uses the Yee cell and leapfrog scheme to update electric field and magnetic field in turn.The UPML layer [17] is placed near the boundaries.In our cudaFDTD code, the thickness of the UPML layer is set to be 5 grid points which is suitable for most situations.Of course, this value can be adjusted for some extreme cases.There is little influence coming from the boundaries and the verification of the correctness of cudaFDTD is presented in Section 4. (2) Source excitation.cudaFDTD has the source excitation, which creates perturbations in the computational domain for scattered or radiant simulations.The form of source excitation is flexible.(3) Near-to-far transformation.According to the Huygens' principle, the scattered or radiant field at an observation point outside the computational domain can be derived from the already existed E and H in the domain.The E and H values are recorded during the E and H updates.After all the updating processes have been accomplished, the far fields can be derived in postprocessing procedure.
The functional modules depicted above can be calculated by CPU and GPU simultaneously.The strategy of distributing the computing tasks for CPU and GPU is presented in Section 3.3.The completed flowchart containing the above functional modules is shown in Figure 1.

The Process and Thread in cudaFDTD.
The number of processes booting by cudaFDTD depends on the number of GPU cards in most of cases (the number of GPU cards greater than that of CPU cores).Each process contains several threads.One thread is distributed to control the CUDA stream; the rest of them are for CPU calculations (using OpenMP).If one node of a supercomputer offers   CPU cores and   GPU cards, the avaliable process number and thread number of each process are given by the following fomulae: where the function "min" in (5)   distribution in one process.Because every subdomain uses CPU and GPU simultaneously, thus the intraprocess partition splits the subdomain to two pieces for CPU and GPU calculations.The boundary data is passed by cudaMemcpy between CPU and GPU.
As we know, the GFLOPS (giga floating-point operations per second) between CPU and GPU is very different.A ratio has to be determined for load balance of CPU and GPU.In cudaFDTD, we use the parameter cuRatio to control the amount of calculation for GPU.However, the cuRatio is not a simple divisional result of theoretical GFLOPS of CPU and GPU devices but the actual GFLOPS which is counted in the code.Figure 3 gives the partition method in intraprocess partition.The data of this subdomain is divided into upper and lower parts for the purpose of spatial locality.The expression of cuRatio is where , , and  are the dimension sizes of , , and  directions.So, the block size calculated by CPU is  ×  ×  and the rest space  ×  ×  is left for GPU.
The interprocess partition and the intraprocess partition are applicative for all function modules like update, source excitation, and near-to-far transformation.Since the data partition has two different types, the data communication also has two types: (1) interprocess communication and (2) intraprocess communication.In interprocess communication the boundary data is passed thought nonblocking MPI functions like MPI ISEND and MPI IRECV, whereas the data transportation in intraprocess communication is transmitted by the CUDA function cud-aMemcpy between CPU and GPU device.For both communication types, the operations are similar (see Figure 3

Optimization Methods Used in cudaFDTD.
To maximize the efficiency of the CPU and the GPU computing, we adopted a number of technical methods to improve the efficiency of our procedures.We have the following points for the optimization of Fortran code.
( We have the following optimization methods for CUDA code. (1) Reduce unnecessary arithmetic operation.This is the same as CPU code.(2) Memory coalescing.Coalesced memory access provides beneficial performance [18].The latest NVIDIA GPU (compute capability 2.0 or later) fetches data from memory in groups of 32 4-byte floats [19].For instance, if a thread-block needs to fetch 16 floats from the GPU memory, at most 16 fetches are needed in the worst case.If the 16 floats are adjacent in memory, then a single fetch operation can grab all required 16 floats.Therefore, we already do our best to use the data in adjacent memory in order to merge memory accesses.(3) Constant memory.GPU constant memory is a smallreadable memory, and it has the following advantages compared to other memories: (1) single read can broadcast its value to other nearby thread and (2) constant memory has cache and the sequential read of the same constant is very fast.Some of the constant coefficients of the material are put into the constant memory in cudaFDTD.
The other optimization techniques, like the usage of shared memory [20], will be developed in the next version.The performances with and without optimizations are shown by Table 1.In this test, the CPU and GPU devices are Intel XEON E5-2670 and NVIDIA TESLA K20m in Shanghai Supercomputer Center (SSC).The problem size is 512 × 256 × 256.

cudaFDTD Tests
Most of our tests are compiled and executed on the GPU cluster in the Center for High Performance Computing of Shanghai Jiaotong University.Each node in the GPU cluster is configured with 2-way INTEL XEON E5-2670 CPUs and 2 NVIDIA TESLA K20m GPUs.A total of 50 nodes were connected by Mellanox FDR InfiniBand.cudaFDTD is a mixed  programming code of Fortran and CUDA using doubleprecision float for calculations for which the Fortran compiler is Intel ifort 13.1.1and CUDA 5.0.Since Intel Fortran compiler cannot directly call GPU kernel functions, our cudaFDTD program is necessarily to call C language functions first and then GPU kernel function in the C language functions.

Verification for the Numerical Accuracy.
Electric dipole source are used to verify the accuracy of cudaFDTD code [16].The vertical electric dipole is located in the center of the computational domain, and we observe the changes of   value with time at an observational point Δ from the dipole.The computational box is  ×  ×  = 64 × 64 × 64 and the Yee cell is Δ × Δ × Δ = 5 cm × 5 cm × 5 cm.The time step is Δ = 83.3ps.The electric field   generated by electric dipole is added directly to   (/2, /2, /2): where  = 2 ns and  0 = 8.8541878×10 −12 F/m.The thickness of UPML is 5Δ.The theoretical expression for   (/2 + , /2, /2) which is Δ from the center (/2, /2, /2) is given by where  0 = ( − / − 3)/,  = Δ. is the speed of light.
We choose  = 10 in this verification and Figure 4 presents the difference between the simulation result and theoretical   for GPU.If the problem size is 100 × 100 × 100 and cuRatio is 0.8, the part for GPU is 100 × 100 × 80.It is also an important key to guarantee the load balance.Figure 5 shows the variation of loop time and cuRatio value over threads number.The test case is the same as described in Section 4.1.We used one node in this case and the problem size is 512 × 256×256.As depicted in Figure 2, there is a thread for CUDA stream and the other ones for CPU calculations.Thus, the values 2, 3, 5, 7, 9, 16 of  axis in Figure 5 correspond to 1, As for the CPU + GPU test, the perfect load balance is difficult to reach which further polls down the parallel efficiency.
Although the parallel efficiency is unsatisfactory, the loop time is very short (see Table 3) which gives a high speedup ratio.
Compared to CPU calculation, the speedup ratios of GPU and CPU + GPU are given in Tables 4 and 5.The meaning of NP is described above.The difference between Tables 4 and 5 is that the MPI communication is involved in 64 processes tests, while the other one has no MPI communication.As shown in these tables, pure GPU calculation has a speedup ratio of 13.2 and the ratio in CPU + GPU test is higher than 13.2 in both tables.Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%-13.6%performance improvement.Since the perfect

Discussion and Summary
This paper introduces a new FDTD code which uses both CPU and GPU resources to deal with the realistic electromagnetic problems.The loads of CPU and GPU are balanced by the parameter cuRatio.We used 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs to carry out International Journal of Antennas and Propagation numerical tests.Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%-13.2%performance improvement and the maximum problem size can be enlarged by 25%.Accroding to ( 5) and ( 6), this FDTD code is also available for different CPU and GPU configurations of other supercomputers.Moreover, the cudaFDTD is also compatible with the clusters without GPU devices in which the CUDA functions will be disabled.Several function modules like source excitation, near-tofar transformation, preprocessing, and postprocessing have been implemented in cudaFDTD; therefore this code can be applied directly to calculate the application problem as shown in Section 4.4.However, we already noticed that our code has some shortages which will be overcome in the future.
(1) The MPI parallel efficiency is not good enough (refer to Section 4.3).The complicated communication is one of the reasons, but there are some approaches to overlap this overhead, for instance, the nonblocking MPI communication.
(2) The OpenMP parallel efficiency is bad.For example, in Section 4.2, the cuRatio is 0.80 and 0.95 when using 15 CPU cores and 1 CPU core (note that one CPU core is used for CUDA stream).The speedup from single core to 15 cores is only a factor of 4.

Figure 2 :
Figure 2: An example to show the processes and threads in cudaFDTD: 2 processes are started and each process controls 8 threads.

Figure 4 :
Figure 4: The accuracy verification of cudaFDTD.The green line is relative error using the right  axis.

Figure 6 :
Figure 6: Speedup test of cudaFDTD.The unit of  axis is NP (the number of processes) which has different means for different speedup curve: 1 NP = 1 CPU core for pure CPU curve, 1 NP = 1 GPU card for pure GPU curve, and 1 NP = 7 CPU cores + 1 GPU card for CPU + GPU curve.The problem size is 512 × 256 × 256.

4. 4 .
Radiation of Mircostrip Array.In this section, we present an application test of 16 × 18 mircostrip antenna array.The initial model is illustrated in Figure7.The frequency is 9 GHz; the size of the array is 0.2453 m × 0.3908 m × 0.0013 m (grid size is  =  =  = 0.00033 m).The total computational grid is 800 × 1200 × 100.Using 24 NP (1 NP = 7 CPU cores + 1 GPU card) to iterate 10000 steps, the radiation patterns of this array are shown in Figure8.The result is almost identical with MoM method.
Figure 1: The flowchart of cudaFDTD.The modules written in blue mean the main calculations and that in brown mean data communications.Note that the interprocess is done by only CPU; other function modules are treated by CPU and GPU simultaneously.
(6)ns minimum function, and function "int" in(6)rounds the result to an integer.For instance, if one node offers   = 2 GPU cards and   = 16 CPU cores, then 2 processes are started at this node.Each process forks to   = 8 threads for 8 CPU cores and one of the 8 threads (or say one of the 8 CPU cores) is distributed to control one CUDA stream, as shown by Figure2.CPU and GPU can work simultaneously.
3.3.Data Partition.The data partition consists of two aspects: (1) interprocess partition and (2) intraprocess partition.Interprocess partition distributes the total computational domain to different processes.Every subdomain has its own CPU and GPU devices and the number of subdomain is decided by   .The data exchage is realized by MPI communications.The intraprocess partition means the data 1) Reduce unnecessary arithmetic operation.For example, defining intermediate variables can reduce duplication of calculations; besides, division and modulo operations should be avoided.(2) Specify CPU instruction set when compiling code.

Table 1 :
The performances before and after optimization.

Table 2 :
Statistics of the time consumption of cudaFDTD.In order to obtain the detailed statistics of the time consumption, a larger scale test with the size 2048 × 1024 × 1024 is performed.The time consumption of all function modules is given in Table2.In this test, we used 32 nodes.Figure 5: The variation of loop time and cuRatio value over threads number.The problem size is 512 × 256 × 256.The blue curve means loop time (left  axis) and the green curve presents the value of cuRatio (right  axis).

Table 3 :
Loop time information used in Figure6.CPU cores at most).Compared to one process, the parallel efficiencies ( 1 /( ×   ),  1 means the loop time calculated by 1 NP;   means the loop time calculated by  NP.) of pure CPU, pure GPU, and CPU + GPU are 93.2%,47.2%, and 42.8%, respectively.The small problem size (512 × 256 × 256) and complex communication are responsible for the low parallel efficiencies of pure GPU.The time spent in communication takes the larger proportion in 64 processes.
load balance is difficult to achieve, the speedup of 64 processes is lower than that of single process.