Parallel Numerical Simulations of Three-Dimensional Electromagnetic Radiation with MPI-CUDA Paradigms

Using parallel computation can enhance the performance of numerical simulation of electromagnetic radiation and get great runtime reduction. We simulate the electromagnetic radiation calculation based on the multicore CPU and GPU Parallel Architecture Clusters by usingMPI-OpenMP andMPI-CUDA hybrid parallel algorithm.This is an effective solution comparing to the traditional finite-difference time-domainmethodwhich has a shortage in the calculation of the electromagnetic radiation on the problem of inadequate large data space and time. What is more, we use regional segmentation, subregional data communications, consolidation, and other methods to improve procedures nested parallelism and finally verify the correctness of the calculation results. Studying these two hybrid models of parallel algorithms run on the high-performance cluster computer, we draw the conclusion that both models are suitable for large-scale numerical calculations, and MPI-CUDA hybrid model can achieve higher speedup.


Introduction
Finite-difference time-domain (FDTD) method has become a common method for solving Maxwell's equations [1].It is a full vector method and can be naturally given timedomain and frequency-domain information user need.This is the unique advantage in electromagnetic and photonic application.FDTD algorithm is discrete in terms of time and space.Therefore, the structure of the electromagnetic field must be described on the grid by the Yee cellular composition.Maxwell's equation is discrete in time factor; therefore, time step is closely related to the mesh size.When mesh size tends to zero in the limit case, the discrete model accurately describes Maxwell's equations.
Recently, general-purpose computing on a graphics processing unit (GPGPU) has received considerable attention in many scientific fields [2][3][4] because a GPGPU offers high computational performance at low cost.What is more, Intel Xeon Phi coprocessor, based on the Many Integrated Core (MIC) architecture, packs up to 1 TFLOP of double precision performance in one chip.It runs a Linux operations system and provides x86 compatibility and also supports several popular programming models including MPI, OpenMP, Thread Building Blocks, and others that are used on multicore architectures.High-performance computer architecture tends to hybrid system, and this corresponds to the software program design requirements mixed programming model.GPGPU and MIC accelerated computing components which appeared in recent years provide the opportunity to improve the performance of FDTD parallel algorithm.Therefore, we achieved the parallel three-dimensional FDTD algorithm based MPI-CUDA model.
The FDTD algorithm obtains a wide range of applications in many fields of electromagnetic radiation, such as radiation antenna analysis, scattering calculations, electronic packaging, and radar.With the development of highperformance computing, the MPI has solved a weakness that the computing time of the FDTD parallel algorithm [5,6] is too long.However, increasing amount of computation, the MPI process in a single node increases computational burden.When we use two different hybrid models which are MPI-OpenMP model and MPI-CUDA model to solve this problem, we can use the distributed shared memory features to improve the parallel speedup and scalability [7,8].
The rest of the paper is organized as follow.We present the FDTD algorithm with uniaxial perfectly matched layer 2 Mathematical Problems in Engineering The structure of Yee cell.
(UPML) in Section 2. Then we describe the procedure and present basic steps for the method acceleration by means of MPI-OpenMP paradigms and MPI-CUDA paradigms in Sections 3 and 4. In Section 5, the performance of parallel computing of the two methods was compared and analyzed the factor that affects performance.The conclusions are given in Section 6 finally.

FDTD Algorithm
FDTD algorithm is a numerical method based on Maxwell's equations.The algorithm uses leapfrog calculation method and alternating electric field and magnetic field distribution in space within a half step sampling by Yee cellular composition [9].
From Figure 1, we can see that the Yee cell has the following characteristics: each magnetic field component was surrounded by four electric field components and each electric field component was surrounded by the four components of the magnetic field, and these field components placement relative position in the Yee cell and automatically satisfy the continuity conditions in the interface.This sampling method not only meets Maxwell's equations difference calculation but also meets Faraday's law of electromagnetic induction and the natural Ampere's law [10].Therefore, this method gradually completes recursive entire electromagnetic fields.First, the explicit equations for the   and   are given by End of the calculation Calculate all the H value at the time Calculate all the E value at the time T > T max ?
E value was to initialized at the time where  is the relative permittivity,  is the conductivity of the tissue [S/m],  is the mesh size, and Δ is the time step.
Figure 2 shows the workflow of the FDTD method.
The explicitly iterative process of FDTD algorithm requires initial field values and boundary conditions.The traditional definition of initial field values put all the space field values which are defined as 0, then the field values of two successive time steps are stored as the initial field value of the next step.There are a variety of the boundary conditions, such as Mur absorbing boundary, perfectly matched layer (PML), and uniaxial perfectly matched layer [11] (UMPL).With the improvement of FDTD algorithm these three matching layers have good effect on absorption.

Mathematical Problems in Engineering 3
In our experiments, we use uniaxial perfectly matched layer (UPML) as absorbing boundary condition.UPML inherited the PML absorbing layers good absorption characteristics, but the UPML absorbing layer is directly based on Maxwell's equations which is different from PML absorbing layer.This ensures that the form of FDTD algorithm calculation model has good consistency with Maxwell's equations, which makes it easier to understand and program.

FDTD Algorithm Based on MPI-OpenMP
3.1.MPI-OpenMP Hybrid Model.MPI programming model is a parallel programming interface for developing standards based messaging, acting on a heterogeneous network environment.It provides a reliable transport mechanism, and it uses security channel to achieve the communication between all the tasks within a process group, data exchange and processing.However, MPI is calculated by parallel interprocess communication, which results in a lower efficiency, the parallel memory overhead, programming problems, and other shortcomings.The OpenMP programming model utilizes fork-join execution mode and shared memory model where a process can be divided into several parallel task execution threads.In a single node, when the main thread is running into OpenMP parallel region, it will produce different thread queue to achieve a parallel effect.However, since OpenMP parallel programming model can only be a single node, computing power of the CPU has been greatly restricted.
MPI-OpenMP programming model is the combination of these two models, which is between multiple machines using MPI distributed memory and each MPI process uses multithreaded OpenMP shared memory model to parallel computing [12].This model can reduce the number of MPI processes in parallel, thereby reducing the number of messages passing.What is more, OpenMP parallel on each node can save memory overhead.So this parallel hybrid model has certain advantage over a single model.

The Division of MPI Model.
According to the different regional calculation scale, each division way has different characteristics.Common area includes one-dimensional, two-dimensional, and three-dimensional division.In the cube model, each section of the communication direction is all the sane, which means that the communication traffic is proportionate to the number of communication surfaces.Three-dimensional method produces the minimal communication surface.Therefore, we should select the threedimensional division in the balanced scales calculating.But when the area which you calculated is dominant in one direction or in both directions, the communication traffic is not proportional to the number of communication surfaces.The one-dimensional or two-dimensional division can get better results.For example, long waveguide will be divided one-dimensional along the waveguide direction.In addition wing aircraft is usually divided two-dimensional alone its width direction.
In this paper, the computational model which is similar to the phone box is an extreme rectangular model.We  performed a one-dimensional division by the meshing and each process is assigned to one node on average to obtain high efficiency.This division can be reduced communication time between nodes in the rectangular model (Figure 3).

MPI Parallel Communication Design.
As shown in (1), FDTD field value for each step of the algorithm depends only on the value of magnetic field intensity and the circumference value of the grid, which means that it has significantly localized.Therefore, MPI algorithm just calculates the area into a plurality of subregions, and each subregion is allocated MPI node in the mesh to ensure the exchange data in the each time step.Finally, we synchronize data of each node at end of the step.In order to obtain high efficiency of parallel speedup, we partition the process by  direction and put the processes average assigned to each node.This partitioning can reduce the communication between nodes time in the  direction dominant of the rectangular parallelepiped model.After the calculate area is divided into a number of subregions.The data communication is shown in Figure 4.If we want to calculate   which is in process II, we should require the value of electric fields   and   which is in process I. Then process I transfers the values to process II.  can begin calculating when data exchange is completed.Obviously, the data in the other direction can communicate with each other in this way.

OpenMP Threaded Design and Optimization.
The electric field and magnetic field strength only depends on the electromagnetic field data of the former step in the electromagnetic computing.Therefore, it has the natural parallelism.In the multicore computing, the program uses OpenMP model to increase program parallelism.
For better memory utilization, the load balancing may not be very good; therefore, when considering performance optimization, we should have a compromise between the #pragma omp parallel for schedule(static) for ( = 1;  <  tot; ++) for ( = 0;  <  tot; ++) for ( = 0;  <  tot; ++) ⋅ ⋅ ⋅ In order to increase OpenMP load balance and get better scalability, we merge the nested loops and reduce from 3 nest layers to 2 layers.end =  tot;  end = ( tot) * ( tot);  start =  end + 1; #pragma omp parallel for schedule(static) for ( =  start;  <  end; ++) {  = / end; need to optimize memory utilization and load balancing.In OpenMP model, there are four kinds of commonly used scheduling, such as static, dynamic, guided, and runtime.The commonly used scheduling is static where iterations are divided into chunks of size chunk size while chunks are assigned to threads in the team in round-robin fashion in order of thread number and dynamic where each thread executes a chunk of iterations and then requests another chunk until no chunks remain to be distributed.
From the definitions of the schedule, we can see that static scheduling applies to the situation that all CPU functions are similar due to the complexity of balancing reasons, while the dynamic scheduling applies to the situation of the ability to run large differences between the CPUs.Due to the experimental environment that each CPU computing power is not very different, we use static scheduling so that each thread can average computing tasks [13] (see Algorithm 1).
Its obvious that , , and  directions have similar calculation.

MPI-OpenMP Hybrid
Programming.With the increasing scale of operation, the shared storage processor overhead of process limits parallel performance.We can solve this problem by OpenMP and MPI hybrid model where each node is calculated by MPI processes and each node uses the optimal number of threads.Thus the amount of computation is shared by multiple threads, thereby reducing communication overhead and easing the process workload on each node.
When FDTD algorithm based on MPI-OpenMP model allocates a process for each compute node, each node is assigned a process.Passing MPI initialization and calculation parameters initialization, the process starts OpenMP parallel computing.In this phase, the program should update cycle electric field value and the electric field value with adjacent nodes exchanges the date.After the communication is completed, CPUs use OpenMP programming model to update the value of the magnetic field.When the programs achieve the maximum time step, MPI processes and the main procedures will end.As for the border absorption treatment, process calculation is carried out in a single process in every time step.Therefore, it can also improve program parallelism through OpenMP model.
The pseudocode is as shown in Algorithm 2.

MPI + GPU Programming Model Design
4.1.FDTD Parallel Algorithm Based on GPU Model.FDTD is the most popular method of computational electromagnetic simulation because of its simple algorithm and high computational efficiency.Figure 5 shows the flow of the threedimensional FDTD method for a single GPU computation [14].There are four tasks within each time step for the GPU side in this figure: electric field computation ( field), uniaxial perfectly matched layers (UPML) computation for electric field ( upml) as the absorbing boundary condition, magnetic field computation (ℎ field), and PML for magnetic field (ℎ upml).All field updates in each time step can be parallelized and are offloaded to the GPU [15].Our GPU implementation of the FDTD method is based on the C++ code that runs on the ZiQiang4000 highperformance computer clusters of Shanghai University.Since all the magnetic and electric field computations can be vectored and parallelized, these computations are also candidates for GPU computation with a CUDA kernel.The CUDA kernel including  field,  pml, ℎ field, and ℎ pml follows.And our GPU three-dimensional FDTD program requires no data transfer for field updates because of all computations within each time.The CUDA kernel function [16] would be in Algorithm 3. As the internal memory of GPU, registers and shared memory have a high access speed and a small data latency.Therefore, the data is loaded into the global memory so that shared memory can improve the efficiency of GPU parallel.Consider

The Optimized Memory of Access
The first line of code is carried in the shared memory array definition.Since the shared memory space is very limited, programmer should choose dynamically allocated arrays or static allocation based on the data size.The second line was to load the data into the global memory shared memory which can increase memory access speed by data sharing.Obviously, other data should also do the same optimization.

MPI-CUDA Hybrid Programming.
In the MPI programming model, a problem will be divided into multiple subtasks so that each process to execute a task.But with the growth of the scale, MPI communication overhead increases at the same time; thereby this will reduce the parallel efficiency.However, the model is just coarse-grained parallelism between the nodes, and computing capacity of CPU is not fully utilized.Then the CUDA model is just to make up the deficiency of this section.The use of multithreaded shared memory  [1]; Grid [2] = /blockDim [2]; dim3 grid (Grid[0], Grid [1], Grid [2]); caluH <<<grid, block>>> (arguments); Algorithm 3 mechanism on the GPU achieves data sharing fine-grained parallelism, which can achieve higher speedup [17].
In order to reduce the computational load of the GPU, we use a single node in a multi-GPU programming model [18], which means that each process is calculated by calling the function cudaSetDevice() to get two GPUs in a single node.
Workflow of three-dimensional FDTD program based MPI-CUDA model is as follows.
(2) Divide the regional calculation and allocate GPU memory and copy initial data to GPU.
(3) Time step for-loop Use MPI model.
(a) execute electric field computation on GPU (b) execute magnetic field computation on GPU.

Experiments and Analysis
5.1.Experimental Environment.ZiQiang4000 normal cluster is constituted of 40-unit IBM X240 server blade, which contains two intel E5-2680, 64 G shared memory, and 16 core components in each blade.The GPU computing cluster node contains 11 sets of IBM DX360, which has two intel E5-2680, 64 G shared memory, and plus two Nvidia M2090 GPU constitution.The software environment of the system includes Centos Linux 6.3 operating system, NVCC compiler supports OpenMP guided C/C++ compiler, and MPICH2 parallel environment.

Model Validation and Analysis.
Verifying the correctness of parallel programs is a prerequisite for its performance analysis.In order to verify the correctness of the program and UMPL absorbing boundary absorption effect, this paper makes use of three-dimensional point source radiation to have a situation analysis.we compared the run time between GPUs and CPUs in the case of Gaussian incidence at 900 MHz.What is more, the grid size is 0.02 mm and the calculation scale is of 100×100×100.We use UPML absorbing layer which is set to 10-mesh size and the maximum time step 1000.Figure 6 shows the 50 layers and 65 layers of XY section of the electric field strength calculations at the 1000 steps.As we can see in the figure, electromagnetic waves shape was diffusion, and the further the cross-sectional layer of electric field away from the excitation source, the smaller the value you get.The excitation source is at the center of the field map and gradually spreads to the electromagnetic field borders, which fits the propagation of electromagnetic waves.

Parallel Analysis.
In order to obtain intuitive results, we expand the calculated scale to 1024 × 256 × 128, and the absorbing boundary UMPL mesh size is set to 16.The results were as follows after 1000 steps calculation using MPI model.Table 1 shows the result of point source radiation calculation time based on the MPI model, such as speedup  and efficiency of the communication area.In this paper, we performed 16 processes on each node.It can be seen from the table that, as the number of processes increases, the computing time significantly reduced for linear speedup.And the electromagnetic radiation program has good scalability and parallelism in a distributed process parallel environment.But with the increase in the number of processes, the parallel efficiency becomes more and more low.This is because the increase in the number of processes leading to the communication area is increasing.It will increase the communication overhead and cause the load between nodes to be not balanced.
From Figure 7, we can see that, when calculating small scale, the acceleration effect of GPU is not obvious.This is due to the number of threads too little to create enough block to make SM concurrent execution, reducing GPU occupancy rate.With the increasing of the calculation size, GPU computing resources gradually are called and the speed of computation is improved.Therefore, when calculating a smaller scale, GPU speed increase is limited, and in the largescale electromagnetic calculations, GPU-accelerated effects model is very impressive, which is reaching about 9 times the CPU.After optimizing memory access patterns by shared storage, the max speed ratio achieves 23 when the number of grids is 6 × 10 6 .The reason is segmentation of data blocks fit in GPU memory.
MPI-OpenMP hybrid programming model combines two kinds of models, which can take advantage of shared memory and message passing model and improve FDTD algorithm parallelism.FDTD program is mainly calculated by the parameter initialization, the time step, absorbing boundary calculated, updating the electric field, and data processing.However, over 98% of the total time is conducted in two stages which are absorbed electromagnetic boundaries update phase and calculating electromagnetic fields.We can utilize OpenMP programming model to improve program parallelism at this two stages.In this paper, each node runs two processes, and the number of parallel threads is 8. Similarly, MPI-CUDA models use shared memory and message passing mechanism to calculate the electric field value performed on the GPU.Each block is divided into 16 × 16 × 16, and the calculation results are shown in Table 2.
From Columns 2 and 3 of Table 2, it can be seen that pure MPI program is better than MPI-OpenMP model when the calculating cores is small.This is because MPI communication is very small when the number of processes is small, and OpenMP program will be a corresponding increase in the thread overhead.But with the increase of the number of the GPUs, the growth rate of MPI model will decline as MPI process communication and synchronization overhead increases, while MPI-OpenMP program obtains a better parallelism for saving offset cost of the thread.By comparing the two parallel methods, we can see MPI application performance superior to MPI-OpenMP program  when calculating the cores is less than 32.But with the increase in the number of cores, the advantage of MPI-OpenMP program will gradually be revealed and get better speedup.
From Columns 2 and 4 of Table 2, it can be seen that pure MPI-CUDA program is better than pure MPI model, and it has a linear increase when the number of processes is small.But with the increase in the number of processes, the efficiency of the program decreased very significantly.MPI communication overhead is increased in the proportion of the total time.As the overall calculation region is divided into a plurality of subregions, the size of each process significantly reduced to GPU parallel efficiency.We can find that the speedup of MPI-CUDA programming model is about 20 times the pure MPI program, which effectively improved the parallelism of the program.
Figure 8 shows the result of FDTD algorithm based on two hybrid models.Due to the GPU's high-speed floatingpoint operations, MPI-CUDA parallel model clearly has a better effect.However, this model is limited by memory and is device-dependent.MPI-OpenMP program can save communication costs and have better scalability.These two kinds of parallel models are suitable for mass-scale electromagnetic calculations which conform the actual electromagnetic calculations.Therefore, the programmer should be selected by different hybrid models according to the different circumstances.

Conclusions
With the development of high-performance computing, the new technology provides a better use of space for FDTD algorithm.The various parallel models make hybrid programming model become the mainstream of highperformance computing.In this study, we implemented the three-dimensional FDTD method by MPI-OpenMP model.What is more, we also implemented FDTD method on multi-GPU cluster environment with CUDA and MPI and fulfilled the simulation of three-dimensional numerical electromagnetic radiation.The results indicate that hybrid programming model can better take advantage of distributed shared memory in order to improve parallel performance.

Figure 2 :
Figure 2: The workflow of the FDTD method.

Figure 3 :
Figure 3: The division of rectangular computational model.

Figure 6 :Figure 7 :
Figure 6: The electric field strength calculations of sectional - at 1000 steps.

Figure 8 :
Figure 8: The result of FDTD algorithm based on two hybrid models.
Patterns.The kernel code which can be called by the GPU has role similar to max Figure 5: The workflow of the three-dimensional FDTD method for a single GPU computation.caluH C function.The kernel variable threadIdx, blockIdx, and blockDim can calculate the array indexes , , and .Host-side code then uses caluH ⋘grid, block⋙ (arguments) function to call the kernel code, where grid and block specify the division of thread blocks and the number of threads in the each block.

Table 1 :
The result of FDTD algorithm based on MPI.

Table 2 :
The result of FDTD algorithm based on all models.