A High-Performance Parallel FDTD Method Enhanced by Using SSE Instruction Set

. We introduce a hardware acceleration technique for the parallel ﬁnite di ﬀ erence time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the signiﬁcant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.


Introduction
Since the FDTD method is firstly proposed by Yee in 1966 [1] for solving Maxwell's equations as a type of difference algorithm, it has grown into a popular computational electromagnetic technique after decades of development and become a major electromagnetic simulation tool today.Compared to other computational methods, the FDTD method becomes more and more popular for the practical and complex problems because of its simplicity and flexibility.In addition, the broadband characteristics of the FDTD method allow us to achieve the wideband frequency response in a single simulation.For the electrically large and multiscale problems, it may take too long time for the FDTD simulation to reach the convergent solution.In order to overcome this disadvantage of the FDTD method, there are a large number of publications on the related topics such as conformal technique [2] to enlarge the cell size, subgridding scheme [3] to use local fine mesh, and ADI algorithm [4] to increase the size of time step.On other hand, to use the multicore PC, multi-CPU workstation, and computer cluster to speed up the electromagnetic simulation, the parallel processing based on the (message passing interface) MPI library [5] and (open multiprocessing) OpenMP [6].
With development of the multicore processors, (graphics processing unit) GPU [7] has drawn a lot of attentions in the past ten years since an advanced GPU contains up to several hundreds of computing cores.Since the field update formula in the FDTD method is similar to the format that GPU is used to display acceleration, the FDTD method is introduced to the GPU acceleration in the earlier time.However, to use GPU acceleration, the simulation format has to be in the certain requirement.For example, the data inside memory must be continuous; otherwise, the performance will be degraded dramatically.In the practical problems, the data structure inside the memory varies from one case to another.In addition, the GPU can only use the high-performance memory installed on the GPU card.Any communication between the GPU memory and main memory will cause reducing the simulation performance.Unlike the GPU acceleration, VALU acceleration will fully use the features built inside the regular CPU.Though the VALU unit has existed International Journal of Antennas and Propagation inside the CPU for many years since the Intel Pentium III.However, it has not used for the acceleration of engineering simulations.Compared to the GPU acceleration, VALU acceleration is more powerful and flexible, and it is the most important that the VALU acceleration is a general platform, and it does not require any extra hardware devices.
The VALU acceleration has been illustrated through several typical examples on both Intel and AMD processors.Since the AMD processor includes more physical cores today than Intel processor.The FDTD simulation accelerated by using VALU units can get more benefit from the AMD than Intel processors.A waveguide filter and patch antenna array are used to validate the VALU acceleration technique.In this paper, we first briefly introduce the parallel FDTD method in Section 2, and then describe the basic architecture of a regular CPU in the Section 3. Section 4 concentrates on the implementation of SSE instruction set in the parallel FDTD code.Section 5 uses several examples to demonstrate the performance improvement of SSE instruction set for the FDTD simulation.Section 6 uses a waveguide filter and patch array to demonstrate the performance improvement of SSE instruction set for the practical problem solving.Finally, we summarize the works in this paper and point out the future works.

Parallel FDTD Acceleration Method
In FDTD method, Maxwell's equations are discretized into a set of difference equations through the central difference formula.The electromagnetic wave propagation and the interaction of electromagnetic wave with the environment are analyzed through update of the electric and magnetic fields in space and time.The electric/magnetic fields in spatial and time domains can be calculated by the explicit way through its previous value at the same location and the four magnetic/electric fields around it at the half time step earlier as illustrated in Figure 1.
It is observed from Figure 1 that the FDTD method is parallel in nature due to its localized characteristics and suitable for the SSE implementation.
The fields on the interface are simulated at both sides Process 1 Process 2 Using this feature, we can achieve an excellent parallel performance when we use the FDTD code on the multiple core processor and computer cluster because it only requires the exchange of information on the interface between the subdomains.In the several popular data exchanging methods [1], the robust one is described in Figure 2.
The MPI library is designed for the communication of distributed processors.The exchanging data in the FDTD simulation is located on the interface between the adjacent processors, as we can see from Figure 2. The parallel efficiency depends not only on the parallel code structure but also the network system.On other hand, the parallel computing on the multicore processors and multi-CPU workstation is based on the OpenMP, which is a shared memory parallel algorithm.OpenMp requires that all the cores or CPUs have the same hostname.In a practical cluster or multi-CPU workstation, MPI and OpenMp are combined to achieve a better simulation performance.
Vector unit has been existed in a regular CPU for many years; however, it was never used in the engineering simulations.In this paper, we describe the implementation approach to apply the vector unit into the FDTD simulation though the SSE instruction set.Compared to the GPU acceleration, the VALU acceleration technique is general platform and does not require any extra hardware devices.Now, we briefly introduce the architecture of regular CPU and then describe how to implement the VALU unit to speed up the FDTD simulation.Each core in the multicore for(i = 0; i <= nx; i ++) { vHi Coeff = mm load1 ps(&H i Coeff); //load single float value to vector for(j = 0; j <= ny; j++) { vHj Coeff = mm load1 ps(&H j Coeff); //load single float value to vector processor has its own cache, FPU and VALU, as shown in Figure 3. Unlike the FPU, the VALU allows us to operate on four data at the same time.VALU unit is a unit that allows us to arrange for 32-bit floating numbers and operate the four numbers at the same time.By the way described, we use the VALU through the SSE instruction set to accelerate the parallel conformal FDTD code.Next, we use the electric field component update as an example to explain how the FDTD code is accelerated by using VALU unit.For example, the electric field E z can be expressed as follows [8,9]: where E z Coeff, H x Coeff, and H y Coeff are coefficients in the field update equations [4].Using the following procedure VALU can be used to compute four E z at the same time to accelerate the simulation, which is described as follows.
(i) Load the coefficient of the magnetic field H y into the SSE registers.
(ii) Load the coefficient of the magnetic field H x into the SSE registers.
(iii) Convert the float pointer to the SSE 128 bit pointer.
(iv) Calculate the difference of magnetic fields.
(v) Multiply the difference of magnetic fields and their coefficients .
(vi) Calculate the contribution of magnetic fields to the electric fields.
(vii) Multiply the previous electric fields and their coefficients.
(viii) Calculate the electric fields and write them to memory (Algorithm 1).
A 3-D array in the FDTD code is allocated using the malloc function in C language, and aligned malloc function in SSE in this paper.For example, if we need a 3D array array name [x size, y size, z size], we can first define a 1D array array name tmp [N] whose size is N = x size * y size * z size, and then map the 1D memory address to 3D array array name.The pseudocode segment is demonstrated as below (see Algorithm 2).

FDTD Performance Investigation
In this part, we employ a typical example to demonstrate the performance improvement by using the SSE instruction set on the multi-CPU workstations.The acceleration factor here reflects the simulation acceleration gained by using the SSE implementation.A typical test case includes only voltage   7].We compare the FDTD simulation performance on one 18-node high-performance cluster and a popular nVidia Tesla C1060 GPU.The cluster with 36 Intel Xeon X5550 CPUs is installed in 18 nodes, which are connected through  the 10G Ethernet system.The performance comparison between two platforms is summarized in Figure 8.For the ideal test case, one nVidia Tesla C1060 GPU has the similar performance to two Intel X5550 CPUs.Next, we investigate the simulation performance of parallel FDTD code on a workstation installed with 4 AMD Opteron 6168 1.9 GHz CPUs.The 6-layer PML in this case is used to truncate the computational domain.The performance of VALU acceleration is shown in Figure 9.
We can observe from Figure 8 that the VALU can improve the simulation performance 200% for the practical problem with 6-layer PML truncation.Next, we compare the performance of the workstation with 4 AMD Opteron 6168 CPUs and a cluster with 8 Intel Xeon X5550 CPUs for simulation of the same problem.The 4-node cluster with 8 Intel CPUs is connected with each other using the 10G Ethernet.The performance comparison is plotted in Figure 10, which shows that 4 AMD CPUs have the similar performance tot 8 Intel CPUs.Since each AMD CPU includes 12 cores and 12 VALUs, the AMD workstation has total 48 VALUs; however, each Intel X5550 CPU only has 4 cores and 4 VALUs.
In the practical problems, the simulation factors such as output types, dispersive media, and near-to-far field transformation will influence the SSE performance due to the discontinuous data structure inside memory.However, it can be improved by optimizing the cache hit ratio.

Engineering Applications
In this part, we use the parallel FDTD code accelerated by using the SSE instruction set to simulate a waveguide (WR75) filter problem [8], as shown in Figure 11.The filter includes five cavities and is excited by TE 10 mode at one end.The output parameter transmission coefficient for the TE 10 mode is measured at another end.The purpose is to    investigate the performance of SSE acceleration on the 2-CPU workstation for the practical problem.
For the sake of comparison, we use the FEM method [9] to simulate the same problem and plot the results in Figure 11.It is evident from Figure 12 to observe the good agreement.It is worthwhile to mention that the results are same with and without the SSE acceleration.
The parallel FDTD performance with the SSE acceleration is summarized in Table 1.It is observed from Table 1 that the SSE can accelerate the FDTD code 2.37 times for this practical problem.
Next, we use the parallel FDTD performance with the SSE acceleration to simulate a patient room, as shown in Figure 13, in which the field distribution is generated by a dipole antenna, as shown in Figure 13.The room dimension is 10 × 5 × 3 miters, which is discretized into 3333 × 1666 × 1000 uniform cells.The simulation performance is summarized in Table 2.It is observed from Table 2 that a workstation with 4 AMD Opteron 6168 1.9 GHz CPUs has the similar performance as the 10 Intel Xeon X5570 2.93 GHz CPUs.It is worthwhile to mention that, without the SSE acceleration, one AMD AMD Opteron 6168 1.9 GHz CPU has the same performance as one Intel Xeon X5570 2.93 GHz CPU.
The simulation on the cluster platform was provided by an independent third party.The cluster does not support the NUMA architecture.Otherwise, the cluster performance should be better.
Finally, we use the parallel FDTD code enhanced by SSE instruction set to simulate a patch array [10] for IEEE802.11b.The array includes 2 patch antenna elements and arrayed in E-plane.The simulation model and hardware implementation are shown in Figure 14.The substrate is FR4 with relative dielectric constant 4.2.The size of array is 71 mm (width) by 160 mm (length) and by 0.8 mm (thickness).The substrate is 2.5 mm above ground plane.The array is measured at Communication Research Center of Oriental Institute of Technology by vector network analyzer for return loss and near field range for power pattern.It is observed from Figures 15 and 16 for the return loss and power patterns of the array the good agreement between the FDTD simulation and measurement.The E-plane power patterns at frequencies 2.41 GHz, 2.44 GHz, and 2.46 GHz are shown in Figure 16.The copolar patterns are in agreement for both FDTD simulation and measurement.The minor difference of xpolar patterns for both FDTD simulation and measurement may be caused by the improper hardware implementation or measurement.

Conclusions
In this paper, we propose a new hardware acceleration technique based on the SSE instruction set and give an implementation technique.The result shows that this technique

Figure 1 :
Figure 1: Position of E-and H-fields in the FDTD cell.

Figure 2 :Figure 3 :
Figure 2: Concept of field exchange between the subdomains in the parallel processing.

Figure 4 :
Figure 4: Data structure in the y-z plane.

Figure 5 :
Figure 5: Performance improvement of the parallel conformal FDTD code using VALU acceleration technique.

Figure 6 :Figure 7 :
Figure 6: FDTD performance on CPU and CPU value for the ideal test case.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Performance of parallel FDTD enhanced by using SSE instruction set on a cluster platform and comparison to the GPU acceleration.
Field distribution at 3 GHz inside the patient room.

Figure 13 :
Figure 13: A typical patient room as an example to demonstrate the FDTD performance enhanced with SSE instruction set.
Figure 14: Configuration of a patch array and the prototype.(a) Configuration of the patch array; (b) prototype of the patch array.

Figure 15 :
Figure 15: Return loss of the antenna array.

Table 1 :
Parallel FDTD performance with the SEE acceleration.

Table 2 :
Performance comparison of parallel FDTD method on 4-CPU workstation and high-performance cluster.