We introduce a hardware acceleration technique for the parallel finite difference time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the significant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.
Since the FDTD method is firstly proposed by Yee in 1966 [
With development of the multicore processors, (graphics processing unit) GPU [
The VALU acceleration has been illustrated through several typical examples on both Intel and AMD processors. Since the AMD processor includes more physical cores today than Intel processor. The FDTD simulation accelerated by using VALU units can get more benefit from the AMD than Intel processors. A waveguide filter and patch antenna array are used to validate the VALU acceleration technique. In this paper, we first briefly introduce the parallel FDTD method in Section
In FDTD method, Maxwell’s equations are discretized into a set of difference equations through the central difference formula. The electromagnetic wave propagation and the interaction of electromagnetic wave with the environment are analyzed through update of the electric and magnetic fields in space and time. The electric/magnetic fields in spatial and time domains can be calculated by the explicit way through its previous value at the same location and the four magnetic/electric fields around it at the half time step earlier as illustrated in Figure
Position of E- and H-fields in the FDTD cell.
It is observed from Figure
Using this feature, we can achieve an excellent parallel performance when we use the FDTD code on the multiple core processor and computer cluster because it only requires the exchange of information on the interface between the subdomains. In the several popular data exchanging methods [
Concept of field exchange between the subdomains in the parallel processing.
The MPI library is designed for the communication of distributed processors. The exchanging data in the FDTD simulation is located on the interface between the adjacent processors, as we can see from Figure
Vector unit has been existed in a regular CPU for many years; however, it was never used in the engineering simulations. In this paper, we describe the implementation approach to apply the vector unit into the FDTD simulation though the SSE instruction set. Compared to the GPU acceleration, the VALU acceleration technique is general platform and does not require any extra hardware devices. Now, we briefly introduce the architecture of regular CPU and then describe how to implement the VALU unit to speed up the FDTD simulation. Each core in the multicore processor has its own cache, FPU and VALU, as shown in Figure
CPU architecture including FPU and VALU.
Next, we use the electric field component update as an example to explain how the FDTD code is accelerated by using VALU unit. For example, the electric field Load the coefficient of the magnetic field Load the coefficient of the magnetic field Convert the float pointer to the SSE 128 bit pointer. Calculate the difference of magnetic fields. Multiply the difference of magnetic fields and their coefficients. Calculate the contribution of magnetic fields to the electric fields. Multiply the previous electric fields and their coefficients. Calculate the electric fields and write them to memory (Algorithm
A 3-D array in the FDTD code is allocated using the
In the C programming language, the data inside the memory is contiguous in
Data structure in the
Using the VALU and the procedure mentioned above can significantly accelerate the multiplication operation compared to the traditional FDTD code. In any case, the data continuity in the FDTD code and memory bandwidth is the most important factors for the VALU acceleration. To evaluate the performance of the FDTD code, we define the performance as follows:
Performance improvement of the parallel conformal FDTD code using VALU acceleration technique.
Finally, we use the parallel conformal FDTD code to simulate an ideal test case that is a hollow box with the simplest excitation and output, and its domain is truncated by using the perfect electric conductor (PEC). We ran the problem with the different sizes using the regular FDTD code (Intel Core i7-965), and the FDTD code with the VALU (Intel Core i7-965 with 4 VALUs, the total memory bandwidth is 32 GB). The simulation summary is plotted in Figure
FDTD performance on CPU and CPU value for the ideal test case.
In this part, we employ a typical example to demonstrate the performance improvement by using the SSE instruction set on the multi-CPU workstations. The acceleration factor here reflects the simulation acceleration gained by using the SSE implementation. A typical test case includes only voltage source and field output at a specific point. The empty box truncated by the PEC boundary is discretized into from 300 × 300 × 300 (27 Mcells), 400 × 400 × 400 (64 Mcells), 500 × 500 × 500 (125 Mcells), 600 × 600 × 600 (216 Mcells), 700 × 700 × 700 (343 Mcells), 800 × 800 × 800 (512 Mcells), and 900 × 900 × 900 (729 Mcells) uniform cells, respectively. The numerical experiments were carried out on a 2-CPU workstation that is installed with two AMD Opteron 6128 2.0 GHz processors. The simulation performance is [summarized in Figure
Performance of parallel FDTD enhanced by using SSE instruction set on a 2-CPU workstation platform for the different operating systems.
We compare the FDTD simulation performance on one 18-node high-performance cluster and a popular nVidia Tesla C1060 GPU. The cluster with 36 Intel Xeon X5550 CPUs is installed in 18 nodes, which are connected through the 10G Ethernet system. The performance comparison between two platforms is summarized in Figure
Performance of parallel FDTD enhanced by using SSE instruction set on a cluster platform and comparison to the GPU acceleration.
Next, we investigate the simulation performance of parallel FDTD code on a workstation installed with 4 AMD Opteron 6168 1.9 GHz CPUs. The 6-layer PML in this case is used to truncate the computational domain. The performance of VALU acceleration is shown in Figure
FDTD performance of the VALU acceleration on one 4-CPU workstation.
We can observe from Figure
Performance comparison of parallel FDTD code on a workstation with 4 AMD CPUs and a cluster with 4 nodes (8 Intel CPUs).
In the practical problems, the simulation factors such as output types, dispersive media, and near-to-far field transformation will influence the SSE performance due to the discontinuous data structure inside memory. However, it can be improved by optimizing the cache hit ratio.
In this part, we use the parallel FDTD code accelerated by using the SSE instruction set to simulate a waveguide (WR75) filter problem [
Configuration of waveguide filter with 5 open cavities.
For the sake of comparison, we use the FEM method [
Transmission coefficient of waveguide filter generated by using enhanced FDTD method and FEM.
The parallel FDTD performance with the SSE acceleration is summarized in Table
Parallel FDTD performance with the SEE acceleration.
FDTD with SSE acceleration | FDTD without SSE acceleration | |
---|---|---|
Workstation | 2 × AMD Opteron 6128 2.0 GHz | |
Memory usage | 37 MB | 37 MB |
Simulation time | 145 sec | 345 sec |
Next, we use the parallel FDTD performance with the SSE acceleration to simulate a patient room, as shown in Figure
Performance comparison of parallel FDTD method on 4-CPU workstation and high-performance cluster.
Platform | CPU type | Network | Simulation time |
---|---|---|---|
Workstation (4 CPUs, 64 GB RAM) | AMP Opteron 6168 1.9 GHz($795 each) | No network | 319 min. (with hardware acceleration and NUMA) |
2 CPUs (24 GB RAM) | Intel Xeon X5570 2.93 GHz($1465 each) | No network | 1720 min. (with hardware acceleration, no NUMA) |
128 CPUs (1536 GB RAM) | Intel Xeon X5570 2.93 GHz($1465 each) | Infiniband | 29 min.37 sec. (with hardware acceleration no NUMA) |
A typical patient room as an example to demonstrate the FDTD performance enhanced with SSE instruction set.
Problem dimension
Field distribution at 3 GHz inside the patient room.
The simulation on the cluster platform was provided by an independent third party. The cluster does not support the NUMA architecture. Otherwise, the cluster performance should be better.
Finally, we use the parallel FDTD code enhanced by SSE instruction set to simulate a patch array [
Configuration of a patch array and the prototype. (a) Configuration of the patch array; (b) prototype of the patch array.
Top and side views of the patch array
Prototype of the patch array
Return loss of the antenna array.
E-plane far field power pattern of the patch antenna array at 2.41 GHz, 2.44 GHz, and 2.47 GHz.
Far field power pattern at 2.41 GHz.
Far field power pattern at 2.44 GHz.
Far field power pattern at 2.47 GHz.
In this paper, we propose a new hardware acceleration technique based on the SSE instruction set and give an implementation technique. The result shows that this technique dramatically improves the computing efficiency without any extra hardware investment and provides an efficient and economical technique for the electromagnetic simulations.