FDTD Acceleration for Cylindrical Resonator Design Based on the Hybrid of Single and Double Precision Floating-Point Computation

Acceleration of FDTD (finite-difference time-domain) is very important for the fields such as computational electromagnetic simulation. We consider the FDTD simulation model of cylindrical resonator design that requires double precision floating-point and cannot be done using single precision. Conventional FDTD acceleration methods have a common problem of memorybandwidth limitation due to the large amount of parallel data access. To overcome this problem, we propose a hybrid of single and double precision floating-point computation method that reduces the data-transfer amount. We analyze the characteristics of the FDTD simulation to find out when we can use single precision instead of double precision. According to the experimental results, we achieved over 15 times of speed-up compared to the CPU single-core implementation and over 1.52 times of speed-up compared to the conventional GPU-based implementation.


Introduction
Computational electromagnetic simulation shows a rapid development recently due to the introduction of processors that have parallel processing capability such as multicore CPUs and GPUs (graphic processing units).The FDTD (finite-difference time-domain) algorithm [1,2] is one of the most popular methods of computational electromagnetic simulations due to its simplicity and very high computational efficiency.It has been widely used in many applications such as coil modeling [3] and resonance characteristics analysis of a cylindrical cavity [4,5].Many of these applications require double precision floating-point computation to satisfy the stability condition [6].
The FDTD simulation requires a large amount of data.When more processor cores are used in parallel, more data transfers occur between the memory and the processor cores.Therefore, the memory-bandwidth limitation is a major problem in the FDTD simulation using computers.To overcome this problem, we have to reduce the data transfer amount, so that we can use more cores in parallel.To do this, we propose a hybrid precision computation method that uses both single and double precision.Single precision data use 4 bytes compared to 8 bytes used in double precision data.Therefore, using single precision reduces the data amount and increases the processing speed.However, using single precision could bring inaccurate results.In some cases, the FDTD simulation does not converge when it is executed for a large number of iterations.
In this paper, we consider the FDTD simulation of a cylindrical resonator [5].It is one of the most fundamental types of resonant cavities and has been used to construct, for example, wavelength filters for microwaves [7,8].It also has a number of important applications in the lightwave field, such as couplers and laser cavities [9][10][11].In such applications, the quality factor (Q) of a cylindrical cavity, which is a basic characteristic, depends on the performance of the cavity walls.The fundamental characteristics, such as resonance wavelength, Q factor, and modal fields, are calculated by numerical simulation.The same simulation is executed many times by changing the parameters such as the radius of the cavity and the thickness and the depth of the cavity wall.Therefore, the processing time of the simulation is very large.
This paper proposes a hybrid of single and double precision computation method to reduce the processing time of the FDTD simulation of a cylindrical resonator.The proposed method can be used in both multicore CPUs and GPU accelerators.We analyze the characteristics of the application to find out where we can use single precision instead of double precision.According to the experimental results, we achieved 1.41 and 5.19 times of speed-up, respectively, for CPU and GPU implementations, compared to the conventional double precision implementation using 12 threads and 6 CPU cores.Compared to the conventional GPU implementation, the proposed hybrid precision method using GPU has over 1.52 times of speed-up.Using the proposed method, we can extract more performance from the same hardware without any additional overhead.

The FDTD Simulation of Cylindrical Resonator
A cylindrical cavity surrounded by a curved resonant grating wall is proposed in [5].Its resonance characteristics are described using the FDTD simulation.In this section, we briefly explain how the FDTD simulation is performed for this application.Figure 1 shows a cross section of the cylindrical cavity which is similar to a ring.The cavity wall has a curved resonant grating.PBC and RBC stand for periodic and radiation boundary conditions, respectively.The depth and the base thickness of the grating are 2 and  0 , respectively.The pitch of the grating is Λ.When the free space wavelength is  0 ,  0 ≈ 2Λ.The FDTD simulation is performed by simplifying the structure as a two-dimensional (2-D) one.Figure 2 shows the 2-D grid for the simulation.The polar coordinates of the computation area in Figure 1 are transformed to a 2-D grid of orthogonal coordinates.Note that the boundary conditions are calculated separately and a small portion of the area inside the ring is calculated using 1-D FDTD.The rest of the area is calculated using 2-D FDTD simulation.According to the experimental results, this simulation does not converge when using only single precision floating-point computation.
The flow-chart of the FDTD simulation is shown in Figure 3.The total number of iterations and the iteration number are given by  tot and , respectively.The electric and magnetic fields are computed one after the other.In between these computations, the boundary conditions are applied.To design a cylindrical resonator for a particular resonance wavelength, the FTDT simulation is executed many times by changing the parameters such as the radius of the cavity and the thickness and the depth of the grated wall.This requires a lot of processing time.
There are many recent works such as [12][13][14][15][16] that use GPUs to accelerate FDTD.GPU acceleration of 2-D and 3-D FDTD for electromagnetic field analysis is proposed in [12,13], respectively.Multi-GPUs are used to accelerate the FDTD in [14,15,17].Although [15] gives good results for few GPUs, the communication overhead between GPUs via host memory could be a serious bottleneck for a system with a large number of GPUs.Periodical data exchange between GPU and CPU is proposed in [16] to overcome the GPU memory limitation.However, it comes with a 10% performance loss.Moreover, recent GPUs have GPU-to-GPU data transfer capability so that multi-GPU method can be better than this approach.The speed-up of these methods comes with an additional hardware cost.In this paper, our approach is different from the previous works.We focus on increasing the throughput of the GPUs without adding extra hardware.We propose a method to reduce the processing time of the FDTD simulation by using a hybrid of single and double precision floating-point computation.We evaluate using both multicore CPU and GPU to show the effectiveness of our method.Moreover, the proposed method can be used together with most of the previous works to increase the speed-up further.

Hybrid of Single and Double Precision
Floating-Point Computation

Hybrid Precision Computation.
Since the FDTD simulation usually requires a large amount of data, its processing speed is decided by the memory-bandwidth.Our idea in this paper is to reduce the data amount by using more single precision floating-point computations instead of double precision.Double precision requires 64-bit data while the single precision requires only 32-bit data.Therefore, the data amount can be reduced by half if only the single precision is used.However, some areas on the computation domain have big fluctuations of the electromagnetic field so that the single precision cannot be used.Therefore, we use single precision computation for the area where the fluctuation of the field is very low.We use double precision on the rest of the area.
We set a partition boundary on the computation domain that separates the area of single precision computation and the double precision computation.The optimal position of the partition boundary in hybrid precision computation is the grid coordinates that separate the single and double precision computation area in such a way that the processing time is minimized, under which the condition of simulation converges.Note that we assume the simulation converges when using double precision computations.The amount of fluctuations of the electromagnetic field depends on the simulation model and it could only be found by doing the simulation.Therefore, the optimal partition boundary is very hard to determine theoretically.Due to this problem, we propose a practical approach to partition the computation domain to achieve a sufficient processing time reduction.This practical approach depends on two factors.The first one is that the electromagnetic field reduces exponentially with the distance from its source.It is proportional to  − , where  is the distance from the center of the ring (the source).The term  depends on the grating thickness Λ of the cylindrical cavity (refer to [5]).The second one is that the partition boundary is placed outside the ring (or cavity wall).Since the resonator is designed to preserve the field of a certain wavelength, a strong field could exist in the cavity and computing it using single precision may not be possible.Considering these two factors, we conduct experiments and observed the electromagnetic field at different distances from the center.According to these experimental results, we found that the field measured  + 4Λ away from the center is very weak and has very small fluctuations.Since  0 ≈ 2Λ, the partition boundary is given by (1), where  is the radius of the ring as shown in Figure 4(a): The partition boundary on the grid is shown by Figure 4(b).
It is calculated by converting the polar coordinates of the partition boundary to the orthogonal coordinates of the grid.We know that the grid-points away from the partition boundary have small fluctuations so that the computations can be done by single precision floating-point computation.However, we do not know the degree of the fluctuations of the electromagnetic field on or near the cavity wall.It depends on the parameters of the simulation model and we have to do the simulation first to find it out.Therefore, similar to the conventional method, we use double precision floating-point for the computation for the grid-points on or near the cavity wall and the boundaries.
Figure 5 shows the flow-chart of the proposed hybrid precision floating-point computation.The total number of iterations and the iteration number are given by  tot and , respectively.A part of the area is computed using double precision while the rest is computed using single precision as shown in Figure 4.In the GPU implementation, the initial and output data transfers are done by the CPU and all the other computations are done by the GPU.The single and double precision areas are identified by the thread-ID [18].Moreover, this hybrid precision computation can also be used in multicore CPUs.In the CPU implementation, single and double precision computation areas are identified by the grid coordinates.

Evaluation
For the evaluation, we use an "Intel Xeon ES-2620" CPU and "Nvidia Tesla C2075" GPU [19].CPU has the hyperthreading  technology with 6 physical cores.The detailed specifications are given in Table 1.We use gcc compiler and CUDA 4.2 [18] for the compilation.GPU is programmed by CUDA which is basically the "C language" with GPU extensions.A detailed description of CUDA architecture and its programming environment are given in [18,20,21].The specifications of the GPU and CPU used for the evaluation are given in Table 1.
Figure 6 shows the electromagnetic field fluctuations against time (iteration) and area.Figure 6(a) shows the electric field at the distance from the center of the ring after 100,000 iterations.The field on and near the cavity wall is strong while the field outside the cavity is weak.We observed similar characteristics from the magnetic field data analysis as well.Figure 6(b) shows the fluctuations of the electric field against the number of iterations.We plot the electric field values of three points: one point inside the ring, one on the ring, and the other outside the ring.The electric field inside the ring remains quite strong while the field outside the ring gets weaker with the time.According to the observations in Figure 6, the electromagnetic field values on and near the cavity walls show large fluctuations, so that high-precision computation is required.Other areas show small fluctuations, so that low-precision computation can be applied.In the proposed hybrid precision computation method, we use single precision for the most of the area  outside the cavity wall.The rest of the computation area and the boundaries are done in double precision.Note that some of the computation area inside the cavity is done by 1-D FDTD, so that the processing time required is very small.Since the boundary area is very small compared to the total grid size, the time required to process the boundary data is also very small.Figure 7 shows the processing time of multicore CPU implementation using conventional double precision and the proposed hybrid precision methods.The parallel thread execution is done using OpenMP with C language.Compiler is gcc.According to the results of the double precision computation, we can see a significant processing time reduction from 1 to 3 threads.However, from 4 threads, the processing time does not change much.The results for the hybrid computation are similar to those of the double precision computation.However, the gap between the curves of double precision and hybrid precision widens after 3 threads.To explain this, we estimated the memory bandwidth of each implementation.
Table 2 shows the estimated memory bandwidth in each implementation.It is estimated by calculating the data capacity and dividing it with the processing time.The estimated data capacity that needs to be transferred in a single iteration is about 490 MB.Since there are 100211 iterations, total of over 49 GB of data are transferred.Since the data transfer amount is so large and the cache memory of the CPU is only 15 MB, we assume that the impact of the cache memory is the same for all single core and multicore implementations.Since we are only comparing the performance of different implementations, we did not include the impact of cache memory for the memory bandwidth estimation.According to the results in Table 2, memory bandwidths of double and hybrid precision computations near their peaks after 3 and 6 parallel threads.After 3 threads, the hybrid precision that has a smaller data capacity than the double precision gives better results.Therefore, the proposed method gives better results for multicore processors with a lot of parallel processing.Note that, in 1 to 2 thread implementations, the memory bandwidth is not a bottleneck so that both double and hybrid precision take similar amount of execution time (although hybrid precision is slightly better).Figure 8 shows the comparison between double and hybrid precision computation on CPU and GPU.The CPU implementation is done with 12 threads on 6 cores, using Intel hyperthreading technology.According to the results, GPU implementation with hybrid precision is 5.19 times and 1.52 times faster than the conventional double precision implementations on the CPU and GPU, respectively.Hybrid precision computation on the GPU is 3.67 times faster than that on the CPU.Moreover, hybrid precision computation on the CPU is 1.41 times faster than the conventional double precision implementation on the CPU.The GPU speed-up factors compared to the CPU in double and hybrid precision computations are 3.41 and 3.67, respectively.Interestingly, these figures very close to the memory bandwidth ratio in Table 1 between GPU and CPU which is 3.38.Therefore, we can assume that the processing times of multicore-CPU and GPU implementations are almost decided by the memory bandwidth.Since single precision floating-point data used in hybrid precision computation need only 4 bytes compared to the 8 bytes in double precision, more data can be transferred with the same bandwidth.As a result, the processing time is reduced.
Figure 9 shows the processing time reduction against the single precision computation area for a simulation model.According to these results, the processing time decreases when the percentage of the single precision computation increases.That is, if we can push the partition boundary towards the center of the ring, we can reduce the processing time further.In this example, we push the partition boundary by increasing the single precision area to find the smallest processing time.According to the results, the minimum processing time is observed when 71% of the area is computed using single precision, while the speed-up is 1.52 times.The proposed partition boundary defined in (1) allows only 64% of the computation to be done by single precision and the speed-up is 1.41 times.Since the speed-up of the proposed method is very similar to the best speed-up, we can say that the proposed partition boundary is very effective.Moreover, if the same simulation is executed numerous times by slightly changing the parameters, it would be useful to move the partition boundary towards the center for aggressive processing time reduction, even if there is a risk of a divergence.If the simulation does not converge, we can always move back the partition boundary away from the wall.
As shown in Figure 9, more than 10% of the computation area must be done in single precision to reduce the processing time.Otherwise, the processing time increases due to the single-double precision conversion overhead on the partition boundary.Since the electric (or the magnetic) field is calculated by its near-by magnetic (or electric) field data, both single and double precision data are required to process the data on the partition boundary.Therefore, if the single precision area is too small, the proposed method cannot be applied.
We compare the FDTD simulation results of double and hybrid precision computations in Figure 10 of the revised paper.The electric fields calculated using double precision and hybrid precision computations are shown in Figures 10(a) and 10(b), respectively.The two electric fields are almost identical where a strong field can be seen on the ring area.We observed that the magnetic fields obtained by double and hybrid precision computations are also very similar.Figure 10(c) shows the absolute difference between the double and hybrid precision computations.The absolute difference is smaller than 1.0 × 10 −9 and is very small compared to the electric field values shown in Figures 10(a) and 10(b).Considering the almost identical electric field distribution and very small computation difference, we conclude that the hybrid precision computation is accurate enough to be used in FDTD simulations of a cylindrical resonator.

Conclusion
In this paper, we proposed an FDTD computation acceleration method that uses hybrid of single/double precision floating-point to extract the maximum performance from GPU accelerators and multicore CPUs.In multicore parallel processing, the speed of the FDTD computation depends on the memory access speed and the memory bandwidth.Since the amount of data required for double precision floatingpoint is two times larger than that required for the single precision floating-point, we can increase the processing speed by doing more computation in single precision.Otherwise, we can reduce the cost by using cheaper medium-range GPUs (or CPUs) to extract the same performance of the expensive high-end GPUs (or CPUs).According to the results, we achieved over 15 times of speed-up compared to the singlecore CPU implementation and over 1.52 times of speed-up compared to the conventional GPU acceleration.

Figure 4 :
Figure 4: Partitioning of the computation domain.

Figure 7 :
Figure 7: Precision versus processing time of FDTD on CPU.

Figure 10 :
Figure 10: Comparison of the simulation results using double and hybrid precision computations.

Table 1 :
Specifications of the evaluation environment.

Table 2 :
Estimated memory bandwidth versus number of threads.