Finite difference time domain (FDTD) method is a very poplar way of numerically solving partial differential equations. FDTD has a low operational intensity so that the performances in CPUs and GPUs are often restricted by the memory bandwidth. Recently, deeply pipelined FPGA accelerators have shown a lot of success by exploiting streaming data flows in FDTD computation. In spite of this success, many FPGA accelerators are not suitable for realworld applications that contain complex boundary conditions. Boundary conditions break the regularity of the data flow, so that the performances are significantly reduced. This paper proposes an FPGA accelerator that computes commonly used absorbing and periodic boundary conditions in many 3D FDTD applications. Accelerator is designed using a “Clike” programming language called OpenCL (open computing language). As a result, the proposed accelerator can be customized easily by changing the software code. According to the experimental results, we achieved over 3.3 times and 1.5 times higher processing speed compared to the CPUs and GPUs, respectively. Moreover, the proposed accelerator is more than 14 times faster compared to the recently proposed FPGA accelerators that are capable of handling complex boundary conditions.
Finite difference time domain (FDTD) method is a very important and widely used one in many areas such as electromagnetic field analysis [
Commonly available boundary conditions in FDTD computation are Dirichlet, periodic, and absorbing boundary conditions. Dirichlet or the fixed boundary condition is the simplest one where a constant (usually zero) is used for the data on the boundaries. Fixed boundaries are used when the outside of a grid is a perfect electric or a magnetic conductor. In this case, the fields on the boundaries equal zero. Most of the existing works use fixed boundary conditions to gain high speedups by preserving the regularity of the data flow. Absorbing boundary condition (ABC) is often used to simulate infinite domains by applying absorbing layers close to the boundary. Implementing ABC computation could break the regularity of the data flow due to the data dependency near the boundaries. Periodic boundary condition (PBC) is used to simulate infinite periodic structures by computing a small portion called a “unit.” PBCs are applied at the boundaries so that the same unit cell is replicated. Implementing PBC computation without breaking the regularity of the data flow is even more difficult, due to the data dependency among different boundaries. However, many realworld FDTD applications [
In this paper, we propose an FPGA accelerator for 3D FDTD computation that efficiently supports absorbing and periodic boundary conditions. This paper is an extension of our previous work [
FPGA accelerators for FDTD computation with simple boundary conditions are already implemented in previous works [
2D stencil computation using
Order of the data flow
Parallel computation of the cells in multiple iterations. Computations of
Figure
Flowchart of the computation. FPGA processes
The works in [
The works in [
In this section, we explain the FPGA acceleration for 3D FDTD computation when there are periodic and absorbing boundary conditions. Figure
Simulation area and unitbox.
Figure
3D computation domain of a unitbox.
Data dependency on the boundaries.
Figure
Flowchart of the 3D FDTD algorithm.
The computation of the electric field in
Our goal in the proposed implementation is to maintain the regularity of the input and output data streams even when there are boundary conditions. However, as explained in Section
Figure
Proposed pipelined computation module (PCM). Computed data are arranged in the correct order to preserve the regularity of the data flow.
Figure
Architecture of the proposed FPGA accelerator. Iterationparallel computation is achieved by processing
However, there are a few disadvantages in this method. Since we have to use shiftregisters to temporally store the computed data, the area of a PCM is increased. As a result, the number of PCMs can be decreased so that the processing time can be increased. The computed data have to be stored for several clock cycles until they are arranged correctly. This increases the latency and also the processing time.
This section explained how to implement and optimize the proposed FPGA accelerator using OpenCL [
Listing
Since
To calculate the electric fields
If all the data are stored in different offsets in the global memory, we require at least 9 “load transactions.” To reduce the amount of transactions, we use an arrayofstructure in the global memory to store the data. Each element in the array has the data of the present and the previous electric fields and the constants. Therefore, one array element provides all the data necessary to compute one cell. One array element contains 36 bytes (
To reduce the amount of memory transactions, we have to reduce the data amount of one array element. To reduce the data amount, we compute the constant in the FPGA instead of loading the already computed ones from the memory. For example, if we compute
Algorithm
The computations of the cells in the core and the boundary are shown from line
For the evaluation, we use two FPGA boards, two GPUs, and two multicore CPUs. The FPGA boards are DE5 [
Table
Comparison of the FPGA accelerators with different data access methods using DE5 board.
Method  Processing time (s)  Area (%)  Frequency (MHz)  

ALMs  Registers  Memory (kByte)  DSPs  RAM blocks  

8.01  179,925 (77)  368,483 (39)  2,911 (45)  114 (45)  2,050 (80)  205.6 

14.75  176,370 (75)  357,220 (38)  3,012 (47)  114 (45)  1,953 (76)  206.7 

9.42  170,692 (73)  349,960 (37)  2,958 (45)  114 (45)  1,879 (73)  217.8 

6.31  175,032 (75)  357,283 (38)  2,595 (41)  114 (45)  1,750 (68)  260.0 

6.69  176,637 (75)  359,472 (38)  2,363 (36)  114 (45)  1,618 (63)  243.6 
Table
Comparison with GPUs and CPUs using singleprecision floatingpoint computation.
FPGA  GPU  Multicore CPU  

DE5  395D8  GTX680  GTX750Ti  i74960x  E51650 v3  
Number of 
—  —  1152  1024  6  6 
Core clock frequency (MHz)  260  193  980  1127  3600  3500 
Memory bandwidth (GB/s)  25.6  34.1  192.2  86.4  51.2  59.7 
Peak performance (Gflop/s)  193  1502.9  3090  1305  345.6  672 


Processing time (s)  6.31  7.36  9.39  10.71  23.63  20.84 
The comparison using doubleprecision floatingpoint computations is shown in Table
Comparison with GPUs and CPUs using doubleprecision floatingpoint computation.
FPGA  GPU  Multicore CPU  

DE5  395D8  GTX680  GTX750Ti  i74960x  E51650 v3  
Core clock frequency (MHz)  252  209  980  1127  3600  3500 
Processing time (s)  28.09  25.24  14.44  20.61  69.45  62.13 
Table
Comparison with other FPGA accelerators for 3D FDTD with boundary conditions.
Method  Boundary conditions  FPGA  Performance 
Clock frequency MHz  Achieved throughput (maximum bandwidth) GB/s 

Work in [ 
ABC  n/a  1,000  50  n/a 


Work in [ 
fixed  Xilinx Virtex6 XC6VSX475T  325  100  n/a 


Work in [ 
fixed  1,820  26.8 (38.6)  
fixed and ABC  Xilinx Virtex6 XC6VSX475T 
1,193  100  29.9 (38.6)  
ABC and PBC  100  n/a  


This paper  ABC and PBC  Altera StratixV 5SGXEA7N2F45C2  1,427  260  13.4 (25.6) 
n/a: not available.
Memory bandwidths of the method in [
Table
Performance comparison against FPGA accelerators that use simple boundary conditions.
Method  Boundary conditions  Performance (Gflop/s)  Frequency (MHz) 

2D FDTD [ 
fixed  150.3  291 
3D FDTD  ABC  98.5  261 
3D FDTD (this paper)  ABC and PBC  89.9  260 
We may improve the performance using manual HDLbased designs. However, we believe that the gap between the performances of OpenCLbased and HDLbased designs is getting narrower. In our earlier work in [
We have proposed an FPGA accelerator for 3D FDTD that efficiently supports absorbing and periodic boundary conditions. The data flow is regulated in a PCM after the computation of the boundary data. This allows data streaming between multiple PCMs, so that we can implement iterationparallel computation. The FPGA architecture is implemented using OpenCL. Therefore, we can use it for different applications and boundary conditions by just changing the software code. Since OpenCL is a system design method, we can implement the proposed accelerator on any system that contains an OpenCLcapable FPGA. According to the experimental results, we achieved over 3.3 times and 1.5 times higher processing speeds compared to the CPUs and GPUs, respectively. Moreover, the proposed accelerator is more than 14 times faster compared to the recently proposed FPGA accelerators that are capable of handling boundary conditions.
The authors declare that there are no conflicts of interest regarding the publication of this paper.