^{1}

It is argued here that more accurate though more compute-intensive alternate algorithms to certain computational methods which are deemed too inefficient and wasteful when implemented within serial codes can be more efficient and cost-effective when implemented in parallel codes designed to run on today's multicore and many-core environments. This argument is most germane to methods that involve large data sets with relatively limited computational density—in other words, algorithms with small ratios of floating point operations to memory accesses. The examples chosen here to support this argument represent a variety of high-order finite-difference time-domain algorithms. It will be demonstrated that a three- to eightfold increase in floating-point operations due to higher-order finite-differences will translate to only two- to threefold increases in actual run times using either graphical or central processing units of today. It is hoped that this argument will convince researchers to revisit certain numerical techniques that have long been shelved and reevaluate them for multicore usability.

General-purpose graphical processing units (GPUs) have emerged in recent years as a viable vehicle for high-performance computing. The emergence of a simplified programming language for GPU computing in the form of CUDA C along with dedicated GPU hardware specifically designed for high-end workstations as well as supercomputing platforms [

The finite-difference time-domain (FDTD) method investigated here, which is a dominant method in the field of computational electromagnetics, belongs somewhere between the latter two categories. This is due to the fact that FDTD is a memory intensive algorithm with relatively low computational density (low ratio of floating-point operations applied on a set of data to number of read/write memory accesses of these data). In GPU computing, this can cause a serious performance penalty in the form of long idle times for the computing cores as they wait for data to process. Most of the efforts reported in the GPU/FDTD literature [

An alternate route, however, is to turn the disadvantage of memory latency (slowness) into an advantage by asking the GPU (or CPU) computing cores to perform more useful computations (instead of idling) on the same data they have at hand. This could be achieved by opting for high-order FDTD algorithms instead of the standard second-order algorithm. The extra computations required by these high-order algorithms go a long way to alleviate FDTD’s biggest shortcoming when modeling electrically large problems (structures with dimensions much larger than the wavelength of interest) which is excessive accumulated phase errors in numerically propagated waves. There are several reported flavors of high-order FDTD algorithms [

Four different high-order FDTD algorithms will be investigated in this work to determine their suitability to achieving a favorable balance between computing speed and memory speed. For a fair and straightforward comparison between the different algorithms, only algorithms with explicit update equations and applicable on Yee’s standard cubical and staggered mesh are considered for the comparison. Shared memory use will not be considered since its use for high-order FDTD algorithms has not been reported in the literature yet. Indeed, this presentation is probably the first reported GPU implementation of a high-order FDTD algorithm. Furthermore, the GPU experiments will be conducted using CUDA Fortran [

We begin by first sampling the update equations of the various FDTD algorithms which will be used to test the proposed hypothesis. Only one of the six field components will be detailed in this section for each algorithm. Furthermore, the various coefficients which are specific to each algorithm will not be detailed here to simplify the presentation. The object of this study, after all, is not to investigate the relative accuracy of the various algorithms. Rather, it is to test the relative efficiency in memory structure manipulations within CPUs and GPUs.

The standard FDTD algorithm [

The high-order FV24 algorithm is based on converting an integral form of Maxwell’s equations into multiple weighted volume and surface integrals around the field node of interest before discretizing them [

The terms multiplied by the

This algorithm is selected from a class of high-order FDTD algorithms developed by Zygiridis and Tsiboukis [

This algorithm is based on straight-forward fourth-order central finite-differences in space. It also differs from the other algorithms in that it has backward fourth-order finite differences in time as suggested by Hwang and Ihm [

Table

Comparison of the theoretical computational costs for the various FDTD algorithms.

Algorithm | Floats | Reads | Writes | Floats/FDTD floats |
---|---|---|---|---|

FDTD | 6 | 7 | 1 | 1 |

FV24 | 45 | 46 | 1 | 7.5 |

FV24-WB | 28 | 29 | 1 | 4.7 |

S | 45 | 46 | 1 | 7.5 |

S44 | 17 | 18 | 1 | 2.8 |

It is somewhat surprising that there is yet to appear in the literature an FDTD GPU implementation using CUDA Fortran. All the relevant references mentioned at the end of this work, for example, unanimously used CUDA C. To add value to this presentation, CUDA Fortan will be used instead, which is a product developed by the Portland Group [

The GPU kernel that computes the standard FDTD update equation can be written as

where

The calling routine for the above kernel from the main program is given by

where dimGridHx and dimBlockHx are special variables that define the configuration of the kernel launch which decides how the overall grid is dissected into optimum subdomain sizes for best GPU performance. In particular, dimBlockHx decides the (up to) 3-dimensional size of the kernel’s thread-block which directly affects its efficiency. Many factors influence the thread-block dimensions choice. Some are related to the available GPU resources such as number of computing cores and their groupings as well as memory architecture and bandwidth. Other factors are related to the algorithm being computed and its requirements of register space and shared memory. Other factors still are related to the programming language being used as CUDA C and CUDA Fortran follow their parent languages in the manner of one being row major (C) and the other being column major (Fortran).

Table

GPU kernel throughput for the standard FDTD algorithm at several thread-block size configurations.

MCells/s | |
---|---|

4 × 4 × 4 | 177 |

8 × 8 × 8 | 221 |

16 × 8 × 4 | 301 |

16 × 8 × 8 | 249 |

32 × 8 × 4 | 350 |

Optimum thread-block configurations for the various FDTD algorithms.

Algorithm | |||
---|---|---|---|

FDTD | 32 | 2 | 3 |

FV24 | 32 | 3 | 2 |

FV24-WB | 32 | 2 | 2 |

S | 32 | 2 | 2 |

S44 | 32 | 2 | 3 |

For those not familiar with Fortran, the following is the CUDA C version of the standard FDTD GPU kernel. The only difference from the previously listed CUDA Fortran kernel is the fact that the thread-block used in this code is one-dimensional

__

This code is only listed for direct comparison’s sake between CUDA Fortran and CUDA C. All the results shown in the next section are based on the CUDA Fortran kernels.

Each of the five developed kernels is tasked with updating every

Table

Comparison of GPU kernels’ throughputs for the various FDTD algorithms.

Algorithm | Throughput (MCells/s) | FDTD/Algorithm throughput |
---|---|---|

FDTD | 383 | 1 |

FV24 | 159 | 2.4 |

FV24-WB | 198 | 1.9 |

S | 144 | 2.7 |

S44 | 215 | 1.8 |

This behavior is by no means limited to GPU computing. Today’s CPUs are also guilty of outpacing memory bandwidth. Running the CPU versions of the above discussed algorithms on a single core of a Xeon W5580 processor resulted in the throughputs listed in Table

Comparison of CPU codes’ throughputs for the various FDTD algorithms.

Algorithm | Throughput (MCells/s) | FDTD/Algorithm throughput |
---|---|---|

FDTD | 75 | 1 |

FV24 | 32 | 2.3 |

FV24-WB | 35 | 2.1 |

S | 26 | 2.9 |

S44 | 26 | 2.9 |

The main objective of this paper is to drive the point that there might be a need to reevaluate numerical algorithms that have long been considered to be computationally inefficient, in light of the unique nature of today’s CPU and GPU computing architectures and programming models. It is possible, especially for memory bandwidth-bound algorithms, that memory latency can hide much of the computational inefficiency within the framework of modern processors. In such situations, such seemingly inefficient algorithms, if they are otherwise of value, might become viable again for certain applications. This possibility is by no means limited to FDTD or to computational electromagnetics. Virtually any memory bandwidth-bound numerical method can potentially benefit from these findings.

This work was supported by Kuwait University Research Grant no. EE02/07.