The Case for Higher Computational Density in the Memory-Bound FDTD Method within Multicore Environments

It is argued here that more accurate though more compute-intensive alternate algorithms to certain computational methods which are deemed too inefficient and wasteful when implemented within serial codes can be more efficient and cost-effective when implemented in parallel codes designed to run on today’s multicore and many-core environments. This argument is most germane to methods that involve large data sets with relatively limited computational density—in other words, algorithms with small ratios of floating point operations to memory accesses. The examples chosen here to support this argument represent a variety of high-order finite-difference time-domain algorithms. It will be demonstrated that a threeto eightfold increase in floating-point operations due to higher-order finite-differences will translate to only twoto threefold increases in actual run times using either graphical or central processing units of today. It is hoped that this argument will convince researchers to revisit certain numerical techniques that have long been shelved and reevaluate them for multicore usability.


Introduction
General-purpose graphical processing units (GPUs) have emerged in recent years as a viable vehicle for high-performance computing.The emergence of a simplified programming language for GPU computing in the form of CUDA C along with dedicated GPU hardware specifically designed for high-end workstations as well as supercomputing platforms [1] was keenly received by the scientific computing community.A literature survey covering almost any field of scientific computing will reveal the unmistakable keen interest in porting codes from central processing units (CPUs) to GPUs.While the fastest CPUs today (e.g., Intel Xeon 8870) are capable of speeds of up to 200-giga-floating point operations per second (GFlops) in singleprecision [2], the fastest dedicated GPUs can deliver performance beyond the tera-Flops (TFlops) mark.Not all numerical methods, however, are created equal in terms of their abilities to approach the GPUs' theoretical performance limits.While some methods fit the GPUs' memory and computing model nicely, others require serious innovation to reshape them for optimum GPU use, whereas others still have to settle for a fraction of the theoretical performance limits.
The finite-difference time-domain (FDTD) method investigated here, which is a dominant method in the field of computational electromagnetics, belongs somewhere between the latter two categories.This is due to the fact that FDTD is a memory intensive algorithm with relatively low computational density (low ratio of floating-point operations applied on a set of data to number of read/write memory accesses of these data).In GPU computing, this can cause a serious performance penalty in the form of long idle times for the computing cores as they wait for data to process.Most of the efforts reported in the GPU/FDTD literature [3][4][5][6][7][8][9][10][11][12][13] were focused on how to best use the available GPU memory bandwidth through exploiting its on-chip shared and texture memory banks, as opposed to the much slower off-chip global memory.To a slightly lesser degree, this memory-bandwidth limitation issue applies to today's high-end CPUs as well.
An alternate route, however, is to turn the disadvantage of memory latency (slowness) into an advantage by asking the GPU (or CPU) computing cores to perform more useful computations (instead of idling) on the same data they have at hand.This could be achieved by opting for highorder FDTD algorithms instead of the standard secondorder algorithm.The extra computations required by these high-order algorithms go a long way to alleviate FDTD's biggest shortcoming when modeling electrically large problems (structures with dimensions much larger than the wavelength of interest) which is excessive accumulated phase errors in numerically propagated waves.There are several reported flavors of high-order FDTD algorithms [14][15][16][17][18][19][20] and they all share the advantage (with various degrees of success) of reduced numerical dispersion, which allows them to model larger electromagnetic structures with better accuracy compared to the standard method when using the same computing resources.
Four different high-order FDTD algorithms will be investigated in this work to determine their suitability to achieving a favorable balance between computing speed and memory speed.For a fair and straightforward comparison between the different algorithms, only algorithms with explicit update equations and applicable on Yee's standard cubical and staggered mesh are considered for the comparison.Shared memory use will not be considered since its use for highorder FDTD algorithms has not been reported in the literature yet.Indeed, this presentation is probably the first reported GPU implementation of a high-order FDTD algorithm.Furthermore, the GPU experiments will be conducted using CUDA Fortran [21] instead of CUDA C, which has not been reported in the FDTD literature either.

FDTD Update Equations
We begin by first sampling the update equations of the various FDTD algorithms which will be used to test the proposed hypothesis.Only one of the six field components will be detailed in this section for each algorithm.Furthermore, the various coefficients which are specific to each algorithm will not be detailed here to simplify the presentation.The object of this study, after all, is not to investigate the relative accuracy of the various algorithms.Rather, it is to test the relative efficiency in memory structure manipulations within CPUs and GPUs.
2.1.The Standard FDTD Algorithm.The standard FDTD algorithm [22] uses second-order finite differences to approximate Maxwell's differential equations which govern electromagnetic propagation and interaction with its surroundings.These finite-difference versions are rearranged as explicit update equations which are in turn evaluated iteratively as waves propagate within the numerical grid.A typical update equation (there are six total per time step) is given by where H x is a typical magnetic field node within the FDTD grid and the E's are the surrounding electric field nodes.In this simplified form, the update equation involves 6 floatingpoint operations per invocation and requires 7 memory reads and one write.Note that some of the i, j, k space and n time indices within the discrete curl operator (all E terms) are not stepped.In the remaining update equations in this work, only stepped indices will be shown to simplify the expressions.
2.2.The FV24 Algorithm.The high-order FV24 algorithm is based on converting an integral form of Maxwell's equations into multiple weighted volume and surface integrals around the field node of interest before discretizing them [20].Among all the other high-order FDTD algorithms, this particular one has the advantage of an almost completely developed suite of tools (see e.g.[23] and the references listed therein).The corresponding FV24 update equation for H x is given by . (2)

The S{2, 2}
Algorithm.This algorithm is selected from a class of high-order FDTD algorithms developed by Zygiridis and Tsiboukis [19] and has the same number of floatingpoint operations and memory reads and writes as the FV24 algorithm.This particular version is selected, however, to illustrate the effect of different memory access patterns between otherwise identical algorithms from a computational point of view.The corresponding update equation for the balanced S{2, 2} algorithm is given by 2.5.The S44 Algorithm.This algorithm is based on straightforward fourth-order central finite-differences in space.It also differs from the other algorithms in that it has backward fourth-order finite differences in time as suggested by Hwang and Ihm [18].The corresponding update equation is given by This update equation has 17 floating-point operations with 18 memory reads and one write.Table 1 summarizes the computational costs of the update equations listed in this section.In listing the memory reads column, in particular, all the D ax and D bx coefficients were assumed position dependent to accommodate modeling inhomogeneous domains.From the table, it is obvious that the high-order algorithms are handicapped by as much as 7.5-fold increase in computational intensity and up to 5.9fold increase in combined (read/write) memory accesses.

CUDA Fortran Implementation
It is somewhat surprising that there is yet to appear in the literature an FDTD GPU implementation using CUDA Fortran.All the relevant references mentioned at the end of this work, for example, unanimously used CUDA C. To add value to this presentation, CUDA Fortan will be used instead, which is a product developed by the Portland Group [21].It would be the obvious choice for Fortran programmers as it affords most of the functionality of CUDA C, coupled with the simplicity and efficiency of Fortran's multidimensional array management.A glaring omission from CUDA Fortran at this time is the lack of texture memory accessibility, though it is expected to be implemented in due course.

International Journal of Antennas and Propagation
The GPU kernel that computes the standard FDTD update equation can be written as Hx(i, j, k) = dax(i, j, k) * Hx(i, j, k) + dbx(i, j, k) * (Ey(i, j, k + 1) -Ey(i, j, k) -Ez(i, j + 1, k) + Ez(i, j, k)) end subroutine Yee kernel where x 1 , x 2 , y 1 , y 2 , z 1 , and z 2 are the overall FDTD grid boundaries, B x , B y , B z define the number of thread-blocks along the three dimensions of this grid, and i, j, k represent the H x mapped position within it.The remaining algorithm's kernels have similar forms and differ only in terms of the actual update equation within the nested if statements and the requisite additional coefficients array declarations.A close look at the above kernel would reveal the absence of the requisite nested for loops in typical CPU codes.The reason behind that is the way GPU kernels work.In GPU kernels, scheduling is performed by the hardware which handles the division of the array elements among the GPU's many cores.All the user needs to do is specify the array bounds and the thread-to-data mapping structure.
The calling routine for the above kernel from the main program is given by call Yee kernel <<<dimGridYee, dimBlockYee>>> where dimGridHx and dimBlockHx are special variables that define the configuration of the kernel launch which decides how the overall grid is dissected into optimum subdomain sizes for best GPU performance.In particular, dimBlockHx decides the (up to) 3-dimensional size of the kernel's threadblock which directly affects its efficiency.Many factors influence the thread-block dimensions choice.Some are related to the available GPU resources such as number of computing cores and their groupings as well as memory architecture and bandwidth.Other factors are related to the algorithm being computed and its requirements of register space and shared memory.Other factors still are related to the programming language being used as CUDA C and CUDA Fortran follow their parent languages in the manner of one being row major (C) and the other being column major (Fortran).Table 2 demonstrates the standard FDTD throughput in MCells/s at different thread-block size configurations.This metric represents the number of millions of FDTD cells that are updated per second which are computed as follows: 1 6

No of FDTD cells in millions Simulation time in seconds
× No of time steps. ( The 1/6 factor is there to highlight the fact that this and subsequent simulation results are based on single and independent field-component kernel timings and not combined components FDTD simulations.With a few extra trials it was found that the optimum thread-block size for this particular algorithm is 32 × 2 × 3. The 32 number is significant in that it fits the GPU access pattern of its global memory (The latest generation GPUs (compute capability 2.0 or later) access their global memory in groups of 32 4-byte floats instead of 16 [24].For optimum results, each thread should read 32 consecutively stored numbers at a time from memory-which is along the first dimension of a Fortran array.Earlier generation GPUs allowed group accesses of multiples of 16 4-byte floats).As can be seen from the table, deciding the optimum thread-block size is critical to the kernel performance.Table 3 lists the optimum thread-block dimensions for each of the tested algorithms.

}}}}}
This code is only listed for direct comparison's sake between CUDA Fortran and CUDA C. All the results shown in the next section are based on the CUDA Fortran kernels.

Runtime Comparisons
Each of the five developed kernels is tasked with updating every H z node in a 300 × 300 × 300 FDTD grid.The kernels were run 1,000 times each and the corresponding runtimes were averaged out.CUDA Fortran's cudaEventElapsedTime and related runtime API's were used for collecting run times and data were converted to units of MCells/s as explained earlier.The GPU used for the comparison is an nVidia Tesla C2050.Table 4 compares the GPU throughput for the five tested FDTD algorithms.As expected, the standard FDTD algorithm has the highest throughput at 383 MCells/s.For the record, the CPU-only version of the code managed a 75 MCells/s which translates into a GPU speed-up of nearly 5 : 1.What is important, however, is the fact that an algorithm such as the FV24 which has 7.5 times the computational density of the standard algorithm managed a 198 MCells/s throughput which is only 198/383 1/2 as slow.This translates into the FV24 algorithm being nearly 4 times faster than predicted by a direct floating-point operations count.This result is even more impressive considering that the FV24 algorithm requires in theory nearly 6 times more memory accesses than the standard FDTD algorithm (see Table 1).
This behavior is by no means limited to GPU computing.Today's CPUs are also guilty of outpacing memory bandwidth.Running the CPU versions of the above discussed algorithms on a single core of a Xeon W5580 processor resulted in the throughputs listed in Table 5.Based on these results, and whether CPUs or GPUs are used, it can be argued that using high-order FDTD algorithms that run only twice as slow as the standard method can be worth the main advantage of such algorithms.For example, it has been reported in [20] that the FV24 algorithm is capable of modeling wave propagation for distances that are 3,000 times longer than in the standard method while accumulating the same overall phase error at the grid resolution of 20 cells per wavelength.

Conclusion
The main objective of this paper is to drive the point that there might be a need to reevaluate numerical algorithms that have long been considered to be computationally inefficient, in light of the unique nature of today's CPU and GPU computing architectures and programming models.It is possible, especially for memory bandwidth-bound algorithms, that memory latency can hide much of the computational inefficiency within the framework of modern processors.In such situations, such seemingly inefficient algorithms, if they are otherwise of value, might become viable again for certain applications.This possibility is by no means limited to FDTD or to computational electromagnetics.Virtually any memory bandwidth-bound numerical method can potentially benefit from these findings.

Table 1 :
Comparison of the theoretical computational costs for the various FDTD algorithms.

Table 2 :
GPU kernel throughput for the standard FDTD algorithm at several thread-block size configurations.

Table 3 :
Optimum thread-block configurations for the various FDTD algorithms.

Table 4 :
Comparison of GPU kernels' throughputs for the various FDTD algorithms.

Table 5 :
Comparison of CPU codes' throughputs for the various FDTD algorithms.