We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter

Spatial interpolation is a fundamental task in Geosciences, where a number of data points with some kinds of known values such as elevations are used to predict unknown quantities of a continuous phenomenon for the prediction points. The computational cost of the underlying algorithms usually grows with the number of data entering the interpolation and the number of locations for which interpolated values are needed. Typically, the implementation of spatial interpolation within the conventional sequential programming patterns is computationally expensive for a large number of data sets and thus calls for scalable computing solutions such as parallelization.

The Inverse Distance Weighting (IDW) algorithm is one of the most commonly used spatial interpolation methods in Geosciences mainly due to its straightforward implementation. Shepard [

By taking advantage of the power of traditional CPU-based parallel programming models, Armstrong and Marciano [

Since that general purpose computing on modern Graphics Processor Units (GPUs) can significantly reduce computational cost by performing massively parallel computing, current research efforts are being devoted to parallel IDW algorithms on GPU computing architectures such as CUDA [

When intending to profit from massively parallel computing on GPUs, algorithms are needed to be carefully implemented according to the inherent features of GPU computing architectures. For example, shared memory is expected to be much faster than global memory; thus any opportunity to replace global memory access by shared memory access should therefore be exploited [

In this paper, we first develop a GPU implementation of the IDW algorithm according to the strategy “tiling” [

The paper is organized as follows. Section

CUDA (Compute Unified Device Architecture) is a general purpose parallel computing platform and programming model created by NVIDIA and implemented by the NVIDIA GPUs, which leverages the power of parallel computing on GPUs to solve complex computational problems in a more efficient way than on a CPU. CUDA comes with a software environment that allows developers to use C as a high level programming language. More details of the CUDA architecture are presented in [

Dynamic Parallelism in CUDA is an extension to the CUDA programming model that enables a CUDA kernel to create and synchronize new nested work by launching nested kernels [

Parent-child launch nesting (derived from Figure 12 in [

Dynamic Parallelism introduces the concepts of “parent” and “child” grids. A parent grid is one that has launched new nested grid(s), that is, the child grid(s). A child grid is one that has been launched by a parent grid. A child grid must be complete before its parent grid is considered complete; in other words, the parent is not considered complete until all of its launched child grids have also been completed [

There are several implementation limitations when programming the Dynamic Parallelism. For example, global memory, constant memory, and texture memory are visible for both the parent and child grids and can be written within the parent and child grids coherently. However, the operations of above three types of memories in the parent thread prior to the child thread’s invocation are visible to the child grid; all memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid’s competition. Shared memory and local memory are private storage for a thread block or a thread, respectively, which are not visible outside their scopes. It is

Dynamic Parallelism is introduced with the Kepler architecture that has the Compute Capability 3.5 or higher.

The IDW algorithm is one of the most commonly used spatial interpolation methods in Geosciences, which calculates the interpolated values of unknown points (prediction points) by weighting average of the values of known points (data points). The name given to this type of methods was motivated by the weighted average applied since it resorts to the inverse of the distance to each known point when calculating the weights. The difference between different forms of IDW interpolation is that they calculate the weights variously.

A general form of predicting an interpolated value

The above equation is a simple IDW weighting function, as defined by Shepard [

The naive implementation of the IDW interpolation is straightforward. Assuming that there are

The implemented CUDA kernel of this naive version can be found in [

The CUDA kernel presented in [

In CUDA architecture, shared memory is expected to be much faster than global memory; thus, any opportunity to replace global memory access by shared memory access should therefore be exploited [

This optimization strategy “tiling” is adopted to implement the IDW interpolation: the coordinates of data points is first transferred from global memory to shared memory; then each thread within a thread block can access the coordinates stored in shared memory concurrently. Since shared memory is limited per SM (Stream Multiprocessor), the data in global memory, that is, the coordinates of data points, needs to be first split/tiled into small pieces and then transferred to shared memory.

In the tiled implementation, the tile size is set as the same as the block size (i.e., the number of threads per block). Each thread within a thread block is responsible for loading the coordinates of one data point from global memory to shared memory and then computing the distances and inverse weights to those data points stored in current shared memory. After all threads within a block finished computing these partial distances and weights, next piece of data in global memory is loaded to shared memory and used to calculate current wave of partial distances and weights. Each thread accumulates the results of all partial weights and all weighted values into two registers. Finally, the interpolated value of each prediction point can be obtained according to the sums of all partial weights and weighted values and then written into global memory.

By blocking the computation this way, the access to global memory can be reduced since the coordinates of data points are only read (

The basic idea behind implementing the IDW interpolation using Dynamic Parallelism is as follows. There are two levels of parallelism in IDW interpolation.

Level 1: for all prediction points, the interpolated values can be calculated in parallel. The interpolating for each unknown point does not depend on that of any of other points and thus can be carried out concurrently.

Level 2: for each prediction point, it is needed to first calculate the distances to all data points and then the inverse weights. These distances and weights can obviously be calculated in parallel.

The parent kernel is responsible for performing the first level of parallelism, while the child kernel takes responsibility for realizing the second level of parallelism. There are only two levels of kernel launches.

In more details, the parent grid is responsible for calculating the interpolated values of all prediction points in parallel. Each thread within the parent grid is designed to evaluate the interpolated value of one prediction point by invoking the child grid. Hence, there are at least

The launch arguments of the parent kernel mainly include the coordinates of data points and prediction points. Several arrays are allocated in global memory to store these coordinates. In addition, another array, sum [

In the child grid, each thread within a block is responsible for computing the distance of one data point to the prediction point specified by the parent thread. The distance is first stored as a register in each thread, and then transferred to the shared memory. After finishing computing all distances, the parallel reduction introduced by Harris [

The arguments of the child kernel are almost the same as those of the parent kernel. The only difference is that there is an additional argument, the unique index of the thread in the parent grid. This argument is used to indicate which thread in the parent grid invokes the child kernel.

We evaluate the GPU implementations using the NVIDIA GeForce GT640 (GDDR5) graphics cards with CUDA 5.5. Note that the GeForce GT640 card with memory GDDR5 has the Compute Capability (CC) 3.5, while it only has Compute Capability 2.1 with the memory DDR3. The CPU experiments were performed on Windows 7 SP1 with a dual Intel i5 3.2 GHz processor and 8 GB of RAM memory. For each set of the testing data, we carry out one CPU implementation and three GPU implementations only on single precision.

For each implementation, we perform two different forms that have different values of the power parameter

Different specifications of the power

The input of the IDW interpolation is obviously the coordinates of the data points and prediction points. The performance of the CPU and GPU implementations may differ due to different sizes of input data [

the numbers of prediction points and data points are identical;

the number of data points is fixed, and the number of prediction points differs;

the number of prediction points is fixed, and the number of data points differs.

We create five groups of sizes, that is, 10 K, 50 K, 100 K, 500 K, and 1000 K (1 K = 1024). When the number of prediction points is identical to the number of data points, five tests are performed by setting the numbers of both the data points and prediction points as the above listed five groups of sizes. The execution times of four implementations and corresponding speedups are shown in Figures

Execution times in the test case when the numbers of prediction points and data points are identical. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

Speedups in the test case when the numbers of prediction points and data points are identical. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

In the second test case, the number of data points is fixed as 100 K. Five tests are also carried out by setting the sizes of the prediction points as the above listed five groups of size. The experimental results in this case are presented in Figures

Execution times in the test case when the number of data points is fixed and the number of prediction points differs. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

Speedups in the test case when the number of data points is fixed and the number of prediction points differs. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

Different from the second test case, in the third case, the number of the prediction points rather than the data points is fixed as 100 K. And the number of data points is set to one of the five groups of size. The results of five experimental tests are shown in Figures

Execution times in the test case when the number of prediction points is fixed, and the number of data points differs. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

Speedups in the test case when the number of prediction points is fixed, and the number of data points differs. (a) The form in which the power parameter is set to 2. (b) The form in which the power parameter is set to 3.0.

According to the results generated in above three test cases, we have found that when the power is set to 2, the GPU implementations, that is, the naive version, the tiled version, and the CDP version achieve the speedups of 60x, 120x, and 10x over the CPU implementation, respectively; in contrast, when the power is set to 3.0, the speedups are about 380x, 670x, and 78x for the naive, the title, and the CDP implementations, respectively.

Several GPU implementations of the IDW interpolation are presented in the literature [

In this paper, we develop two GPU implementations of IDW interpolation, the tiled version and the CDP version, by taking advantage of the fast shared memory and the CUDA Dynamic Parallelism. To the best of our knowledge, the above two GPU implementations have not been introduced in existing literatures.

In the tiled version, the coordinates of data points originally stored in global memory are divided into small pieces/tiles that fit the size of shared memory and then loaded from slow global memory to fast shared memory. These coordinates stored in shared memory can be accessed quite fast by all threads within a thread block when calculating the distances. By blocking the computation this way, we take advantage of fast shared memory and reduce the global memory access: the coordinates of data points are only read (

Experimental tests show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter

The basic idea behind implementing the IDW interpolation using Dynamic Parallelism is simple. There are two levels of parallelism in IDW interpolation:

In the standard IDW interpolation, it needs to calculate

However, we obtain a negative result in practice. Although the CDP version is about 10x and 78x times faster than the CPU version when the power

We analyze the CDP implementation carefully to explain the negative behavior and find that there are probably two main causes.

In the CDP version, the input data are the coordinates of data points and prediction points, which is originally stored in global memory. When a thread within the child grid is invoked to calculate the distances of one prediction point to all data points, only those coordinates stored in global memory can be passed as a launch argument from the parent kernel to the child kernel. The “tiling” optimization strategy described in the tiled version cannot be accepted to reduce the global memory access since the coordinates that are first divided and then loaded to shared memory cannot be passed as a launch argument when launching a child kernel.

Due to above implementation limitations of Dynamic Parallelism, there are currently no optimization approaches to reducing global memory access. The coordinates of all data points that are stored in global memory are needed to be read n times, where n is the number of prediction points. The amount of global memory access in the CDP version is the same as that in the naive GPU implementation and greater than that in the tiled GPU implementation. This is one of the main causes that lead to the negative result.

In CDP version, we call the function cudaDeviceSynchronize() after launching the child kernel to guarantee all child grids completely executed. We have observed that, without calling the barrier cudaDeviceSynchronize(), only part of the threads within the parent kernel execute and return expected interpolated values; in other words, the interpolation results in this case are incorrect and uncompleted. However, the execution time for the overall interpolation procedure is much less than that when calling the barrier; see Figure

Impact of the barrier cudaDeviceSynchronize() on execution times.

As noted above, the barrier cudaDeviceSynchronize() is needed to guarantee producing correct and complete interpolating results but is time consuming. This is probably another cause that makes the CDP implementation computationally expensive.

For both the CPU and GPU implementations, the form that has the power set to 3.0 is more computationally expensive than the form where the power is set to 2. In particular, this behavior can be clearly observed for the CPU implementation: the form with the power set to 3.0 is 7.6x slower than the other form. The above behavior is due to the expensive sequential calculations of the powered distances. For those GPU implementations, the deceleration is slight because of the massively parallel computation of the powered distances.

In this paper, both the tiled version and the CDP version are the GPU implementations of the basic form of IDW interpolation algorithm, which calculate the interpolated values using all data points (sample points). A practical solution to reducing the computational cost is to use part of rather than all data points to calculate the interpolated values. The selecting of proper partial sets of data points can be carried out by domain decomposition [

We have developed two GPU implementations of the IDW interpolation algorithm, the tiled version and the CDP version, by taking advantage of shared memory and CUDA Dynamic Parallelism. We have demonstrated that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter

The author declares that there is no conflict of interests regarding the publication of this paper.