^{1, 2}

^{1, 2}

^{1, 2}

^{1}

^{2}

The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose (noisy) computed tomography (CT) data. While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 × 512 × 445 × 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT-based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT-based filtering. The short processing time increases the clinical value of true 4D image denoising significantly.

Image denoising is commonly used in medical imaging in order to help medical doctors to see abnormalities in the images. Image denoising was first applied to 2D images [

Three-dimensional image denoising has previously been applied to several time points independently, but there has not been much work done on

The rapid development of graphics processing units (GPUs) has resulted in that many algorithms in the medical imaging domain have been implemented on the GPU, in order to save time and to be able to apply more advanced analysis. To give an example of the rapid GPU development, a comparison of three consumer graphic cards from Nvidia is given in Table

Comparison between three Nvidia GPUs, from three different generations, in terms of processor cores, memory bandwidth, size of shared memory, cache memory, and number of registers; MP stands for multiprocessor and GB/s stands for gigabytes per second. For the GTX 580, the user can for each kernel choose to use 48 KB of shared memory and 16 KB of L1 cache or vice versa.

Property/GPU | 9800 GT | GTX 285 | GTX 580 |
---|---|---|---|

Number of processor cores | 112 | 240 | 512 |

Normal size of global memory | 512 MB | 1024 MB | 1536 MB |

Global memory bandwidth | 57.6 GB/s | 159.0 GB/s | 192.4 GB/s |

Constant memory | 64 KB | 64 KB | 64 KB |

Shared memory per MP | 16 KB | 16 KB | 48/16 KB |

Float registers per MP | 8192 | 16384 | 32768 |

L1 cache per MP | None | None | 16/48 KB |

L2 cache | None | None | 768 KB |

In the area of image denoising, some algorithms have also been implemented on the GPU. Already in 2001 Rumpf and Strzodka [

To our knowledge, there has not been any work done about true 4D image denoising on the GPU. In this work we therefore present a novel algorithm, based on local adaptive filtering, for 4D denoising and describe how to implement it on the GPU, in order to decrease the processing time and thereby significantly increase the clinical value.

In this section, the algorithm that is used for true 4D image denoising will be described.

To show how a higher number of dimensions, the power of dimensionality, can improve the denoising result, a small test is first conducted on synthetic data. The size of the 4D data is

(1) Original test image without noise. There is a large step in the middle, a bright thin line and a shading from the top left corner to the bottom right corner. (2) Original test image with a lot of noise. The step is barely visible, while it is impossible to see the line or the shading. (3) Resulting image after 2D denoising. The step is almost visible and it is possible to see that the top left corner is brighter than the bottom right corner. (4) Resulting image after 3D denoising. Now the step and the shading are clearly visible, but not the line. (5) Resulting image after 4D denoising. Now all parts of the image are clearly visible.

The denoising approach that our work is based on is adaptive filtering. It was introduced for 2D by Knutsson et al. in 1983 [

Compared to more recently developed methods for image denoising (e.g., nonlocal means [

The local structure tensor can, for example, be estimated by using quadrature filters [

The number of filters that are required to estimate the tensor depends on the dimensionality of the data and is given by the number of independent components of the symmetric local structure tensor. The required number of filters is thus 3 for 2D, 6 for 3D and 10 for 4D. The given tensor formula, however, assumes that the filters are evenly spread. It is possible to spread 6 filters evenly in 3D, but it is not possible to spread 10 filters evenly in 4D. For this reason, 12 quadrature filters have to be used in 4D (i.e., a total of 24 filters in the spatial domain, 12 real valued and 12 complex valued). To apply 24 nonseparable filters to a 4D dataset requires a huge number of multiplications. In this paper a new type of filters,

Monomial filters also have one radial function

In 2D, the frequency variable is in this work defined as

(a) Two-dimensional monomial filters

(a) Two-dimensional monomial filters

The monomial filter response matrices

A total of 14 nonseparable 4D monomial filters (4 odd of the first-order

By using equation (

If monomial filters are used instead of quadrature filters, the required number of 4D filters is thus decreased from 24 to 14. Another advantage is that the monomial filters require a smaller spatial support, which makes it easier to preserve details and contrast in the processing. A smaller spatial support also results in a lower number of filter coefficients, which decreases the processing time.

When the local structure tensor

In the 2D case, the magnitude

Examples of the M-function that maps the magnitude of the structure tensor. If the magnitude of the structure tensor is too low, the magnitude is set to zero for the control tensor, such that no highpass information is used in this part of the data. The overshoot is intended to amplify structures that have a magnitude that is slightly above the noise threshold.

The isotropy

Examples of the mu-function that maps the isotropy of the structure tensor. If the structure tensor is almost isotropic (a high value on the

Three examples of isotropy mappings. (a) Original structure tensors. (b) Mapped control tensors. If the structure tensor is anisotropic, the control tensor becomes even more anisotropic (examples 1 and 2). If the structure tensor is almost isotropic, it becomes more isotropic (example 3).

The control tensor

For matrices of size

To map the isotropy, the structure tensor is first normalized as

The resulting transfer function that maps each eigenvalue is given in Figure

The transfer function that maps the eigenvalues of the structure tensor.

Eleven nonseparable reconstruction filters, one lowpass filter

All the processing steps of the denoising algorithm are given in Table

The table shows the in and out data resolution, the used equations and the memory consumption for all the processing steps for spatial filtering (SF) and FFT-based filtering (FFTBF). Note that the driver for the GPU is stored in the global memory, and it normally requires 100–200 MB.

Processing step | Resolution, SF | Memory consumption, SF | Resolution, FFTBF | Memory consumption, FFTBF |
---|---|---|---|---|

Lowpass filtering and downsampling of CT volumes | in | 406 MB | in | 294 MB |

out | out | |||

Filtering with 14 monomial filters and calculating the local structure tensor (( | in | 1376 MB | in | 1791 MB |

out | out | |||

Lowpass filtering of the local structure tensor components (normalized convolution, ( | in | 1276 MB | in | 720 MB |

out | out | |||

Calculating the tensor magnitude and mapping it with the M-function (( | in | 1376 MB | in | 770 MB |

out | out | |||

Mapping the local structure tensor to the control tensor (( | in | 1376 MB | in | 770 MB |

out | out | |||

Lowpass filtering of the control tensor components (normalized convolution, ( | in | 1476 MB | in | 820 MB |

out | out | |||

Filtering with 11 reconstruction filters, interpolating the control | in | 2771 MB | in | 2110 MB |

out | out |

While the presented algorithm is straightforward to implement, spatial filtering with 11 reconstruction filters of size

One of the main drawbacks of the presented algorithm is that, using standard convolution, the number of valid elements in the

The loss of valid slices can be avoided by using normalized convolution [

In this section, the GPU implementation of the denoising algorithm will be described. The CUDA (compute unified device architecture) programming language by Nvidia [

The CUDA programming language can easily generate 2D indices for each thread, for example, by using Algorithm

Fast-Fourier-transform (FFT-) based filtering can be very efficient when large nonseparable filters of high dimension are to be applied to big datasets, but spatial filtering is generally faster if the filters are small or Cartesian separable. The main advantage with FFT-based filtering is that the processing time is the same regardless of the spatial size of the filter. A small bonus is that circular filtering is achieved for free. The main disadvantage with FFT-based filtering is however the memory requirements, as the filters need to be stored in the same resolution as the data, and also as a complex-valued number for each element.

To see which kind of filtering that fits the GPU best, both spatial and FFT-based filtering was therefore implemented. For filtering with the small separable lowpass filters (which are applied before the data is downsampled and to smooth the tensor components), only separable spatial filtering is implemented.

Spatial filtering can be implemented in rather many ways, especially in four dimensions. One easy way to implement 2D and 3D filtering on the GPU is to take advantage of the cache of the texture memory and put the filter kernel in constant memory. The drawback with this approach is however that the implementation will be very limited by the memory bandwidth, and not by the computational performance. Another problem is that it is not possible to use 4D textures in the CUDA programming language. One would have to store the 4D data as one big 1D texture or as several 2D or 3D textures. A better approach is to take advantage of the shared memory, which increased a factor 3 in size between the Nvidia GTX 285 and the Nvidia GTX 580. The data is first read into the shared memory and then the filter responses are calculated in parallel. By using the shared memory, the threads can share the data in a very efficient way, which is beneficial as the filtering results for two neighbouring elements are calculated by mainly using the same data.

As multidimensional filters can be separable or nonseparable (the monomial filters and the reconstruction filters are nonseparable, while the different lowpass filters are separable) two different spatial filtering functions were implemented.

Our separable 4D convolver is implemented by first doing the filtering for all the rows, then for all the columns, then for all the rods and finally for all the time points. The data is first loaded into the shared memory and then the valid filter responses are calculated in parallel. The filter kernels are stored in constant memory. For the four kernels, 16 KB of shared memory is used such that 3 thread blocks can run in parallel on each multiprocessor on the Nvidia GTX 580.

The shared memory approach works rather well for nonseparable 2D filtering but not as well for nonseparable 3D and 4D filtering. The size of the shared memory on the Nvidia GTX 580 is 48 KB for each multiprocessor, and it is thereby only possible to, for example, fit

{

{

{

{

}

}

}

}

Our nonseparable 2D convolver first reads

The 14 monomial filters are of size

While the CUFFT library by Nvidia supports 1D, 2D, and 3D FFTs, there is no direct support for 4D FFTs. As the FFT is cartesian separable, it is however possible to do a 4D FFT by applying four consecutive 1D FFTs. The CUFFT library supports launching a batch of 1D FFTs, such that many 1D FFT's can run in parallel. The batch of 1D FFTs are applied along the first dimension in which the data is stored (e.g., along

A forward 4D FFT is first applied to the volumes. A filter is padded with zeros to the same resolution as the data and is then transformed to the frequency domain. To do the filtering, a complex-valued multiplication between the data and the filter is applied and then an inverse 4D FFT is applied to the filter response. After the inverse transform, a FFT shift is necessary; there is however no such functionality in the CUFFT library. When the tensor components and the denoised data are calculated, each of the four coordinates is shifted by using a help function, see Algorithm

{

}

As the monomial filters only have a real part or an imaginary part in the spatial domain, some additional time is saved by putting one monomial filter in the real part and another monomial filter in the imaginary part before the 4D FFT is applied to the zero-padded filter. When the complex multiplication is performed in the frequency domain, two filters are thus applied at the same time. After the inverse 4D FFT, the first filter response is extracted as the real part and second filter response is extracted as the imaginary part. The same trick is used for the 10 highpass reconstruction filters.

The main problem of implementing the 4D denoising algorithm on the GPU is the limited size of the global memory (3 GB in our case). This is made even more difficult by the fact that the GPU driver can use as much as 100–200 MB of the global memory. Storing all the CT data on the GPU at the same time is not possible, a single CT volume of the resolution

For the spatial filtering, the algorithm is started with data of the resolution

For the FFT-based filtering, the algorithm is started with data of the resolution

To store the 10 components of the control tensor in the same resolution as the original CT data for one run with spatial filtering (

Table

The 4D CT dataset that was used for testing our GPU implementation was collected with a Siemens SOMATOM Definition Flash dual-energy CT scanner at the Center for medical Image Science and Visualization (CMIV). The dataset contains almost 9000 DICOM files and the resolution of the data is 512 × 512 × 445 × 20 time voxels. The spatial size of each voxel is

A comparison between the processing times for our GPU implementation and for a CPU implementation was made. The used GPU was a Nvidia GTX 580, equipped with 512 processor cores and 3 GB of memory (the Nvidia GTX 580 is normally equipped with 1.5 GB of memory). The used CPU was an Intel Xeon 2.4 GHz with 4 processor cores and 12 MB of L3 cache, 12 GB of memory was used. All the implementations used 32 bit floats. The operating system used was Linux Fedora 14 64-bit.

For the CPU implementation, the OpenMP (open multiprocessing) library [

The processing times are given in Tables

Processing times for filtering with the 14 monomial filters of size

Data size | Spatial filtering CPU | Spatial filtering GPU | GPU speedup | FFT filtering CPU | FFT filtering GPU | GPU speedup |
---|---|---|---|---|---|---|

17.3 min | 5.7 s | 182 | 25 s | 1.8 s | 13.9 | |

2.3 h | 36.0 s | 230 | 3.3 min | 14.3 s | 13.9 |

Processing times for lowpass filtering the 10 tensor components, calculating

Data size | CPU | GPU | GPU speedup |
---|---|---|---|

42 s | 1.0 s | 42 | |

292 s | 7.3 s | 40 |

Processing times for filtering with the 11 reconstruction filters of size 11×11×11×11 and calculating the denoised data for the different implementations. The processing times for the GPU do NOT include the time it takes to transfer the data to and from the GPU.

Data size | Spatial filtering CPU | Spatial filtering GPU | GPU speedup | FFT filtering CPU | FFT filtering GPU | GPU speedup |
---|---|---|---|---|---|---|

2 | 7.5 h | 3.3 m | 136 | 5.6 min | 1.1 min | 5.1 |

2.5 days | 23.9 m | 150 | 45 min | 8.6 min | 5.2 |

Total processing times for the complete 4D image denoising algorithm for the different implementations. The processing times for the GPU DO include the time it takes to transfer the data to and from the GPU.

Data size | Spatial filtering CPU | Spatial filtering GPU | GPU speedup | FFT filtering CPU | FFT filtering GPU | GPU speedup |
---|---|---|---|---|---|---|

7.8 h | 3.5 m | 133 | 6.7 m | 1.2 m | 5.6 | |

2.6 days | 26.3 m | 144 | 52.8 m | 8.9 m | 5.9 |

To show the results of the 4D denoising, the original CT data was compared with the denoised data by applying volume rendering. The freely available MeVisLab software development program (

Three comparisons between original CT data (a) and denoised CT data (b). The parameters used for this denoising where

A movie where the original and the denoised data is explored with the two volume renderers was also made. For this video, the data was downsampled a factor 2 in the spatial dimensions, in order to decrease the memory usage. The volume renderers automatically loop over all the timepoints. The video can be found at

By looking at the video, it is easy to see that the amount of noise in the original data varies with time.

We have presented how to implement true 4D image denoising on the GPU. The result is that 4D image denoising becomes practically possible if the GPU is used and thereby the clinical value increases significantly.

To make a completely fair comparison between the CPU and the GPU is rather difficult. It has been debated [

While spatial filtering can be significantly slower than FFT-based filtering for nonseparable filters, there are some advantages (except for the lower memory usage). One is that a region of interest (ROI) can be selected for the denoising, compared to doing the denoising on the whole dataset. Another advantage is that filter networks [

From our results, it is clear that FFT-based filtering is faster than spatial filtering for large nonseparable filters. For data sizes that are not a power of two in each dimension, the FFT based approach might however not be as efficient. Since medical doctors normally do not look at 3D or 4D data as volume renderings, but rather as 2D slices, the spatial filtering approach however has the advantage that the denoising can be done for a region of interest (e.g., a specific slice or volume). It is a waste of time to enhance the parts of the data that are not used by the medical doctor. The spatial filtering approach can also handle larger datasets than the FFT-based approach, as it is sufficient to store the filter responses for one slice or one volume at a time. Recently, we acquired a CT data set with 100 time points, compared to 20 time points. It is not possible to use the FFT-based approach for this data set.

There are several reasons why the GPU speedup for the FFT-based filtering is much smaller than the GPU speedup for the spatial filtering. First, the CUFFT library does not include any direct support for 4D FFT's, and we had to implement our own 4D FFT as two 2D FFT's that are applied after each other. Between the 2D FFT's the storage order of the data is changed. It can take a longer time to change the order of the data than to actually perform the FFT. If Nvidia includes direct support for 4D FFT's in the CUFFT library, we are sure that their implementation would be much more efficient than ours. Second, the FFT for the CPU is extremely optimized, as it is used in a lot of applications, and our convolver for the CPU is not fully optimized. The CUDA programming language is only a few years old, and the GPU standard libraries are not as optimized as the CPU standard libraries. The hardware design of the GPUs also changes rapidly. Some work has been done in order to further optimize the CUFFT library. Nukada et al. [

As previously discussed in the paper, 4D image processing in CUDA is harder to implement than 2D and 3D image processing. There are, for example, no 4D textures, no 4D FFTs, and there is no direct support for 4D (or 3D) indices. However, since fMRI data also is 4D, we have previously gained some experience on how to do 4D image processing with CUDA [

It might seem impossible to have medical image data with more than 4 dimensions, but some work has been done on how to collect 5D data [

To conclude, by using the GPU, true 4D image denoising becomes practically feasible. Our implementation can of course be applied to other modalities as well, such as ultrasound and MRI, and not only to CT data. The short processing time also makes it practically possible to further improve the denoising algorithm and to tune the parameters that are used.

The elapsed time between the development of practically feasible 2D [

This work was supported by the Linnaeus Center CADICS and research Grant no. 2008-3813, funded by the Swedish research council. The CT data was collected at the Center for Medical Image Science and Visualization (CMIV). The authors would like to thank the NovaMedTech project at Linköping University for financial support of the GPU hardware, Johan Wiklund for support with the CUDA installations, and Chunliang Wang for setting the transfer functions for the volume rendering.