Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising.
Image denoising represents one of the most common tasks of image processing. Several techniques have been developed in the last decades to face the problem of removing noise from images, still preserving the small structures from an excessive blurring (see [
One of the most performing and robust denoising approaches is the Nonlocal Means (NLM) filter, introduced in 2005 by Buades et al. [
The result is a general-purpose denoising scheme, whose performances are widely accepted to be better with respect to the previous state-of-the-art algorithms, such as the total variation, the wavelet thresholding, or the anisotropic filtering [
The GPUs are massively parallel architectures that efficiently work with impressive performance improvements but at the same time require a deep understanding of the underlying architecture and
In this paper we design and implement a fully 3D Nonlocal Means parallel approach based on Compute Unified Device Architecture (CUDA) [
The plan of the paper is as follows. In Section
An
If the image is defined on a discrete, regular grid
Both computational issues and the convenience to introduce a geometric proximity criterion in addition to the pure radiometric distance measure led to a change in the original version of the NLM filter [
Therefore, given a search radius
Analogously, given a similarity radius
Finally, the denoised image is
The filter strength, which is determined by
(1) (2) Initialize the cumulative sum of weights and the restored value to 0; (3) (4) (5) (6) (7) (8) (9) (10) (11)
General-purpose computing on graphics processing units (GPGPU) is the use of GPUs to perform highly parallelizable computations that would normally be handled by CPU devices. Programming with GPUs requires both a deep understanding of the underlying computing architecture and a massive rethinking of existing CPU based algorithms. The basic idea behind GPGPU is handling the sequential code by CPU and processing parallel code on GPU, still having CPU be the coordinator of the data processing. Additionally, the use of multiple GPU in one system, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
We implement the 3D NLM filter on the NVIDIA parallel computing architecture, which consists in a set of cores, or scalar processors (SPs), performing simple mathematical operations. SPs are organized into a streaming multiprocessor (SM), which have a shared memory, a shared L1-cache, and several other units. A SM executes one or more thread blocks, while CUDA cores and the other units execute thread instructions; in particular, each SM executes threads in groups of 32, called warps [
View of a NVIDIA FERMI GPU architecture.
As shown in Algorithm
(1) (2) (3) (4) (5) }
As we can observe, the nested
With the aim of achieving the best performance of the algorithm and to optimize the usage of this massively parallel architecture, we perform two different
(1) (2) (3) (4) (5) (6)
(7)
Graphic representation of the partial unrolling strategy.
In the second strategy, illustrated in Figure
(1) (2) (3) (4) (5) (6) (7)
Graphic representation of the full unrolling strategy.
These two strategies require different kernel grid configurations, in order to take full advantage of the GPU architecture; while the first strategy aims at creating a smaller number of threads with a longer computing time, the second one aims to create a larger grid and therefore a greater number of threads, which, however, have a shorter computing time.
We clearly observe that these implementations are bounded by the GPU global memory size. Since, for a fixed dataset Compute the available memory Compute the size of each subdataset Split the dataset
The function, in a first phase, splits the input dataset along the
Graphic representation of the split.
For a better understanding of the entire process and its different steps that compose the 3D NLM GPU algorithm, we show in Figure
3D NLM work flow.
To exploit even larger degrees of parallelism in our 3D dataset denoising, we deploy a new version of the 3D NLM algorithm that is compatible with multi-GPU architectures. The multiple GPUs avoid many of the limitations of single GPU utilization (e.g., on-chip memory resources) by exploiting the combined resources of multiple devices. Dividing the algorithm across multiple GPUs requires partitioning the computational space and allowing each GPU to operate on a subsection of the data. Although, during all our experiments, we consider only configurations with one or two GPU, the framework is fully compatible to exploit the use of more than two GPUs.
It is important to underline that the input dataset will be partitioned among the global memories of the GPUs by means of the
In Figure
3D NLM multi-GPU work flow.
Exploiting the multi-GPU support will lead to a drastic reduction of the execution times as we will detail in Section
The basic idea consists in the identification of the number of GPU devices on the architecture; this information can be returned by means of a CUDA library function and stored in the variable
(1) “ (2) (3) (4) (5) (6) (7) (8)
As data access times in memory can vary depending on the kernel configuration, we focus our attention on testing several configurations, both mono- and bidimensional, for the thread block size in which each slice is divided; therefore, each thread processes sequentially the voxels along the third dimension. The workload is divided along the third dimension for multi-GPU configurations. Inside each GPU the workload is divided along the first and second dimensions, in strips (monodimensional configurations) and tiles (bidimensional configurations) of threads. Strips or tiles are allowed to cover entirely or only partially the slice grid. In Figure
Schematic representation of thread organization inside a single GPU and between GPUs. Thread workload highlights the voxels that will be processed by a single thread. Moreover, threads can be organized in strips or tiles of the specified Block size. The red cube depicts the search window and the smaller blue one depicts the similarity window.
1D CUDA block configuration (12, 1, 1)
2D CUDA block configuration (12, 12, 1)
Moreover, we also test the performance impact of L1-cache usage, by using the binary L1-prefer setting, which allows choosing between two possible configurations: 48 KB of shared memory and 16 KB of L1-cache (no L1-prefer), or 16 KB of shared memory and 48 KB of L1-cache (L1-prefer).
The computing environment we used is equipped with 2 Intel Xeon CPU E5620 (2.4 GHz) and an NVIDIA TESLA S2050 card, based on the FERMI GPU. This device consists of 4 GPGPU units, each of which with 3 GB of RAM memory and 448 processing cores working at 1.15 GHz. Moreover, the middleware framework has OS Linux 2.6.18-194.el5, CUDA release 3.2, and C compiler
General-purpose computing on graphics processing units (often termed GPGPU or GPU computing) supports a broad range of scientific and engineering applications, including physical simulation, signal and image processing, database management, and data mining [
Recently, a number of authors have implemented the 2D Nonlocal Means algorithm on a GPU; in this section, we will describe representative ones. In [
In this section we want to investigate and discuss the performance and the scalability of the proposed GPU algorithm, taking into account also the splitting strategy. To better point out the effective application of this implementation, we report, in Section
As demonstrated in these research articles [
From left to right the frames show a slice of the original dataset, the CPU restored image, the GPU restored image, and the difference between CPU and GPU filtered images (enhanced by a scaling factor of
To demonstrate the advantages and benefits of using our parallelization strategy we report several performance experiments. From a memory usage optimization point of view, we focus our attention on the impact of the cache size on the overall execution time and perform several test runs by varying L1-prefer modality; promising results are shown in Table
L1-prefer switch influence on execution times for the partial and full unrolling algorithm with (128, 1, 1) block size configuration and 3D datasets of normally distributed random numbers.
Partial unrolling algorithm | Full unrolling algorithm | |||||||
---|---|---|---|---|---|---|---|---|
Single GPU | Multi-GPU | Single GPU | Multi-GPU | |||||
L1 | no L1 | L1 | no L1 | L1 | no L1 | L1 | no L1 | |
|
0.77 | 0.83 | 0.38 | 0.42 | 0.58 | 0.58 | 0.23 | 0.23 |
|
2.25 | 2.3 | 1.12 | 1.15 | 2.29 | 2.29 | 0.85 | 0.85 |
|
4.5 | 4.59 | 2.25 | 2.3 | 4.38 | 4.38 | 1.9 | 1.9 |
|
16.24 | 16.54 | 8.12 | 8.25 | 17.51 | 17.51 | 7.58 | 7.58 |
|
32.5 | 33.58 | 16.24 | 16.54 | 34.22 | 34.24 | 15.94 | 15.95 |
|
63.55 | 66.42 | 31.67 | 32.52 | 70.01 | 70 | 30.31 | 30.32 |
|
128.19 | 138.06 | 63.57 | 66.59 | 136.89 | 136.95 | 63.75 | 63.77 |
|
260.93 | 288.84 | 128.63 | 138.2 | 270.61 | 270.75 | 130.78 | 130.67 |
As we mentioned in the previous section, the kernel grid configuration plays a key role if we want to optimize the executions of the GPU NLM filter. The strip or tile thread division influences the filter performance in terms of computing time, due to the different type of data access. Experimental results we carry out prove that an optimal configuration is given by the strip subdivision and in Table
Partial unrolling algorithm on a single GPU unit. Execution times and speed-up values for several block size configurations and 3D datasets of normally distributed random numbers. Search and similarity windows have been set according to
Dataset size | Execution time/speed-up | ||||
---|---|---|---|---|---|
1 GPU unit | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
0.99/ |
0.77/ |
1.32/ |
2.72/ |
17.5 |
|
2.31/ |
2.25/ |
2.23/ |
4.57/ |
83 |
|
4.62/ |
4.5/ |
4.46/ |
9.22/ |
170 |
|
16.9/ |
16.2/ |
16.6/ |
18.6/ |
696 |
|
33.79/ |
32.5/ |
33.14/ |
37.5/ |
1393 |
|
65.2/ |
63.6/ |
63.9/ |
69.2/ |
2814 |
|
131/ |
128/ |
128/ |
139/ |
5623 |
|
264/ |
261/ |
259/ |
281/ |
11291 |
Partial unrolling algorithm on 2 GPU units. Execution times and speed-up values for several block size configurations and 3D datasets of normally distributed random numbers. Search and similarity windows have been set according to
Dataset size | Execution time/speed-up | ||||
---|---|---|---|---|---|
2 GPU units | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
0.49/ |
0.38/ |
0.66/ |
1.36/ |
17.5 |
|
1.15/ |
1.12/ |
1.11/ |
2.27/ |
83 |
|
2.31/ |
2.25/ |
2.23/ |
4.57/ |
170 |
|
8.44/ |
8.12/ |
8.28/ |
9.26/ |
696 |
|
16.9/ |
16.2/ |
16.6/ |
18.6/ |
1393 |
|
32.6/ |
31.7/ |
31.92/ |
34.5/ |
2814 |
|
65.2/ |
63.6/ |
63.9/ |
69.2/ |
5623 |
|
131/ |
129/ |
128/ |
139/ |
11291 |
Full unrolling algorithm on a single GPU unit. Execution times and speed-up values for several block size configurations and 3D datasets of normally distributed random numbers. Search and similarity windows have been set according to
Dataset size | Execution time/speed-up | ||||
---|---|---|---|---|---|
1 GPU unit | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
0.59/ |
0.58/ |
1.15/ |
2.4/ |
17.5 |
|
2.33/ |
2.29/ |
2.3/ |
4.8/ |
83 |
|
4.46/ |
4.5/ |
4.38/ |
9.2/ |
170 |
|
17.9/ |
17.5/ |
17.5/ |
18.4/ |
696 |
|
34.9/ |
34.2/ |
34.3/ |
36/ |
1393 |
|
71.4/ |
70/ |
70.1/ |
73.9/ |
2814 |
|
140/ |
137/ |
137/ |
145/ |
5623 |
|
276/ |
271/ |
271/ |
286/ |
11291 |
Full unrolling algorithm on 2 GPU units. Execution times and speed-up values for several block size configurations and 3D datasets of normally distributed random numbers. Search and similarity windows have been set according to
Dataset size | Execution time/speed-up | ||||
---|---|---|---|---|---|
2 GPU units | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
0.12/ |
0.23/ |
0.43/ |
0.9/ |
17.5 |
|
0.87/ |
0.85/ |
0.85/ |
1.79/ |
83 |
|
1.94/ |
1.9/ |
1.9/ |
3.99/ |
170 |
|
7.73/ |
7.58/ |
7.59/ |
7.98/ |
696 |
|
16.3/ |
15.9/ |
16/ |
16.8/ |
1393 |
|
30.9/ |
30.3/ |
30.3/ |
32/ |
2814 |
|
65/ |
63.8/ |
63.8/ |
67.3/ |
5623 |
|
133/ |
131/ |
131/ |
138/ |
11291 |
In Figure
Outline of the CPU (blue), 1 GPU (red), and 2 GPU (green) GFlops values for the full unrolling algorithm with different dataset sizes and thread block configuration
Another factor that can strongly influence the performance trend of the algorithm is the size of the search and similarity windows. Hence we investigate the behavior of the full unrolling algorithm, with a 3D dataset of normally distributed random numbers (size = 512 × 512 × 128), against the search
Full unrolling algorithm on a single GPU unit. Execution times and speed-up values for a 3D dataset of normally distributed random numbers (size
|
Execution time/speed-up | ||||
---|---|---|---|---|---|
1 GPU unit | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
71.4/ |
70/ |
70.1/ |
73.9/ |
2814 |
|
514/ |
504/ |
505/ |
538/ |
19790 |
|
254/ |
244/ |
244/ |
267/ |
8133 |
|
1845/ |
1761/ |
1765/ |
1964/ |
58785 |
Full unrolling algorithm on 2 GPU units. Execution times and speed-up values for a 3D dataset of normally distributed random numbers (size
|
Execution time/speed-up | ||||
---|---|---|---|---|---|
2 GPU units | CPU | ||||
(16, 16, 1) | (128, 1, 1) | (256, 1, 1) | (512, 1, 1) | ||
|
30.9/ |
30.3/ |
30.3/ |
32/ |
2814 |
|
196/ |
192/ |
193/ |
205/ |
19790 |
|
107/ |
103/ |
103/ |
113/ |
8133 |
|
685/ |
654/ |
656/ |
729/ |
58785 |
At the end of this section we draw some practical remarks. To have the best hardware performance and results closer to NVIDIA guidelines [
The performance results showed in the previous section encourage the application of this denoising filter to a number of different scenarios, formerly far away from being able to be taken into account, due to the filter high computational demand on large datasets. Denoising video sequences represent one, where the increase of resolution and time of a video sequence would lead to denoising times that do not take into account adopting a serial version of the filter. According to this, we perform several performance tests on different video in resolution and fps, in order to demonstrate the scalability of our splitting approach on huge input datasets. Selecting the most commonly used video formats, PAL (
Partial unrolling algorithms for different video formats. Observe that the execution times include the time needed to copy data from CPU to GPU and from GPU to CPU.
Video size | MB | Execution time | |||
---|---|---|---|---|---|
|
|
|
|
||
PAL |
79.1 | 20.8 | 145 | 72.9 | 508 |
PAL |
158 | 41.6 | 291 | 146 | 1021 |
PAL |
316 | 83.7 | 586 | 294 | 2045 |
PAL |
791 | 212 | 1500 | 744 | 5162 |
HD |
148 | 39 | 271 | 136 | 953 |
HD |
297 | 78.1 | 545 | 274 | 1914 |
HD |
593 | 157 | 1104 | 551 | 3848 |
HD |
1483 | 398 | 2812 | 1395 | 9678 |
FULL-HD |
396 | 97.3 | 678 | 342 | 2386 |
FULL-HD |
791 | 195 | 1365 | 686 | 4788 |
FULL-HD |
1582 | 392 | 2754 | 1375 | 9601 |
FULL-HD |
3955 | 995 | 7018 | 3483 | 24149 |
Full unrolling algorithms for different video formats. Observe that the execution times include the time needed to copy data from CPU to GPU and from GPU to CPU.
Video size | MB | Execution time | |||
---|---|---|---|---|---|
|
|
|
|
||
PAL |
79.1 | 16.8 | 85.8 | 55 | 276 |
PAL |
158 | 38.8 | 239 | 131 | 807 |
PAL |
316 | 82.9 | 545 | 284 | 1868 |
PAL |
791 | 215 | 1463 | 742 | 5052 |
HD |
148 | 31.4 | 161 | 103 | 517 |
HD |
297 | 72.7 | 448 | 246 | 1523 |
HD |
593 | 155 | 1022 | 532 | 3502 |
HD |
1483 | 403 | 2743 | 1391 | 9472 |
FULL-HD |
396 | 78.5 | 402 | 257 | 1294 |
FULL-HD |
791 | 182 | 1119 | 615 | 3781 |
FULL-HD |
1582 | 388 | 2550 | 1327 | 8738 |
FULL-HD |
3955 | 1007 | 6846 | 3473 | 23634 |
Investigating more in depth the scalability of our full unrolling approach, in Figure
Scalability of the full unrolling approach, fixed the third dimension (fps).
In Figure
Scalability of the full unrolling approach, fixed the PAL format, and tuning the fps along the
On the
Full unrolling algorithm with split for different video formats. Observe that the number of splits is intended on the same GPU and the time needed to copy data from CPU to GPU and from GPU to CPU is reported in bold.
Video size | Splits | Execution time + CPU-GPU copy time | |||
---|---|---|---|---|---|
|
|
|
|
||
PAL |
1 | 16.8 + |
85.8 + |
55 + |
276 + |
PAL |
2 | 33.5 + |
172 + |
110 + |
552 + |
PAL |
4 | 67 + |
343 + |
220 + |
1104 + |
PAL |
10 | 168 + |
858 + |
550 + |
2761 + |
HD |
1 | 31.4 + |
161 + |
103 + |
517 + |
HD |
2 | 62.8 + |
321 + |
206 + |
1035 + |
HD |
4 | 126 + |
643 + |
412 + |
2070 + |
HD |
10 | 314 + |
1607 + |
1030 + |
5175 + |
FULL-HD |
1 | 78.5 + |
402 + |
257 + |
1294 + |
FULL-HD |
2 | 157 + |
804 + |
515 + |
2587 + |
FULL-HD |
4 | 314 + |
1607 + |
1030 + |
5174 + |
FULL-HD |
10 | 785 + |
4018 + |
2575 + |
12935 + |
Finally, in Figure
Scalability comparison between the fully unrolled strategy, the splitting strategy, and the ideal, for the FULL-HD video format.
NLM filter is a state-of-the-art denoising algorithm. However, the huge amount of computational load prevents the large-scale diffusion of most common implementations of the algorithm, in particular for 3D datasets of great sizes.
In this paper we design and implement a fully 3D Nonlocal Means parallel approach, adopting different algorithm mapping strategies on a GPU and multi-GPU architecture in order to demonstrate its high scalability and consequent applicability. We focus our attention on designing and testing several kernel configurations that rely on two different threads usage schemes: bidimensional and tridimensional CUDA kernels adopting a multi-GPU framework. In this way, we argue to identify a set of optimal settings that guarantee high performance results in terms of execution time and scalability of this approach, for a wide spectrum of application scenarios. Moreover, we propose two different cases of studies; the first simulates an MRI scenario, where the Nonlocal Means algorithm is widely applied to remove the Rician noise in this kind of medical images; the second simulates a video sequence that represents a suitable example of huge 3D dataset.
The obtained running times and the scalability demonstrate that, for most common 3D dataset sizes, thanks to our parallelizing approaches, we are able to drastically reduce the inapplicability of the Nonlocal means algorithm, that is, due to its huge amount of computational demand. The experimental scenarios we showed obtain great benefits by adopting a parallel approach using the GPU architecture and therefore encourage the exploration of more sophisticated algorithm variants and reduce the gap between the previous execution times and acceptable performance for real-time scenarios.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors thank Drs. G. Guarnieri (ENEA-UTICT-HPC Portici Research Center) and M. Chinnici (ENEA-UTICT-PRA Casaccia Research Center) for support in GPU utilization. The computing resources used for this work have been provided by CRESCO/ENEAGRID High Performance Computing infrastructure and its ITC staff; see