3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies

Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising.


Introduction
Image denoising represents one of the most common tasks of image processing. Several techniques have been developed in the last decades to face the problem of removing noise from images, still preserving the small structures from an excessive blurring (see [1] for a review). All those schemes share the belief that an improved value of a given image point can be expressed as a function of the image itself.
One of the most performing and robust denoising approaches is the Nonlocal Means (NLM) filter, introduced in 2005 by Buades et al. [2]. Since its first appearance, the family of NLM algorithm and implementation variants has enormously grown (just to mention some of the most relevant improvements, see [3][4][5][6][7][8][9][10]); nevertheless, all of them assume that the restoring function for a given point is a mean of all the image values, largely weighted according to the radiometric similarity between values and only weakly tied to a spatial proximity criterion.
The result is a general-purpose denoising scheme, whose performances are widely accepted to be better with respect to the previous state-of-the-art algorithms, such as the total variation, the wavelet thresholding, or the anisotropic filtering [11]. In particular, it has been shown that NLM filter guarantees the homogeneity of flat zones, preserves edges and fine structures, and transforms white noise into white noise, thus avoiding the introduction of artifacts and spurious correlated signal [1]. Unsurprisingly, the NLM algorithm is simple but computationally very heavy, because it requires, for denoising of each pixel, a high number of elementary operations, and even some fast versions of the scheme are quite demanding on 2D images and almost daunting on 3D datasets. The huge amount of computational demand has been recently addressed by using accelerated hardware [12,13], the graphic processor units (GPUs) in particular.
The GPUs are massively parallel architectures that efficiently work with impressive performance improvements but at the same time require a deep understanding of the underlying architecture and ad hoc thread based algorithms. Obtaining reasonable performance in terms of execution times of the NLM algorithm allows its implementation in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising where, especially in the latter one, the input datasets can reach a considerable size. Moreover, in 3D datasets, for example, in the context of the magnetic resonance imaging (MRI), the use of fully 3D filters is more appropriate than a 2D-based slice-by-slice filtering approach to exploit all the information contained in the image. To the best of our knowledge, although there are several 2D GPU-based NLM versions [14][15][16], the 3D version of NLM filter has been poorly investigated in terms of both implementation and performance on GPUs.
In this paper we design and implement a fully 3D Nonlocal Means parallel approach based on Compute Unified Device Architecture (CUDA) [16] that relies on the graphic processor unit (GPU) architecture. Moreover, we adopt a simple but effective strategy to avoid the GPU memory bound issue, due to oversizing 3D datasets, for example, in a video sequence denoising scenario. We report and analyze the performance of our implementation for several 3D datasets. The parallelization strategy of the filter via GPUs gives clinically-feasible MRI denoising execution times with a significant increase in speed.
The plan of the paper is as follows. In Section 2 we briefly describe the NLM algorithm. To follow, in Section 3 we provide the implementation details. Moreover, in Section 4 we report a brief review about the related works and in Section 5 we present and discuss the results of our work. Finally, in Section 6 we draw conclusions and outline the future work.

General
Description. An -D image can be considered as a real function : R → R with a bounded support Ω ⊂ R . The NLM filter [2] is a class of endomorphisms of the image space, identified by 2 parameters ( and ℎ), that acts as follows: The intensity of a given point of the new image is a mean of the intensities of the original image, according to a weight function that disregards any explicit criterion of spatial proximity and only considers a measure (ruled by ℎ) of selfsimilarity between windows of radius centered on each point (radiometric proximity).

Actual
Algorithm. Both computational issues and the convenience to introduce a geometric proximity criterion in addition to the pure radiometric distance measure led to a change in the original version of the NLM filter [6]. Therefore, given a search radius , for each voxel located at x we define a search box as The search box associated with the th voxel defines the ensemble of voxels whose intensities will be available in the following for restoring (denoising) of the intensity (x ), thus reducing the search freedom of (4) (in that case, ≡ Ω). Analogously, given a similarity radius , for each voxel x within a given search box , we can define a similarity box In this case, plays the role of in (4), provided that the original smooth Gaussian kernel is replaced by a binary cutoff. Finally, the denoised image is where it results that the algorithm complexity is The filter strength, which is determined by ℎ, can be tuned to obtain an optimized denoising, independently from the search radius and the standard deviation of noise , where ∼ 1 is an adimensional constant to be tuned: In Algorithm 1 the pseudocode of the NLM filter is showed.

Parallel GPU Implementation
General-purpose computing on graphics processing units (GPGPU) is the use of GPUs to perform highly parallelizable computations that would normally be handled by CPU devices. Programming with GPUs requires both a deep understanding of the underlying computing architecture and Computational and Mathematical Methods in Medicine 3 (1) for each voxel ( 1 , 2 , 3 ) of the 3D image to be filtered do (2) Initialize the cumulative sum of weights and the restored value to 0; Cumulate squared Euclidean distance; (6) end for (7) Calculate and cumulate the weight of the voxel in search window; (8) Cumulate the restored value; (9) end for (10) Normalize restored value to the sum of the weights; (11) end for Algorithm 1: Pseudocode of the NLM algorithm.
Algorithm 2: C code of NLM algorithm.
a massive rethinking of existing CPU based algorithms. The basic idea behind GPGPU is handling the sequential code by CPU and processing parallel code on GPU, still having CPU be the coordinator of the data processing. Additionally, the use of multiple GPU in one system, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. We implement the 3D NLM filter on the NVIDIA parallel computing architecture, which consists in a set of cores, or scalar processors (SPs), performing simple mathematical operations. SPs are organized into a streaming multiprocessor (SM), which have a shared memory, a shared L1-cache, and several other units. A SM executes one or more thread blocks, while CUDA cores and the other units execute thread instructions; in particular, each SM executes threads in groups of 32, called warps [17]. In Figure 1 we show the NVIDIA Fermi architecture, where each SM (vertical rectangular strip) has scheduler and dispatch units (orange portion), execution units (green portion), and a configurable memory of 64 KB (light blue portions), which consists of a register file, an internal shared memory, and an L1 cache. This memory is configurable in 16 KB (or 48 KB) for shared memory and 48 KB (or 16 KB) for L1 cache. Moreover there are 16 load/store (LD/ST) units per SM and 4 special function units (SFUs) (nuances of green portions) available to handle transcendental and other special operations such as sin, cos, exp, and rcp (reciprocal).

GPU Thread Strategy
Mapping. As shown in Algorithm 1, the computational factor that influences the overall performance of this algorithm is represented by a nested iteration structure (see the statement at line 1). A basic serial implementation is presented in Algorithm 2.
As we can observe, the nested for loops represent the main bottleneck in the algorithm computation; according to this, the first step of our parallelization strategy is to unroll the outermost for loops that, in the serial version of the algorithm, perform the scanning of the 3D image.
With the aim of achieving the best performance of the algorithm and to optimize the usage of this massively parallel architecture, we perform two different unrolling strategies. In the first one, illustrated in Figure 2, we unroll the for loops that scan on -axis and -axis, leaving the for loop on -axis (partial unrolling); in this way the GPU model proposes that each thread is in charge of performing the denoising process of all the voxels in its direction (see Algorithm 3).
In the second strategy, illustrated in Figure 3, we unroll all the for loops that scan the 3D images along the directions , , and (full unrolling); in this way the GPU model proposes that each thread is in charge of performing the denoising process of the single voxel associated to it (see Algorithm 4).
These two strategies require different kernel grid configurations, in order to take full advantage of the GPU architecture; while the first strategy aims at creating a smaller number of threads with a longer computing time, the second one aims to create a larger grid and therefore a greater number of threads, which, however, have a shorter computing time.
We clearly observe that these implementations are bounded by the GPU global memory size. Since, for a fixed dataset Ω, we store two objects to manage the overall data (the first one for the input dataset and the second one for the denoised one), the maximum size of available memory is = /2, where is the GPU global memory size in byte.
(1) int const i 1 = threadIdx.x + blockDim.x * blockIdx.x; (2) int const i 2 = threadIdx.y + blockDim.y * blockIdx.y; For datasets that exceed we implement an ad hoc function that tunes and splits the data before starting the denoising. In particular, this function follows the following.
(1) Compute the available memory and the size of the dataset Ω. The function, in a first phase, splits the input dataset along the -axis; then the 3D NLM GPU kernel is executed for each Ω , exactly times, and finally, at the end of the denoising process, it performs a gathering phase, in order to reassemble the entire dataset collecting the different denoised parts from each execution of the kernel function. It should be noted that the size of each subdataset Ω is computed taking into account an overlapping zone between two adjacent subdatasets, to adapt the NLM algorithm 3D on suitable search and similarity windows. In Figure 4 the graphic representation of the split along the -axis is showed, where, to simplify the discussion, we do not consider the overlapped portions.
For a better understanding of the entire process and its different steps that compose the 3D NLM GPU algorithm, we show in Figure 5 its work flow, where the blue colored charts represent the CPU workload and the green colored charts represent the GPU one. The input dataset Ω is splitted in subdatasets Ω , by the "tune and split" function. One at a time, each Ω is copied from the CPU (host) to the GPU (device), then it is processed by the 3D NLM kernel and the resulting denoised subdataset Ω * is copied from the GPU to the CPU. Finally, all the Ω * are merged in order to obtain the denoised dataset Ω * .

Multi-GPU Implementations.
To exploit even larger degrees of parallelism in our 3D dataset denoising, we deploy a new version of the 3D NLM algorithm that is compatible with multi-GPU architectures. The multiple GPUs avoid many of the limitations of single GPU utilization (e.g., on-chip memory resources) by exploiting the combined resources of multiple devices. Dividing the algorithm across multiple GPUs requires partitioning the computational space and allowing each GPU to operate on a subsection of the data. Although, during all our experiments, we consider only configurations with one or two GPU, the framework is fully compatible to exploit the use of more than two GPUs.
It is important to underline that the input dataset will be partitioned among the global memories of the GPUs by means of the ad hoc function for tuning and splitting, above described. Also in this case, we split the input dataset along the -axis, thereby distributing the slices among the available GPUs; then we allocate on each GPU global memory only the required space to store the subset of the input data. At the end of the denoising process, we perform a gathering phase, in order to reassemble the entire dataset collecting the different denoised parts from each GPU. As we will show in the next section, the transfer time between GPU global memory and the host RAM memory minimally affects the total computing time of the algorithm.
In Figure 6 we report a simple flowchart that summarizes our multi-GPU strategy. Let be the number of available GPU units and the number of subdatasets Ω , Computational and Mathematical Methods in Medicine 5 (1) / * "my in img" and "my out img" are respectively the sections of the images "in img" and "out img" splitted between the "n gpus" GPUs. * / (2) int const i 1 = threadIdx.x + blockDim.x * blockIdx.x; (3) int const i 2 = threadIdx.y + blockDim.y * blockIdx.y;   (where = 1, . . . , / and = 1, . . . , ). If the sizes of the input and output datasets, respectively, Ω and Ω * , are bounded by the total GPU global memory size, then is a multiple of , with > ; otherwise = . As we can observe, the input dataset Ω is splitted into subdatasets Ω , , by the "tune and split" function. At this point, the Ω , are copied from the CPU to the available GPU units that, independently of one another, perform the denoising of the assigned Ω , . Then the resulting denoised subdataset Ω * , is copied from the specific GPU unit to the CPU. Finally, all the Ω * , are merged in order to obtain the denoised dataset Ω * .
Exploiting the multi-GPU support will lead to a drastic reduction of the execution times as we will detail in Section 5.1. This kind of approach allows us to take advantage of a second type of parallelism; indeed we observe that the denoising of a single slice along the -axis is independent from the denoising of the other slices along the same axis; therefore it is possible to split the dataset into subsets of contiguous slices that can be processed in parallel by several GPUs.
The basic idea consists in the identification of the number of GPU devices on the architecture; this information can be  returned by means of a CUDA library function and stored in the variablen gpus. Then, the third dimension of the image is "splitted" between the available GPUs. In Algorithm 5, we report a scratch of the multi-GPU implementation with a partial unrolling strategy. For the multi-GPU algorithm with full unrolling strategy, the implementation is in a similar way. As data access times in memory can vary depending on the kernel configuration, we focus our attention on testing several configurations, both mono-and bidimensional, for the thread block size in which each slice is divided; therefore, each thread processes sequentially the voxels along the third dimension. The workload is divided along the third dimension for multi-GPU configurations. Inside each GPU the workload is divided along the first and second dimensions, in strips (monodimensional configurations) and tiles (bidimensional configurations) of threads. Strips or tiles are allowed to cover entirely or only partially the slice grid.  Copy · · · · · · · · · · · · 3D NLM kernel 3D NLM kernel In Figure 7 we report a schematic representation of the data processing subdivision. Moreover, we also test the performance impact of L1cache usage, by using the binary L1-prefer setting, which allows choosing between two possible configurations: 48 KB of shared memory and 16 KB of L1-cache (no L1-prefer), or 16 KB of shared memory and 48 KB of L1-cache (L1prefer).
The computing environment we used is equipped with 2 Intel Xeon CPU E5620 (2.4 GHz) and an NVIDIA TESLA S2050 card, based on the FERMI GPU. This device consists of 4 GPGPU units, each of which with 3 GB of RAM memory and 448 processing cores working at 1.15 GHz. Moreover, the middleware framework has OS Linux 2.6.18-194.el5, CUDA release 3.2, and C compiler gcc v.4.1.2. The numerical code is implemented by using the single precision arithmetic. The CPU system is equipped with an INTEL core i5-2500S (Sandy Bridge) at 2.7-3.7 GHz.

Related Work
General-purpose computing on graphics processing units (often termed GPGPU or GPU computing) supports a broad range of scientific and engineering applications, including physical simulation, signal and image processing, database management, and data mining [18][19][20][21][22].
Recently, a number of authors have implemented the 2D Nonlocal Means algorithm on a GPU; in this section, we will describe representative ones. In [23] a locally constant weight assumption is used in the GPU implementation to speed up the basic algorithm. In [15], a GPU extension of the Nonlocal Means algorithm is proposed to denoise ultrasound images. In this approach, the maximum patch size is limited by the amount of shared memory of the GPU. A technique to incorporate both temporal information and color information into the Nonlocal Means algorithm, in order to restore video sequences, is described in [16]. From a parallel computational point of view, they did not report in depth analysis of the accelerated algorithm; moreover the algorithm they consider is basically a bidimensional implementation, that although it is computationally less expensive, is not able to reach the same denoising quality results of the 3D version we consider and implement in this paper.

Results and Discussions
In this section we want to investigate and discuss the performance and the scalability of the proposed GPU algorithm, taking into account also the splitting strategy. To better point out the effective application of this implementation, we report, in Section 5.1, a magnetic resonance imaging case of study.

A Case of Study: MRI Consistency.
As demonstrated in these research articles [24,25], magnetic resonance imaging can take great advantage of the Nonlocal Means filter, since the MRI images always suffer of Rician noise. Therefore, in order to produce a strictly equivalent GPU implementation of the sequential CPU NLM algorithm, we check the  Figure 7: Schematic representation of thread organization inside a single GPU and between GPUs. Thread workload highlights the voxels that will be processed by a single thread. Moreover, threads can be organized in strips or tiles of the specified Block size. The red cube depicts the search window and the smaller blue one depicts the similarity window.   implementation consistency by comparing voxel-by-voxel the images obtained by CPU and GPU denoising. In Figure 8 we show the 3D NLM filtering results, where the overall simulations are carried out with a 3D brain MRI dataset (http://www.bic.mni.mcgill.ca/brainweb) of 217 × 181 × 181.
The difference between the GPU and CPU restored images falls within machine precision order of magnitude, which are likely to be due to the arithmetic logic unit precision. Since the differences are negligible (six orders of magnitude smaller than signal), they do not impair visual inspection.

Performance.
To demonstrate the advantages and benefits of using our parallelization strategy we report several performance experiments. From a memory usage optimization point of view, we focus our attention on the impact of the cache size on the overall execution time and perform several test runs by varying L1-prefer modality; promising results are shown in Table 1, both for the partial unrolling algorithm and the full unrolling algorithm. Clearly, L1-prefer choice gives a benefit on lager dataset, with a performance improvement that ranges from fraction of percent in the smallest dataset to some 5% in the largest ones. From the other side, as we can observe in Table 1, forcing the L1-prefer setting gives a performance that benefits only the partial unrolling strategy. These results suggest that the L1 miss rate, still with 16 KB, is low enough to have high performance processing even with old generation cards having small amount of cache. As we mentioned in the previous section, the kernel grid configuration plays a key role if we want to optimize the executions of the GPU NLM filter. The strip or tile thread division influences the filter performance in terms of computing time, due to the different type of data access. Experimental results we carry out prove that an optimal configuration is given by the strip subdivision and in Table 1, both for the partial unrolling algorithm and the full unrolling algorithm; we report the running times of (128, 1, 1) configuration on 2 GPUs. Adopting these different kinds of grid organization, Tables 2 and 3 (both for the partial unrolling algorithm) and Tables 4 and 5 (both for the full unrolling algorithm) summarize a detailed comparison between running times of CPU, single GPU, and multi-GPU implementation of 3D NLM filter with various thread block size. Speed-up values suggest that the bigger the dataset to be filtered, the better the scalability of the implementation. In particular, for datasets size typical of MRI clinical practice, the scalability of the proposed implementation is close to be ideal. Moreover, the optimal thread configuration seems to be strips of thread between 128 and 256 elements. This result is strongly consistent with NVIDIA guidelines [26] about optimal thread block size. Finally, the obtained results suggest that a strip Table 6: Full unrolling algorithm on a single GPU unit. Execution times and speed-up values for a 3D dataset of normally distributed random numbers (size = 512 × 512 × 128) for several window configurations. configuration should be preferred to tile one of the same size, because of sequential memory access.
In Figure 9 we outline the CPU (blue curve), 1 GPU (red curve), and 2 GPU (green curve) GFlops for the full unrolling algorithm with variable dataset sizes and thread block configuration (128, 1, 1) (see Tables 4 and 5), since the performance calculation of an algorithm through GFLOPs is very indicative in areas where the floating point calculation is the predominant feature of the calculation itself. It should be remarked that we are able to exploit up to 55% of single precision floating point peak performance of the used GPU.
Another factor that can strongly influence the performance trend of the algorithm is the size of the search and similarity windows. Hence we investigate the behavior of the full unrolling algorithm, with a 3D dataset of normally distributed random numbers (size = 512 × 512 × 128), against the search (| |) and similarity (| |) window cardinalities. In our experiments we consider the search and similarity window size to be larger than the common dimension in the real scenarios. Observing Tables 6 and 7, respectively, for executions on 1 GPU unit and 2 GPU units, we note a high and almost constant speed-up among the various experiments, which makes feasible large window filter testing in a reasonable time.
At the end of this section we draw some practical remarks. To have the best hardware performance and results closer to NVIDIA guidelines [26], configuration tricks are needed. It is strongly recommended to use the L1-prefer feature in the partial unrolling algorithm. This fact is related to the principle locality of the threads for data accesses. Moreover, in overall experiments, we have observed that the optimal thread configuration seems to be strips of thread between 128 and 256 elements.

GPU NLM Video
Denoising. The performance results showed in the previous section encourage the application of this denoising filter to a number of different scenarios, formerly far away from being able to be taken into account, due to the filter high computational demand on large datasets. Denoising video sequences represent one, where the increase of resolution and time of a video sequence would lead to denoising times that do not take into account adopting a serial version of the filter. According to this, we perform several performance tests on different video in resolution and fps, in order to demonstrate the scalability of our splitting approach on huge input datasets. Selecting the most commonly used video formats, PAL (720 × 576), HD (1080 × 720), and FULL-HD (1920 × 1080), we report the execution times for the 3D NLM GPU kernel with the partial unrolling strategy, in Table 8, and with the full unrolling one in Table 9.
Investigating more in depth the scalability of our full unrolling approach, in Figure 10, we fixed the third dimension (fps) and tuned the first two dimensions (see the related Tables 8 and 9). In this case, the algorithm well scales with each research windows configuration. Moreover, in this chart we correlate the ratio between the HD format execution times with respect to the PAL one and the FULL-HD format with respect to the HD one; each color represents a research window configuration of the NLM algorithm. In both histograms, the colored cubes have approximately the same dimension between them, which indicates that the ratio between the execution time remains almost constant.
In Figure 11, we fixed the PAL format and tuned the fps along the third dimension, in order to investigate the scalability of the full unrolling algorithm, for a selected research window of the NLM algorithm.  On the -axis we have four PAL videos differing in fps (50, 100, 200, and 500), on -axis we have four research window sizes (10 = (11 3 , 3 3 ), 20 = (11 3 , 5 3 ), 30 = (21 3 , 3 3 ), and 40 = (21 3 , 5 3 )), and on -axis we have the execution times. The below surface represents the ideal execution times that would be obtained if the algorithm perfectly scales, growing the third video dimension (fps). The top surface instead represents the execution times actually obtained by the algorithm. As we can observe, the two surfaces deviate from each other and this phenomenon grew considerably with the increase of the values along the third dimension. Accordingly, it was necessary to modify the algorithm to increase scalability, adopting the splitting strategy described in the previous section. In detail, splitting a video sequence, in subsequences at 50 fps, we obtain execution times proportional to increasing fps, while augmenting the overall fps number. As example, considering a video sequence of 100 fps, the execution time of the algorithm will be approximately twice with respect to a sequence at 50 fps (both the same format; see Table 10); the time necessary to transfer the data from the CPU to the GPU and conversely is negligible.
Finally, in Figure 12, we report a very encouraging result in terms of scalability; on -axis we have 4 FULL-HD videos differing in fps (50, 100, 200, and 500) and on -axis we have our GPU NLM denoising execution times. In the chart, the dashed line represent the ideal scalability, the red line the scalability obtained by using the fully unrolled strategy, and the green line the scalability by using the splitting strategy. As we can observe, our splitting strategy applied on the GPU version of NLM algorithm presents a scalability very close to the ideal one.

Conclusions
NLM filter is a state-of-the-art denoising algorithm. However, the huge amount of computational load prevents the largescale diffusion of most common implementations of the algorithm, in particular for 3D datasets of great sizes.       In this paper we design and implement a fully 3D Nonlocal Means parallel approach, adopting different algorithm mapping strategies on a GPU and multi-GPU architecture in order to demonstrate its high scalability and consequent applicability. We focus our attention on designing and testing several kernel configurations that rely on two different threads usage schemes: bidimensional and tridimensional CUDA kernels adopting a multi-GPU framework. In this way, we argue to identify a set of optimal settings that guarantee high performance results in terms of execution time and scalability of this approach, for a wide spectrum of application scenarios. Moreover, we propose two different cases of studies; the first simulates an MRI scenario, where the Nonlocal Means algorithm is widely applied to remove the Rician noise in this kind of medical images; the second simulates a video sequence that represents a suitable example of huge 3D dataset.
The obtained running times and the scalability demonstrate that, for most common 3D dataset sizes, thanks to our parallelizing approaches, we are able to drastically reduce the inapplicability of the Nonlocal means algorithm, that is, due to its huge amount of computational demand. The experimental scenarios we showed obtain great benefits by adopting a parallel approach using the GPU architecture and therefore encourage the exploration of more sophisticated algorithm variants and reduce the gap between the previous execution times and acceptable performance for real-time scenarios.