A Correlation Based Strategy for the Acceleration of Nonlocal Means Filtering Algorithm

1Laboratory of Image Science and Technology, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing 210096, China 3Centre de Recherche en Information Biomédicale Sino-Francais (LIA CRIBs), 35000 Rennes, France 4INSERM, U1099, 35000 Rennes, France 5Université de Rennes 1, LTSI, 35000 Rennes, France


Introduction
Digital image denoising has been a fundamental and challenging issue for several decades [1,2].Many contributions have been devoted to recover the image degraded by Gaussian noise.Much attention has been paid to the Partial Differential Equation (PDE) approaches among which the total variation (TV) and the Perona and Malik (PM) model are well known [3][4][5].Other methods rely on the image transform from the spatial domain to another domain (such as the Fourier domain, the wavelet domain, and the DCT domain).After adjusting the transform coefficients, the image is restituted by applying the inverse transform [6,7].
The algorithm of nonlocal means (NLM) filtering was proposed by Buades et al. [8].They suggested that a denoised pixel is equivalent to the weighted average of its neighboring pixels, with the weights calculated by the normalized Gaussian weighted Euclidean distance between the blocks centred at those pixels.The NLM algorithm has demonstrated better performance than other main-stream filter methods, such as bilateral filter and TV model in both visual performance and objective measure [9].Unfortunately, the computational cost of computing the weights is too expensive in many applications.
Some solutions have been proposed to alleviate this high computation burden for weights' calculation.Liu et al. proposed an approximation to the similarity of neighborhood windows by employing an efficient Summed Square Image (SSI) further combined with Fast Fourier Transform (FFT) [10,11].In [12,13], the NLM algorithm is accelerated by eliminating some computation of weights through a preclassification step based on a hard threshold of local block measures (average intensities, gradients, and first-and second-order moments).Also, some Probabilistic Early Termination (PET) schemes, such as cluster tree, Singular Value Decomposition (SVD), or dictionaries for image blocks and image edges, were also employed to speed up the weights' calculation [14][15][16][17].
In this paper, compared with the traditional NLM algorithm, in which the calculations of weights are by the pixel by pixel way, our proposed fast strategy was performed on the whole image.Specifically, correlation operators are applied to 2 Mathematical Problems in Engineering compute the differential image and lead to a straightforward shortcut to achieve all the weights.By doing this, a lot of redundancy computation can be successfully avoided.Thereby, the computation speed of NLM is improved.
This paper is organized as follows.In Section 2, we briefly review the NLM algorithm.The repetitive computation causing the original per-pixel NLM slowness is analyzed and our proposed fast NLM algorithm is described in Section 3. In Section 4, some experiments are conducted and they show that a significant improvement over competitive approaches is brought by our method.Section 5 concludes this paper and some relative discussions are given.

The Non-Local Means Algorithm
where () denotes the noisy image value, () represents the original noise-free image value, and () is the noise value at the pixel .The idea behind the NLM filter is to consider that the denoised pixel value of  is equivalent to the weighted mean of all pixels' values of the noisy image (indexed in image ) [1,8].However, considering the high computational cost, it was suggested in [1,8] that we can estimate the denoised pixel value () using the pixels within a larger search neighborhood where NLM(()) is the filtered image value of the pixel  and (,  , ) represents the weight between () and ( , ).The weight is acquired by where () is a normalization factor: and W(,  , ) denotes the exponential Gaussian weighted Euclidean distance: In (6), ℎ acts as a filtering parameter and   (,  , ) is the Gaussian Euclidean distance between blocks   ∈ R (2+1)×(2+1) and   , ∈ R (2+1)×(2+1) centred at the pixels  and  , , respectively, and is given by In (7), obtained by scanning the pixel values of the block   column by column and then concatenating them and  is the number of pixels within the square block   whose radius is set as .

Our Proposed Fast Algorithm.
To avoid repetitive computations of the pixel difference F , () (1 ≤  ≤ , 1 ≤  ≤ ) in ( 12) by the traditional pixel by pixel way, the proposed correlation based accelerated fast NLM algorithm deals with the image  as a whole.In this case, it is necessary to deal with (1)-( 9) in matrix form.Mathematically, it is trivial to extend them to the nonscalar matrix case.Then, the variables in (1)-( 9) are restricted to all the  pixels in the noisy image  rather than some specific pixel .
To deal with the operation with the whole image , our algorithm can be accordingly divided into 8 steps being consistent with the opposite sequence of (1)-(9).
Step 1. Compute the "differential image" F  , which is matrix case consistent with F , () in (12).This step can be efficiently implemented via a correlation operator: where "∘" denotes the correlation operation and F  ∈ R √×√ denotes the th differential image of the original image .   ∈ R (2+1)×(2+1) is a 2D correlation kernel, which contains only two nonzero elements; that is, the center pixel value is 1 and the th pixel value is −1.Note that, for every different  values, there is a different correlation kernel    since  determines the position of the pixel value −1. Figure 2 gives an illustration of the  = 5th differential image of the whole image within an 11 × 11 search neighborhood.By means of ( 13), the achieved differential image contains all the available information in (12).In fact, (13) can be straightforwardly implemented by a translation operation: where  v  represents the image shifted by a coordinate vector v  , 1 ≤  ≤ .Here, the th vector v  points out the coordinate (  ,   ) from the origin (0, 0) as shown in Figure 2. Note that the correlation kernel    and v  , 1 ≤  ≤ , are one-to-one mapping; that is, when where (, ) is the coordinate of "−1" in    shown in Figure 2, then, its corresponding v  is given by Step 2. Compute the "square image"    , which is matrix case consistent with   , () in (11).
This step can be easily conducted by an element-wise square operation as follows: where "•" acts as an element-wise product operator.
This step can also be implemented by a correlation as follows: (  ) ∈ R √×√ is the Gaussian weighted Euclidian distance corresponding to the th pixel in the search neighborhood  for the whole image .Equation ( 18) can be efficiently implemented by FFT as follows: Step 4. Compute the exponential Gaussian weighted Euclidian distance image W(  ), which is matrix case consistent with W(,  , ) in (6).
Step 6. Compute the normalization factor image , which is matrix case consistent with () in (5).
Step 7. Compute the weights (  ), which is matrix case consistent with (,  , ) in ( 4).This step can be implemented by an element-wise division operation.

Symmetry Property. Note that the computational cost of
Step 5 can be further reduced by exploiting the symmetry property of the v  ; that is, where v (−+1) denotes the ( −  + 1)th translation vector.The symmetry property is shown in Figure 3. Thus, when we calculate the th weights W(  ) for the image, the ( −  + 1)th weights W( (−+1) ) of the image are also meanwhile obtained as a secondary product, which saves the computational cost by one half.
Specifically, when the kernel    is employed to yield the th differential image F  according to (14), the ( −  + 1)th differential image F (−+1) is achieved by means of (20) as follows: Next, when    is calculated by (17),   (−+1) is also meanwhile achieved by In Steps 3-4, only once same operation is implemented, W(  ) and W( (−+1) ) are simultaneously obtained.
Intuitively, this can be observed in Figure 3.By means of k  13 , for each pixel  within the image , the 13th differential image F, 13 () ( = 1, . . ., 9) is available which is corresponding to difference between the red and green blue rectangle.At the same time, the negative F, 109 () ( = 1, . . ., 9) is also available which is corresponding to the difference between the red and blue rectangle.Following Steps 3-4, only once operation is implemented, W(,  13 ) and W(,  109 ) are simultaneously obtained.
Therefore, the necessary translation operations in terms of v  can be reduced from ( − 1) to ( − 1)/2 according to the symmetry property (see (20)).Thus, the computational cost will be reduced by one half.

Computational Complexity.
Note also that the differences between the proposed fast algorithm and the traditional per-pixel NLM algorithm lie in Steps 1-3.Steps 4 to 8 in the proposed algorithm include the same operations as those in the original NLM algorithm.So the computational complexity in these steps is not taken into account in the following subsection of complexity analysis.
The computational complexity of the proposed algorithm is analyzed in this section.We assume that FFT and Inverse Fourier transform (IFFT) need the same computational complexity.In general, the computational complexity of twodimensional FFT for an √  × √  image needs approximately (5/9) log 2  multiplications and (4/3) log 2  additions [18].
Step 3. Computing   (  ) in ( 19) for all 1 ≤   ≤ ( − 1)/2 involves ((−1)/2+1) FFTs, (−1)/2 IFFTs, and (−1)/2 multiplications for element-wise multiplications.Therefore, it needs (5/9) log 2  + ( − 1)/2 multiplications and ((4/3) log 2 ) additions.The computational cost of   (  ) using the proposed algorithm, the original algorithm [8], and the FFT based algorithm [11] are, respectively, given in Table 1.Both the proposed algorithm and the algorithm of [11] are superior to the original NLM algorithm in terms of the computational complexity.According to Table 1, if (23) holds, the complexity relations in ( 24) and ( 25) will also hold: Here, to satisfy (23), the minimum pixel number in the search neighborhood  min will be 3 √  and that can be used to deduce the minimal radius  min of search region.Table 2 lists the  min and  min values according to different image sizes.These requirements on minimal values can be easily met in most practical cases; thus, the proposed method is superior to that of [11].

Experiments
In this section, we assess the performance of the proposed algorithm by conducting experiments with the "Lena" image with size 512 × 512 (Figure 4).The degradation is simulated by adding zero-mean Gaussian noise with standard deviations 10, 20, and 30, respectively.For Gaussian noise with different standard deviations, we use different size settings for search neighborhood  and block  to obtain the optimal denoising effect.For the standard deviations 10, 20, and 30, the sizes of block  are set as 3 × 3, 5 × 5, and 7 × 7 and the search neighborhoods  are set as 15 × 15, 17 × 17, and 21 × 21, respectively.For a fair comparison with the methods in [8,11], we use the simplified weight calculation strategy based on the Euclidian distance rather than Gaussian weighted one.All the experiments are performed in the Matlab R2015a environment using a PC computer with 16 G ROM and Intel i7 CPU.Our experiments are compared with Buades's classical per-pixel algorithm [8] and the Wang's algorithm [11].
We conduct quality assessment in terms of peak signal to noise ratio (PSNR) and structural similarity index measuring (SSIM) [19].These quantitative metrics are defined by (26): where   and   are the means of the 8 × 8 square window and  and  are the image size.  and   are the corresponding standard deviations and   is the corresponding covariance. 1 and  2 are constant and set according to [19].
The experiments results are reported in Table 3.We can note that the proposed algorithm leads to a very significant decrease in computation cost.Because all the methods are based on the same Euclidean distance, the three different methods yield the same values of PSNR and SSIM in this table.

Conclusion and Discussion
In this paper, we proposed a correlation based strategy to accelerate NLM algorithms.This method filters the whole image simultaneously instead of filtering pixel by pixel as the original NLM one.This new approach shows that both the computational complexity and the running time are greatly reduced when compared to the original NLM method [8] and the method reported in [11].The acceleration method in [11] has to be applied using the simplified Euclidean distance but the proposed approach can overcome this limit by effectively calculating the Gaussian kernel convolution in (18).Our future work will consider its parallelization using GPU techniques and will address applications in medical image processing such as low-dose CT scanner [20][21][22] and also color image processing [23][24][25].
Recently, the collaborative filtering algorithms, such as the BM3D algorithm (block matching and 3D filtering algorithm), which are extension of the general concepts of grouping and block matching from the NLM, have shown the superior denoising performance.These are described in [26] and applied in [27] to complex data sets with good results.However, the method also encounters the problem of the large computational complexity costed to group.The application of our proposed correlation based strategy can simply deal with the problem.Actually, the grouping process is consistent with the process of weights' calculation within NLM algorithm.Thereby, our method can be employed to group.In addition, because of the existence of noise, the tradition grouping strategy is easily disturbed [26,27].Our method can overcome the issue to a large extent to enhance grouping in terms of the adjustment of the Gaussian filtering kernel employed in the NLM algorithm (see (18)).
Eventually, referring to Table 3, it can be observed that the computational time increases versus the augment of noise standard deviation.This is because, for larger noise corrupted image, a larger search neighborhood  and block  are necessary, and vice versa.Fortunately, the concept of random resampling mask has been developed in [28,29], which can lower the noise standard deviation.Thus, the random mask prefiltering helps to reduce the computational time of these methods, which will be further studied in our future work.

Figure 2 :
Figure 2: The introduced correlation kernel matrix and its corresponding v.

Figure 3 :
Figure 3: The schematic representation of the symmetry property for weight calculation.

Figure 4 :
Figure 4: The "Lena" image test in the experiment.
Two adjacent pixels  and  with their blocks   and   , and their search neighborhoods   and   .The pixels within the 11 × 11 blue block diagram represent the search neighborhood   of the pixel , the 5 × 5 blue block diagram is the block   of the pixel , and the centered blue block diagram is the pixel .The red block diagrams (from small to big) correspond to the pixel , block   , and search neighbourhood   , respectively.
3.1.RepetitiveComputation of the Per-Pixel Algorithm.From the above observations, we can see that, for one specific central pixel , the Gaussian weighted Euclidean distance   (,  , ) between the block   centred at  and block   , centred at  , in the neighboring needs to be computed.However, for any pixel  adjoining the pixel , the Gaussian weighted Euclidean distance   (,  , ) between the block   centred at  and block   , centred at  , in the neighboring also needs to be computed.Because the two pixels are adjacent,   and   have lots of overlap and are the same as   , and   , .Therefore, by the per-pixel way to implement the NLM algorithm, the blocks based Gaussian weighted Euclidean calculation is highly repetitive.

Table 1 :
The computational complexity of different methods.

Table 2 :
The  min and  min values for different image sizes.

Table 3 :
The experiment results with different methods.