Spatiotemporal Video Denoising Based on Adaptive Thresholding and Clustering

. In this paper we propose a novel video denoising method based on adaptive thresholding and 𝐾 -means clustering. In the proposed method the adaptive thresholding is applied rather than the conventional hard-thresholding of the VBM3D method. The adaptive thresholding has a high ability to adapt and change according to the amount of noise. More specifically, hard-thresholding is applied on the higher noise areas while soft-thresholding is applied on the lower noise areas. Consequently, we can successfully remove the noise effectively and at the same time preserve the edges of the image, because the clustering approach saves more computation time and is more capable of finding relevant patches than the block-matching approach. So, the 𝐾 -means clustering method in the final estimate in this paper is adopted instead of the block-matching method in the VBM3D method in order to restrict the search of the candidate patches within the region of the reference patch and therefore improve the grouping. Experimental results emphasize the superiority of the new method over the reference methods in terms of visual quality, Peak Signal-to-Noise Ratio (PSNR), and Image Enhancement Factor (IEF). Execution time of the proposed algorithm consumes less time in denoising than that in the VBM3D algorithm.


Introduction
The search for effective video denoising methods remains a major challenge for researchers.Denoising by spatial domain methods such as total variation, bilateral filter, and nonlocal mean filter is more effective for still image processing than other algorithms.However, these methods have proven their failure in preserving the image features (e.g., edges).A plethora of algorithms that are based on transform domain have been proposed to overcome the flaws of these spatial domain denoising methods [1][2][3].In these algorithms, the signal is sparsely represented in the transform domain.In spite of this, for any fixed 2D transform, the great variety in natural images is unable to achieve good sparsity for all situations [4].
In recent decades, the most efficient approach in restoring video sequences takes advantage of the potential similarity between the grouped block [5].Most algorithms in this field have been proposed for signal processing especially for video denoising [6][7][8][9][10][11].
Lately Dabov et al. [12] have proposed a novel image denoising method based on an enhanced sparse representation in transform domain, which is known as block-matching and 3D filtering (BM3D).To the best of our knowledge, BM3D is the most efficient image denoising algorithm.In this algorithm, the initial, mutual, and similar 2D image blocks are stacked into the 3D group.Then, the group is filtered by hard-thresholding and Wiener filter, respectively.The same authors succeeded to apply the BM3D filtering scheme on video denoising, which is termed video blockmatching and 3D filtering (VBM3D) [13].In the VBM3D algorithm, the set of consecutive frames in the video sequence has been used to construct the groups.VBM3D algorithm is implemented as follows.Firstly, the groups are formed by predictive-search block-matching.Secondly, each group is filtered by 3D transform-domain shrinkage (i.e., hardthresholding and Wiener filtering, resp.).Finally, the last estimate of the true video is computed by aggregating all the obtained local estimates.
Despite that VBM3D method represents the state-of-theart in video denoising, it suffers from several drawbacks where we can explain two of them as follows.One is that, in the first stage of the VBM3D filter, the hard-thresholding is incapable of distinguishing between the areas that contain more noise versus that containing less noise.As a result, the significant amount of the true signals in the less noisy areas will be removed, which will lead to deterioration of the visual quality of the output video.And then the blockmatching in the VBM3D occasionally searches out of the region that contains the reference block, which will result in poor matching in the areas that heavily contaminated by noise and this would lead to blurred edges [9].
The above-mentioned disadvantages of the VBM3D filter have been extensively studied in the literature (e.g., [7,11]).However, the challenge remains.
One of the most significant enhancements made for the VBM3D filter is the one proposed by Maggioni et al. [5], known as the video block-matching and 4D filtering (VBM4D).The mutually similar volumes in the VBM4D have been grouped rather than the blocks in the VBM3D method.In the VBM4D filter, the tracking blocks along the trajectories are used to construct the 3D spatial temporal volumes, and the mutually similar volumes have been grouped together by stacking them along the 4th D. The authors in [5] have succeeded to get better results than that of the VBM3D.However, VBM4D method suffers from high computational cost.
In order to conquer the above-stated weaknesses of the VBM3D filter, we propose to replace the steady hard-thresholding by adaptive thresholding (i.e., hardthresholding and soft-thresholding).The hard-thresholding is applied to the areas that are heavily desecrated by noise, while the soft-thresholding is applied to the slight noise areas.Applying the hard-thresholding to the heavy noise areas will play a significant role in removing the noise effectively, while applying the soft-thresholding to the slight noise areas will assist in maintaining the edges.To avoid the second drawback of the VBM3D filter, we propose to replace the block-matching in the final estimate by the -means clustering.Applying the -means clustering will allow us to find the relevant patches professionally better than the blockmatching, which will lead to improvement of the grouping and therefore obtain sharper edges.
The remainder of this paper is organized as follows.The proposed adaptive thresholding and clustering are described in Section 2. Experimental results are presented in Section 3. Some concluding remarks are given in Section 4.

Proposed Algorithm
Our proposed method is based on the following stages.In the first stage (basic estimate), we group the block-matching in the noisy video for every reference block and apply the adaptive thresholding; whereas the hard-thresholding is applied to the areas that contain more noise and less image features, the soft-thresholding is applied to the areas that contain less noise and more image features.In the second stage, we apply the -means clustering only to the basic estimate for finding the relevant patches and apply the Wiener filter for achieving further improvement in denoising.
In this paper, we consider (, ) = (, ) + (, ), as the observed noisy video, where  is the true video signal, (⋅) ∼ N(0,  2 ) is the independent and identically distributed white Gaussian noise, and (, ) ∈  are the 3D spatiotemporal coordinates in the domain  ∈ Z 3 , where the first component  ∈ Z 2 represents the spatial coordinates, while the last component ( ∈ Z) represents the time index.

Grouping and 3D
Transform.Suppose   and   are the reference block and candidate block located at  and , respectively.The distance between   and   [12] is where  is the adaptive thresholding, T 2D is the normalized 2D linear transform, ‖ ⋅ ‖ denotes  2 -norm, and  is the block size.By using -distance (1), we can find a group containing blocks   all of which are similar to a reference block   : where T ath match is the maximum -distance for two similar blocks.
In the next step, each block that has distance less than a predetermined threshold T ath match will be searched by the block-matching.
As in [13], in this paper we use the predictive-search block-matching within a search range [−, ].
Under the above-mentioned conditions the group will be formed by stacking the reference block   and its candidate blocks.Afterwards, we apply the 3D transform to each group.
In the 3D transform, each 2D block will be transformed by 2D-transforming, while the coefficients are transformed by 1D-transforming [14].The transform can afford to attain an extremely sparse representation of the true signal group (  ath  ), and therefore it will be easy to disentangle the noise by shrinkage.

Thresholding.
In this paper, we propose a novel adaptive thresholding.This thresholding is achieved by shrinkage in a 3D transform domain.
Before talking about the proposed adaptive thresholding, let us take a brief description of hard-thresholding and softthresholding.

Hard-Thresholding.
In the hard-thresholding, if the absolute value of any element is either less than or equal to the threshold, the element will be set to zero, while the element will be retained if the absolute value is greater than the threshold.
Typically, the results of the hard-thresholding are exceedingly smoothed.However, it is not enough for maintaining the edges of the image, where the denoised image usually suffers from blurring edges [10,15].

Soft-Thresholding.
In the soft-thresholding, if the absolute value of any element is either less than or equal to the threshold, the element will be set to zero, while the threshold will be subtracted from the element if the absolute value is greater than the threshold.
The main idea of the soft-thresholding methods lies in that the coefficients have contributions from both the informative signal and noise.Consequently, the retained coefficients will be shrunk which will contribute to constricting the effects of noise.

Adaptive Thresholding Method.
In this paper, we propose to apply adaptive thresholding (ATH) rather than the conventional hard-thresholding which was adopted by VBM3D filter.The proposed adaptive thresholding method is applied as the initial denoising for the 3D-transformed group, which is implemented as follows: where  is optional test threshold and  is the noise standard deviation.
From (5) we can expect the following: (i) In the areas that contain more noise (i.e.,  ⩾ ) and less image features (such as edges), the proposed thresholding will highlight the role of the hardthresholding which leads to reduction of the noise level significantly in these areas.
(ii) In the areas that contain less noise (i.e.,  < ) and more image features, the proposed thresholding will highlight the role of the soft-thresholding that in turn leads to maintain the significant features of the image.
It is worth noting that the shrinkage in the VBM3D model is based on the hard-thresholding while the shrinkage in our model is based on the soft-thresholding.

Inverse Transform and Aggregation.
After the noise reduction process that was implemented by the adaptive thresholding, we apply the inverse transform to get 3D array of block-wise estimates where T ath 3 is the 3D transform.The groups Math  ath  (basic estimate) constitute an extremely redundant representation of the video because of the overlapping of the obtained block estimates.Consequently, there are several estimates for each pixel.For that, we aggregate these estimates in order to shape an estimate of the whole video.
The basic estimate of the true video is computed by weighted averaging of all the obtained block-wise estimates that are overlapped.The weights can be described as inversely proportional to the total sample variance of the estimate of the corresponding block-wise estimates [12].

Clustering.
Block-matching is a supervised approach.The noisy video, in this approach, is processed through blockwise manner where the matching blocks are grouped for every reference block.
In the block-matching approach, it is so hard to define the threshold as to how similar to the reference block is acceptable; therefore, this approach has high computational cost.
Contrary to the block-matching approach that was adopted by the VBM3D, the clustering partitions the image into disjoint areas.As a result, we can get similar patches in an unsupervised manner [16].
The threshold in the block-matching is predetermined, while the threshold in the clustering is adaptively determined by comparing the proximities of the reference patch with different cluster centers.For that, the clustering approach has the capacity to find the relevant patches better than the blockmatching approach.
In this paper, we adopt the the -means clustering [17] in the final estimates, rather than the block-matching.
The basic idea of the -means clustering algorithm is summarized as follows.
Step 2. Calculate the distance between each cluster center and each object.
Step 3. Appoint each object to the most similar cluster.
Step 4. After appointing all the objects, update the average of each cluster.
In the proposed method, the -means clustering is carried out only on the basic estimate Math  ath  rather than the block-matching which was conducted in both basic estimate and noisy video in [13].
Applying the block-matching on the basic estimate usually leads to blurred edges.This seems obvious once we look at Figures 5 and 6.Thus, unlike VBM3D [13], the proposed method does not apply the block-matching method on the basic estimate, but it assigns the efficient -means clustering method to partition the basic estimate.
The implementation of the -means clustering on the basic estimate rather than the noisy video will lead to improvement of the grouping.
The main reason of applying the clustering to the basic estimate, instead of the input noisy video, is the flaw of accuracy in detecting edges in the noisy video.For that, the clustering can only be performed on the Wiener filtering stage where a denoised video basic estimate is obtainable.

Wiener Filtering.
After the implementation of the basic shrinking (adaptive thresholding) on the transform coefficients, we implement the empirical Wiener filtering to improve the shrinkage which in turn will mitigate the noise.
The empirical Wiener shrinkage coefficients, which are computed from the energy of the 3D transform coefficient of the initial estimate group Math The implementation results of the Wiener filtering in transform domain T wie 3D are much more efficient and precise than those of the adaptive thresholding of the 3D spectrum of the noisy video.
The general final estimate Mwie

Experimental Results
In this paper, we compare the performance of the proposed algorithm with the ones that are in [13,18,19] algorithms, in terms of visual quality, Peak Signal-to-Noise Ratio (PSNR), and Image Enhancement Factor (IEF).
As mentioned in Section 2, the capacity of VBM3D method is limited in highly noise-polluted video.Thus, in this section we focus on the heavily noisy video sequences in order to highlight the role of the proposed algorithm in improving the denoising performance in this case.
In this section, the video sequences, Miss America, Salesman, Tennis, and Bus, are used to verify the efficiency of the new video denoising algorithm.These video sequences have been polluted by the additive white Gaussian noise (AWGN).
To evaluate the performance of the proposed algorithm, we use the visual quality as a qualitative measurement and the PSNR and IEF as a quantitative one.
The proposed algorithm achieves higher performance on both noise removal and edges preservation as compared with the algorithms in [13,18,19].The experimental results are shown in Figures 1-4.These figures provide a visual comparison in the performance of the four algorithms applied to the test sequences: Miss America, Salesman, Tennis, and Bus.These sequences are degraded by AWGN with standard deviation  = 40.From these figures, we can remark that the proposed method has higher ability to suppress the noise and sharpen the edges simultaneously, better than the other methods.Figures 5 and 6 offer a visual comparison of the performance of the four models.
As a qualitative measurement, the proposed model is the best in terms of edges conservation.This is clearly visible in the edges of the books of the Salesman sequence as well as in the lock of the hair of the Miss America sequence in Figures 5 and 6.
Applying the clustering approach in our proposed algorithm instead of the block-matching approach has contributed in the development of the performance of our algorithm, which in turn led to sharper edges.This is proved in Figures 5 and 6.
Figures 7-14 include the different values of PSNR and IEF of the four models with different noise standard deviations.These figures show that the proposed algorithm outperforms all of the reference algorithms in terms of PSNR and IEF.
Table 1 lists the execution time of the VBM3D and the proposed algorithms for all noise levels.From this table, we can        [13,18,19] and new algorithms for various Gaussian noise for Miss America video sequence.
observe that the implementation of the proposed algorithm is faster than that of the VBM3D algorithm, which means that our proposed algorithm saves more computation time than the VBM3D algorithm.In other words, the proposed algorithm is more economical and faster.
Tables 2-9 compare the denoising performance of the four algorithms.In these tables, common video sequences have been contaminated by AWGN, with raising standard deviation  = 10-100.From Tables 2-5, we can see that increasing the noise standard deviation is offset by decreasing the PSNR values, which means that the denoising effect is worse.However, the PSNR values of the proposed model are the highest among the four models, which imply that the denoising effect of the new model is the best.[13,18,19] and new algorithms for various Gaussian noise for Salesman video sequence.
The data in the Tables 2-9 demonstrate that the proposed algorithm is superior to the VBM3D algorithm in all the experiments.From these tables we can also observe that our proposed algorithm has made a great improvement in all of the PSNR and IEF of the VBM3D up to 1.67 dB and 28.68, respectively.

Conclusion
This paper presented a novel model of video denoising based on adaptive thresholding and clustering.The proposed model has a high capacity to acclimatize and change according to the amount of the noise.Consequently, this model is able [18] model [19] model [13] model New model  [13,18,19] and new algorithms for various Gaussian noise for Tennis video sequence.
to attenuate the noise from heavy noise video sequences effectively.In the proposed method, we applied the adaptive thresholding, instead of hard-thresholding which proved its incompetence to distinguish between the heavy noise areas and slight noise areas.Accordingly, the proposed algorithm has succeeded in maintaining the details (e.g., edges) and removing the noise strongly in comparison with the other reference methods.The applying of the clustering approach to the basic estimate in our algorithm rather than the blockmatching approach in the VBM3D algorithm allowed us to get edges sharper than the ones in the VBM3D method.Numerical experiments with four different video sequences and various levels of white Gaussian noise showed that our  [13,18,19] and new algorithms for various Gaussian noise for Miss America video sequence.
proposed model has achieved higher noise removal gain as compared with the reference methods, in addition to 1.67 dB gain higher than that in the VBM3D algorithm.Experimental results emphasize the superiority of the proposed algorithm in terms of visual quality with obvious improvement in the PSNR and IEF.As shown in Table 1, the execution time of the proposed algorithm is less than that of the VBM3D algorithm.[13,18,19] and new algorithms for various Gaussian noise for Tennis video sequence.

Figure 1 :
Figure 1: Visual comparison of different algorithms for frame 5 of the Miss America sequence.From left to right and from top to bottom: original frame, noisy frame (  = 40), result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 2 :
Figure 2: Visual comparison of different algorithms for frame 5 of Salesman sequence.From left to right and from top to bottom: original frame, noisy frame (  = 40), result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 3 :
Figure 3: Visual comparison of different algorithms for frame 5 of Tennis sequence.From left to right and from top to bottom: original frame, noisy frame (  = 40), result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 4 :
Figure 4: Visual comparison of different algorithms for frame 5 of Bus sequence.From left to right and from top to bottom: original frame, noisy frame (  = 40), result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 5 :
Figure 5: Zoom of noisy Miss America sequence (  = 40).From left to right and from top to bottom: result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 6 :
Figure 6: Zoom of noisy Salesman sequence (  = 40).From left to right and from top to bottom: result of[18] algorithm, result of[19] algorithm, result of[13] algorithm, and result of the proposed algorithm.

Figure 11 :
Figure 11: IEF graph of[13,18,19] and new algorithms for various Gaussian noise for Miss America video sequence.

Table 1 :
Computational cost of the VBM3D and the proposed algorithms for the Bus video sequence.

Table 2 :
PSNR of different algorithms with different Gaussian noise levels for Miss America video sequence.

Table 3 :
PSNR of different algorithms with different Gaussian noise levels for Salesman video sequence.

Table 4 :
PSNR of different algorithms with different Gaussian noise levels for Tennis video sequence.

Table 5 :
PSNR of different algorithms with different Gaussian noise levels for Bus video sequence.

Table 6 :
IEF of different algorithms with different Gaussian noise levels for Miss America video sequence.

Table 7 :
IEF of different algorithms with different Gaussian noise levels for Salesman video sequence.

Table 8 :
IEF of different algorithms with different Gaussian noise levels for Tennis video sequence.

Table 9 :
IEF of different algorithms with different Gaussian noise levels for Bus video sequence.