Dictionary-Based Image Denoising by Fused-Lasso Atom Selection

We proposed an efficient image denoising scheme by fused lasso with dictionary learning. The scheme has two important contributions. The first one is that we learned the patch-based adaptive dictionary by principal component analysis (PCA) with clustering the image into many subsets, which can better preserve the local geometric structure. The second one is that we coded the patches in each subset by fused lasso with the clustering learned dictionary and proposed an iterative Split Bregman to solve it rapidly. We present the capabilities with several experiments. The results show that the proposed scheme is competitive to some excellent denoising algorithms.


Introduction
As an essential low-level image processing procedure, image denoising is studied extensively, which is also a classical inverse problem.The general observation with additive noise is modeled as: where  is the noisy observation and  and  present the original image and white Gaussian noise, respectively.With the degradation model in (1), many denoising algorithms are proposed during the past decades.From the earlier spatial and frequency domain filters [1,2] to the lately developed wavelet and beyond wavelet based shrinkage methods [3][4][5].Due to the the fact that the wavelet is not enough to present the local structure in images, an effective dictionary-learning algorithm called KSVD was proposed in [6] and achieved the good effect in image denoising task.In this method, a highly patch-based overcompleted dictionary is learned firstly, and then denosing is implemented by solving the sparse coding of each patch with assuming that the patches are sparsely representable under the learned dictionary.Foi et al. proposed the pointwise shape-adaptive DCT to be applied to the patch and its neighborhoods, which achieved the very sparse representation for noisy image and got the effective denoising results [7].To lower the complexity, an orthogonal dictionary learning method is proposed in [8], which intends to train a global dictionary by collecting samples from degradation image randomly.Though it achieves good performance in image restoration, some progress can still be made in learning better dictionaries to present the patches accurately.To better present the image, a two-stage denoising method with PCA is proposed in [9], which trained the dictionary by the local neighboring patches.
More recently, a set of approaches with nonlocal techniques (NL) is used for removing noise.The idea of NL can be tracked to [10], where the similar pixels are searched to weighted average the filtered pixel.But the weight in NL is determined only by the intensity distance between patches, so it cannot guarantee to search the patches with similar local geometric structure under the strong noise.Zhang et al. proposed a novel nonlocal mean method with application to MRI denoising [11].To make full use of nonlocal similarity, Mairal et al. proposed the nonlocal sparse model for image restoration [12].Also motivited by the NL, a collaborative denoising scheme called BM3D is proposed by Dabov et al. [13], where they took the patch matching to search the similar patches and grouped them into a 3D cube matrix.Then, the algorithm carried out the 3D sparse transform, such as the 3D Wavelet or 3D Curvelet, to the cube matrix and then removed the noise with Wiener filtering in the transform domain.Another effective method addressed the denoising problem under the kernel regression (KR) framework proposed by Takeda et al. [14].Many spatial classical denoising algorithms, such as the bilaterial filtering [15] and nonlocal mean [16,17], can be seen as the special cases in KR with different constraints.
In this paper, we proposed a novel scheme for image denoising based on clustering dictionary learning.Firstly, we clustered the patches with similar geometric structure by taking a weight function as the feature.Secondly, we learned the patch-based dictionary by principle component analysis for each cluster.Lastly, we coded these patches by fused lasso and developed an iterative Split Bregman to solve it rapidly.
The rest of the paper is outlined as follows.In Section 2, we briefly review the kernel regression framework as we want to choose the weight function in KR as the feature for clustering.And then, we talk about how to learn a suitable dictionary to better describe the patches in each cluster in Section 3.An iterative Split Bregman to solve the fused lasso is proposed in Section 4, which is used to code the patches under the dictionary learned in Section 3 rapidly.Section 5 shows several experiments results compared to the current excellent algorithms and Section 6 concludes the paper with a summary.

Clustering with Weight Function in KR
Kernel regression is well studied in statistical signal processing.Recently, the KR is used to address many image restoration tasks, such as the denoising, interpolation, and deblurring [18].The kernel regression expression can be written as follows: where  is the number of pixels in neighborhood.  presents the th pixel in image, whose location is denoted by   .(⋅) is the local kernel to measure the distance between the center pixel and its neighborhoods, and ℎ is the smoothing parameter to control the penalization strength.By (2), we can obtain that the key technique in KR was how to determine a effective form of kernel function, which was studied in literatures [19,20].Among these methods, the steering kernel is distinguished by producing the local regression weights, where local gradient was taken into consideration to analyze the similarity between the pixels in a neighborhood.The weights in steering kernel can be expressed as: where    denotes the structure similarity between the th center pixel with the th pixel in neighborhood.  is the covariance matrix with the gradient of the th pixel.Furthermore, the whole kernel consists of all weights in neighborhood.
Introduced in [14], the    can present the underlying local structure of patch centered in th pixel.In addition, Takeda et al. pointed out that the different locations patched with different intensities but similar underlying structure will still produce the similar kernel.Generally, the clustering is implemented by the Euclidean measurement of intensity, such as the denoising algorithm in NLM.But, distinguished from the regression algorithm, what we want here is to learn a dictionary to describe the patches with similar geometric structure.That is, we do not require them to have similar intensity simultaneously.So, we can take    as the feature to measure the structure similarity among the patches.The significant distinction between the general Euclidean measurement and KR weighted function is that the latter can obtain the patches with similar structure, but not the patches with similar intensity.To this end, we can take the weights   = ( 1  ,  2  , . . .,    ) formed from streeing kernel as the feature and it will show some advantage on learning the clustering dictionary to better describe the local structure in each cluster.Also, the  1 norm can be used to measure the distance between the features, which show the anisotropy property.Next, we will talk about how to obtain the weights of each patch.
Conveniently, the matrix   is divided into three components [14] and can be reformulated as follows: where By ( 4), what we need is to determine the two variables   and   .To this end, we calculated the local gradient matrix   of th pixel as follows: where  ℎ (⋅) and  V (⋅) present the horizontal and vertical gradient operator, respectively;   is an analysis window around the th pixel and  is the number of pixels in the window.With the singular value decomposition (SVD), we can obtain With and   = diag{ 1 ,  2 }, we can calculate the parameters in (4) as: where  is the regulation parameter, which is used to prevent the ratio from being degenerate. is used to keep   from being zero.We summarized the calculation of weights of each patch in Algorithm 1.
Then, we can cluster the ordering overlapping patches into subset Ω  ( is the cluster indicator) by -Means with  1 norm to measure the similarity between the samples and clustering centers.

Adaptive PCA Dictionary Learning
Once the clusters are formed, we can learn a dictionary with the principal component analysis suited to each cluster independently.To this end, according to the general dictionary learning algorithm, we need to solve the following minimization: where   = {y  ,  ∈ Ω  } is the centered samples matrix which satisfied   =   − y () .  is the samples matrix, whose columns are the patches in the th cluster, and also we can present it as   = {y  ,  ∈ Ω  }. y () is the mean vector of   and ‖ ⋅ ‖  denotes the Frobenius norm.
As the minimization in (8), the numerical method of alternate minimization is used to estimate the two variables.That is to say, we should estimate one variable while the other is fixed.Separately, we rewrite (8) as follows: where    is the th column of   .When the dictionary is fixed, the expression of    is given by Then, assuming the   is fixed and taking formulation (10) into (9), we can obtain The patches in the same cluster have similar structure to each other, so we do not require the dictionary to be redundant enough.So, to simplify the problem (11), we add the constraint      =  to dictionary and formulation (11) can be changed to min The minimization problem in (12) can be approximated by finding the first  () principle components of centered matrix   with PCA framework.And  () satisfied where  is the number of pixels in each patch,  is a constant, and  is the noise standard deviation.  is the singular value of   and satisfied   ≥   ≥ 0, if  < .The reason why we take (13) to choose  () is that it can discard the principle components that present the variance arisen by noise.Our above learning method can train the dictionary with lower complexity.In additional, to make it more effective and compact, we also show a selection scenario in (13), which can tradeoff between the essential signal preservation and noise reduction.
We summarized the dictionary learning with PCA in Algorithm 2.

Patch-Based Coding by Fused Lasso
With the preparation work in Sections 2 and 3, in this paragraph, we will study the sparse coding with fused lasso [21].The reason why we take the fused lasso for sparse coding is that it not only constrains the sparsity of coefficients but also enforced the differences between the neighboring coefficients, which led to show some advantage in recovering the texture in image [22,23].So, we can recover the noisy image by minimizing the cost function of fused lasso, which is expressed as follows: where y is the sample patch in image denoising task,  can be seen as the dictionary,  = ( 1 ,  2 , . . .,   ) is the sparse coding of y, and  and  are regularization parameters.

Algorithm 3
Note that the minimization with the augmentation Lagrangian function of ( 17) can be expressed as follows: With the alternating direction method [28] (ADM), we can solve (18) with the following iterative algorithm: Note that the initial idea to get the solution of ( 14) is presented in [21], which is called coordinate descent method (CDM).But, in practice, CDM is slow, complex, and difficult to implement to some specialized algorithms.In addition, it is not effective in some large-scale processing problems.
Furthermore, by the iterative Split Bregman, we can divide (19) into the following three independent minimizations, whose convergence is guaranteed by [29]: As to the quadratic minimization in (22), we can obtain the solution by solving the following linear equation: where  is the identity matrix with the same size to   .
We summarized the coding algorithm in Algorithm 3. Now, with all the preparation works above, we can summarize the synthetic denoising algorithm in Algorithm 4.

Experiment Results and Analysis
We conducted various experiments on image denoising to demonstrate the performance of our proposed algorithm.We degrade the images by adding artificial zero mean Gaussian noise with different standard deviations.The test images are shown in Figure 1.The sizes of images in experiment are all 256 × 256, and the patch size is 8 × 8; that is,  is 64 in Algorithm 1. Empirically, we set the parameters  = 1.0 and ℎ = 2.4 for all the experiment images in Algorithm 1.We fixed the window size in (5) to be 9 × 9, where we found the small size, such as 5 × 5, may not capture the local geometric structure of underlying image data sometimes.Surely, the window size can be tuned according to concrete experiment requirement.In particular, we extend the image boundary with the "symmetric" type according to the window size.The clustering number in -Means is flexible, which is small in image with compact structure such as House in Figure 1, and large with complex structure such as Lena.According to the image underlying structure, in our experiments, the clustering number  is optional between 5 to 10.The maximum iteration  in Algorithm 4  is 10.In addition, the regularization parameters are set in the algorithms themselves.We compared our proposed algorithm to several current excellent denoising approaches, including the FGTV method in [30], denoising with the dictionary learned by KSVD in [6] (DKSVD), BM3D with Wavelet (BM3DW) in [13], the kernel regression method in [14], and the two-stage denoising method with PCA (TSPCA) in [9].Due to the limited space,  we only show the experiment results of Lena and Couple with noise standard deviation  = 20 in Figures 2 and  3, respectively.Furthermore, the PSNR results of all the recovered images are reported in Figure 4.
From the denoising results, we can note that the FGTV method has the worst visualization among the compared methods.Because it only constrains the total variation but does not consider the local structure adaptively, so it lost  many details in original image and also showed the disadvantage in PSNR results.KR algorithm generates many mottled artifacts in denoised image, but it indeed preserves some texture by capturing the local structure with kernel function, such as the curtain in Couple.But its results declined rapidly with the increasing noise standard deviation both in visual quality and in PSNR.TSPCA generated specific smoothness in the recovered image and was weak to present the texture region.In addition, the other factor resulting in bad performance in texture is that the PCA dictionary is learned with neighboring patches, which will show weaker results than the nonlocal scheme.We can see the PSNR of BM3DW, DKSVD, and the proposed method are approximate to each other in Figure 4, although the BM3DW obtained the highest PSNR compared to other methods.But BM3DW and DKSVD have more distortions in part of texture regions than the proposed method.This is because in BM3DW, the wavelet is not a good representation for all types of images, such as the ones with some complex structures.Also, although the KSVD can obtain a dictionary learned by image itself to better present the structures in image, it produces a universal dictionary, which may be not effective to certain local structures.Compared to BM3DW and KSVD, the proposed method showed the advantage to present the

Figure 4 :
Figure 4: PSNR results of test images.