Wavelet Domain Multidictionary Learning for Single Image Super-Resolution

Image super-resolution (SR) aims at recovering the high-frequency (HF) details of a high-resolution (HR) image according to the given low-resolution (LR) image and some priors about natural images. Learning the relationship of the LR image and its corresponding HF details to guide the reconstruction of the HR image is needed. In order to alleviate the uncertainty in HF detail prediction, the HR and LR images are usually decomposed into 4 subbands after 1-level discrete wavelet transformation (DWT), including an approximation subband and three detail subbands. From our observation, we found the approximation subbands of the HR image and the corresponding bicubic interpolated image are very similar but the respective detail subbands are different. Therefore, an algorithm to learn 4 coupled principal component analysis (PCA) dictionaries to describe the relationship between the approximation subband and the detail subbands is proposed in this paper. Comparisons with various state-of-the-art methods by experiments showed that our proposed algorithm is superior to some related works.


Introduction
Image super-resolution (SR) is a technology of generating a high-resolution (HR) image from one or more low-resolution (LR) images.SR has many applications such as criminal investigation and medical diagnosis.
SR approaches can be roughly classified into three categories: interpolation-based methods, reconstruction-based methods, and example-based methods.Interpolation-based methods [1][2][3][4], such as bicubic interpolation, are efficient, but they tend to lose many details of the images.Reconstructionbased methods [5][6][7] usually assume some degraded models and solve the inverse problems to obtain the HR images.These methods perform well when the magnification factor is small, but their performance depends heavily upon a rational prior imposed on the required images.Learningbased methods [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24] assume that high-frequency (HF) details lost in an LR image can be recovered by learning the relationships between the LR-HR image patch-pairs.This kind of methods shows robust SR capability when a larger magnification ratio is required.In this paper, we focus on the learning-based method.
Recently, various typical learning-based SR methods have been proposed.Most of these methods realize SR by inferring the mapping relationships between the HR and LR image patch-pairs.For example, Freeman et al. [8] applied a belief propagation algorithm to train the Markov network which was utilized to model the relationships between HR and LR patch pairs.However, this method will introduce some unwanted artifacts.By introducing locally linear embedding (LLE) [25] from manifold learning, Chang et al. [9] proposed a neighbor-embedding (NE) algorithm.It assumes that the two manifolds of the LR image patches and the corresponding HR patches have similar local structures.However, the manifold assumption is not always true for the SR problem because there are one-to-many mappings existing between the LR-HR patches.Inspired by the fact that the image patches can be well-represented as a sparse linear combination of elements from an appropriately chosen overcomplete dictionary, Yang et al. [11] proposed a sparse-coding-based SR 2 Journal of Electrical and Computer Engineering by employing a relationship that the LR-HR image patchpairs share with the same sparse coefficients with respect to their own dictionaries.However, the dictionary training process is time-consuming.Zeyde et al. [12] introduced some modifications to Yang's [11] algorithm.They adopted a different approach to train the LR-HR dictionary pairs and generated training data sets with no access to an external set of images.Timofte et al. [15] accelerated the SR based on sparse representation by gathering the dictionary atoms together.Nazzal et al. [23,24] proposed two SR methods by training three or four HR-LR pairs of dictionaries in the wavelet domain, respectively.It is clear that these methods are time-consuming.Yang et al. [16] also proposed a fast regression model for practical SR based on in-place examples by leveraging two fundamental SR methods, learning from an external database and learning from self-examples (a patch in the upper scale image has good matches around its origin location in the lower scale image).This method has low time complexity but will introduce some unwanted artifacts.A Bayesian method constrained by a beta process prior is applied to learn the overcomplete dictionaries by He et al. [19].Chen and Qi [20] proposed an SR method based on the low-rank matrix recovery and neighbor embedding.Xu et al. [21] proposed a new coupled K-SVD dictionary training method for image super-resolution.It is a more efficient method of dictionary training and largely improves the performance of K-SVD.Xu et al. [22] also proposed another useful method for image super-resolution by mid-frequency sparse representation and total variation regularization.
The most important task of image SR is to recover the HF details of an HR image according to the LR image or its prior assumptions.However, the existing methods do not well exploit the information provided by input image for this task.We use 1-level discrete wavelet transform (DWT) to decompose a given image into 4 subbands, including an approximation subband and three detail subbands called horizontal, vertical, and diagonal detail subbands.
There are two reasons to do this decomposition: (1) Most of the sparse representation based algorithms are working on local patches.DWT can well extract the local spatial and frequency information of the images.(2) The relationship between these subbands can be learned from the natural images.The relationship can be utilized in the reconstruction.That means the approximation subband of the bicubic interpolated image can be served as the approximation subband of the resultant image.The detail subbands can be reconstructed by the relationship recorded by the coupled dictionaries.Therefore, we introduce the sparse coding method to train the relationships of the subbands.
Though the universal dictionaries learned from example image patches by using algorithms such as K-SVD [26] are better adapted to local image structures, the K-SVD gets reliable learning only when the number of the data samples is much larger than the dimension of the dictionary.Therefore, the process of dictionary-learning by K-SVD is timeconsuming.In this paper, we train 4 principal component analysis (PCA) dictionaries which can well represent 4 subbands, respectively.Every dictionary is composed of a set of compact subdictionaries which is learned from high-quality training patches.The training patches are clustered into many clusters.Since each cluster consists of many patches with similar patterns, a compact subdictionary can be learned for each cluster.Thus, the PCA subdictionary-learning method does not need a large number of training patches.The atoms obtained by PCA are orthogonal to each other.These relevant atoms will not only decrease the computational cost in sparse coding but also will improve the representation accuracy.The initial reconstruction image of our method and its corresponding subbands can be seen in Figure 1(c).From this figure, we can see that many textures in the detail subbands are reconstructed by sparse representation technology.
The rest of this paper is organized as follows.Section 2 briefly reviews the related work of our algorithm.Section 3 describes the details of the proposed wavelet domain SR method.The experimental parameter settings and the results are given in Section 4. The last section concludes our paper.

2D Discrete Wavelet Transform (DWT).
When digital images are to be viewed or proposed at multiple resolutions, the DWT is the mathematical tool of choice.As an efficient, highly intuitive framework for the representation and storage of multiresolution images, the DWT provides powerful insight into an image's spatial and frequency characteristics.The Fourier transform, however, reveals only an image's frequency attributes.
1-level DWT of a gray-scale image can be presented as follows: where (, ) is the size of image I(, ).(, ) is the scale function.w(, ) is the approximation subimage of I(, ) and (, ) is the position of each pixel.Consider where   (, ) and  ∈ {, , } are the 2-dimensional wavelets and the superscript  is a set of directional variables named horizontal, vertical, and diagonal directions.w  (, ) represents detail subimages corresponding to different directional variables.
where  is a function which can describe the relationship of the LR-HR image patch pairs.The goal of SR is to recover x from y.Because SR remains extremely ill-posed, a sparse prior is used in order to regularize it in this paper.By jointly training two dictionaries for the LR and HR image patches, we enforce the similarity of the sparse representations between the LR and HR image patch pairs with respect to their own dictionaries.Thus, for each input LR patch y  , we can find a sparse representation with respect to LR dictionary D  .The corresponding HR patch x  can be reconstructed by the HR dictionary D ℎ and the sparse representation   .The joint sparse coding model can be formulated as [27]: where   represents the th sparse coefficient vector and ‖ ⋅ ‖ 2 is the  2 -norm of a vector and  is a Lagrange parameter balancing the sparsity penalty and representation fidelity.

Proposed Algorithm
In this section, we describe the specific processes of our proposed algorithm; and the flowchart of our proposed algorithm is shown in Figure 2.

Training Four Subband Dictionaries.
According to the aforementioned content, the goal of the training phase is to learn 4 dictionaries corresponding to the subimages in the wavelet domain.Because the human visual system is more sensitive to the luminance component than the chrominance components, we use the YCbCr color space for color images and only perform SR reconstruction in the luminance component.
Firstly, for an HR luminance image X  (, ), we decompose it into four subimages named approximation subband w  and three detail subbands w   ,  ∈ {, , } by 1-level DWT as follows: Secondly, we extract the gradient features of w  and divide the feature images into patches and reshape them into the columns.For each detail subband, we directly partition it into patches.Now we obtain four patch matrices, W  , W   , W   , and W   , by stacking the corresponding columns.One important issue of sparsity-based SR is the selection of the dictionary.In this paper, we train 4 principal component analysis (PCA) dictionaries which can represent 4 subbands, respectively.Comparing with the dictionaries generated by the famous algorithm K-SVD [26], the PCA dictionary-learning method does not need a large number of training patches, and the atoms of the dictionary are orthogonal to each other.The small dictionary size saves the memory and time costs.In addition, since a given patch can be better represented by the adaptively selected subdictionary, the whole image can be more accurately reconstructed than using a universal dictionary.The same idea has been validated by Dong et al. [13].So we classify the training patches into  clusters and learn a principal component analysis (PCA) subdictionary for each cluster.These  PCA subdictionaries construct a large overcomplete dictionary to characterize all the possible local structures of natural images.
That means for the th cluster of the training patches W   , W    ( = 1, . . ., ),  ∈ {, , }, the joint sparse coding model which is used to train the PCA subdictionaries is as follows: min The solution to (7) is given by the following pseudoinverse expression: Thus, we can train 4 subdictionaries corresponding to 4 subimages by ( 6) and (8).

Test Phase.
The test phase attempts to magnify an LR test image Y. From Figure 1, we assume that the approximation subimages of the desired HR image and the bicubic  interpolated image are very similar.We magnify the test LR image by a factor of 2 and transform the bicubic interpolated image by 1-level DWT to get the four subimages.Then the approximation subband of it can be simply served as the approximation subband of the required HR image.The goal of this phase is to reconstruct the other three detail subbands from the approximation subband.
Firstly, we need to extract the gradient features of the approximation subimage.We partition the feature images into patches and reshape each of them as column of the matrix W  = {y  }  =1 , where  is the number of patches.Secondly, for each patch to be coded, we adaptively select one subdictionary from the trained  approximation subdictionaries to code it.This actually enforces the coding coefficients of this patch over the other subdictionaries to be 0, leading to a very sparse representation of the given patch.So the sparse representation   of each patch y  with respect to the learned approximation subdictionary D   is the solution to the problem: Thirdly, the detail subband patches corresponding to can be computed by the following equation: Three subimages can be produced by merging the corresponding patches.For the overlapping regions between those adjacent patches, averaging fusion is applied to obtain the estimated pixels.
In the end, the initial HR image X 0 will be obtained by inverse DWT of the approximation subband and the received 3 detail subbands.

Postprocessing Phase.
The initial output HR image always loses some details in the input LR image and contains some artifacts.In order to make the image more suitable for human visual system, we need to fit the reconstructed HR image to the input LR image.Because the recently introduced nonlocal redundancies in image are useful for image denoising and restoration [28], we can incorporate the nonlocal selfsimilarities in our postprocessing phase.Therefore, a global constraint such as Iterative Back-Projection (IBP) [29] and a nonlocal similarity constraint are employed to further improve the reconstruction accuracy as follows [19]: where x  and x are patches in X and X 0 , respectively.x  ( = 1, . . ., ) is the th most similar patch to x and   is the nonlocal weight defined in [28]. is the permitted error of the truth patch and the weighted evaluated patches.D and B are downsampling and blurring operations, respectively.
The summary of the training phase can be found in Algorithm 1.

Input:
A group of training images;
In addition, the summary of the test phase is listed in Algorithm 2 as follows.
Algorithm 2 (the test phase of the wavelet domain super-resolution).

Input:
Test image Y; Trained PCA subdictionaries: where C(:, ) is the th column of the matrix C; (ii) Calculate the sparse representation   for each y  by ( 9) and the corresponding patches of the detail subimages by ( 10); (3) Generate the detail subimages by sequentially merging all patches of each detail subimage, respectively; (4) Get the initial HR image X 0 by inverse DWT of the Y 1 and the received three detail subimages; (5) Perform the nonlocal self-similarity and IBP on X 0 to obtain the final HR image X * by (11).

Experimental Results and Analysis
In this section we analysis the experimental settings and results.

Experimental Settings.
In our experiments, we downloaded the software package for [11] as our HR training images and 69 images were used to prepare our training database; and we randomly selected some benchmark images which were often used in others' works to prepare our test database.For all of our training and test images, we use the YCbCr color space for color images and only perform our method in the luminance component.The chrominance components can be upscaled to the desired size with the bicubic interpolation method.We select 150000 textured patches with fine details as our training examples to avoid uninformative patches affecting the efficiency of learning.The settings of parameters always affect the performance of our method.To get the best results, we discuss some important parameters involved in this method.
Firstly, we need to know which wavelet in the wavelet family is the best for our proposed method; the Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measurement (SSIM) [30] values of each test image when the number of clusters is 370 and the patch size set 3 with 2 overlapping pixels are shown in Table 1.From Table 1, we can draw a conclusion that the best wavelet of our algorithm is ℎ wavelet.On one hand, shift variance of the wavelet transform is considered as the major drawback for the applications in texture analysis [31].Except for ℎ filter, the strict shift invariance of the wavelet transform is not achievable.On the other hand, the phase linearity (symmetry) of the decomposition is also an important condition in classification.Nonsymmetric filters will lead to the degradation of the image and lower accuracy in classification due to the nonlinear phase.Therefore, we choose ℎ wavelet in our experiments.
Secondly, we cluster the training patches extracted from a set of example images into  clusters by -means algorithm; and for each test patch, we compute the distances of it with the centroids to determine its cluster and corresponding PCA subdictionary.To verify the number of clusters , we set it from 50 to 1000 with a step of 10 on all test images when the patch size is 3 with 2 overlapping pixels.From Figure 3, we can see that, when the number of clusters is smaller than 300, the PSNR values are small.The results indicate that, when the number of the cluster is small, the texture in a cluster is too complex for a compact representation produced by PCA; and, when  is bigger than 610, a small amount of the data samples is considered as a group, which makes the proposed method work like NE.The average PSNR value reaches its peak value when  = 370.Therefore, we use 370 in our following experiments.
Finally, our experiments all work on image patches.We partition the image into patches with some overlapping pixels in raster-scan order.An example of partitioning an image into patches with the patch size of 5 and overlapping pixels of 2 is shown in Figure 4.The overlapping pixels between the adjacent patches may ensure the reconstructed image with consistency and avoid blocking effect.Therefore, to measure the size of image patch and overlapping pixels, we set the patch size from 2 to 5 with different overlapped pixels on all the test images while fixing the number of clusters to 370; and the average PSNR values corresponding to different patch sizes with different overlapping pixels are shown in Figure 5.Many related works have indicated the reputation property of the local patches [10,17].The smaller the patch is, the better reputation is [10].That means it is easier to find similar patches in the training data set if the patch size is smaller.Therefore, smaller patch obtains higher PSNR values.Larger overlap provides smoother results, which meets the smooth prior of the natural images.So the size of image patch is set to 3 with 2 pixels overlapped between the adjacent patches.

Computation Complexity.
We analyze the computational complexity and CPU execution time of our test phase.Supposed that there are  patches in the test set and the dimension of the patch is  2 , the dimension of the gradient feature of test patch is 4 2 .For each sample, we find the nearest cluster by computing the Euclidian distance of this example with the centroids which is obtained in the training process.Suppose the size of the nearest cluster is   , the computation complexity of finding the cluster is (4   2 ).Then we need to compute the sparse representation vector of this example by (9).Suppose the size of the PCA subdictionary with respect to the nearest cluster is    , the computation complexity of computing the sparse representation is (4    2 ).In addition, the three detail patches can be calculated by (10), and the computation complexity of these steps is (3(   ) 2 ).In conclusion, the computation complexity for all  test patches is ((4   2 + 4    2 + 3(   ) 2 )).
For better understanding, the CPU execution time comparisons with some state-of-the-art methods of test phase are shown in Figure 8.The configurations of our computer are as follows: AMD FX (tm)-8120 Eight-Core Processor, 8G RAM, and 64-bit operating system.It is clear that our method is faster than SCSR (Sparse Coding Super-resolution) [11] and slightly slower than Zeyde et al. 's [12].

Conclusion and Future Works
This paper proposes a novel algorithm about single image SR based on training PCA dictionaries in wavelet domain.We decompose the image into 4 subbands, which obtains good performance by alleviating the uncertainty in HF details prediction.In addition, the PCA dictionary provides low time complexity in both the training and testing phases.In the future, we will upscale each subband with different method according to its own characteristics.The dictionary group obtained by the proposed method is trained with the training sets about one magnification factor, so it is only suitable for one magnification factor.Therefore, how to train dictionaries to fit different magnification factors is still a problem.[11,12].

Figure 1 (
Figure 1(a) shows the 1-level DWT result of the ground HR image.It is clear that the approximation subbands of the original HR image and the bicubic interpolated image (Figure 1(b)) of the corresponding LR image after 1-level DWT are very similar but the detail subbands are different.That means the approximation subband of the bicubic interpolated image can be served as the approximation subband of the resultant image.The detail subbands can be reconstructed by the relationship recorded by the coupled dictionaries.Therefore, we introduce the sparse coding method to train the relationships of the subbands.Though the universal dictionaries learned from example image patches by using algorithms such as K-SVD[26] are

Figure 1 :
Figure 1: The images and their corresponding decomposition images after 1-level DWT (the subimages contain an approximation subband and three detail subbands).(a) The ground HR parrot image and its decomposition images.(b) The bicubic interpolated image and its DWT subband images.(c) The HR image reconstructed by our method and its DWT subband images.

D
LL , D LH , D HL , D HH

Figure 2 :
Figure 2: The flowchart of our proposed algorithm.

Figure 3 :
Figure 3: The average PSNR values of all test images versus different numbers of clusters.

Figure 6 :
Figure 6: The comparison of the proposed algorithm and four other methods on  image (2x), and the local magnification in green rectangle is shown in the upper right corner in each example.(a) Bicubic.(b) Zeyde et al. [12].(c) Yang et al. [16].(d) ANR [15].(e) Our proposed.(f) Original.

Figure 7 :
Figure 7: The comparison of the proposed algorithm and four other methods on  image (2x), and the local magnification in yellow rectangle is shown in the lower left corner in each example.(a) Bicubic.(b) Zeyde et al. [12].(c) Yang et al. [16].(d) ANR [15].(e) Our proposed.(f) Original.
is the th PCA subdictionaries corresponding to approximation subimage and the atoms of this dictionary are the   principal components of the patch matrix W   .A   = {   } =1,...,  is the sparse coefficient matrix by calculating A   = D    W   .‖ ⋅ ‖ 0 is the number of the nonzero elements of a vector.The detail subdictionaries D    ,  ∈ {, , } corresponding to D   can be reformulated as follows: D   , . . ., D   }; Cluster the patches in W  into  clusters by -means, and save the cluster centroid matrix as C; (4) Train PCA subdictionaries D   ( = 1, . . ., ) by

Table 1 :
The PSNR and SSIM values of the proposed method with different wavelet-filters (for each wavelet, the left is PSNR values and the right is SSIM values).

Table 2 :
The comparisons of the proposed algorithm with other methods by a magnification factor of 2 (for each method, the values are PSNR and SSIM from left to right).