Single Image Superresolution Using Maximizing Self-Similarity Prior

,


Introduction
The task of single image superresolution (SISR) is concerned with generating a high-resolution (HR) image from an input image of low resolution (LR).Since the same LR image may hypothetically correspond to multiple HR images, the solution space is inherently ambiguous.To resolve the ambiguity, existing methods often rely on certain image priors for selecting the preferred HR reconstruction result.
Reconstruction-based methods derive HR images via imposing constraints according to prior knowledge of the inverse problem [6,7].Shan et al. [6] design a feedback control framework, which uses a continuous function to fit a gradient density distribution for the input image as a prior for the HR reconstruction process.However, these methods often produce undesirable artifacts in the HR results, especially along salient edges.
The basic assumption of the learning-based methods is that the high-frequency details lost in a reconstructed HR image can be learnt from a set of low-and high-resolution image pairs.This category of methods generates an HR image from a single LR image by referring to pairs of lowand high-resolution images as the training data.Extensive research results have demonstrated the promising potential of the approach.Freeman et al. [11,15] proposed a learning based method applicable for processing generic images where 2 Mathematical Problems in Engineering the mapping from LR to HR versions of an image is captured by a Markov random field inferred through belief propagation.Chang et al. [17] extended their method by applying the manifold learning theory onto the correspondence between the HR image patch space and the LR image patch space.The method generates HR patches as a linear combination of its nearest neighboring patches.However, the fixed number of nearest neighbors could cause over-or underfitting.To conquer this shortcoming, Yang et al. [13,18,19] proposed an approach that generates sparse linear combinations using a compact dictionary.The computed sparse representation adaptively selects the relevant patches in the dictionary to represent each patch.It is noted that constructing a compact dictionary suitable for the application is a time-consuming process.These methods severely depend on the quality and relevance of the image training data set.It tends to produce obvious artifacts and generate unwanted noise in its HR generation results.
Some recent studies show that natural images generally possess a great amount of self-similarities [14,[20][21][22], that is, local image structures tend to recur within and across different image scales [14,[20][21][22], and image superresolution can be regularized taken into account these self-similar examples instead of some external database [20][21][22].Particularly, Yang et al. [14] use self-examples within and across multiple image scales to regularize the ill-posed classical superresolution problem.The method extends the example-based superresolution framework with self-examples and iteratively upscale the image.They show that the local self-similarity assumption for natural images holds better for small upscaling factors and the patch search can be conducted in a restricted local region, allowing very fast practical implementation.However, the main problem rests with the fact that the reconstructed edges are usually too sharp to look natural.
Our work is inspired by the observation that almost all small image patches usually reappear themselves at other positions in natural images or its different scales.In conformity with the above observation, the basic assumption we design in this study is that there should at least exit one resemble patch in the input low resolution image for a patch randomly extracted from the well-reconstructed HR image.
The main contributions of this paper include the following: (1) Inspired by the idea that each patch in the reconstructed SR result should resemble patches in the original input image, we propose a new algorithm that infers HR images from their LR versions through multiresolution self-similarity maximization.The results of the new method outperform those generated by the state-of-the-art algorithms through comparative experiments.
(2) For finding the optimal HR image, we introduce GMM to train on a patch set for approximating the joint probability density function of input image with different scales and uses an iterative scheme to resolve the cost function.This framework is easy to be adapted to other problems such as image deblurring and image denoising.
The rest of this paper is organized as follows.We first state the study motivation in Section 2. Section 3 then introduces the proposed algorithm, including how to design the cost function, how to construct and train the reconstruction model, and how to solve the cost function.Section 4 presents the results of our approach as well as comparing them with those produced by the peer methods.Finally, we conclude this paper with discussions on future work directions.

Motivation
Recently, expected patch log likelihood (EPLL) framework [23] using GMM prior for image denoising was proposed with its performance comparable to the state-of-the-art algorithms.EPLL trains a GMM using external natural image patches and, for each patch extracted from the unclean image, finds its highest probability solution in the model by the iterative splitting algorithm (for more details, the reader can refer to [23]).GMM prior shows its powerful denoising ability.However, this scheme is unsuitable for solving SISR problem, except for the time consuming during training phrase, also because there is only one-scale image patches in the training data; thus it cannot introduce high frequency in the SR result and adopt the Wiener filter solution leading to damaging the fine details.We design a simple experiment to illustrate this point which was displayed in Figure 1.In this experiment, first, we upsample the small natural image "bear" with bicubic interpolation method as the input one (but adding no noise in it).Processing it with EPLL (the result in Figure 1(c)), we can notice that the salient edges are preserved very well, but the texture and details are almost erased.The result of EPLL is in agreement with our analysis.Processing the small natural image "bear" with our method, the result (Figure 1(d)) is sharper than the results of bicubic interpolation and EPLL methods.The algorithm of EPLL [23] is specialized for image denoising, which can preserve edge and erase the noise, and, however, it cannot add any correct high-frequency content in the results.So if the input is an image without high frequency, the result of EPLL algorithm is undesirable and our method can refine the result by finding some high frequency in the input image itself.Obviously, it is unfair for the specific denoising algorithm EPLL to be used for finding high frequency; we just use this experiment to illustrate the function of our algorithm.
Inspired by EPLL work, and also due to the image property of self-similarity between different scales, thus the input LR image and its filtering version can serve as a sufficient training data set for learning the model of the image scene.We propose a SISR method which trains GMM on the training data constructed by concatenating the different scales of the input image and can introduce the correct high-frequency component and protect the details in the HR result.

Our Approach
We first explain the notations appearing in our SISR scheme with the help of Figure 2. In Figure 2, the input image is denoted by  ∈  √×√ , from which we obtain its low-frequency component image  0 ∈  √×√ by low pass Gaussian , which is used as the input for producing (c) and (d) using the EPLL and the proposed method, respectively.From (c), we can see that EPLL fails to capture the image texture and its fine details.In comparison, the result of the proposed method (d) appears more natural and realistic; in particular the edges and textures regions are better preserved.
The framework of our approach.For a random patch   in , we wish there exists at least one very similar patch   in .We find the joint PDF of the data pair {  ,   0 }  =1 and thus solve the optimal   when the data pair (  ,   0 ) substitute into the joint PDF.
filtering.We upsample  using bicubic interpolation by a factor of  to get  0 ∈  √×√ ,  =  2 , and  is the upsampling factor.We use  0 to approximate the low-frequency component version of the unknown high-resolution image  ∈  √×√ .From ,  0 , and  0 , the target is to estimate the HR image .We use   and   0 to denote th × image patch sampled from  and  0 , respectively, and   and   0 to denote th  ×  image patch sampled from  and  0 , respectively.{  0 }  =1 and {  0 }  =1 are thus referred to as the low-resolution image patches because they are missing high-frequency components, while {  }  =1 and {  }  =1 are referred to as their highresolution counterparts separately. represents the location of the patch in the corresponding images, and, in our scheme, the overlapped pixels between two neighbored patches are ( − 1); for the boundary pixels, we extend the image by symmetrical method.So the number of total patches extracted from one image is the same as the pixels of the image.

The Cost Function Based on Maximizing Self-Similarity
Prior.For SISR problem, the generation of an LR image from the underlying HR image is expressed as where  ∈   is the observed LR image and  ∈   is the original HR image,  = √/ is the up-sampling factor, and  ∈   is assumed to be an additive Gaussian noise., , and  are all represented in vectorial form. ∈  × ,  ∈  × are downsampling and filtering matrix, respectively.For a random patch   in , Freeman et al. [11,15], Chang et al. [17], Yang et al. [14], and Dong et al. [12] all want to find one nearest neighbor patch   in , by approximating the distance between   in  and   in  to the distance between   0 in  0 and   0 in  0 .However, the approximation is not accurate, since the correspondence between the LR image patches and their counterparts (HR image patches) and the neighborhood relationship cannot be preserved perfectly due to the "one-to-many" mapping existing between one LR image and many HR images.
If we have already got the reconstructed HR image , according to the self-similarity property, a random patch   in  should at least have one nearest neighbor patch   in , and for another random patch   in , being not afraid of the overlapping situation in the low resolution image , it should also at least have one nearest neighbor patch   in .We use (  ) to measure the similarity between   and   .  is a matrix to extract the patch   from .The larger the probability value (  ), the more similar the appearance between   and   .For any patch   extracted from the reconstructed HR image, we hope the probability (  ) of the patch obtains the maximizing value.And we call the property of the reconstructed HR image Maximizing Self-Similarity Prior (MSSP).So, for the whole reconstructed HR image, we can write out the cost function as arg min where  is the parameter for balancing the effect of the fidelity term ‖ − ‖ The cost function is nonconvex and difficult to solve.For finding the optimization solution, we introduce a set of auxiliary variables {  }  =1 , the overlapped patch    corresponding to   .Thus the cost function is changed to arg min Note that when  → +∞, since we want to find the minimize solution of (4),    is equal to   , and the solution of the equation will be converged.We solve this equation by taking an iterative way.Firstly, while we keep   fixed, solve for .This can be solved by setting the derivative of this equation about  to zero and we can get the result where ()  represents transposing matrix , () −1 solves the inverse matrix of , and  is for all overlapping patches in .
And then solving for {  }  =1 gives X.After each iteration, we increase  and continue to do the next iteration.In next two following subsections, we will elaborate how to solve {  }  =1 .  ] ,  = 1, . . ., .

Training
Gaussian mixture model (GMM) is a parametric probability density function (PDF) that is represented as a weighted sum of Gaussian densities.GMM can approximate an arbitrary probability density function accurately [24].For a random variable  ∈  2 2  , we learn a GMM over these concatenated vectors and write its PDF by GMM as where  is the number of Gaussian mixture components and   ,   , and Σ  are the parameters of weight, mean vector, and covariance matrix of th Gaussian mixture component separately.Using EM algorithm to estimate these parameters is not hard.
] for the covariance matrix of the model component densities.According to the self-similarity between the different scales of one image, for a random patch   in the reconstructed HR image, we can predict the occurrence probability in the original input image as where Based on (3), prediction can be obtained by taking the expectation result: At last, the algorithm for SISR using maximizing self-similarity prior is summarized as in Algorithm 1.

Parameter Setting. Experiments have been conducted
to evaluate the result of our method in comparison with several state-of-the-art algorithms.We start by using the original image as LR input and upsample it with a scale factor of 2. For further upsampling, we used the previous output as the input and solved its HR image.Note that, for a color image, it is first transformed from RGB to .Then, the  channel (intensity) is upsampled by our algorithm because human vision is more sensitive to brightness change.
and  channels are interpolated to the desired size by the bicubic interpolation method.Finally, the three channels are combined to form the final HR result.The input LR image  is generated from the original HR image by a downsampling and filtering process, the low-frequency band  0 of the unknown HR image is approximated by bicubic interpolation from , and the low-frequency band of the input LR image  0 is obtained by low pass Gaussian filtering with a standard deviation of 0.5.In all experiments, the size of all patches is 5 × 5 ( = 5),  = 0.5,  = ( × 10) 2 , and  represents the th iteration. represents the number of the iterative times; in this paper, we set  = 5.As we know, the amount of GMM component densities  is important to the performance of the joint PDF learning.However, the accuracy and the efficiency should be trade-off in practical implementation.In our experiments, we apply a Bayesianbased model selection criterion [25] to determine the number of GMM components by minimizing the following metric: (11) where  is the total number of model parameters.

Visual Analysis.
Subjective visual effect is a significant evaluation to superresolution models, which could reflect the advantages and disadvantages intuitively.In this section, we test our model and make some comparisons to the state-ofthe-art models.
As mentioned before, Glasner et al. [20] emphasized the importance of image self-similarity to superresolution problem.Therefore, in Figure 3, we present a comparison with respect to Glasner's patch recurrence (PR) model, our model, and the ground truth.As observed in Figure 3(b), Glasner's patch has some artifacts along the sharp edges and tends to be smoother.By contrast, our model could obtain an output closer to the ground truth.The difference between both of the results could be observed clearly by the close-up and its corresponding pseudocolor version.
In Figure 4, we adopt the natural images which have obvious self-similarity texture and structure components to evaluate our model.For objectiveness, we select the Gaussiantype model to compare. Figure 4 presents the results which produced by He's Gaussian process regression (GPR) model [22] and ours.In the texture case, we note that our result has better detail performance rather than blurring.And in the structure case, we could see that some artifacts appear in GPR result but not in ours.
Furthermore, in 2x upsampling, we compare between Shan's method [6], Glasner's patch recurrence method [20], Yang's SCSR method [13], Peleg's statistical prediction model [26], and ours.The tested images include such characteristics of sharp edges, complicated textures, and repeated structures.In Figure 5, by the close-up, we observe that Shan's, Glasner's, and Peleg's results would yield some artifacts along the sharp edges, for example, the wheel image in the second line.And Yang's SCSR model would generate a relative blurred result.Generally speaking, our model could obtain a series of satisfactory results in these conventional tested images.

Objective Evaluation.
Peak single-to-noise ratio (PSNR) and structure similarity (SSIM) index are used as the objective measures.PSNR is defined as where  is the original HR image, x is the SR result we approximated, and  is the total number of pixels in the image.SSIM is more consistent with human eye than PSNR.The SSIM metric is calculated on various windows of an image and the measure between two windows  and  of common size  ×  is where   ,   are the average values of  and ,    see that SR results with better quality should provide larger PSNR and SSIM values.
In our experiments, fifteen images are randomly selected from RetargetMe benchmark (http://people.csail.mit.edu/mrub/retargetme/), as illustrated in Figure 6.We first apply the Lanczos resampling to yield the low-resolution version (0.5x) and then recover to the original size to compare with the ground truth.Five state-of-the-art models are tested, including Glasner's patch recurrence method [20], SCSR [13], KRR [27], GPR [22], and SPM [26].The PSNR and SSIM metrics are used to evaluate these results, and the corresponding scores are, respectively, recorded in Table 1.Note that our method could obtain the best or the second-best performance in most of the test images.
For saving the computing time, we can compute , and ()   after finishing the GMM training phrase, and use it directly in the iterative process.Figure 7 provides the numerical comparison on executing time with the state-of-the-art algorithms.The images are upscaled from 256 × 256 to 768 × 768 with the conventional images.The hardware configuration of our modern computer is Intel CPU 1.7 GHZ, 4 GB memory, and the software platform is Matlab 7.1.

Conclusion
In this paper, a novel algorithm is proposed for SISR based on the concept of image self-similarity maximization.The proposed method carefully observes and leverages a pair of local and global image constraints between different resolutions of an image.The local constraint dictates that a random patch in the reconstructed HR image should exhibit visual similarity with its corresponding image region in the LR image; the global constraints state that the reconstructed HR image should perceptually resemble the input LR image through downsampling and filtering.The comparative experimental results show that the results generated by the new method preserve details of image more realistically, as determined both perceptually through subjective evaluation and quantitatively via numeric measurements.In the future, we plan to adapt and migrate the method for the problem of video superresolution.

Figure 1 :
Figure1: The comparison between the results of EPLL and the proposed method.(a) The ground truth image.Through downsampling and upsampling the image by the bicubic method, we derive (b), which is used as the input for producing (c) and (d) using the EPLL and the proposed method, respectively.From (c), we can see that EPLL fails to capture the image texture and its fine details.In comparison, the result of the proposed method (d) appears more natural and realistic; in particular the edges and textures regions are better preserved.

Figure 3 :Figure 4 :
Figure 3: Comparisons with the self-similarity measurement.(a) Ground truth.(b) Glasner's PR [20].(c) Our proposed method.Note the artifacts in (b) and ours are very close to the ground truth.

Figure 6 :
Figure 6: Fifteen test images for effectiveness validation.From left to right: bells, bike, Buddha, butterfly, child, eagle, face, family, Fatem, fish, surfer, tiger, trees, two birds, and waterfall.All the test image are extracted from RetargetMe data set.And the validation data are recorded in Table1.
. For solving each auxiliary variable   , it just means estimating the most likely image patch under prior of MSSP.However, we have no HR image  in practice; thus it is difficult to resolve directly.Here, we consider the relationship between  and  0 and extract patches {  }  =1 and { 3.2.2.Predicting.We consider the parameters {  ,   , Σ  } and can rewrite   = [ 2 , 2 are the variances of  and , and  1 ,  2 are two variables to stabilize the division with weak denominator.From the definition, we can

Table 1 .Table 1 :
PSNR and SSIM for scale factor of 2. Figure 7: Comparison of the averaged CPU time between different methods.4.4.Analysis.By analyzing (8) and (10), we can notice that   is decomposed into  layers, and, for each layer, the weight is   (  0 ).This operation can avoid the influence of different frequencies.In each layer, we can see that if Σ