Image Quality Assessment Based on Joint Quality-Aware Representation Construction in Multiple Domains

Image quality assessment that aims to evaluate the image quality automatically by a computational model plays a significant role in image processing systems. To meet the need of accuracy and effectiveness, in the proposed method, complementary features including histogram of oriented gradient, edge information, and color information are employed for joint representation of the image quality. Afterwards, the dissimilarities of the extracted features between the distorted and reference images are quantified. Finally, support vector regression is used for distortion indices fusion and objective quality mapping. Experimental results validate that the proposed method outperforms the state-of-the-art methods in terms of consistency with subjective perception and robustness across various databases and different distortion types.


Introduction
With the rapid development of multimedia technology, millions of digital images require to be processed; however, any operations on images, such as image acquisition, reproduction, compression, storage, transmission, and restoration, may introduce noise into images [1,2].Image quality assessment (IQA) is employed to estimate the degree of distortion.That is, IQA plays a pivotal part in evaluating or monitoring the performance of an image processing system.Visual perception is the result of interaction between human being and environment, and it is difficult to define image quality exactly.The most accurate estimation method for image quality is subjective IQA carried out by human observers.However, subjective quality methods are costly, time-consuming, and impractical as they cannot be integrated within real-world systems for real-time visual quality monitoring and controlling [3,4].It triggers the need to develop reliable objective IQA methods that are consistent with subjective human evaluation.
Traditional objective IQA methods like Peak Signal-to-Noise Ratio (PSNR) and Mean Squared Error (MSE) [2] are simple and with clear physical meanings, but they are blamed for their low accuracy because they just measure the statistical information in images.Thereby, in recent decades, an increasing number of researchers make efforts to develop effective and efficient methods to evaluate the image quality automatically.Ideally, perceptual quality is obtained by mimicking the image perception mechanism of human visual system (HVS); nevertheless, due to the complexity and limited understanding of HVS, it is almost impossible to completely imitate HVS [5].State-of-the-art methods turn to the other direction that manages to capture the statistical properties (features) which represent the information that HVS is interested in and are closely relevant to the image inherent quality and map them to the perceptual quality.Therefore, it is a significant problem for us to extract the effective image feature.Furthermore, in most cases, a single measurement cannot provide sufficient information for quality prediction.It is proved that exploiting the image information jointly in image space, scale, and orientation domains can provide richer clues, which are not evident in any single domain.That is to say, quality-related representation should be obtained by considering joint occurrences of two or more features [6].Saha et al. put forward a method to evaluate the image quality based on extracting multiple features which are composed of gradient difference, contrast difference, and saliency map difference [7].And, in [6], an approach of complementary representation is proposed by employing log-Gabor filters to model the perceptual characteristics and local derivative pattern to describe detailed texture information.
Furthermore, an effective IQA method is extremely desirable to be (i) consistent with subjective sensation of HVS and (ii) stable across different types of distortions.To tackle the challenge mentioned above, a method based on complementary feature extraction and fusion is proposed in this paper.The framework of the proposed method is shown in Figure 1.Firstly, quality-aware features on three domains are extracted to construct the complementary representation.Secondly, the difference of each feature between the reference and distorted images is quantified.Such differences that reflect the degree of the quality degradation are taken as the distortion indices.Thirdly, all the distortion indices are mapped into an objective quality score, where Support Vector Machine (SVM) is adopted to learn a regression model.
The main contributions of this paper lie in the following: a complementary representation of image quality is proposed based on joint quality-aware feature extraction, where multiscale histograms of oriented gradient (HOG) are used for local structure information description, multiscale edge information is employed to describe the global information, and, meanwhile, color histograms are used to describe the color information.With the complementary quality-aware feature extraction and SVM-based fusion, a high-efficiency IQA method which performs well in terms of consistency with subjective assessment and robustness across different databases and distortions types is put forward.
The organization of the remainder of this paper is as follows: In Section 2, a brief review of IQA and related work is given.Section 3 presents the multiscale quality-aware feature extraction and the corresponding dissimilarity quantification.Section 4 introduces the feature fusion and quality mapping strategy.Experimental results and discussions are presented in Section 5, and, finally, in Section 6, we draw the conclusion of this paper and indicate the future work.

Related Work
Generally, the intrinsic principle of IQA is to measure the degree of perceptual quality degradation by assessing the difference or dissimilarity between the distorted image and its corresponding reference image [3].The scheme of a generic IQA method is shown in Figure 2, where there are three stages.In the first stage, features that can reflect the image quality are extracted by different algorithms.Then, difference or dissimilarity of each feature between the distorted image and its corresponding reference is quantified in the second stage.Such differences or dissimilarities are regarded as the distortion indices to measure the degradation of image quality.Finally, in the third stage, all the distortion indices are fused together and mapped into an objective quality score.
Feature extraction is quite important in image quality assessment.Over the last decades, various effective feature extraction methods have been explored in literature, which are conducted in two directions [8].Since images are ultimately viewed by human beings, by taking advantage of the limited understanding of human visual system (HVS), some meaningful HVS-oriented methods capture the features by simulating the way that human beings perceive the image [9].Study on neural science has discovered that the primary and secondary visual cortexes (V1 and V2) are the earliest receptors for signal and play the most important part in generating vision [10].Recent HVS-based methods try to incorporate just noticeable difference (JND) [11], contrast sensitivity function (CSF) [12], contrast masking [13], temporal masking, visual attention, and saliency map into the perceptual feature extraction.For example, the well-known method is visual information fidelity (VIF) [14] which incorporates the HVS model into information fidelity criterion (IFC) [15].Since images are naturally multiscale in HVS perception, some weighted multiscale methods at multiple resolutions cohering with human perception are explored [16,17].Furthermore, it has been found that Gabor filters have excellent properties and the shapes of Gabor wavelets are similar to the receptive fields of simple cells in the primary visual cortex.Thus, visual features are extracted by 2D Gabor filter to reflect the nonlinear mechanism of HVS [18].Instead of simulating the functional components of the low-level HVS, some high-level aspects of HVS, such as visual attention and visual saliency, are considered into feature extraction [9,19].However, it is very difficult to model the complex and rigorous HVS well relying on the limited understanding upon it.Besides, the high computational complexity also limits the application of these methods.
In this scenario, most of state-of-the-art methods turn to the other direction that attempt to extract the statistical properties (features) of an image that are closely related to the image inherent quality [6].Such methods are inclined to effectively extract quality-aware features and have achieved notable success.Based on the hypothesis that HVS is highly adapted for extracting structural information from images, structural similarity (SSIM) index is quite attractive, in which the luminance, contrast, and structure information are involved for perceptual quality estimation [20].Furthermore, several extensive methods have been proposed to improve the performance [17,[21][22][23].Recent studies prove that exploiting the image information jointly in color space, scale, and orientation domains can provide richer clues, which are not evident in any single domain [5].Therefore, multidomain feature extraction by considering joint occurrences of two or more features is necessary and constructive [6,10].
In the respect of image quality assessment, if there are some distortions introduced in an image, they may inevitably affect the statistic characteristics of features.Intuitively, the features' statistic characteristics of the distorted image are quite different from that of the reference image.Quantifying deviations from the regularity appropriately is validated as a useful way for assessing the perceptual quality.Thus, in the stage of difference quantification, the dissimilarity of each feature between the reference and distorted images is computed.There are many common approaches to quantify the difference between two features, such as Euclidean distance, histogram intersection distance (HID), histogram quadratic (cross) distance, Chi-square, and correlation coefficient [24,25].In [6], correlation coefficient is employed to quantify the difference between the gradient magnitude and orientation maps, chi-square distance is used to quantify the difference between the energy maps, and HID is adopted to quantify the difference between the local pattern maps.
As mentioned above, the dissimilarities of features between the distorted and reference images can reflect the image quality degradation.It is reasonable to take them as distortion indices.That is, the quantification can be regarded as a good approximation of perceived distortion in image quality.In the stage of indices fusion and quality mapping, all the distortion indices should be fused together and mapped into an objective score.Average fusion which calculates the mean value as the overall quality score is simple and widely used, but its accuracy is generally blamed.Weighted-based fusion strategy is another widely used approach because various features may contribute differently to the final quality.More recently, researchers tend to construct more complex weighting function or employ machine learning techniques to develop the fusion scheme [4].-nearest neighbor (KNN), convolutional neural network (CNN), and SVR are the most commonly used machine learning tools [26].

Quality-Aware Feature Extraction and Dissimilarity Quantification
It is claimed that human eyes are sensitive to less complex features which are localized, oriented, and bandpass and interested in color and structure information [27].In this paper, we attempt to capture complementary quality-aware features from different domains.Quality-aware features represent the characteristics which are closely related to the quality of image.In addition, quality-aware features are sensitive to the degradation of image quality, and the changes of these features due to the image distortions are consistent with the human eye vision quality perception.

Orientation Feature Extraction and Dissimilarity Quantification.
With regard to feature extraction, local description has received a lot of attention in recent years which performs very well in many image applications including image retrieval, object recognition, and texture analysis [6].Such success inspires us to introduce local features extracted by HOG into image quality assessment.As shown in Figure 3, there are five steps for calculating HOG of an image.In the first step, a pair of masks [−1, 0, 1] and [−1, 0, 1]  are used to obtain the gradient magnitude and orientation maps because the weights of positions in  the masks are in connection with their relevant distance to center pixel and thereby can achieve relatively better results.Image segmentation is the second step, in which the image is divided into a fixed number of 8 × 8 small connected regions called cells.Histogram accumulation is the third step, where each cell is discretized into several angular bins according to the gradient orientation, and a local HOG over all pixels within one cell is accumulated [28].The weighted votes of the local HOG are generated by the gradient magnitudes.In this paper, the range of orientations is from 0 ∘ to 180 ∘ , and the gradient orientation is divided into nine bins.The next step is grouping histograms of cells into blocks, in which the histograms of adjacent cells (2 × 2 cells) are concatenated to constitute a descriptor of one block.After grouping, the values of histograms within the blocks are normalized in order to avoid contrast bias.The final step is concentrating total histograms of blocks into a combined vector which is regarded as the final feature descriptor.
The degree of corruption can be regarded as the approximate value of the image quality degradation because an image with better quality has smaller divergence with its perfect image.Chi-square distance is adopted as the dissimilarity measure to construct the first kind of distortion index.
where  is the total number of HOG bins within an image and HOG  () and HOG  () denote the values of reference and distorted images in the same HOG bin, respectively.

Edge Feature Extraction and Dissimilarity Quantification.
It is well known that human eye is significantly sensitive to edge and contour information of an image; therefore, edge information is very important for fully exploiting the potential inherent quality.There are many useful tools for edge information detection, such as Canny, Sobel, Scharr, and Prewitt operators.Prewitt operator is selected to obtain the edge information in this paper.The Prewitt operators in horizontal () and vertical () directions are shown as follows: An edge map is produced by the following: where symbol "⊗" denotes a convolution operator and   and   represent the edge maps of the distorted image and its reference image, respectively.As an example, a reference image named "parrots" taken from LIVE database [29] and its corresponding detected edge information are shown in Figures 4(a)-4(e).For comparison, the relevant results of its distorted version with the distortion of white Gaussian noise (WN) are shown in Figures 4(f)-4(j).
Ideally, the edge map should have a continuous outline because the edge distribution of a perfect image is highly organized; in comparison, perfection of the outline for a distorted image is destroyed [8].From Figure 4, it can be apparently observed that the outline in Figure 4(e) is much clearer than that in Figure 4(j).Moreover, the degree of degradation reflects the distortion level of the images.Edge similarity map between the distorted and reference images is calculated in a pixel-wise manner, which is defined as follows: where  is a positive constant utilized for numerical stability and ES() denotes the dissimilarity of th pixel between reference and distorted images.In this paper, we take the mean value EA and standard deviation ED of ES to act as two indices of image quality: where  is the total number of pixels contained in an image.

Color Feature Extraction and Dissimilarity Quantification.
HOG and edge detection just extract the features in gray channel.However, color plays a crucial role in the perception of an image [5].To extract the quality-aware feature sufficiently, in this paper, color histograms based on  space are employed to describe color information.Before the color information extraction, images are converted from  space into  space: where , , and  are hue, saturation, and the value of intensity.In general, , , and  are discretized into 8, 3, and 3 bins according to their magnitude, respectively.After discretization,  is usually calculated an integer between 0 and 8,  is usually calculated an integer between 0 and 2, and  is usually calculated an integer between 0 and 2.
Given an image , the color histogram (HI) that describes the global distribution of pixels within an image can be represented as follows: where  is the number of bins that color histograms are categorized into.In  space, ℎ  is computed as where (, ) =  × (, ) +  × (, ) +  × (, ), ,  denote the horizontal and vertical pixel numbers contained in an image, and , , and  are the parameters should be defined.In this paper,  = 16,  = 4, and  = 1.
The dissimilarity of color histograms between the distorted and reference images is then quantified by histogram intersection distance.Such dissimilarity is regarded as the third distortion index.
where HI  and HI  represent the color histograms of the distorted image and its reference, respectively, and  is the number of bins.

Multiscale Features Fusion and Objective Quality Mapping
Multiscale approach is an advantageous way to incorporate image details at different resolutions.It is discovered that human eyes can easily identify and process the natural images at different scales [1].Therefore, processing a natural image at different scales can increase the flexibility and adaptation for the image quality evaluation [30].In our IQA model, three types of features mentioned above in multiscale are extracted from the original scale image and the corresponding downsamples.With all these quantified multiscale distortion indices, many methods can be employed to construct a function to synthesize and map them into objective quality scores.Traditionally, linear approaches are usually used to fuse and map the indices; for example, use weighted average values of indices [31].In recent years, machine learning techniques are introduced to derive the single metric, such as KNN, deep learning, and SVR.The theory of KNN is simple, but its accuracy is unsatisfactory, especially for large-scale data.Deep learning is more precise, but it spends a lot of time building topological structure and determining the values of parameters.In addition, an efficient network model is based on a large amount of training data.SVR is an accurate tool for constructing the regression function in IQA [8].Under the condition of small samples, it is difficult to obtain ideal results because deep learning usually generates overfitting problem.Since the experimental data provided by the databases is limited, SVR is employed to construct a function to synthesize and map them into objective quality score.To be specific, -SVR with the radius-basis function (RBF) kernel is employed in the proposed method [8,32].
where  HOG2 , EA 2 , ED 2 , and  COL2 represent the indices of orientation, edge, and color information in the second scale, respectively.The framework of the dissimilarity quantification and quality mapping is shown in Figure 5, where  is the original scale image and 2↓ represents its reduced scale image with downsampling by 2, "HOG" denotes the HOG, "GM" is the edge map, "COL" means the color histogram, and suffixes " ref" and " dis" imply the features are extracted from a reference image or a distorted image.-SVR is used as the machine learning tool to find the best  SVR in (11) which minimizes the dissimilarity between subjective and objective quality scores.Function  can be defined as where  and  are weight vector and deviation factor, respectively, which are introduced to insure the dissimilarity between subjective score  and objective score  less than  for all training data.(•) represents a nonlinear function which maps  into a nonlinear space, and  is represented as where   * and   (0 ≤   * ,   ≤ , and  is the parameter of tradeoff error) represent the Lagrange multipliers and  SVR denotes the amount of support vectors [33].Function  can be rewritten as where (  , ) is the radius-basis kernel function: where  is the width of the kernel function.A high value of  will lead to overfitting problem, which reduces the generalization ability of the IQA model.On the contrary, the accuracy of the IQA model will be degraded if the value of  is too low.Therefore, the value of  has influence on quality assessment.

Experimental Setup.
Experiments are conducted on four large-scale image databases, LIVE [29], TID2008 [34], TID2013 [35], and CSIQ [36].Hundreds even thousands of distorted images which are corrupted by various types of distortions are contained in each database.And subjective ratings in each database which are in the form of either mean opinion score (MOS) or difference of mean opinion score (DMOS) give the perceptual quality for each distorted image.Detailed information of the databases is shown in Table 1.
For performance evaluation, Pearson Linear Correlation Coefficient (PLCC), Spearman Rank-order Correlation Coefficient (SRCC), and Root Mean Square Error (RMSE) between objective and subjective quality scores are adopted as the evaluation metrics, where PLCC and RMSE are used to assess prediction accuracy, and SRCC is adopted to evaluate prediction monotonicity.It deserves to be mentioned that, before computing PLCC and RMSE, a nonlinear mapping is carried out between subjective and objective scores using the modified logistic regression model [20] because different databases adopt different schemes to quantify subjective IQA results.The nonlinear mapping function is defined by where  indicates the raw objective score and  are the parameters to describe the relationship between  and DMOS.The associated quality difference scores of distorted images are available for fitting the five parameters and establishing the prediction of the nonlinear mapping function.
Higher PLCC and SRCC values yet lower RMSE mean better performance of the IQA method.
It should be noted that fivefold cross-validation is adopted.Images from one database are split randomly into two no-overlapped subsets: 80% used for training and 20%  used for testing.A predicted quality score is acquired for each image after testing, and then the performance is assessed based on all quality scores.In order to eliminate the performance bias, random-test is repeated 1000 times and the mean value across these 1000 iterations is reported as the final result in this paper.

Overall Performance across Databases.
Table 2 shows the overall performance comparison with state-of-the-art IQA methods including MCSD [1], IFS [4], LCSIM3 [5], CLFE [6], CLR [8], VSI [9], SSIM [20], GMSD [25], ESIM2 [37], and QASD [38] on different databases.For SSIM, an alternative framework based on structural similarity is introduced to evaluate the quality of images.With respect to GMSD, gradient magnitude similarity deviation is presented to construct the IQA model.Numbers of new methods are developed to extract quality-aware features to assess the image quality recently; for example, MCSD is utilized to represent the image quality.In addition, an adaptive subdictionaries index is put forward for IQA in [38].LCSIM3 [5], IFS [4], CLFE [6], CLR [8], ESIM2 [37], and the proposed method try to improve the accuracy and robustness across different databases by multiple feature extraction, indicating that complementary features construction is a promising solution for effective IQA development.However, one of the vital problems for multiple features based IQA methods is to find proper fusion function.With the machine learning tool SVR, our method achieves superior performance.In Table 2, the best results on each database are highlighted in boldface.It is obvious that the performance of the proposed method is superior to all the other compared IQA methods on TID2008 and TID2013 and near to the best on LIVE and CSIQ, while, for other competitors, they may perform well on one or two databases but poor on other databases [39].For example, earlier IQA methods, like SSIM, perform well on only one or two databases, and CLFE [6]   In addition, to analyze the statistical significance of the proposed method, a left-tail -test [40] with significant level at 1% is carried out.It should be mentioned that the prediction residuals of the objective scores after nonlinear mapping are used by -test.Table 3 lists the results, where the symbols "1"; "0"; or "−1" suggest that our method is statistically better, indistinguishable, or worse than the other methods.It can be found that our method is significantly better than other methods on TID2008 and TID2013 databases and has a small advantage on LIVE and CSIQ.There are only several indicators that appear as 0: comparable to CLR and CLFE on LIVE and CSIQ, comparable to LCSIM3 on LIVE.However, CLR and CLFE perform so poorly on TID2008 and TID2013 databases.Consequently, the proposed method has performance superiority over all listed methods because most of the values in Table 3 appear as 1.
Besides, a good IQA method should predict the image quality consistently across repeating many times.Thus, the box-plot of PLCC generated during the cross-validation period on different databases is presented in Figure 6 to demonstrate the stabilization of the proposed method.In Figure 6, for one box, the center mark is median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually.It is obvious that our method is concentrated and has small whiskers in different databases, which implies it achieves a stable and impressive performance.

Performance on Individual Distortion Type.
For verifying the robustness of the proposed method further, the prediction accuracy on individual distortion type is conducted.A method with high robustness should not only attain good accuracy on all distortion types but also achieve stability among different distortion types.LIVE database is used to conduct the experiments, which contains five distortion types, including JPEG2000 (JPEG), JPEG, white Gaussian noise (WN), Gaussian blur (GBLUR), and fast-fading (FF).
Table 4 lists the SRCC comparison of the proposed method with others for all five individual distortion types on LIVE database.The compared methods include MCSD [1], BIFS [3], LCSIM [5], GLD-SR [7], CLR [8], VSI [9], SSIM [20], GMSD [25], CSV [40], and SURF [41].In Table 4, the top performance for each distortion type is highlighted in boldface, too.It can be found that the proposed method obtains a remarkable high SRCC across five distortion types.The excellent performance across different distortion types lies in the following: (i) WN can be detected by edge features because WN always results in high frequency component decrease which makes the edge information destroyed.It is the same with JP2K and GB, where they cause low frequency component increase and edge being unclear.(ii) For JPEG, since new edge appears in the boundary of the blocks with the edge of original image declining, block-based feature extraction method is fit for JPEG.Therefore, in the proposed method, JPEG can be detected by the HOG operation.Although HOG is mainly used to detect local distortion, it is also fit for FF because FF is a local distortion appearing on random area in an image.The practical image processing system is unpredictable and may encounter complex situations, so it is important for an IQA method to maintain stability.In order to more directly validate the performance across various distortion types, weighted average (w.a.) and standard variation () of SRCC of five distortion types are shown in Table 4.For w.a.calculation, the number of images contained for each distortion type is taken as the weight, similar to what it does in Table 2. High value of w.a.means excellent performance; meanwhile, low value of  reveals that the method has a power to maintain a stability in evaluating images contaminated by different types of distortions.Obviously, our method has largest w.a.(0.9824) and comparatively small  (0.0027), which means that the proposed method can deal with all the listed distortion types very well.
To further prove the performance of our method, the scatter plots on LIVE and TID2008 databases are shown in Figure 7, where the objective score shown in vertical axis is generated by a nonlinear mapping function, horizontal axis is subjective quality scores, and every dot represents an image in databases.Obviously, the objective scores show a large correlation with subjective image quality scores, as the distribution of the scatter plots is closely next to fitted curve.

Conclusion
In this paper, a complementary image quality representation is introduced to develop a highly effective IQA method, which consists of orientation and edge and color information.Conducted on four public databases, the experimental results show that the proposed method achieves an excellent performance in terms of both prediction accuracy and robustness across different databases and distortion types, which provides an insightful and promising solution for high performance IQA development.

Figure 1 :
Figure 1: Framework of the proposed method.

Figure 4 :
Figure 4: (a) An original reference image, (b) gray image, (c) horizontal image with Prewitt operator, (d) vertical image with Prewitt operator, and (e) edge map of image.((f)-(j)) A WN distorted version of the image and its corresponding results.

Figure 5 :
Figure 5: Structure diagram of dissimilarity quantification and quality mapping.

Table 1 :
Fundamental information about the four largest databases used in our experiments.
works well on LIVE, but it performs poorly on TID2008 and TID2013.The earlier proposed IQA methods perform poorly because only one feature is employed to describe the degradation of image quality in the earlier IQA methods.However, in recent years, many IQA methods based on combinational and joint representation of image quality are proposed.For CLR and CLFE, edge feature, local spatial-frequency feature, and texture feature are used to complementarily represent the image quality, and SVR is employed to fuse the multiple distortion indices.In respect of LCSIM3 and ESIM2, several earlier IQA approaches are used to jointly evaluate the image quality.From Table2, it can be seen that the methods that employ complementary features can obtain the outcome expected.There are only five distortion types contained in the

Table 2 :
Performance comparison on different databases.

Table 4 :
SRCC of the proposed method with others for all five individual distortion types on LIVE database.LIVE database.Therefore, numbers of IQA methods perform well in the LIVE database, especially CLFE.It means that edge and texture features are enough to describe the degradation of the image quality in the LIVE database.However, other features are acquired to be extracted to represent the image quality because edge and texture features will not be able to accurately represent all type of distortion.Thus, color histograms are employed to describe color information in this paper.In Table2, it is obvious that our algorithm has strong robustness.CLFE only performs well in the LIVE database.