Inferring Visual Perceptual Object by Adaptive Fusion of Image Salient Features

Saliency computational model with active environment perception can be useful for many applications including image retrieval, object recognition, and image segmentation. Previous work on bottom-up saliency computation typically relies on hand-crafted low-level image features. However, the adaptation of saliency computational model towards different kinds of scenes remains a challenge. For a low-level image feature, it can contribute greatly on some images but may be detrimental for saliency computation on other images. In this work, a novel data driven approach is proposed to adaptively select proper features for different kinds of images.This method exploits low-level features containing the most distinguishable salient information per image.Then the image saliency can be calculated based on the adaptive weight selection scheme. A large number of experiments are conducted on the MSRA database to compare the performance of the proposed method with the state-of-the-art saliency computational models.


Introduction
Saliency computational model with active environment perception can be useful for many applications including image retrieval, object recognition, and image segmentation.Generally, visual saliency can be defined as what captures human perceptual attention.Saliency detection plays an important role in image analysis and processing, which is capable of allocating the limited resources effectively.For example, the detecting of visual saliency can be effectively used to automatically zoom the "interesting" areas [1] or automatically crop the "important" areas in an image [2].Object recognition algorithms can use the results of saliency detection to quickly locate the position of visual salient objects.Salient object detection can also reduce the interference of cluttered background to further improve the performance of image segmentation algorithm and image retrieval system [3].
Most of the existing saliency computational models are based on the bottom-up mechanism because the visual attention is generally driven by the low-level stimulus such as edge [4], color [5,6], orientation [7], and symmetry [8].
These models typically contain two main procedures in saliency computation.The first step extracts low-level features from input image.Then the saliency map can be computed by fusing the extracted low-level features.In the past, low-level image features have been extensively studied.However, the selecting of proper features containing saliency information per-image is still complex and difficult to determine.The reason is mainly due to the lack of a well-defined feature which can exhaustively interpret saliency information in different images.Most of the existing saliency computational models face great difficulties in adaptively selecting low-level image features towards different images.
Aiming to address this problem, this paper puts forward an adaptive fusion scheme towards low-level image features for saliency detection.This method firstly extracts various low-level features to largely reflect saliency information of image.Then, the visual perceptual object can be detected by the adaptive weight selection towards these low-level image features.This method can retain the most significant lowlevel image features for saliency computation.The flowchart of the proposed method is illustrated in Figure 1.

Related Works
Visual saliency reflects how much an image region or object stands out from its surrounding.Saliency computational model aims to provide a numerical measure indicating the degree of visual saliency, and is useful for a wide range of applications.Consequently, a number of computational approaches for salient object detection have been developed in recent years.Based on the biologically plausible visual attention architecture [9] and the feature integration theory [10], Itti et al. [11] proposed the bottom-up saliency model which depends only on low-level image features.The salient object in image can be determined using dynamic neural networks for feature fusion.Following this model, many state-of-the-art saliency computational models focus on lowlevel features, such as color and contrast [12,13].
For example, Achanta et al. [14,15] used the luminance and color features to detect salient object in image.This method calculated the contrast between local image region and its surrounding region and utilized the average color vector difference to obtain the saliency value of image.Aziz and Mertsching [8], Liu et al. [16], and Cheng et al. [17] proposed a contrast computational model based on region segmentation.These methods took use of the image segmentation algorithm to divide the image into different regions according to the homogeneity of different low-level image features such as luminance, texture, and color.
Most of the existing saliency computational models are based on low-level image features.Consequently, there are various formulations for exploiting well-defined salient features.Gao and Vasconcelos [18] studied the statistics of nature images and constructed an optimal detector based on the discriminate saliency.This method can effectively integrate the saliency features.Klein and Frintrop [19] detected the salient object by reconstituting the cognitive visual attention model.Their approach computed the saliency of all feature channels in an information-theoretic way.The optimal features can be determined by using Kullback-Leibler divergence (KLD) to fuse different features.Lu et al. [20] put forward the diffusion-based salient object detection model.This model can learn the optimal saliency seeds by combining two kinds of features: the bottom-up saliency maps and midlevel vision cues.
However, most of the existing saliency computational models fail to consider the adaptability of different low-level features towards different images.Some low-level features which contribute greatly on some images may actually be detrimental to the detection of saliency on other images.In this paper, a novel data driven approach is proposed to adaptively select proper features for different images.This method exploits the most distinguishable low-level features containing salient information per image.Then the image saliency can be calculated based on the adaptive weight selection scheme.
The rest of this paper is organized as follows.Section 3 details the extraction of different low-level image features.Section 4 puts forward the adaptive feature fusion scheme.The experimental results are given in Section 5.And finally, the conclusions are drawn in Section 6.

Low-Level Image Feature Extraction
According to physiological experiments, it can be found that human attention towards image is mainly driven by lowlevel image features [21].In order to fully and accurately describe the saliency information in image, ten low-level image features are chosen according to their visual properties.These ten low-level features are lightness, color, contrast, intensity, edge, orientation, shape, gradient, coarseness, and sharpness.In this section, we describe the extraction of these features and analyze the contributions of each feature in saliency computation.
The low-level features are extracted using the block-based method from input image (denoted by ), which is divided into 8 × 8 blocks (denoted by ) with 50% overlap.Let   (),  = 1, . . ., 10, denote the saliency map corresponding to each low-level feature; then   () can be used to denote the saliency map of block .

Lightness Feature Extraction.
Lightness is the perception attribute of human vision system for visible objects radiation or glow amount.The lightness feature may affect the performance of other features to a certain extent.
The lightness feature measures how the brightness of each block is different from the average brightness of the image.Therefore, the object having a high brightness can be treated as a salient object by the human eye gaze.Let  1 () denote the Euclidean distance of the average lightness value between each image block  and the image .The larger distance stands for the greater lightness value difference.Specific methods are shown as follows.
Firstly, the input image is converted from the RGB color space to a uniform LAB color space.Then, the lightness feature  1 () can be obtained by where  represents the average lightness component  in LAB color space.The lightness feature can well distinguish the illumination differences of object in image.

Color Feature Extraction.
As the salient object has a strong contrast or a strong variation, the color value of the image object is relatively far from the average color value of the whole image.The background region can be seen as the smooth area in which the color value is closed to the average color value.Therefore, the color feature can be extracted by calculating the Euclidean distance of the average color value between each image block and the background region.The color feature  2 () of the image block  can be computed via where  and  represent the average  and  color components in LAB color space, respectively.Most of the existing saliency computational models are conducted in the RGB or LAB color feature space.RGB is mainly used for the color representation of the image, while CIELab provides a representation of color that corresponds to how observers perceive chromatic differences.Thus, this method extracts the color feature in the LAB color space.
Color feature can simply describe the global distribution of colors in an image.Thus, it can measure the proportion of different colors accounting for the whole image.The color feature is especially suitable for the image which does not take the spatial position of the object into account.

Contrast Feature Extraction.
The saliency of the image block  in image  depends on the difference between the image block and its surrounding, while the visual characteristic of the image block itself does not determine its saliency.Thus, the greater the difference between the image block and its surrounding, the more likely it is a salient region.The calculation method to extract the contrast feature  3 () of the image block  is shown as follows.
First, convert each block to luminance image block via where  = 0.7656,  = 0.0364, and  = 2.2 denote the display conditions of the Adobe RGB color space.The contrast feature of the image block  can be expressed as where () and () represent the standard deviation and the average value of (), respectively.The contrast is critically important on the visual attention.Generally, the greater the contrast is, the sharper the image can be.

Intensity
where () and () represent the average intensity values of () and (), respectively.The human eye perception towards the brightness is mainly associated with the luminous intensity of observed objects.
The edge in image is often associated with the discontinuity of image grayscale.

Orientation Feature Extraction.
Orientation feature is similar to the color feature, which is also an integrity feature.Based on the Itti model, the proposed method calculates the orientation feature by running the Gabor filter on grayscale image (denoted by (, )) of input image .The Gabor wavelet can extract the direction feature effectively and can eliminate the redundant information at the same time.The two-dimensional Gabor function (, ) is shown as follows: where  = 0.5 represents the spatial aspect ratio and  = 0.56 represents the standard deviation of the Gaussian factor.The parameters , , and  denote the wavelength, the phase offset, and the angle, respectively.The 2D Gabor operator based orientation image feature extraction (denoted by   (),  ∈ {0 ∘ , 45 ∘ , 90 ∘ , 135 ∘ }) can be obtained using the following form of convolution: Let  6 () denote the orientation feature of image block , which can be obtained by computing the Euclidean distance between the orientation image block   () and the orientation image   () via where () and () represent the average orientation values of () and (), respectively.The orientation feature has global properties and thus can enable the saliency information with good feasibility and stability even in complex scenes.
3.7.Shape Feature Extraction.The shape feature can be extracted using Hu's moment invariants, which has the invariant properties of rotation, translation, and scale.The Hu moment invariants are commonly used to identify the large objects in an image and can better describe the shape feature of an object.The proposed method calculates the 7 moments (denoted by   ,  = 1, . . ., 7) by Hu's method, which is obtained by normalizing the central moments through orders two and three.
Let   () and   (),  = 1, . . ., 7, denote the 7 moments of the image block  and the input image , respectively.Thus, the shape feature (denoted by  7 ()) of  can be expressed as the Euclidean distance between   () and   (): The shape feature is not sensitive to the lightness and the contrast changes and thus can effectively reduce the influence of lightness.

Gradient Feature Extraction.
The gradient feature is sensitive to the gradient variation; however, it is not sensitive to the grayscale of the image.The image gradient can not only be able to capture the contour, silhouette, and some texture information, but also further weaken the influence of illumination.Let (, ) be the grayscale at pixel (, ) in the image region, with the size  × ; the gradient feature (denoted by  8 ()) of image block  can be calculated by averaging the abscissa squared gradient (denoted by   ()) and the ordinate squared gradient (denoted by   ()) through Gradient value can describe the magnitude of the dramatic changes of the pixel values.Thus, the gradient map constituted by the pixel gradient values can reflect the local grayscale changes in image.
3.9.Coarseness Feature Extraction.Coarseness is the fundamental perceptual texture feature.It can measure the particle size of the texture pattern.The larger particle size means the coarser image texture.The coarseness feature can be calculated as follows.
Firstly, the average gray value (denoted by   (, ),  = 1, 2, . . ., 5) of the neighborhood with size 2  × 2  in image  is calculated as where (, ) is the gray value at pixel (, ) in the active window.
Then, for each pixel, the average intensity difference (denoted by  ,ℎ (, ) and  ,V (, )) can be calculated between the nonoverlapping neighborhoods in the horizontal and vertical directions, respectively: Finally, the optimal size can be set by  which gives the highest value of .The coarseness (denoted by ) is the average of  best (, ) = 2  : Let () and () denote the coarseness of the image block  and image , respectively.Thus, the coarseness feature (denoted by  9 ()) of  can be expressed as the Euclidean distance between () and (): Coarseness feature represents the surface properties of the whole image and can well describe the integrity of the salient object.Meanwhile, the coarseness feature has good rotation invariance; it can effectively resist the interference of noise.
3.10.Sharpness Feature Extraction.The sharpness feature measures how the acutance of each region is different from its surrounding, which can indicate the contrast of the adjacent region.The proposed method extracts the sharpness value () at position  by computing the convolution between the input image  and the first-order derivatives of the Gaussian via where (, ) is the grayscale at pixel (, ) in the image region.   (, ) and    (, ) represent the first-order derivatives of the Gaussian in the vertical and the horizontal directions, respectively. is the scale of the Gaussian filter.
The Gaussian derivative method can measure the acutance variation and suppress the influence of noise and illumination.Let () and () denote the average sharpness value of the image block  and the input image , respectively.Thus, the sharpness feature (denoted by  10 ()) of  can be expressed as the Euclidean distance between () and (): Sharpness represents the image definition and the edge acuteness.The higher the sharpness is, the higher the image contrast can be.The sharpness feature can be less susceptible to local variations.

Adaptive Fusion of Low-Level Image Features
According to the discrete degree and clarity of the saliency map of each low-level feature, different weights are assigned to different features.Thus, low-level features that contribute greatly on some images will assign a large weight; other lowlevel features that may be detrimental to the detection of saliency will assign a low weight or completely ignore during the saliency computation.Let ]  denote the statistical validity and   denote the weights of different feature maps   (),  = 1, 2, . . ., 10.The ]  can be defined as where  2 and  represent the variance and the kurtosis of   (), respectively.
The weights   of different feature maps   () are determined by the numerical magnitudes of the statistical validity ]  via where ] *  = sort{]  } using the descending order.The proposed method assigns different weights according to numerical sort of ]  .
The final fusion map (denoted by   ) is calculated by the weighted sum of the ten feature maps   (),  = 1, 2, . . ., 10: To enhance the robustness of detection process and achieve a preferable visual effect, the proposed method is performed in three scales {100%, 50%, 25%} to better restrain the background information.
The obtained saliency map is then refined using center prior principle to enhance visual effect.When humans watch a picture, they will naturally gaze on the objects next to the center of image [22].Thus, in order to obtain the saliency objects closer to the human visual fixations, more weight is needed to add to the center of image.Therefore, a feature (denoted by   ()) is included to indicate the distance between each image block and the center of image.For each image block feature   (),  = 1, . . ., 10, it can be recalculated via where ((), ()) denote the upper-left coordinate of image block , ( * ,  * ) represent the center coordinate of the estimated image region, and  and  denote the width and height of the image region, respectively.The feature map   () of the whole image  can be generated by combining the image feature  *  () of all the blocks.The normalized feature map is calculated by Finally, the generated saliency map is smoothed by a Gaussian filter (the template size is 10 × 10, and  is 2.5).The saliency maps of each low-level feature and the final saliency maps are shown in Figure 2. As illustrated in Figure 2, the ten low-level image features have their own advantages and disadvantages towards different images.On the contrary, the final saliency map can adaptively fuse the optimal features to achieve better performance.

Mathematical Problems in Engineering
In addition, we also provide the statistical of the ten low-level image features in different images.The result from different images is shown in Figure 3.As illustrated in Figure 3, the sharpness feature and the shape feature have good stability in our testing.Their weights have little change due to the deviation caused by local contrast transformation.On the contrary, the gradient feature can only reflect the local differences.

Experimental Results
The performance evaluation is conducted using the MSRA salient object database [16].This database contains over 20 000 images, which includes two parts: (i) image set A, containing 20,000 images, and the principle salient objects are labeled by three users, and (ii) image set B, containing 5,000 images, and the principle salient objects are labeled by nine users.The proposed multifeature fusion (MF) approach is compared with the other seven state-of-the-art methods: Itti's (IT) method [11], spectral residual (SR) method [23], saliency using natural statistics (SUN) method [24], frequency-tuned (FT) method [25], S3 method [26], nonparametric (NP) method [27], and context-aware (CA) method [28].
Figure 4 illustrates the performance comparison of these various salient region detection methods.As can be seen from Figure 4, the saliency maps extracted using our proposed method are more consistent with the ground-truth rectangle, and the detected saliency objects are more similar to the ground-truth binary masks.The approaches developed in [23,24,26] fail to detect and clearly identify the location of salient object from complex background.The saliency maps generated using methods [11,27] look rather blurry and are difficult to clearly distinguish the salient region.The saliency map of method [25] retains a lot of background information.The CA method [28] can achieve good detection performance on some images; however this method is unable to highlight the entire salient object itself.
The objective assessment is implemented by computing the true positive rate (TPR) and the false positive rate (FPR).Given the ground-truth binary masks (, ) and the obtained saliency map   (, ) (0 ≤   (, ) ≤ 1), a threshold  (0 ≤  ≤ 1) is used to obtain the binary masks   (, ), in which 0 denotes the background and 1 denotes the salient objects.Figure 5 shows the TPR and FPR results of the seven methods and the proposed method.As seen in Figure 5, the overall performance of the proposed method is more excellent than the other seven methods.(f) SUN [24] (g) FT [25] (h) S3 [26] (i) NP [27] (j) CA [28] (k) MF Figure 4: Saliency maps obtained from the proposed method (MF) and state-of-the-art saliency computational models.Given the generated saliency map   (, ), we set a threshold  (computed by Otsu's method) to segment the saliency objects.The binary mask is denoted by   (, ).
Let  and   denote the ground-truth binary masks and binary masks of the proposed approach, respectively.The Precision = (  ∩ )/(  ) and Recall = (  ∩ )/(), where (⋅) represent the salient region.The evaluation criterion -measure can be computed via: The proposed method uses  2 = 0.5 to weigh the precision and recall.The precision, recall, and -measure of these methods are shown in Table 1.
Finally, we compare the computational complexity of the different saliency computational models discussed.These models are implemented using the MATLAB programming language and run on a PC with a Pentium G2020 CPU and a 4 GB RAM.Table 2 shows the results of the proposed method and the other methods.The proposed method yields slightly higher computational load than the conventional approach; however, the proposed method can achieve more accurate saliency detection in various images.

Conclusion
In this paper, a novel feature selection scheme is proposed to adaptively select the proper features for different images.This method exploits the most distinguishable salient information in ten low-level features per image.The generated saliency map can highlight the salient object in different images even containing complex background.A large number of experiments are conducted on the MSRA database to compare the performance of the proposed method with the state-ofthe-art saliency computational models.And the experimental results indicate that the proposed method outperforms the state-of-the-art saliency computational models to achieve better performance.

Figure 1 :
Figure 1: Flowchart of the proposed saliency computational model.

3. 5 .
Edge Feature Extraction.The edge refers to a collection of image pixel in which the grayscale intensity has a strong contrast change.Thus, we use the edge feature to capture the image region with dramatic brightness variations.Let () denote the binary map obtained through the Roberts edge detection.The edge feature  5 () of the image block  is then achieved by computing the average value of () as  5 () =  ( ()) .

Figure 5 :
Figure 5: ROC curves obtained from the proposed method and state-of-the-art saliency computational models.
Feature Extraction.The intensity feature (denoted by ()) can also be treated as a lightness feature and can be obtained by averaging the three RGB color channels (denoted by (), (), and ()).Besides, the intensity feature can also reflect the brightness information and the color variations of image, which can represent human subjective feeling.The intensity feature 4 () of image block  can be generated by calculating the Euclidean distance between the intensity value of the image block () and the whole image ():  4 () =       () −  ()      ,

Table 1 :
The precision, recall, and -measure of the saliency computational models.

Table 2 :
Average execution time (in seconds) for the saliency computational models.