Salient Region Detection via Feature Combination and Discriminative Classifier

We introduce a novel approach to detect salient regions of an image via feature combination and discriminative classifier. Our method, which is based on hierarchical image abstraction, uses the logistic regression approach to map the regional feature vector to a saliency score. Four saliency cues are used in our approach, including color contrast in a global context, center-boundary priors, spatially compact color distribution, and objectness, which is as an atomic feature of segmented region in the image. By mapping a four-dimensional regional feature to fifteen-dimensional feature vector, we can linearly separate the salient regions from the clustered background by finding an optimal linear combination of feature coefficients in the fifteen-dimensional feature space and finally fuse the saliency maps across multiple levels. Furthermore, we introduce the weighted salient image center into our saliency analysis task. Extensive experiments on two large benchmark datasets show that the proposed approach achieves the best performance over several state-of-the-art approaches.


Introduction
Humans have the ability to locate the most interesting region in a cluttered visual scene by selective visual attention.The task of computer vision is to simulate the human intelligence, and the related research has been carried out for many years.The study of human visual systems suggests that the saliency is related to rarity, uniqueness, and surprise of a scene.It has recently gained much attention , as it has been brought into the various applications, including image classification [24], object recognition [25], and content-aware image editing [26].
Existing saliency region detection methods can be roughly classified into two categories: bottom-up, data driven and top-down, task driven approaches.Bottom-up methods which utilize low-level image features, such as color, intensity, and texture, determine the contrast of image regions to their surroundings, while top-down methods make use of highlevel knowledge about "interesting" object.The majority of most bottom-up models can be roughly divided into local and global schemes.
Inspired by the early work by Treisman and Gelade [27] and Koch and Ullman [28], Itti et al. [1] proposed highly influential biologically plausible saliency analysis method and they define image saliency using local center-surrounding operators across multiscale image features, including intensity, color, and orientation.Harel et al. [4] proposed a method to generate saliency map by nonlinearly combining local uniqueness maps from different feature channels.Ma and Zhang [29] propose a novel approach which directly computed center-surround color difference in a fixed neighborhood for each pixel and then utilize a fuzzy growth model to extract image salient region.They classify the saliency into three levels: attended view, attended areas, and attended points.Liu et al. [19] propose a set of novel features, including center-surround histogram, multiscale contrast, and color spatial distribution, which are unified in a CRF learning framework, to detect salient region in images.
Later on, many saliency models are proposed which exploit various types of image features in a global scope for saliency detection.Hou and Zhang [3] propose a spectral residual method that relies on frequency domain processing.
Zhai and Shah [2] define pixel-level saliency based on pixels contrast to all other pixels.To improve computational efficiency, they introduce the color histogram to analyze image saliency.Achanta et al. [5] propose a frequency tuned method which achieves globally consistent results by defining the saliency as the distance between the pixel and the overall mean image color.Cheng et al. [6] also utilize the color histogram and segmented region to analyze image saliency, which enable the assignment of comparable saliency values across similar image regions.
High-level priors have been used to analyze image saliency in recent years.Judd et al. [30] train SVM model using a combination of low-, middle-, and high-level image features, making their approaches potentially suitable for specific high-level computer vision tasks.The concept of center prior was considered in their approach.Shen and Wu [8] unify three higher level priors, including location prior, semantic prior, and color prior, to a low rank matrix recovery framework.Shape prior is proposed in Jiang et al. [7]; concavity context is utilized by [31].Wei et al. [32] turn to background priors to analyze image saliency, and they assume that the image boundary is mostly background.Subsequently, many recent approaches use boundary prior to guide saliency detection, such as GMR [11], SO [17], PDE [33], AMC [15], and DSR [34], and using those methods can obtain state-of-the-art performance on several public available datasets.
Recent studies indicate that single saliency cue is far from being comprehensive.Some methods such as LC [2], FT [5], and HC [6] only use the contrast cue and the generated saliency maps are disappointing; the contrast cue sometimes produces high saliency values for background regions, especially for regions with complex structures.To alleviate the above problems, some approaches such as SF [18], PD [9] GC [12], PISA [21], PR [22], UFO [14], and HI [10] use multiple cues.Perazzi et al. [18] formulate saliency estimation using high-dimensional Gaussian filters by which region color and region position are, respectively, exploited to measure region uniqueness and distribution.Cheng et al. [12] and Tong et al. [22] also consider color contrast cue and color distribution cue when computing the saliency map.Margolin et al. [9] combine pattern distinctness, color uniqueness, and organization priors to generate saliency result.Shi et al. [21] present a generic framework for saliency detection via employing three terms, including color-based contrast term, structure-based contrast term, and spatial priors.Jiang et al. [14] propose a novel algorithm by integrating three saliency cues, namely, uniqueness, focusness, and objectness.Yan et al. [10] propose a multilayer approach to analyze image saliency.To determine the single-layer saliency cue, they exploit two useful saliency cues, including local contrast and location heuristic, and then a hierarchical inference framework is used to generate the final saliency map.The above-mentioned algorithms compute saliency maps from various cues and heuristically combine them to get the final results.
These methods can generate ideal saliency map when dealing with simple images.When computing the image with complex background, some methods such as [9,12,18] can only highlight part of salient object.Though methods such as [10,14,22] can highlight the entire object uniformly, the background may be highlighted too.Thus, to differentiate real salient regions from high-contrast parts, more saliency cues including low-level feature and high-level priors need to be integrated.To the best of our knowledge, there are few works that model the interaction between different saliency cues.Inspired by the work [10,23], we propose a feature combination strategy which can capture the interaction between different cues.Our main contributions lie in three aspects.Firstly, we introduce feature combination to model the interaction between different cues, which is different from most existing methods that generate saliency maps heuristically from various cues.Secondly, we formulate salient estimation as a classification problem and learn a logistic classifier that can directly map a fifteen-feature vector to a saliency value.Thirdly, the use of smoothing and weighted salient image center can further improve the detection performance.The experimental results show that our method can generate reasonable saliency map, even though the image contains complex background and the salient object has similar color to background.
The framework of the approach is presented in Figure 1.First, our approach includes four main parts.The first one is hierarchical image abstraction, which segments the image to homogenous regions across several layers by using the efficient graph-based image segmentation [35].Second, four saliency cues, including color contrast in a global context, center-boundary priors, spatially compact color distribution, and objectness, are used as an atomic regional feature.Then we map a four-dimensional regional feature to fifteendimensional feature vector which can capture the interaction between different features.Third, a logistic regression classifier is trained for mapping a fifteen-dimensional feature vector to a saliency value.Finally, we combine the saliency map at different layers to obtain our saliency map. Figure 2 shows samples of saliency maps generated by state-of-the-art methods and by ours.
The remainder of this paper is organized as follows.The proposed model is introduced in Section 2. Section 3 presents experiments and results.This paper is summarized in Section 4.

The Proposed Approach
Our method can be divided into four main stages: hierarchical image abstraction, regional feature generation, training a logistic regression classifier, and multilayer saliency map integration and reinforcement.In the following, we describe the details of the proposed approach.
as  1 = { 1  1 ,  1 2 , . . .,  1  1 }, and the segmentation result of other layers can be described in a similar way.Each superpixel    is represented by mean color    (in CIELab) and a spatial position    (-coordinate and -coordinate) which are defined as where  stands for a pixel in the region    , while  () represents the color vector of pixel  and  () represents the coordinate vector of pixel .   is the number of pixels in the segmented region    .

Regional Feature Generation
(1) Color Contrast Cue.Given the result of segmentation in  layer of image pyramid which is described as   = {  1 ,   2 , . . .,     }, the color contrast of a region   (1) can be formulated as follows: where (   ,    ) is the smooth term which considers the distance between two regions and (   ,    ) is the color distance between    and    .
(2) Color Distribution Cue.Inspired by Liu et al. [19], we use the nonoverlapped region as computing unit to compute region color distribution.First, all region colors are represented by Gaussian Mixture Models (GMMs) {  ,   , Σ  } 5 =1 , where {  ,   , Σ  } is weight, the mean color, and the covariance matrix of the qth Gaussian component.The probability of a region belonging to the th component is given by The number of Gaussian is set to 5 in the subsequent experiment.We exploit -means algorithm to initialize the parameters of GMMs and EM algorithm to train the GMMs.Referring to [19], the horizontal spatial variance of the th clustered component of GMMs is defined as where The vertical spatial variance   () can be defined in the same way.Different from Liu et al. [19] that use the two variances to compute saliency cue, we only use the horizontal spatial variance.The color distribution of region    can be defined as (3) Center-Boundary Prior Cue.Location is an important factor in saliency detection.Center and boundary are two priors which are widely used in previous saliency detection methods.After considering above two priors, our centerboundary heuristic is thus defined as where   (cp) is the center prior term which measures the distance between the region    and the image center and it is defined as   (cp) = 1/( + √‖   − ‖ 2 /2), where  is the center of image and it is set to (0.5, 0.5); the parameter  controls the sensitivity of the center prior and it is set to 1 in the experiment.  (bp) is the boundary prior term which measures the color distance between the region    and the image boundary.Inspired by the approach proposed by Yang et al. [11], we define background feature of region    as where   (top) is the sum of distances from region    to the top boundary of image which is different from Yang et al. [11]  (5) Cues Smoothing.Thus, we get four saliency cues and they are normalized to range [0, 1] using minimum-maximum normalization.Although we can efficiently compute four saliency cues, there exist at least two problems.Firstly, some regions with similar property will have very different saliency value and, secondly, some adjacent regions will be assigned to very different saliency value.To reduce noisy saliency results caused by above-mentioned issues, we use two smoothing procedures to refine the saliency value for each region.
-Means Clustering Based Smoothing.Given the results of segmentation in  layer of image pyramid which is described as   = {  1 ,   2 , . . .,     }, we first exploit -means clustering algorithm to divide the segmented regions into different clusters in each layer.Referring to [36], we can then define an object function, sometimes called a distortion measure, given by which we can easily solve for   to give is the binary indicator variables; if a region is assigned to cluster , then   = 1, and   = 0 for  ̸ = .This is known as the 1-of- coding scheme.The two phases of reassigning data points to clusters and recomputing the cluster means are repeated in turn until there is no further change in the assignments.Then we get the number for each cluster cl () = ∑   =1   .We replace the saliency values of each region by the weighted average of the saliency values of the same cluster (measured by  *  *  * distance).The saliency value of each region    can be refined by where tmp can be replaced by 1, 2, 3, and 4. The parameter  controls the importance of color space smoothing term.
In our experiment, we set the parameter  = 0. propose a spatial based approach to refine saliency between adjacent regions and the procedure is very similar to color space smoothing.We replace the saliency values of each region by the weighted average of the saliency values of its neighbors.
(6) Regional Feature.After completing the above steps, we can get four atomic features (  (1) ,   (2) ,   (3) , and   ( 4) ) for each segmented region, including color contrast, center-boundary prior, color distribution, and objectness.In order to capture the interaction between the four different features, a novel feature is generated by mapping a four-dimensional regional feature to fifteen-dimensional feature vector.There are four kinds of combinations: single term, double term, triple term, and quadruple term.For single term, we use the four atomic features ( 1 - 4 in the vector).For double term, there are six elements which are combination of any two atomic features ( 5 - 10 ) ( .Finally, we can get a novel fifteen-dimensional feature vector.

Learning Framework for Saliency Estimation.
The logistic function is useful because it can take an input with any value from negative to positive infinity, whereas the output always takes values between zero and one [37].We take full advantage of this property; thus our saliency estimation can be formulated as a probability framework.Let us assume that where ℎ  () = (  ) = 1/(1 +  −   ) is our hypotheses.Consider () = 1/(1 +  − ) is called the logistic function or the sigmoid function.Notice that () tends towards 1 as  → +∞, and () tends towards 0 as  → −∞.Hence, our hypotheses is always bounded between 0 and 1 and higher value indicates that the region is likely to belong to a salient object.
The parameter  is what we want to learn from the data.We use the first layers of image pyramid for training, given the result of segmentation in 1st layer of image pyramid which is described as  1 = { 1  1 ,  1 2 , . . .,  1  1 }.A segmented region is considered to be positive if the number of the pixels belonging to the salient object exceeds 90% of the number of the pixels in the region and its saliency value is set to 1. On the contrary, a region is considered to be negative if the number of the pixels belonging to the salient object is under 10% of the number of the pixels in the region and its saliency value is set to 0. As aforementioned, each segmented region is described by a fifteen-dimensional vector x.We learn a logistic regression classifier  from the training data X = {x 1 , x 2 , . . ., x  } and the saliency value Y = {y 1 , y 2 , . . ., y  }.Once the parameter  is obtained, we can quickly perform the saliency estimation using (10).

Multilayer Saliency Map
Integration.We combine image pyramid which is the multiscale representation of image to suppress background region.Similar to [1], the saliency map is obtained by adjusting the saliency map to the same scale and point-by-point addition.The fusion strategy is given by  Fusion () = ⨁  =1 sal(  ), where  is the input image,   is the mth layer of image pyramid, and sal(  ) is the saliency detection result of the mth layer of image pyramid.

Reinforcement of Salient Region.
Salient object is always distributed in local region of the image, while background has a high degree of dispersion.To use such property, we introduced the weighted salient image center into our saliency estimation, and the newly defined salient center is defined as where  is the number of pixels in image and   is the th pixel.Hence, the final pixel-level saliency can be defined as where  is a pixel in an image,   is the Euclidean distance between the pixel  and the weighted salient image center, and the parameter  2 is the smooth term which controls the strength of spatial weight; we set the parameter  2 to 0.4.

Experiments and Results
To validate our proposed approach, we have performed experiments on two publicly available datasets.(1) The first one is the MSRA dataset [19], which contains 5000 images with pixel-level grounds truth.We used the same training set, validating set, and test set as the paper of Jiang et.al [23].The training set contains 2500 images, the validating set contains 500 images, and the testing set contains 2000 images.

Evaluation Methods.
Following [5,6,8], we evaluate the performance of our method measuring its precision and recall rate.Precision measures the percentage of salient pixels correctly assigned, while recall measures the percentage of salient object detected.In order to study the performance of saliency detection approaches, we use two kinds of objective comparison measures in previous studies.
Secondly, we follow [5,6,8] to segment a saliency map by adaptive thresholding.The image is first segmented by mean-shift clustering algorithm.And then we calculate the average saliency value of each nonoverlapped region; an overall mean saliency value over the entire saliency map is calculated as well.The mean-shift segments whose saliency value is larger than twice of the overall mean saliency value will be marked as foreground, and the threshold is defined as where  and  are the width and height of the saliency map, respectively.In many applications, high precision and better recall rate are both required.In addition to precision and recall, we thus estimate   , which is defined as where we set  2 = 0.3 as is suggested in [5,6,18].
3.2.Performance on MSRA Dataset.We report both quantitative and qualitative comparisons of our method with 18 state-of-the-art saliency detection approaches on the MSRA dataset.
Quantitative Comparison.Figures 3(a) and 3(b) show the precision-recall curves of all the algorithms on the MSRA-5000 dataset.As observed from Figure 3, the curve of our method is consistently higher than others on this dataset.Besides, we compare the performance of various methods using adaptive thresholding.Each value of our -- (0.8524, 0.7794, and 0.8343) ranks first among the 18 state-of-the-art methods.

Performance on ECSSD Dataset.
The ECSSD dataset is a more challenging dataset provided by Yan et al. [10].As is shown in Figure 5, our approach achieves the best precisionrecall curve.We also evaluate average precision, recall, and   using adaptive thresholding; our recall and   value rank first among all the methods.We also provide the visual comparison of different approaches in Figure 6, from which we see that our approach produces the best detection results on these images and can highlight the entire salient object uniformly.We only consider the most recent thirteen models: FT [5], HC [6], RC [6], LR [8], PD [9], GC [12], HI [10], GMR [11], BMS [13], UFO [14], AMC [15], HDCT [16], and SO [17].

Analysis of the Influencing Factors of Segmentation.
Recently, low-level image segmentation methods have been widely used for saliency analysis.SLIC [38] and superpixel [35] approaches are two efficient algorithms and the source codes are publicly available.Because of considering different segmentation criterion, the segmentation results are quite different from each other, as is shown in Figure 8. From the figure, we can see that the result of SLIC method has more local compactness than superpixel method.We also provide the visual comparison of four saliency cues and final saliency map produced by the above two segmentation algorithms.Figure 9 shows that different segmentation algorithms can produce different salient cues and final saliency map.The superpixel approach can generate high-quality saliency map, while the SLIC segmentation algorithm may highlight some nonsalient region.Finally, we provide the quantitative comparison of SLIC and superpixel segmentation algorithm.To verify the effectiveness of two segmentation algorithms, we plot the corresponding precision-recall curves on the ASD dataset.As observed from Figure 10, the use of superpixel algorithm can obtain better precision-recall curves when compared with SLIC clustering algorithm.

Conclusion
In this paper, a novel salient region detection approach based on feature combination and discriminative classifier is presented.We use four saliency cues as atomic feature of segmented region in the image.To capture the interaction among different features, a novel feature vector is generated by mapping a four-dimensional regional feature to a fifteendimensional feature vector.A logistic regression classifier is trained to map a regional feature to a saliency value.We further introduce the multilayer saliency map integration and salient center for improvement.We evaluate the proposed approach on two publicly available datasets and the experiments results show that our model can generate high-quality saliency map which can uniformly highlight the entire salient object.

Figure 1 :
Figure1: An overview of our weighted feature combination framework.We extract four image layers from input and then train a logistic regression classifier by using the four atomic features.Initial saliency maps of the four layers can be obtained by weighted feature combination.Finally, we fuse saliency maps of different layers to obtain the final saliency map.

Figure 3 :
Figure 3: Experimental results on the MSRA dataset.(a) and (b) are precision and recall curves of all approaches which are obtained using fixed threshold.The histogram (c) (precision, recall, and   ) is obtained using adaptive thresholding.

Figure 5 :
Figure 5: Experimental results on the ECSSD dataset.(a) and (b) are precision and recall curves of all approaches which are obtained using fixed threshold.The histogram (c) (precision, recall, and   ) is obtained using adaptive thresholding.

Figure 8 :
Figure 8: Visual comparison of SLIC and superpixel segmentation result, from left to right: input image, SLIC segmentation result, and superpixel segmentation result.

Figure 9 :Figure 10 :
Figure 9: Comparison of salient feature with different segmentation methods.(a) Input image.(b) Ground truth.(c) Saliency map generated by using SLIC method.(d) Saliency map generated by using superpixel method.(e) Color contrast based salient feature by using SLIC method.(f) Color contrast based salient feature by using superpixel method.(g) Color distribution based salient feature by using SLIC method.(h) Color distribution based salient feature by using superpixel method.(i) Objectness based salient feature by using SLIC method.(j) Objectness based salient feature by using superpixel method.(k) High prior based salient feature by using SLIC method.(l) High prior based salient feature by using superpixel method.
[14]  is the number of regions that intersect with the top image boundary.We use a simple approach to compute   bounding boxes to the pixel level first and then we can obtain region level objectness measure.For more details, please refer to UFO[14].For each region, we can get its region-level objectness   (4) = ∑ ∈    () /   , where  () is the objectness value for pixel .
(4) Objectness Cue.Recently, a generic objectness measure is proposed to quantify how likely it is for an image window to contain an object of any class.The measure is based on low-level image cues.As our goal is to obtain a saliency map for the whole image, we should transfer the objectness value from the