Salient Region Detection via Multiple Hierarchy and Competition Mechanism

e detection of salient regions has attracted an increasing attention inmachine vision. In this study, a novel and eective framework for saliency region detection is proposed to solve the problem of the low detection accuracy of traditional methods. Firstly, we divide the image into three levels. Second, each level uses three dierent feature methods to generate dierent feature saliency maps. Subsequently, a novel integration mechanism, termed competition mechanism, is introduced into the coarse saliency maps at the same level, and the two coarse saliencymaps with the highest similarity are selected for fusion to ensure the eectiveness of the salient region map. Accordingly, after adjusting the scales of the saliency map after the fusion of dierent levels, among three coarse saliency maps of the dierent levels, the two feature maps with the most signicant dierence are selected to fuse to further obtain the nal rened saliency map. Finally, using the proposed method, experiments on three benchmark datasets were conducted. As demonstrated by the experimental results, the proposed algorithm is superior to other state-of-the-art methods.


Introduction
With the development of image processing technology, large amounts of image data can be quickly acquired, which is a challenge to the e ciency of image processing algorithms. According to the visual attention mechanism, the human eye can quickly select the most interesting objects and ignore useless information in the scene, and the saliency detection task aims to identify these important regions. In recent years, saliency detection has gradually become a research hotspot in machine vision, and an increasing number of researchers have focused on this problem. As an important preprocessing procedure, saliency detection contributes to many machine vision and image processing tasks such as visual tracking [1], image compression [2], object recognition [3], and image segmentation [4]. e saliency detection method consists of salient object detection and gaze predictions [5]. Existing salient object detection methods can be divided into two categories: topdown [6] and bottom-up [7]. [8]. Top-up approaches are task-driven, and supervised learning frameworks typically use top-down strategies such as deep learning or convolutional neural networks. Although deep learning has achieved good results in salient object detection, it requires numerous datasets for training and involves many parameters, which complicates parameter tuning. Conversely, bottom-up methods are driven by vision; their performance relies on various low-level visual feature extraction methods, such as color features, texture, and orientation.
Many salient object detection models exist, among which graph-based methods measure the similarity between nodes when generating saliency graphs [9]. For superpixel images with superpixels as nodes, the similarity weight between nodes is regarded as a graph model, and the saliency information of the original image can be extracted through node propagation. In these methods, initial node selection is the key to the salient object detection process. Existing node selection methods introduce various types of prior knowledge to ensure the accuracy of the foreground or background node selection. For example, the boundary prior [10] considers the presence of salient objects at the boundary of an image, and the salient objects are connected to the boundary. e center prior [11] considers that the salient object is located at the center of the image. However, this is not the case for natural images alone, where the salient object locations are randomly distributed.
is is a significant challenge for existing prior-based methods.
As regards the advantages and disadvantages of saliency graph object detection methods based on node selection in graph methods, we can conclude that prior knowledge determination plays an important role in node selection; however, the saliency graphs generated by different prior knowledge determinations of initial nodes also differ. Some researchers have attempted to improve the accuracy of saliency graphs by determining the initial node. If the initial node is initially misjudged, the accuracy of the saliency graph is significantly reduced, and even its feature method is invalid. In addition, superpixel images generated using the simple linear iterative clustering (SLIC) algorithm [12] have different shapes and numbers of nodes generated by images of different sizes. In other words, image layering enriches the input features of salient objects, thereby ensuring the robustness of the detection results. It also provides a basis for subsequent feature selection. erefore, image multiple hierarchy and competition mechanisms are particularly important for saliency detection. In this study, we focus on bottom-up methods to compute saliency detection models. In contrast to traditional methods, the proposed scheme improves the accuracy and consistency of saliency results by fusing multilevel saliency priors and an integration mechanism of global similarity cues. e main contributions of the proposed method are as follows: (1) e use of multilayered images overcomes the nonrobustness problem of saliency measurements because of the size of the salient objects. Simultaneously, the shapes and number of nodes generated by images of different scales are different, thus enriching the input features of the salient objects. (2) ree different feature methods are proposed in parallel to prevent prior knowledge from judging wrong nodes at the beginning, which is beneficial to the selection of later features. (3) A novel feature saliency map fusion mechanism was established. Firstly, feature maps are fused at the same level, and the two most similar coarse saliency maps are selected for fusion. Accordingly, two feature saliency maps with large differences were selected for fusion. e first competition removes features with low detection accuracy for individual features, thereby ensuring the effectiveness of saliency maps at the same level. e second competition enriches the feature map boundary features and ensures the detection accuracy of the saliency map. e remainder of this paper is organized as follows. In Section 2, we review related studies on saliency detection. In Section 3, the construction of various hierarchical coarse saliency maps is presented. Section 4 presents a saliency graph refinement of the competition mechanism. Section 5 presents the experimental data and analysis. Finally, Section 6 presents conclusions and discussion.

Related Work
Existing salient object detection methods can be divided into two categories: bottom-up and top-down. Bottom-up methods are driven by vision, and their performance relies on various low-level visual feature extractions such as color features, texture, and orientation. To date, many low-level features exist in the saliency detection work, and it has been confirmed that features such as color, salient object size, location, and image boundaries directly influence our visual attention mechanism and play an important role in saliency detection. Although top-up approaches are task-driven, supervised learning frameworks typically use top-down strategies, such as deep learning or convolutional neural networks.
Visual saliency detection is mostly based on bottom-up methods that use low-level visual features in images and videos. A previous study on saliency detection was presented by Itti et al. [13], in which a final saliency map of an input image was obtained by exploiting the center-surround differences in low-level image features based on visual features (including color, intensity, and orientation maps). Ma and Zhang [14] proposed a method to detect regions of visual attention in images based on local contrast analysis, which in turn is based on center-surrounding and chromatic aberration calculations. Wei et al. [15] proposed two priors, boundary and connectivity, and assumed that the image boundary was the main background. Combining this prior with the geodesic distance, an effective salient object detection model called the GS model was designed. Saliency detection was performed using the background priors and distances. Perazzi et al. [16] decomposed a given image into compact and perceptually homogeneous elements, evaluated the uniqueness and spatial distribution of these elements, inferred the contrast of these elements to derive saliency measures, and generated pixel-accurate saliency maps. Yang et al. [17] ranked the similarity of image elements to foreground or background cues through graph-based MR, representing an image as a closed-loop graph with superpixels as nodes based on the similarity of background and foreground queries, which is ranked based on the affinity matrix to effectively extract background regions and foreground salient objects. Wu et al. [7] used two perceptual cues, proximity and similarity, to generate background nodes through background probability measures and established a label propagation model to generate saliency maps. Wu et al. [18] proposed a boundary-guided graph structure to explore the correlations between superpixels and built an iterative propagation mechanism to refine the saliency map. Wang et al. [19] used the convex hull of interest points to estimate the locations of salient objects by employing random walks to propagate nodes and generate saliency maps. Jian et al. [20] used a discrete wavelet transform to extract directional features, calculated the centroids of salient objects, and combined the color contrast and background features to generate robust saliency maps.
As proposed by Al-Azawi et al. [21], local saliency detection used the structure of objects to determine saliency, whereas global saliency detection identifies saliency based on background-related contrast and uses irregularity as a two-stage saliency measure. Chakraborty and Mitra [22] used random walks on Markov chains to obtain saliency maps, and k-dense submaps enhanced the salient regions in the image.
Among these saliency detection methods, graph-based methods generate saliency graphs that measure the similarity between nodes based on different rules. In these classes of methods, the initial node selection is the key to the salient object detection process. Existing node selection methods introduce various types of prior knowledge to ensure the accuracy of the foreground or background node selection. ese priors are significant for carefully selected samples such as ID photos. However, this is not the case for natural images alone, where the salient object locations are randomly distributed. is is a significant challenge for existing prior-based methods. In addition, superpixel node images generated using the SLIC algorithm have different shapes and numbers of nodes generated by images of different sizes. In other words, image layering enriches the input features of salient objects, thereby ensuring the robustness of the detection results, thus providing a basis for the subsequent feature selection.
As regards the advantages and disadvantages of saliency graph object detection methods based on node selection in graphical methods, we can conclude that prior knowledge determination plays an important role in node selection; however, the saliency graphs generated by different prior knowledge determinations of initial nodes also differ. To improve the accuracy of saliency graphs, some researchers have attempted to improve the accuracy by determining the initial node. If the initial node is initially misjudged, the accuracy of the saliency graph is significantly reduced, and even its feature method is invalid.
Consequently, we propose a saliency map refinement method using a competitive mechanism. First, multiple feature maps are generated in parallel, and then, the feature saliency maps are further screened to determine useful and eliminate useless feature maps. Here, the similarity of the saliency maps generated by a certain number of features was compared, and effective feature maps were selected. For convenience, this study adopted three different types of features and compared the similarity of the saliency maps generated by the three features.
In contrast to bottom-up saliency detection models, topdown methods are based on high-level priors on images and are thus always task-specific-driven training and learning methods. In [6], Yang et al. used joint learning of a conditional random field (CRF) and a visual dictionary for salient object detection, which comprised a hierarchical structure from top to bottom: CRF, sparse coding, and image patches to sparse coding. As an intermediate layer, CRF is learned in a feature-adaptive manner and also used as an output layer to learn a dictionary under structured supervision, and a top-down saliency detection model was proposed. Shan et al. [23] proposed a neural network for threestage layering: first, fast R-CNN extracts feature for each superpixel; second, an attention mechanism was used to expand the receptive field from one superpixel to the surrounding and related; finally, the value of the saliency score was obtained using the global regression model to generate a saliency map for top-down saliency detection. Although deep learning-based salient object detection has achieved good results [24], these methods are extremely sensitive to data attributes, which may influence their adaptability to different situations.

Construction of Multiple Layered Coarse
Saliency Maps e selection of initial nodes differs, the detection results of saliency graphs also differ significantly, and a competitive and complementary relationship exists between them. Based on various factors, such as salient object size, color, location, and image boundary, a multilevel hierarchical saliency detection algorithm with different prior knowledge is designed to determine the initial nodes. e algorithm framework of this study is illustrated in Figure 1. First, we divide the input image into three levels: Z1, Z2, and Z3, of different sizes. Subsequently, we simultaneously processed Z1, Z2, and Z3 by building three different types of saliency graph models. Finally, we defined a competition mechanism to fuse the saliency map twice to obtain a refined saliency map.In Figure 1, ① adopts the SLIC algorithm to generate superpixel images. ②, ③, and ④ represent the feature method based on center prior, the feature method based on color prior, and the feature method based on boundary connectivity prior, respectively. ⑤, ⑥, and ⑦ represent the feature analysis of their respective feature maps, ⑧ feature map of comparison, and ⑨ feature fusion.

Coarse Saliency Map Generation Based on Boundary Prior.
Zhu et al. [10] proposed a method to measure the degree of connection between region Y and the image boundary; that is, the boundary connectivity when the salient object is located at the image boundary. It is expressed as the ratio of the perimeter of a region on the boundary to the square root of the area of the region.
where v i is a node in area Y, C is a collection of image boundary nodes, Len bnd (·) represents the perimeter of a region at the boundary, and Area( . ) represents the area of a region.
To facilitate the calculation of area Y, v i is placed as a set of nodes with similar colors.
where Z Y i represents the total number of pixels of node v i , and |Q c i | represents the number of nodes in Q c i . Similarly, the image boundary length is defined as follows: where δ(·) is 1 at the boundary. Otherwise, it is 0. A similarity also exists in colors among these nodes. e salient object in the image has smaller spatial changes, whereas the background areas distributed over the entire image have higher spatial changes. In other words, tightness is an appropriate supplement to boundary connectivity, which can suppress areas where errors are prominent. e spatial compactness D can be calculated using the intraclass distribution.
where p i is the position of node v i and μ i is the center position of Q c i . At this moment, Further, the above formula is extended to the background probability, which is based on the boundary connectivity value of node v i ; the background probability where c B is the adjustment parameter used to control the intensity A bg . Because the background probability evaluates the possibility that a node belongs to the background, possible background nodes can be distinguished by the background probability. erefore, the background exclusion can be expressed as follows: where ϑ bg is the parameter that controls the background probability threshold.

Coarse Saliency Map Generation Based on Color Prior.
Kim et al. [25] proposed a linear combination of colors based on high-dimensional color space to create an image saliency map when the salient object position is independent of the node position of the salient object. e histogram feature is an effective measure of saliency. e i-th superpixel histogram feature is measured using the chi-square distance between other superpixel histograms, and it is defined as follows: where b is the number of histogram bins and each histogram has eight bins. e global contrast feature U G i of the ith superpixel is as follows: where d(c i , c j ) represents the Euclidean distance between the color value of the ith superpixel node c i and the color value of the j-th superpixel node c j , and the eight color channels of RGB, CIELab, hue, and saturation are used to calculate the color contrast feature. e local color contrast feature U L i is defined as follows: where, p i ∈ [0, 1] × [0, 1] represents the normalized position of the i-th superpixel node, and E i is a normalized term. Set κ 2 � 0.25 according to [16] here.

Mathematical Problems in Engineering
For texture shape features, superpixel regions, histogram of gradients, and singular value features were used. e histogram of the gradient uses the gradient information of the pixels to quickly provide appearance features. e singular value feature [26] is based on the features of the feature image. It decomposes an image through the weighted summation of multiple feature images, where each weight is a singular value obtained through singular value decomposition.
e feature image corresponding to the larger singular value determines the overall outline of the original image, whereas the other smaller singular values describe detailed information.
erefore, for blurred images, the larger singular values occupy a higher weight.
First, the initial graph is divided into 2 × 2, 3 × 3, 4 × 4 regions, and a threshold is applied to each region. Owing to changes in color saturation or illumination, the threshold setting is influenced to a certain extent. Here, Otsu's multilevel adaptive threshold [27] is used to control the ratio between the foreground, background, and unknown regions. Seven-level thresholds were used for each subarea. After three different threshold saliency maps were fused, we obtained a 21-level saliency map with a local threshold such that the saliency map had better local contrast. erefore, although the local area may not be the most prominent area globally in the entire image, it can also capture the most prominent area locally. Finally, the coarse saliency map is obtained using the global threshold as follows:

Coarse Saliency Map Generation Based on Center Prior.
Lou et al. [28] proposed a color name method based on the central prior when the salient object is located in the middle of the image. For input images of different sizes, the input image is first fixed to a certain pixel width to obtain the optimal parameters of the structural elements at a certain scale. Subsequently, the color name model is used to provide the im2c function and the mapping matrix and convert scaled image I from the RGB space to the color space C.  (12) where ⌊ · ⌋ represents rounding down, and the im2c function assumes the row corresponding to the index Ind x,y from the mapping matrix to obtain an 11-dimensional vector. rough the above conversion, the RGB color value (R x,y , G x,y , B x,y ) of each pixel in I is mapped to an 11-dimensional color name vector, and each element in the color name vector is a [0, 1] floating-point number in the interval, indicating the probability that the pixel in I belongs to each color name, the sum of 11 elements in the vector is equal to 1. Because only one probability map of the color name can be obtained each time the im2c function is called, the Color Name Space(CNS) method calls the im2c function 11 times to obtain 11 color name probability maps. For the convenience of expression, each color probability map becomes a color name channel. e entire color namespace C is composed of M color name channels, C � C 1 , C 2 , ..., C M , where M � 11. To use a set of serialized threshold segmentation to obtain a Boolean graph, each color channel is normalized to an integer interval of [0, 255]. Although each channel is a probability map, it essentially reflects the intensity information of the color name; that is, the color name formed by these channels contains the perceptual color characteristics under the linguistic description. e color name channel is used to generate a Boolean graph as follows: First, take a sampling interval δ in the interval [0, 255] to obtain a set of serialized segmentation thresholds. Assuming that the total number of segmentation thresholds is n, the j-th threshold is denoted as ϕ j , and each color name channel C i ∈ C is segmented as follows using these n thresholds: where H j i is the Boolean graph obtained by the threshold ϕ j segmentation of the color name channel C i , the subscript i ∈ 1, 2, ..., M { } represents the color name channel number, and the superscript j ∈ 1, 2, ..., m { } represents the sequence number of the segmentation threshold. e functions of the THRESH( . ) function is the element value H j i (x, y) � 1 at coordinate (x, y) when C i (x, y) ≥ ϕ j ; otherwise, H j i (x, y) � 0. When the area of the salient object has a large color intensity, the serialized segmentation of the normal phase image can obtain the object area. Conversely, if the salient object has a low color intensity, only the normal phase image segmentation will make the object area the background. At this time, a reverse graph H j i is required.
At this time, M × 2 × m Boolean graphs including the positive/negative of M color name channels. ree morphological processes are performed on the serialized Boolean graph generated above, including closing, hole-filling, and boundary object clear-border, to obtain a saliency map based on the surrounding clues.
After obtaining the saliency maps of all color name channels, these saliency maps are combined in a linear average manner to obtain a more stable saliency map.
where S j i is the positive saliency map of the i-th color name channel, S j i is the inverted saliency map of the i-th color Mathematical Problems in Engineering name channel, S i is the i-th color name channel, and S C is the average saliency map of the M color name channels.

Integration Mechanism of the Saliency Map
Useful coarse saliency maps were fused from the coarse saliency maps in the above three feature methods, and coarse saliency maps with low detection accuracy were eliminated. e initial nodes determined by different prior knowledge have different effects on salient object detection. e feature maps generated using this feature exhibited certain differences. To ensure the selection of effective feature saliency maps when merging at the same level, two feature saliency maps with the most significant similarity are selected for fusion. is also effectively avoids reducing the saliency of salient objects, owing to the low accuracy of a certain feature saliency map. erefore, a novel integration mechanism was defined. ree feature saliency maps between the same level (Z1, Z2, and Z3) were selected to fuse the most similar feature saliency maps of the two saliency maps, which effectively avoids the saliency caused by a certain method. e detection accuracy is not high or fails, thereby introducing new errors, and the competing mechanism is shown below.
where s Zi A is the Z i -th level S A feature saliency map; similarly, s Zi B and s Zi C are the Z i -th level S B and S C feature saliency map, respectively. e two selected saliency maps are then fused, and the fusion method is as follows: where s Zi 1 and s Zi 2 are the saliency maps selected by the first handsome, w s are the weights calculated using s Zi 1 and s Zi 2 , and s Zi 2 are the normalization terms. us, the feature saliency map was further refined, the saliency maps of the three levels were adjusted to the original image size, and the different levels (Z 1 , Z 2 , and Z 3 ) were merged again to ensure the effectiveness of the feature saliency map, to maximize the enrichment of the boundary information of the feature saliency map, improve the boundary accuracy of the salient object, and select the fusion of the two different levels of the feature saliency map with the largest difference. e direct fusion of the three-layer feature map to reduce the saliency of the boundary and the fusion of two similar feature maps that lose the boundary information of the salient object are both avoided. e fusion method is as follows: where S Z1 , S Z2 , and S Z3 are the feature saliency maps of the Z1, Z2, and Z3 levels adjusted to a uniform scale, respectively. After the second screening, the two feature saliency maps are fused as follows: where S 1 and S 2 are the two saliency maps screened for the second time, α denote the set parameters, W f denote the weights calculated from S 1 and S 2 , and E i denote the normalization terms.
Here, based on the first fusion, two saliency maps with the largest differences are selected such that the boundaries of the salient object are more abundant. e integration of the two feature saliency maps did not simply fuse all the feature saliency maps. For the first time, two similar feature maps at the same level were selected for fusion, which effectively avoided the failure of the single-feature method, which would reduce the salient object after feature fusion, hereby ensuring the validity of the feature saliency map. e second feature map fusion is to select feature saliency maps with significant differences in feature maps at different levels on the first basis, ensuring the accuracy of the saliency object boundary.
e proposed integration mechanism is based on two aspects: (1) When fusing feature maps of the same level (Z1, Z2, or Z3), to avoid a feature method from failing, select the feature map fusion of the two most similar feature maps, and the two saliency maps with saliency values close to 1-pixel regions, thereby ensuring their saliency. (2) For the fusion of feature saliency maps of different levels, based on the first fusion, attempt to ensure the richness of the boundary features of the salient objects and select the two feature saliency maps with the most significant differences between them; to improve the richness of the boundary features of the salient objects, the feature redundancy is suppressed to a certain extent, and the boundary accuracies of the salient objects are increased.

Experimental Results and Analysis
In this section, a series of experiments are performed on the proposed method. To provide a fair experimental evaluation, we first introduce several evaluation indicators and benchmark datasets. Further, we use the proposed framework to optimize other advanced methods to illustrate the robustness of the proposed framework. Finally, we compared the proposed algorithm with seven other saliency detection algorithms using three benchmark datasets.

Evaluation Metrics and Datasets.
In this study, we verified the performance of the proposed algorithm based on four evaluation indicators: Precision-Recall (PR) curve, Receiver Operating Characteristic (ROC) curve, F-measure curve, F-measure score, and Mean Absolute Error (MAE). e PR curve is widely used in most salient object detection methods. e calculation method is as follows: Given a saliency map, we segment the map with a threshold from 0 to 255 and then compare each result with the real situation to generate a precision-recall rate curve. As a supplement, we also introduced the F-measure score, which is the comprehensive performance of the precision and recall values and is calculated as follows: where β 2 is set to 0.3, according to [29]. e combinations of the PR curve and F-measure score are common in existing studies.
e MAE is a measure of the average difference between the estimated saliency map S(x, y) and ground truth G(x, y).
where W and H are the width and height of the given image, respectively.
In this study, all the methods were compared on three benchmark datasets. e ECSSD dataset [30] contains 1000 images and is composed of many complex scenes, such as animals, trees, and people.
e quantitative results of all methods are shown in Figure 2. e proposed method in the ECSSD, ASD, and image pair is a dataset that contains a large number of   complex scenarios. For the information-rich ECSSD dataset, the proposed method achieved the best performance. e results of the PR curve of all methods are shown in Figure 2.
As observed, the proposed method achieved the best overall performance on all datasets, particularly on the ECSSD dataset. Among the comparison methods, RBD, MR, SF, DGL, GS, CNS, and HDCT also achieved outstanding results; however, they are still inferior to our method. e three benchmark datasets contained a large number of complex scenarios, and the experimental results demonstrated the superiority of our method in complex scenarios.
From top to bottom: ASD dataset, ECSSD dataset, and Image Pair dataset. Table 1 shows the F-measure score; that is, the comprehensive performance of the precision and recall values. e proposed algorithm is superior to the above seven methods. Table 2 compares the MAE, and the proposed algorithm achieved the optimal value. In summary, the proposed algorithm was validated on three benchmark datasets containing a large number of complex scenarios, and the results demonstrate the superiority of our method in complex scenarios. Figure 3. It can be observed that the proposed algorithm can simultaneously highlight salient regions and suppress irrelevant background regions, such as the second and fifth rows. For salient objects with rich texture features, the competition mechanism of coarse and fine saliency maps can overcome the failure limit of a certain feature map, such as the sixth row. e proposed method can correctly label salient regions and completely suppress background regions, particularly shadow regions. is is because the proposed graph model not only guarantees that the foreground is not influenced by the surrounding background regions but also enhances the contrast between the foreground and background. e visual comparison results further demonstrate the superiority of the proposed method.

Conclusions
In this study, we proposed a novel saliency detection framework. In the first stage, we divide the image into three different levels. On the one hand, the shape and number of nodes generated by images of different scales differ, which enriches the features of salient objects. On the other hand, it compensated for the first feature fusion. e feature saliency map may lose some details. Subsequently, different types of coarse saliency maps were constructed using three different types of prior knowledge features. In the second stage, we fully utilized the competition and complementarity between the feature saliency maps by introducing a novel competition mechanism. In the first fusion, two similarities in the layer were selected to ensure the validity of the selected features. e feature saliency map fusion with the largest degree also ensured the saliency of the salient objects. In the second fusion, the feature maps of different levels were first adjusted to a unified scale, and then two feature saliency maps with a significant difference were selected for fusion, which reduced feature redundancy and ensured the salience of information on the boundaries of salient objects. e proposed framework overcame the limitations of previous graph-based methods in complex scenarios. We compared the proposed method with seven other state-of-the-art methods, including graph-and nongraph-based methods. Furthermore, considering the importance of existing saliency detection methods, the proposed detection method can be applied to other nongraph-based methods in future studies, such as machine learning-based methods, because the competition mechanism can better exploit these feature methods.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.