Saliency Detection by Multilevel Deep Pyramid Model

,


Introduction
Visual saliency aims to extract the most significant regions and targets in a scene by simulating the human visual attention system.In recent years, visual salient object detection models has been applied to many applications such as video summary [1], specific object retrieval [2], and object detection [3].
In order to calculate visual saliency, traditional models usually based on the image contrast.For example, the global contrast-based salient model [4] divides the image into several small image regions, and the contrast between the small image regions is used to highlight salient targets.This kind of model has a good overall contrast and location of salient targets is accurate, but its contours of salient targets are relatively fuzzy.Salient models [5] obtain the saliency map by comparing the neighboring pixels in the image.This method can extract the contours of salient targets but is susceptible to complex backgrounds, which reduce the accuracy of target detection.
At the same time there are many salient models based on low-level features, the most famous of which is the traditional pyramid model proposed by Itti and Koch [6].In this model low-level features such as color, direction, and brightness are extracted at different channels.This model can simulate the biological central-surrounding suppression mechanism in the human visual system, and the saliency map is obtained by multiscale feature fusion, but only low-level features are extracted.So the contours of salient targets are fuzzy due to background noise.Therefore, in order to avoid background noise interference, a model based on fusion of background and foreground [7] was proposed.This model is useful in avoiding background noise interference, but are still not accurate enough.Meanwhile, with the continuous improvement of the deep learning network and deep convolutional network [8] in the area of computer vision and image processing, a salient model using deep convolution network based on high-level features was proposed [9] which divides the original image into small image blocks.The small image blocks are convoluted and pooled at different levels and iterated to obtain the feature dictionary of the original image.Then, the saliency score of each pixel is calculated by a feature dictionary and support vector machine (SVM).This kind of model combines the advantages of models which is based on local contrast and extracts high-level features (such as details of human faces).However, due to the lack of color, space, and other low-level features, the salient map is easily affected by background noise.
In order to solve these problems, we propose a deep pyramid model called the MLDP (multilevel deep pyramid) model, which integrates low-level features, highlevel features, and local contrast.The MLDP model is based on the structure of the pyramid [6], the VGG16 deep model, the superpixel segmentation mapping structure, and the background noise filtering structure [7].
First, we place the entire image into the VGG16 model to get an initial salient map.Then, we apply a pyramid structure simulating the central-surrounding suppression mechanism, because this pyramid structure can extract features based on local contrast.We then divide the initial saliency map into six different scales to form the initial saliency map pyramid.In this way, we can compare the multiscale images and extract local contrast features.The VGG16 model has a relatively small convolution kernel, small steps, and has better accuracy; thus, we selected the VGG16 model for our pyramid to extract high-level features.The VGG16 model extracts high-level features for each scale image of the pyramid to form a deep pyramid with high-level features, and a multiscale feature map formed by center-surround difference.For the obtained feature map, the "winner-takes-all" policy and inhibition of return are used to create the saliency map based on the deep pyramid.It uses the VGG16 deep model to extract high-level features, so it has better feature extraction accuracy compared with shallow model.The superpixel segmentation mapping structure is designed to extract color, brightness, texture, and other low-level features, because superpixels are the small areas whose pixels are similar to each other in position, color, brightness, texture, and other low-level features.By mapping between superpixels and a saliency map based on a deep pyramid, low-level features can be added to form a saliency map fused with superpixels.Because the saliency map fused with superpixels is sensitive to background noise, the background noise filtering structure (based on low-level factors such as color and spatial distance) is used to eliminate the effect of background noise.The background noise filtering structure can also enhance the extraction of low-level features in order to obtain a saliency map without background noise (a saliency map based on foreground).The final saliency map is a fusion of the saliency map based on the foreground and the saliency map fused with superpixels, so the feature extraction of the final saliency map is more comprehensive and accurate.
Our MLDP model makes the following four contributions: (1) We do not use the traditional pyramid structure to extract the low-level features of three channels such as color, direction, and brightness.Instead, we creatively use the VGG16 model to extract high-level features to form an initial saliency map as the deep pyramid structure input. ( We add an extra spatial pyramid pool layer to the VGG16 model in order to adapt the deep pyramid structure for different scales. (

Our Approach
In recent years, researchers have been inspired by the human visual attention system and have proposed many visual salient object detection models [10][11][12][13][14][15].In this section, we will introduce our MLDP model, as shown in Figure (5) weighted fusion structure of multilevel saliency maps.Next, we will introduce these five parts in turn.
2.1.Forming the Initial Saliency Map.We take the original image as the input for a VGG16 model to extract high-level features.The VGG16 model in our MLDP model is similar to the traditional VGG16 model [8] with regard to feature extraction, which contain convolution and pooling iterations.
The difference is that we add a spatial pyramid layer [16] in front of the full connecting layer of the VGG16 model in order to adapt to the different scales of images in the pyramid.The structure of our VGG16 model is shown in Figure 2. We use five convolution layers and obtain the initial global saliency map through the activation of the full connecting layer.The five convolution layers can extract the global high-level features, and the spatial pyramid pooling layer can avoid the change of parameters in the full connecting layer due to the changed size of the initial saliency map, which can make the training easier.However, because the initial saliency map is based on the global high-level feature 2 Journal of Sensors information, ignoring the local contrast feature information and low-level feature information, the initial global saliency map cannot extract the details of salient targets.Therefore, we use the deep pyramid structure and superpixel segmentation to extract the local and low-level features.

Multiscale Deep Pyramid.
The multiscale image pyramid is a feature extraction method based on local contrast.Unlike a traditional pyramid, we ignore low-level features like color, brightness, and direction, because the pyramid in our model is used to extract the high-level features with local contrast.We use the initial global saliency map as the input for the pyramid structure and apply the VGG16 model with a spatial pyramid pooling layer to each scale in the pyramid.The main contribution of applying the VGG16 model to the Gaussian pyramid is that the VGG16 model has excellent ability to extract the features, but VGG16 model lacks the features confrontation mechanism which exists in the pyramid, the confrontation mechanism has been proved significant in the salient object detection of the human visual system [17].
Without the confrontation mechanism, the performance of VGG16 model will fall in the salient object detection.On the other hand, if the pyramid loses the VGG16 model, the pool performance in extracting the features like color, brightness, and direction of the traditional pyramid will restrict the performance in the salient object detection.Therefore, it is important to apply the VGG16 model to the Gaussian pyramid as shown in Figure 3.    3

Journal of Sensors
The VGG16 model has a fixed scale requirement to the input; our Gaussian pyramid has different scales; if we want to use the VGG16 model in the Gaussian pyramid, the multiscale requirements of the pyramid must be solved.In view of the above problem, we add a spatial pyramid pooling layer [16] to deal with the problem which is another main contribution in applying the VGG16 model to pyramid.As shown in Figure 2, the spatial pyramid pooling divides the input into the fixed grid 4 × 4, 2 × 2, and 1 × 1.Through the fixed grid, the final output in the full connecting layers will normalize to 4096 × 1 of different scale input.In our model, the VGG16 model containing a spatial pyramid pooling layer can adapt to the multiscale requirements of the pyramid and form a deep pyramid.
The idea in our model inspired by the human visual attention is multiscale deep pyramid.As shown in the research of human visual attention system [17], the human visual system can detect the visual salient object due to the confrontation between the central area and the surrounding area of feeling field in the visual cell, if the central area is more salient than the surrounding area, the salient object in the human visual system is the central area as shown in Figures 4(a) and 4(c), otherwise salient object is the surrounding area as shown in Figures 4(b) and 4(d).Therefore, the central and surrounding are confronting each other to produce the final result.In our model, the scale we choose as the surrounding area is the scale 896 × 896, 640 × 640, and 520 × 520 in the Gaussian pyramid, and the scale we choose as the central area is the scale 448 × 448, 256 × 256, and 128 × 128 in the Gaussian pyramid.As similar to the Gaussian pyramid, we choose the scale 448 × 448, 160 × 160, and 65 × 65 as the surrounding area and the scale 28 × 28, 16 × 16, and 8 × 8 as the central area.Through the following central-surrounding difference mechanism, our model can simulate the confrontation between the central area and the surrounding area of feeling field in visual cell of human visual system.
The number of layers in VGG16 model varies with the scale of pyramid to avoid excessive image size which may cause distortion, as shown in Table 1: The flattening and full connecting layer are then used to obtain the high-level feature map of the deep pyramid at multiscale.Its formula is as follows: where v ⋅ is the extraction of high-level features with the VGG16 model, x is the high-level feature saliency map at multiscale, f ⋅ is the flattening operation, f c ⋅ represents the full connecting layer, and v x is the multiscale highlevel feature map of the deep pyramid.
The deep pyramid simulates the central-surrounding difference of the human visual system, which can extract local contrast features.We subtract between the different levels of the high-level feature map of the deep pyramid.The formula is as follows: where v c and v s represents the multiscale high-level features map of the deep pyramid, respectively, ⊗ indicates the point-to-point subtraction between the multiscale high-level features map of the deep pyramid, and q c, s is the obtained multiscale local contrast feature map of the deep pyramid.
In order to fuse the multiscale local contrast feature map of the deep pyramid, the deep pyramid defines a normalized function, which has the following formula: where x is the inputted multiscale local contrast feature map of the deep pyramid, i represents the feature scores of the local contrast feature map, M is the maximum value of the local contrast feature map, and m is the feature average of the local contrast feature map.  4

Journal of Sensors
We normalize the local contrast features at different scales using the normalization function, and perform multiscale fusion.The formula is as follows: where J is the saliency map based on the deep pyramid.
On the basis of the high-level features of the initial saliency map, the deep pyramid is used to further extract high-level features, and the local contrast features are also fused to enhance the extraction of salient targets.

Using Super-Pixel Segmentation to Mitigate the Missing
Salient Scores of Low-Level Features.Superpixel segmentation is based on similarities between pixels in low-level features such as color and spatial distance.These pixels with similar low-level features are classified as a region, in order to segment regions whose pixels are similar in low-level features.These regions are mapped onto a saliency map based on a deep pyramid, so as to carry out an image region segmentation operation on a deep pyramid saliency map based on the principles of low-level feature similarity.The average of pixel points' salient scores in the region is calculated for each region.Because pixels in the same region are similar in color, spatial distance, and other low-level features, if the salient scores of the pixel are lower than the average score in the same region, the pixel's salient score will be replaced by the average score.This method is actually based on the high-level salient scores, according to the similarity of pixel's low-level features in the same region: where n indicates the total number of pixels in a small region, d x, y is the salient scores of each pixel occupying a small region of the saliency map based on a deep pyramid, and x, y is the coordinates of the pixel.Therefore, this method can compensate for the reduction of salient scores caused by a lack of low-level features.
2.4.Background Noise Filtering.We use the saliency map extraction method based on foreground clues [7] to filter the interference of background noise.This is primarily divided into two parts as detailed below.
2.4.1.The Choice of Foreground Clue.We use the method of adaptive thresholds [18] to segment the saliency map fused with superpixels, and select those pixels whose salient scores are greater than the threshold as the foreground clue.We use adaptive thresholds rather than a fixed threshold because the adaptive thresholds can be adapted to the different origins of the input and have good accuracy.

Saliency Map Filtering Background
Noise.We measure the salient scores of a region by calculating the color and spatial distance between the regions obtained by superpixel segmentation which match the foreground clue and the regions obtained by superpixel segmentation that do not match the foreground clue.The formula is as follows: where FS is the set of foreground clues, d a i , a j represents the color distance between the regions obtained by superpixel segmentation which match the foreground clue and the regions obtained by superpixel segmentation that do not match the foreground clue, and d k i , k j represents the spatial distance between the regions obtained by superpixel segmentation which match the foreground clue and the regions obtained by superpixel segmentation that do not match the foreground clue.In order to avoid the self-similarity of zero in the foreground clue, we calculate the salient score using the following formula: where FS is the cardinality of the foreground clue set FS, S i is the salient scores of each region segmented by superpixel.
Because our foreground clue consists of pixels which have high salient scores selected by the adaptive threshold method from our saliency map fused with superpixels, extracting foreground clues can filter out those pixels with low salient scores caused by background interference.Thus, a saliency map composed of S i can filter out the interference of background noise.
2.5.Weighted Fusion of Different Saliency Maps.In order to avoid the weakening effect that extracting salient targets caused by background noise filtering may have, we use weighted fusion [19] between the saliency map (fused with superpixels) and saliency map (based on foreground).The formula is: where w 1 and w 2 are the weights of the saliency map S and D, respectively, obtained by the least squares estimation, and Q is the final saliency map.

Experiments
3.1.Datasets.In this section, in order to test and reflect the effect of our model, we select the MSRA dataset, ECSSD dataset, and PASCAL dataset as the processing target dataset.The MSRA dataset contains 5000 images with different complex backgrounds, and each image of the ECSSD dataset has a 5 Journal of Sensors well-defined saliency target, PASCAL contains 1000 realworld images which has more than one salient objects.

Evaluation Metrics.
In addition to the PR curve, we use the F-measure score [20] to evaluate the extraction of the saliency target.The F-measure score is calculated as follows: where x 2 is set to 0.3, and Precision and Recall are obtained by the adaptive threshold segmentation method.
We evaluate the extraction results of the salient targets in these models using different datasets, as shown in Figures 5 and 6. 6

Journal of Sensors
As shown in Figure 5, our MLDP model achieves good precision and recall across both the MSRA, ECSSD, and PASCAL datasets.Although the other models' recall rate is slightly higher than our MLDP model when the recall is between 0.28 and 0.3 in ECSSD datasets, the MLDP model is better across most other ranges in terms of precision and recall.Thus, our MLDP model has wideranging applicability.
We can see from Figure 4 that our MLDP model is better than other models evidenced by the higher F-measure scores.The comprehensive results show our MLDP model is quite effective in recall, precision, and F-measure.
We present a visual comparison of results for each model in Figure 7.As shown, our MLDP model not only accurately locates the salient target but also extracts significant details of salient targets with clear contours, particularly in the case of complex backgrounds such as lines 2 and 8.Most other models confuse the salient targets and the background.Because the MLDP model can eliminate the interference of complex background factors, the extraction of salient targets is better than most other models.Whether the salient targets are small (lines 11) or large (lines 10), our MLDP model has better salient target extraction, especially when the salient targets are large and close to the edge of the image (lines 10 and lines 5).Most other models will be affected by the edges of the image, and this impacts the clarity of salient targets.The MLDP model also show good results for low contrast color (lines 1, lines 3, and lines 6).MLDP not only has a good results on a single salient target but also works well on multiple targets (lines 4, lines 7, and lines 9).  Figure 8 shows that the MLDP model (with a deep pyramid) is better than the BFS model (without a deep pyramid).Although both MLDP and BFS models have a background filtering structure, the BFS model is still based on low-level features extracted by traditional methods.The MLDP model, however, uses a deep pyramid to extract high-level features resulting in greatly improved outcomes.
3.4.2.The Importance of Mapping on Superpixels.We do not extract low-level features directly according to the traditional ideas, but indirectly extract low-level features with mapping on superpixels based on low-level features, we can measure its benefits from Figure 9. From the red block diagram, we can see that the mapping on superpixels can make up for the lack of shape in salient targets.Mapping on superpixels can also reduce the background interference as shown in the yellow block diagram.The mapping on superpixels has two big benefits in our MLDP model.

The Importance of Background Noise Filtering.
Because the extraction of salient targets is easily affected by background noise factors, particularly in cases with complex backgrounds, we adapt background noise filtering to eliminate this effect.To illustrate the importance of background noise filtering, we compare the results of a model without background noise filtering and our MLDP model in Figure 10.
Compared with the model without the background noise filtering, the MLDP model eliminates the interference of background noise factors.Figure 10 demonstrates that the    9 Journal of Sensors model without the background noise filtering will reduce the accuracy of salient targets due to the interference from a complex background, and the background interference will appear around the salient targets.Meanwhile, the MLDP with background noise filtering can eliminate these errors caused by background noise factors.

Conclusions
In this paper we propose the MLDP model, which is based on a pyramid to extract low-level features with a deep learning model added to extract high-level features.The results of the MLDP model are better than most state-of-the-art methods, and it is able to address the issues of identifying salient targets against complex backgrounds by eliminating the interference of background noise factors.

Figure 2 :
Figure 2: The architecture of the VGG16 model with spatial pyramid pooling.
(a) Image (b) Saliency map based on deep pyramid without VGG16 model (c) Saliency map based on deep pyramid without Gaussian pyramid (d) Saliency map based on deep pyramid with Gaussian pyramid and VGG16 model

Figure 3 :
Figure 3: The importance of VGG16 and Gaussian pyramid.

Figure 1 :
Figure 1: The overall architecture of the proposed MLDP model.

Figure 4 :
Figure 4: The sketch map of confrontation mechanism in human visual system.

Figure 8 :
Figure 8: Comparisons between BFS without a deep pyramid and MLDP.
MLDP without mapping on superpixels (c) MLDP with mapping on superpixels

Table 1 :
The number of layers in different scale of pyramid.