Saliency Aggregation: Multifeature and Neighbor Based Salient Region Detection for Social Images

,


Introduction
Images and videos are two of the main ways for social entertainments and communications.With the popularity of photo sharing websites, social images have become an important type.The most obvious feature of social images is that they typically have several tags to describe the contents.How to use the tags for multimedia tasks, such as image indexing and retrieval [1,2], has attracted increasing attention these days [3].However, tags are seldom considered in stateof-the-art salient region detection models.Therefore, in this paper, we focus on salient region detection of social images using both appearance features and tag features.
With the development of saliency detection, a large number of saliency detection algorithms have been developed [4][5][6].It has been found that only relying on low-level features cannot achieve satisfactory results.The researches have proved that the hierarchical and deep architectures [7][8][9][10][11][12] for salient region detection are very effective.Thus, a salient region detection method based on deep learning is proposed in this paper.In addition, various priors are also very important in salient region detection [13], for example, face [14][15][16], car [17], color [14], center bias [13], and objectness [18][19][20].Intuitively, the tags could potentially be important high-level semantic cues for salient region detection [16,21].Thus, tags are incorporated into our salient region detection models.
It is observed that different methods perform differently in saliency analysis [22].The performance of saliency varies with individual images.The problem also exists in deep feature based methods and handcrafted feature based methods.So handcrafted feature based detection methods can be considered as complementarities to deep feature based detection methods.However, the fusion process is without ground truth.It is nontrivial to determine which saliency map is better.The good saliency aggregation model should work on each individual image and be able to consider the performance gaps appropriately.Therefore, how to fuse saliency maps of different detection methods is a key issue to be solved in the paper.
The framework of salient region detection is shown in Figure 1.It includes two parts: deep learning based salient region detection and handcrafted feature based salient There are a variety of saliency detection benchmark datasets, either from saliency detection field [7,8,[23][24][25][26] or from image segmentation field [27][28][29].To promote further researches and evaluations on visual saliency detection for social images, it is necessary to construct a new dataset of social images.
The paper focuses on salient region detection of social images.The contributions of this paper are twofold.First, a deep learning based salient region detection method for social images is proposed, considering both appearance features and tag features.Second, tag neighbor and appearance neighbor based saliency aggregation method is proposed, which fuses state-of-the-art handcrafted feature based detection methods with our deep learning based detection method.The aggregation method is dependent on each specific individual image and considers the saliency performance gaps appropriately.So the detection model has fully taken advantage of image tags.
The rest of the paper is organized as follows.The deep learning based model is proposed in Section 2. Section 3 discusses the handcrafted feature based detection models.In Section 4, the saliency aggregation method is proposed.Spatial coherence optimization is discussed in Section 5.In Section 6, the new saliency dataset of social images is introduced.In Section 7, extensive experiments are performed and analyzed.Finally, conclusions are given in Section 8.

Deep Learning Based Salient Region Detection
Deep learning based salient region detection uses two types of features, appearance based CNN (convolution neural network) features and social image tag features.They are discussed in the following subsections.The convolution layers are responsible for the multiscale feature extraction.In order to achieve translation invariance, max pooling operation is performed after convolution operation.The learned feature is composed of 4096 elements.Fully connected layers are followed by ReLU (Rectified Linear Units) for nonlinear mapping.The dropout procedure is to avoid overfitting.ReLU performs the operation for each element in the following.
The output layer uses softmax regression to calculate the probability of image patches being salient.

Multiscale CNN Feature Computation.
In an image, salient regions have uniqueness, scarcity, and obvious difference with their neighborhoods.Inspired by literature [8], in order to effectively compute the saliency, three types of differences are computed, that is, the difference between the region and its neighborhoods, the difference between the region and the whole image, and the difference between the region and image boundaries.To compute these differences, four types of regions are extracted: (1) rectangle  In fine-tuning process, the cost function is the softmax loss with weight decay given by where  is the learnable parameter of convolution neural network, including the bias and weights of all layers; {⋅} is the indicator function; (  =  | ) is the probability of the th sample being salient;  is the parameter of weight decay;   is the weight of the th layer.We use stochastic gradient descent to train the network with batch size  = 256,  = 0.0005.
The initial learning rate is 0.01.When the cost is stabilized, the learning rate is decreased by a factor of 0.1.80 epochs are repeated for the training process.The dropout rate is set to 0.5 to avoid overfitting.

Tag Semantic Feature Computation.
Due to the fact that objects are closely related to salient regions, we use object tags to compute semantic features.The probability that a region is a particular object reflects the possibility being a salient region to some extent.Therefore, the probabilities that regions are specific objects can be regarded as priors.RCNN (Regions with CNN) [31] is based on deep learning and has been widely used because of its excellent object detection accuracy.In the paper, RCNN is used to detect objects; thus tag semantics are transformed into RCNN features.
Suppose there are  object detectors.For the th detector, the detection process is as follows.
(1) Select  proposals which are more likely to contain the specific object.
(2) Compute the th proposal probability    of the th proposal being the th object, 1 ≤  ≤ , 1 ≤  ≤ .At the same time, each pixel in the th proposal also has the same probability    .
(3) For  proposals, each pixel has the score dimension feature is obtained for each pixel after  objects detector detection. dimension feature is normalized as ,  ∈   .Each dimension of  indicates probability being a specific object.

Fusion of CNN Based Saliency and Tag Semantic Features.
Assume that the saliency map is   and RCNN based semantic features is ; the fusion is Tags are priors and play weights in fusion. represents the fused saliency map.

Handcrafted Feature Based Salient Region Detection
It is observed that different methods perform differently in saliency analysis [22].Although the overall detection effect based on deep features is better than that based on handcrafted features, the differences still exist on individual images.So handcrafted feature based salient maps can be considered as complementarities to deep feature based saliency maps.In Figure 4, the first column shows the original social images; the second shows the ground truth masks; the third shows the salient maps of DRFI method [25] which is based on handcrafted features; the last represents the salient maps of MDF method [8], which are based on deep features.We can see that the last column includes incomplete parts, unclear boundaries, and false detections.So in the paper, some state-of-the-art salient region detection methods based on handcrafted features are selected as complementarities to our proposed deep detection method.

Saliency Aggregation
4.1.Main Idea.It is observed that if a salient region detection method has good effects on a social image, this method has great possibility to get sound effect on similar images.The main idea of aggregation is based on this assumption.
In training process, sort lists of all detection methods on all images can be achieved.Sort lists can be seen as priors in testing.
In testing process, we search KNN (K nearest neighbors) images similar to the test image in the training set.Moreover, sort lists of KNN images are known in the training stage.KNN images can vote for detection methods through sort lists.Thus, the test image is able to obtain its sort list based on voting.Salient map of test image can be computed by aggregating its salient maps of different methods using sort lists.
Training process and testing process are shown in Figures 5 and 6.

Training Process.
Given an image  in the training set, its ground truth is given by ; its salient maps using different detection methods is denoted as  = { 1 ,  2 ,  3 , . . .,   , . . .,   }.In this saliency map set,  is the number of detection methods, and   is the salient map of the th method.
For every detection method, its salient maps can be compared with ground truth  and yield AUC (Area under ROC Curve) values.The greater the AUC value, the better the saliency detection performance.After AUC value computation, sort lists of all methods can be obtained.
For convenience, it is assumed that there are four detection methods.Sort lists are shown in Figure 7.The data structure is single linked list.Data domain of header node denotes image and pointer domain of header node points to data node.Nonheader node includes three domains: the first domain is the AUC value, the second domain is the method index, and the last domain is a pointer.There are 37 object tags in the new dataset, including animal, bear, birds, cat, fox, zebra, horses, tiger, cow, dog, elk, fish, whale, vehicles, boats, cars, plane, train, person, police, military, tattoo, computer, coral, flowers, flags, tower, statue, sign, book, sun, leaf, sand, tree, food, rocks, and toy.
In these categories, animal has super class and subclass relationship with bear, birds, cat, fox, zebra, horses, tiger, Although super class and subclass have great relevance in the class definition, many subclasses have a variety of differences in environment and appearance.So, for animal class, subclasses need exact matching to find neighbors; for vehicles class, subclasses need exact matching to find neighbors; because of particularity of class people, if there is no exact matching of subclass, matching can be performed at person level.

Appearance
where   () is the saliency map of the th detection method.The fused saliency map can be computed as follows.

Spatial Coherence Optimization
In saliency computations, the spatial relationship of adjacent regions is not considered, so it will result in noises on salient regions.In the field of image segmentation, the researchers use fully connected CRF (conditional random field) model [49] to achieve better segmentation results.Therefore, we use the fully connected CRF model to optimize the spatial coherence of saliency maps.The objective function is defined as follows.
where  is the binary variable being salient or not.(  ) is the probability of pixel   being salient.Initially, (1) =   , (0) = 1 −   .  is the saliency of the pixel . , is defined as follows.
If   ̸ =   , then (  ,   ) = 1, or else 0. Both position information and color information are considered in  , .
is the position of pixel  and   is the position of pixel .
is the color of pixel  and   is the color of pixel .
2 ) suggests that adjacent pixels with similar colors should have similar saliency. 1 and  2 control color similarity and distance proximity. 2 exp(−‖  −  ‖ 2 /2 2 3 ) only considers position information.The purpose is to remove small areas.

Construction of Saliency Dataset of Social Images
The paper focuses on salient region detection of social images, so it is necessary to construct a new dataset of social images to promote further researches and evaluations of visual saliency models.The following will be discussed in detail.
6.1.Data Source.NUS-WIDE dataset [50] is a web image dataset constructed by NUS lab for media search.The images and the tags of this dataset are from Flickr which is a popular social web site.We randomly select 10000 images from NUS-WIDE dataset.The images come from thirty-eight folders of NUS-WIDE dataset, including carvings, castle, cat, cell phones, chairs, chrysanthemums, classroom, cliff, computers, cooling tower, coral, cordless cougar, courthouse, cow, coyote, dance dancing, deer, den, desert, detail, diver, dock, close-up, cloverleaf, cubs, doll, dog, dogs, fish, flag, eagle, elephant, elk, f-16, facade, and fawn.

Salient Region Annotation.
Since the bounding boxes for salient regions are rough and can not reveal region boundaries, we adopt the pixel-wise annotation.In annotation process, nine subjects are asked to specify the attractive regions according to their first glance at the image.
To reduce label inconsistency of the annotation results, the pixel consistency score is computed.A pixel can be considered salient if 50% of subjects have selected it [23].
Finally, two subjects use Adobe Photoshop to segment salient regions.

Image Selection.
First, 10000 images are randomly selected from NUS-wide dataset.Then, the images are further selected by the following criteria.
(3) At least ten percent of the salient regions connected with the image boundaries.
After 5 rounds of selecting, the dataset contains 5429 images.
In the new dataset, the images have one or more salient regions; the positions of salient regions are not limited to image centers.The sizes of salient regions are varied.A great deal of images have complex/cluttered backgrounds.There are 78 tags which come from 81 tags of NUS-WIDE dataset.All these will bring challenges to salient region detection.

Typical Images of the New Dataset.
In this section, typical examples of images, ground truth masks, and tags are listed below.Images can have one or multiple salient regions in Figure 8.The images may have cluttered and complex backgrounds in Figure 9.The sizes of salient regions are rich in Figure 10.We selected 20 object tags, including bear, birds, boats, buildings, cars, cat, computer, coral, cow, dog, elk, fish, flowers, fox, horses, person, plane, tiger, train, and zebra.Correspondingly, 20 RCNN object detectors were chosen to extract RCNN features.Top 1000 proposals of each detector were used to compute RCNN features.
In addition, we also verify the performance of the aggregation method in Section 7.2.2.

Experiments on State-of-the-Art Datasets.
We also carried out the experiments on six state-of-the-art datasets to validate our method.These datasets are MSRA1000 [23], DUT-OMRON [24], ECSSD [7], HKU-IS [8], PASCAL-S [51], and SOD [27].In these datasets, SOD [27] is a dataset which is from segmentation field; others are from saliency field.Because these datasets have no image level tags, we extract objectness feature [19] of these datasets.Objectness is a kind of high-level semantic cues, so objectness cue is similar to tag feature.Compared with the method DBS, the method using objectness feature instead of tag feature is abbreviated as OBS (Objectness Based Saliency).
7.1.3.Evaluation Criteria.We adopted popular performance evaluations to quantitatively evaluate the results, including PR (Precision Recall) curves, ROC (Receiver Operating Characteristic) curves, -measure value, AUC (Area under ROC Curve) value, and MAE (Mean Absolute Error) value, respectively.

Experiments of Deep Learning Based Detection Method.
DBS is compared with 27 state-of-the-art methods.The results are given in Table 1 and Figure 11.
Among the 28 methods in Table 1, the top four methods are all deep learning based methods, including MCDL [9], RFCN [11], MDF [8], and DBS.To some extent, deep learning based detection methods are better than handcrafted feature based methods, in terms of both completeness and accuracy of saliency maps.AUC value of DBS method is the highest.-measure value of DBS method is slightly lower than RFCN [11].MAE value of DBS is third low.The overall performance of DBS method is good.
Typical saliency maps are shown in Figure 11.

Experiments of Aggregation Method.
The handcrafted feature based detection methods used as complementarities to DBS are DRFI [25], SMD [46], BL [32], and MC [40].In neighbor searching, the number of tag neighbors is 4 and the number of appearance neighbors is 4.
In order to verify the effect of neighbors, appearance neighbor based method and tag neighbor based method are carried out, respectively.Appearance neighbor based aggregation method is abbreviated as ABS (Appearance Based Saliency).Tag neighbor based aggregation method is abbreviated as TBS (Tag Based Saliency).Tag neighbor and appearance neighbor based aggregation method is abbreviated as FBS (Fusion Based Saliency).
The detection performances of DBS, ABS, TBS, and FBS are compared in Table 2.
PR and ROC curves are shown in Figures 12 and 13.PR and ROC curves of FBS are higher than 27 state-of-the-art methods.
The examples of typical saliency maps of FBS method and DBS method are shown in Figure 14.It can be seen that the aggregation results are more complete and the details are better.

Experiments on State-of-the-Art Datasets.
The experiment results are given in Table 3.We can see that AUC values of OBS are the highest on all datasets, -measure values of OBS are the highest on all datasets, and MAE values are the lowest or the second lowest.The performance of OBS is the best.However, the improvements of OBS are not so obvious because objectness feature is not the accurate tag feature.Thus we believe that the results will be improved obviously if we use accurate tag annotation of images.Experiments on state-of-the-art datasets validate the effectiveness of our proposed method DBS.

Conclusions
The paper focuses on salient region detection of social images.First, the proposed deep learning based salient region detection method considers both appearance features and tag features.Tag features are detected by RCNN models.Second, tag neighbor features and appearance neighbor features are added to the saliency aggregation model.Finally, a new database of challenging social images and pixelwise saliency annotations is constructed, which can promote further researches and evaluations of visual saliency model.

2. 1 .
CNN Based Salient Region Detection 2.1.1.Network Architecture.The deep network for appearance feature extraction has 8 layers [30] as shown in Figure 2. It includes 5 convolution layers, 2 fully connected layers, and 1 output layer.The bottom layer represents the input image and the adjacent upper layer represents the regions for deep feature extraction.

Figure 4 :
Figure 4: Examples of saliency detection results.Images in each column are original images, ground truth masks, salient maps of method DRFI [25], and salient maps of method MDF [8], respectively.

Figure 7 :
Figure 7: Images and their sort lists.

Figure 8 :
Figure 8: Images with one or multiple salient regions.

Figure 9 :
Figure 9: Images with cluttered and complex backgrounds.

Figure 10 :
Figure 10: Images in various size levels.

7. 1 .
Experimental Setup 7.1.1.Experiments on the New Dataset.The aim of the paper is to solve salient region detection of social images.So the main experimental dataset is our new dataset, which is abbreviated as TBD (Tag Based Dataset).

Figure 12 :Figure 13 :
Figure 12: PR curves of FBS and 27 state-of-the-art methods.

Figure 14 :
Figure 14: Visual comparisons of FBS with DBS.The order of images is original image, ground truth mask, FBS, and DBS.
Based Neighbor Search.256dimensionalhistogram of RGB color space is used and  2 distance is computed.4.4.Vote Based Saliency Maps Aggregation.Suppose the test image is , the number of tag neighbors is , and the number of appearance neighbors is .After tag based search in the training set, the detected neighbor number is .If  is bigger than , then  images are selected according to appearance similarities from  images. Fnally, tag based neighbor set is given as Im   = {Im   Im  2 , . . ., Im   , . . ., Im  + } .

Table 1 :
-measure, AUC, and MAE of DBS and 27 state-of-the-art methods.