MRBENet: A Multiresolution Boundary Enhancement Network for Salient Object Detection

Salient Object Detection (SOD) simulates the human visual perception in locating the most attractive objects in the images. Existing methods based on convolutional neural networks have proven to be highly effective for SOD. However, in some cases, these methods cannot satisfy the need of both accurately detecting intact objects and maintaining their boundary details. In this paper, we present a Multiresolution Boundary Enhancement Network (MRBENet) that exploits edge features to optimize the location and boundary fineness of salient objects. We incorporate a deeper convolutional layer into the backbone network to extract high-level semantic features and indicate the location of salient objects. Edge features of different resolutions are extracted by a U-shaped network. We designed a Feature Fusion Module (FFM) to fuse edge features and salient features. Feature Aggregation Module (FAM) based on spatial attention performs multiscale convolutions to enhance salient features. The FFM and FAM allow the model to accurately locate salient objects and enhance boundary fineness. Extensive experiments on six benchmark datasets demonstrate that the proposed method is highly effective and improves the accuracy of salient object detection compared with state-of-the-art methods.


Introduction
/e goal of salient object detection (SOD) is to find the most distinct and salient objects in an image.Salient object detection as an important preprocessing task in computer vision applications has been widely applied in many fields, such as semantic segmentation [1,2], video segmentation [3], object recognition [4,5], and cropping [6].
Inspired by cognitive studies of visual attention, most early works were based on handcrafted features, such as contrast [7,8], center prior [9,10], and so on [11][12][13].With the improvement of GPU computing power, deep convolutional neural networks (CNNs) [14] have successfully broken the limits of traditional methods./ese CNN-based SOD methods have achieved great success on widely used benchmarks.
Inspired by the excellent performance of FCN [15] based CNN in the field of semantic segmentation, FCNs have also been massively applied to SOD, such as several end-to-end deep network structures [16][17][18]./e basic units of salient object map finally output by these end-to-end network structures are the individual pixels in the image area, which can highlight the salient information.As the depth of the convolutional layer increases, the location of salient objects becomes more accurate.However, the detailed boundaries of salient objects are lost due to the pooling operation, see Figure 1.
Boundary information is critical for SOD./erefore, many SOD jobs also try to enhance boundary details by different means.Some salient object detection models [19][20][21] refine high-level features with local information by combining U-Net with a bidirectional or recursive approach to obtain accurate boundary details.Some methods use preprocessing (superpixel) [22] and postprocessing (CRF) [17,21] to preserve object boundary information.Besides, loss functions have also been used to obtain high-quality salient objects.For example, BASNet [23] uses the proposed hybrid loss to improve boundary accuracy.Some methods [24][25][26] attempt to use the edge as the supervision for training SOD models, which significantly improves the accuracy of the saliency map.
We explicitly model edge features and use the attention mechanism to fuse edge features and salient features to obtain salient objects with a high-quality edge.
Our contributions can be summarized as follows: (A) We propose an MRBENet network that utilizes FFM to fuse salient edge features to enhance the boundary and semantic features of salient objects.From the top layer to the bottom layer, the edge details of salient features are sequentially optimized.When extracting edge features, the guidance of high-level semantic features can effectively avoid the influence of shallow noise.Experimental results show that it can filter out noise.(B) /e edge features are first supervised by the salient edge ground truth, and then fused with the salient object features through a feature fusion module./e feature aggregation module extends the receptive field through multi-scale convolution, which can not only effectively aggregate features but also promote feature fusion and enhance the edge details of salient features.

Related Work
Traditional nondeep learning methods predict salient objects based mainly on low-level features, such as pixel contrast [12], average image color difference [27], and phase spectrum with Fourier transform [28].
Compared with traditional methods, the convolutional neural network (CNN) performs extraordinarily.In [15], Long et al. first proposed a fully convolutional neural network (FCN) to predict each pixel./e FCN replaces the last fully connected layer of the convolutional network with a convolutional layer.At the end of the network structure, the feature map is up-sampled, and then the up-sampled feature map is classified into pixels./e final output is an end-toend image.
In recent years, most neural network models for salient object detection have extended or improved fully convolutional neural networks.HED [29] added a series of side output layers after the last convolutional layer of each stage of VGGNet [30] and fuses the feature maps output by each layer to obtain the final result map.In DSS [17], Hou et al. added several short connections from the deeper side output to the shallower side output, so that higher-level features can help locate lower-level features, and lower-level features can enrich the details of higher-level features./e smart combination of higher and lower-level features makes it possible to detect salient objects accurately.PoolNet [26] made full use of the function of the pool and incorporated three residual blocks.Wu et al. [31] embedded a mutual learning module and an edge module in the model.Each module is separately supervised by the salient object and the edge of the salient object and is trained in an intertwined bottom-up and top-down manner.Wang et al. [32] designed a pyramid attention module for salient object detection and proposed an edge detection module./e former extends the receptive field and provides multi-scale clues./e latter uses explicit edge information to locate and enhance the saliency of the object edge.Wang et al. [33] proposed an iterative collaborative top-down and bottom-up reasoning network for salient object detection./e two processes of top-down and bottom-up are alternately executed to complement and enhance the fine-grained saliency and high-level saliency estimation.Noori et al. [34] proposed a multiscale attention guidance module and an attention based multilevel integrator module./ese two modules not only extract multiscale features but also assign different weights to multi-level feature maps.
Given the huge body of work in this field, the latest research progress of SOD can be quickly grasped through relevant surveys.In [35] Recently, RGB-D/RGB-T SOD is a growing trend./e accuracy of saliency detection can be improved by learning simultaneous multimodal information.For example, Ji et al. [37] proposed a depth calibration framework (DCF) learning strategy.DCF generates depth images quality weights by classifying positive and negative samples of depth images./e depth images are then calibrated based on these weights./rough this strategy, the accuracy of saliency estimation is improved by depth information, and it tackles the influence of bad information in depth images on saliency results.

Features Extraction.
We use the vgg16 network as the backbone network.As shown in Figure 2, we delete the full  C � C (1) , C (2) , C (3) , C (4) , C (5) , C (6)   . ( Side path1 is abandoned because side path1 is too close to the input image, so the receptive field is very small.In addition, encoding shallow features will significantly increase the computational cost [38].Side path1 has little effect on the final result.
Since the side output features have different resolutions and number of channels, we first use a set of CP modules to compress the number of side output feature channels into an identical, smaller number, denoted as k./is is beneficial for reducing the amount of subsequent computation and for performing subsequent elementwise operations./e compressed side features can be expressed as follows: where Trans(C (i) ; θ) represents the convolutional layers with parameter θ (it can change the number of channels of the feature map), and ϕ(•) represents the ReLU activation function, i ∈ 2, 3, 4, 5, 6 { }.Low-level features have rich information, but some of the information will interfere with the SOD task.So, it is necessary to highlight the salient information of low-level features./e added GGM has the largest receptive field./erefore, we predict a coarse saliency map for this layer to guide the network to extract useful details from low-level features./e coarse prediction map can roughly locate the salient object regions having larger saliency values (weights) than the background regions.We upsample the coarse saliency map to make its resolution consistent with that of the low-level feature layers.In order to find the right details of salient objects in low-level features, we combine low-level features with coarse prediction maps to enhance the useful details of salient objects./e data flow indicated by the purple dotted arrow in Figure 2 represents the guidance of the coarse saliency map to the low-level features.F (i) can be expressed as follows: where Up denotes the bilinear interpolation operation, i ∈ 2, 3, 4, 5 { }.We explicitly model the edge features on side path2.We utilize a U-shaped network to extract edge features at four different resolutions./is network consists of a CP module and six convolution blocks.C (2) and F (6) are elementwise summed and input into the network./e CP module compresses the input features into k channels./e convolution block consists of two convolution layers to enhance and extract edge features.We also add four edge supervisions to this network.As shown in Figure 2, we get the edge feature F (i)  e , i ∈ 2, 3, 4, 5 { }.

Features Fusion Module (FFM).
As shown in Figure 3, the Spatial Attention (SA) module performs maximum pooling and average pooling on features.Two pooling operations are used to aggregate the channel information of the features, and two single channel maps are obtained./e two images are concatenated together to get a spatial attention through a standard convolutional layer.Spatial attention focuses on the weights of each part of the feature map./erefore, the spatial attention model can find the most important part (salient object) in the feature map, which is very suitable for SOD tasks./e corresponding salient feature F S and edge feature F E are input into the FFM module for feature fusion.We first enhance the edge features in the salient features by multiplication./en we use a 3 × 3 convolution layer to drive the F (5)   F (3) F (4) F (5)  F (6)   Conv layer preliminarily fused feature P, which can be expressed as follows: Meanwhile, we apply the spatial attention module to salient features to get a feature vector, then multiply it with the edge feature to obtain the feature Υ, which can be expressed as follows: ( /e final fusion feature T can be expressed as follows:

Decoder and Features Aggregation Module (FAM).
As shown in Figure 4, FAM utilizes multiscale learning (dilated convolution with different dilation rates) to expand the receptive field, enhance the boundary details of salient objects, and promote the fusion of salient features and edge features./e input feature is denoted as χ.We expand the number of χ channels to M times by 1 × 1 convolution./e depth separable convolution with different expansion rates is used for multiscale learning./is process can be expressed as follows: where d 1 , d 2 , d 3 , d 4 are dilation rates, taken as 1, 2, 3, 4 here.BN is the abbreviation of batch normalization.Here we set up a residual connection, which can be expressed as follows: /en, we apply a spatial attention to χ and obtain the attention vector χ ′ ./e feature of the final output can be expressed as follows: We obtain the final saliency prediction map from top to bottom through FAM.We added depth supervision (purple arrow in Figure 2) after the four FAM modules to refine the saliency map by learning the error between the saliency map and the ground truth.

Loss Function
. /e total loss function of our network consists of edge lossL e and saliency lossL S .Assume that G, G e represents supervision from saliency ground-truth and edge ground-truth, and P e k andP S k represent the edge prediction map and the saliency prediction map./e total loss function can be expressed as follows: L e andL S uses the widely used cross-entropy loss function: where i represents the pixel index, P ∈ P e k , P S k  .

Experiment
4.1.Datasets.We train our network on the subset DUTS-TR in the dataset DUTS [39].We have evaluated the proposed network on six standard benchmark datasets: DUT-OMRON [40], DUTS [39], ECSSD [41], PASCAL-S [42], HKU-IS [43], and SOD [44,45].DUT-OMRON contains 5168 high-quality images./ere are one or more salient objects with a relatively complex background structure.DUTS is so far the largest salient object detection dataset available.DUTS contains two subsets: training subset DUTS-TR and test subset DUTS-TE.DUTS-TR has 10553 images designed for training, and DUTS-TE has 5019 images designed for testing.ECSSD contains 1000 meaningful and complex semantic images with various complex scenes.PASCAL-S has 850 images with chaotic background and complex foreground.HKU-IS has 4447 images with high-quality annotations.Most images in the dataset have multiple connected or disconnected salient objects.SOD contains 300 high-quality images with a complex background.It was originally designed for image segmentation [44].Pixel-level 4 Computational Intelligence and Neuroscience annotations of salient objects are generated in [45] and used for object detection.Although the SOD dataset has fewer images, it is currently one of the most challenging object detection datasets, since most of the images contain multiple salient objects, and some salient objects overlap with the boundary or have low contrast.

Experimental Details.
We train our network on the DUTS-TR dataset.We use vggnet16 as the backbone network.All weights of the newly added convolutional layer are randomly initialized with truncated normal (σ � 0.01), and the deviation is initialized to 0. /e hyperparameters of our network model are set as follows: learning rate � 2e-5, weight decay � 0.0005, momentum � 0.9, batch-size � 8. Backpropagation is processed in a group of 50 images.We do not use the validation dataset during the training process./e model is trained for 30 epochs, and the learning rate after 15 epochs is divided by 10.We implement our network model based on the publicly available Pytorch framework.We use a GTX 2080ti GPU (12 GB RAM) to train and test our model.

Evaluation Metrics.
We use some widely used standard metrics, including F-measure, Mean Absolute Error (MAE) [7] and S-measure [46], and the PR curve, to evaluate our model and other advanced models./e PR curve is a standard method for evaluating the probability map of saliency prediction.It is actually a curve obtained by two variables, Precision (precision rate) and Recall (recall rate), where recall is the abscissa and precision the ordinate.
F-measure is an overall performance measurement, computed from the weighted harmonic mean of precision and recall.It is expressed as follows: β 2 is set to 0.3 to weight precision more than recall./e MAE value represents the average absolute pixel difference between the saliency map (represented by S) and the ground truth map (represented by G).It is expressed as follows: where W and H represent the width and height of the saliency map, respectively.
S-measure focuses on evaluating the structural information of saliency maps.S-measure is closer to the human visual system than F-measure./e S-measure could be computed as follows: where S 0 and S c denote the region-aware and object-aware structural similarity.c is set as 0.5 by default.

Ablation Experiment and Analysis.
In this section, we use DUTS-TR as the training set to verify the effectiveness of the key components in the proposed network.We also discuss the effects of different components in the proposed network on different datasets./e baseline model is an encoder-decoder structure.It can integrate multi-scale features.We adopt saliency supervision and the cross-entropy loss function in this model.From Table 1, the U-shaped network built with vgg16 still has excellent performance.
/e Base + E model adds edge supervision to the side path2 of the Baseline model./e saliency prediction map is obtained by fusing salient edge features and salient features.As shown in Figure 5(f ), there is a lot of redundant information in the edge features of the picture.From Table 1, after incorporating edge information into the Baseline model, the evaluation metrics are improved.
/e Base + U-E model uses a U-shaped structure to extract edge features and fuses salient features and edge features by adding elements.Figure 5(e) is the obtained feature prediction map at the largest resolution among the four different resolutions.Compared with Figures 5(f), 5(e) has clearer object boundaries and less redundant information.Although the edge map obtained by Base + U-E model is finer than that obtained by Base + E model, from the evaluation metrics in Table 1, their SOD tasks do not differ significantly./erefore, the network has to be further optimized.
Base + U-E-G model adds a GGM module to the Base-+ U-E model.Although the saliency map obtained by GGM has blurred boundaries (see Figure 2 coarse prediction map), its spatial location information is the most abundant./e predicted coarse saliency map serves as guidance to enhance the saliency information of the side output feature.By fusing the top-level semantic feature, the edge feature extracted by the U-shaped network is even finer, as shown in Figure 5(d)./e evaluation metrics are also significantly improved.
Our final model adds FFM and FAM modules to the Base + U-E-G model.From the data in Table 1, through the optimization of FFM and FAM, our model has the Computational Intelligence and Neuroscience best performance./is verifies that the proposed FFM and FAM modules can more effectively promote the fusion of edge features and salient features to improve performance.
As shown in Table 2, experiments are conducted using different feature fusion methods on SOD, HKU-IS, and PASCAL-S datasets.Method (a) uses elementwise addition instead of FFM to fuse salient features and edge features.Method (b) uses elementwise multiplication instead of FFM.Method (c) concatenates the two feature maps and performs a convolution operation to fuse the features.Method (d) utilizes a spatial attention module after using element addition for feature fusion.Compared with method (a), method (d) has improved performance after increasing spatial attention.In our model (e), FFM is a combination of elementwise addition and multiplication, and convolution and spatial attention.Comprehensively comparing the indicators of these datasets, our FFM module performs best.
Quantitative evaluation.We evaluate our model MRBENet with other advanced models on six datasets.As shown in Table 3, we can see the MAE, Max F-measure, and S-measure values of different methods in different datasets.We draw PR curves of the different methods in Figure 6.Combining the graphs and tables, it can be seen that our method outperforms most methods.Our vgg16-based model has better performance than some Resnet-based models such as CPD and BASNet.After replacing the backbone network with resnet50, the performance of the model is improved.
Visual comparison.In Figure 7, we show the visualization results of different methods.It can be seen that our method performs well on images with low contrast (rows 1 to 3), complex background (rows 4 to 6), blurred borders (rows 1 to 5), and multiple objects (rows 7 to 8).Our method makes full use of high-level semantic information and edge information, and can still recognize salient objects in complex scenes.

Conclusion
In this paper, we propose an MRBENet network that enhances the fineness of salient objects through the multiscale fusion of salient edge features./e GGM incorporated into the backbone network can extract high-level semantic features, which can help locate object boundaries accurately for shallow features./e FFM fuses edge features and salient object features to enhance the edge of salient objects.Our model performs well against the state-of-the-art methods on six datasets./e experimental results show that the model can improve the salient object localization and edge fineness  8 Computational Intelligence and Neuroscience although the images have complex backgrounds and low contrast.In the future, we will continue to explore how to use edge information to improve saliency detection performance.
, Borji et al. comprehensively reviewed the works and development trends of salient object detection before 2019 and discussed the impact of evaluation indicators and dataset bias on model performance.Recently, Wang et al. conducted a comprehensive survey [36] covering all aspects of SOD./ey summarized the existing SOD evaluation datasets and evaluation indicators, constructed a new SOD dataset with rich attribute annotations and analyzed the robustness and portability of the deep SOD model for the first time in this field.

Figure 1 :
Figure 1: We compare our method with BMPM, Picanet, and RAS./e boundary maps in the figure are all calculated using the canny algorithm on their respective saliency feature maps.

Figure 2 :
Figure 2: Framework of MRBENet.GGM (global guidance module) consists of a set of deeper and successive convolution layers and a CP module./e CP module consists of two convolution layers and a ReLU layer.

Figure 4 :
Figure 4: FAM: features aggregation module.SA: spatial attention.D-conv denote deep separable convolution, and d is dilation rate.

Figure 5 :
Figure 5: Examples of a visual edge map of ablation experiment.(a) RGB (b) GT (c) Edge GT (d) Base + U-E-G (e) Base + U-E (f ) Base + E.

Figure 7 :Figure 6 :
Figure 7: Qualitative comparison of our method with nine other methods.

Table 2 :
Ablation experiment for feature fusion.

Table 3 :
Quantitative comparison of MaxF, MAE and S-measure over six widely used datasets.↑ means bigger is better, ↓ means smaller is better.Red and blue represent the best and second best results, respectively."-R" indicates that backbone network is Resnet.