Multistrengthening Module-Based Salient Object Detection

Object detection is a classical research problem in computer vision, and it is widely used in the automatic monitoring field of various production safety. However, current object detection techniques often suffer low detection accuracy when an image has a complex background. To solve this problem, this paper proposes a double U-shaped multireinforced unit structure network (DUMRN). /e proposed network consists of a detection module (DM), a reinforced module (RM), and a salient loss function (SLF). Extensive experiments on five public datasets and a practical application dataset are conducted and compared against nine state-of-the-art methods. /e experiment results show the superiority of our method over the state of the art.


Introduction
Object detection in computer vision is widely used in the field of production safety monitoring, for example, abnormal behavior detection, regional invasion detection, and dress code detection. e practical applications of object detection effectively solve many problems and defects in safety production management while reducing preventable accidents in the workplace.
In many practical production safety applications of objection detection, we found that the following problems still exist in the safety harness detection: (1) the color contrast between safety harness and work clothes is low, which makes it difficult to detect the safety harness accurately. (2) e structure of safety harness is complex, and the detection of safety harness is easily interfered by the texture of work clothes, resulting in detection difficulty, as shown in Figure 1.
In order to solve the above problems, salient object detection technology is considered as a feasible solution. Salient object detection is to imitate the mechanism of human visual attention, that is, to obtain the object of interest in the visual scene through human eyes and then transmit this object to the brain for understanding and optimization, so as to quickly obtain the desired information from the scene. Since salient object detection can ignore irrelevant information, the region of interest can be effectively segmented and applied to the subsequent detection process. erefore, salient object detection, as an effective pretreatment technology, is widely used in many computer vision tasks.
Based on the extensive literature review, we briefly introduce the current salient target detection methods from the following four aspects.

Salient Object Detection Based on Traditional Methods.
Early salient object detection methods were based on lowlevel features and some heuristics' prior knowledge, such as color contrast [1], background prior [2], and center prior [3]. Early methods detect salient objects by searching for pixels according to a predefined saliency measure computed based on handcrafted features [4,5]. Borji et al. [6] provided a comprehensive survey in this field. Encouraged by the advancement on the image classification of deep CNNs [7,8], early deep salient object detection methods searched for salient objects by classifying image pixels or superpixels into salient or nonsalient classes based on the local image patches extracted from single or multiple scales [9,10].

Salient Object Detection Based on Feature Enhancement.
Wang et al. [11] used a weight sharing method to refine features iteratively and promoted a mutual fusion between features. Li et al. [12] proposed a novel dense attentive feature enhancement (DAFE) module for efficient feature enhancement in saliency detection. Zhang et al. (UCF) [13] developed a reformulated dropout and a hybrid upsampling module to reduce the checkerboard artifacts of deconvolution operators as well as aggregate multilevel convolutional features for saliency detection. Hu et al. [14] proposed a level set [15] function to output accurate boundaries and compact saliency. Luo et al. (NLDF+) [16] designed a network with a 4 × 5 grid structure to combine local and global information and used a fusing loss of cross-entropy and boundary IoU inspired by Mumford and Shah [17]. Hou et al. (DSS+) [18] proposed a holistically nested edge detector (HED) [19] by introducing short connections to its skip layers for saliency prediction. Chen et al. (RAS) [20] presented a HED by refining its side output iteratively using a reverse attention model. Zhang et al. (LFR) [21] predicted saliency with clear boundaries by proposing a sibling architecture and a structural loss function. Yao and Wang [22] proposed an enhancing region and boundary awareness network (ERBANet) equipped with attentional feature enhancement (AFE) modules to improve the detection performance.

Salient Object Detection Based on the Attention Mechanism.
e gate unit is combined by two consecutive feature maps of different resolutions from the encoder to generate rich contextual information [24]. Li et al. [5] proposed an attention steered interweave fusion network (ASIF-Net) to detect salient objects, which progressively integrated cross-modal and cross-level complementarity from the RGB image and the corresponding depth map via steering of an attention mechanism. Xu et al. [25] proposed a dual pyramid network (DPNet) for salient object detection by formulating the self-attention mechanism into the subregion-based contexts. Zhou et al. [26] proposed a simple yet effective hierarchical U-shape attention network (HUAN) to learn a robust mapping function for salient object detection and formulated a novel attention mechanism to improve the well-known U-shape network. Li et al. [27] proposed a multiattention guided feature fusion network (MAF). A novel channel-wise attention block (CAB) was used which is in charge of message passing layer by layer from a global view, which utilized the semantic cues in the higher convolutional block to instruct the feature selection in the lower block. Zhang [31] predicted the pixelwise attention maps by a contextual attention network and then incorporated them with U-Net.

Salient Object Detection Based on Edge Optimization.
To capture finer structures and more accurate boundaries, numerous refinement strategies have been proposed. Wu et al. [32] proposed a novel stacked cross refinement network (SCRN) for salient object detection which aimed to simultaneously refine multilevel features of salient object detection and edge detection by stacking a cross refinement unit (CRU). Wang et al. (SRM) [33] proposed to capture global context information with a pyramid pooling module and a multistage refinement mechanism for saliency maps' refinement. Amirul et al. [34] proposed an encoder-decoder network that utilizes a refinement unit to recurrently refine the saliency maps from low resolution to high resolution. Deng et al. (R3Net+) [23] developed a recurrent residual refinement network for saliency maps' refinement by incorporating shallow and deep layers' features alternately. Fu et al. [35] proposed an end-to-end deep-learning-based refinement model named Refinet. Intermediate saliency maps that are edge-aware were computed from segmentation-based pooling and then fed to a two-tier fully convolutional network for effective fusion and refinement. e researchers have improved salient object detection from the above four aspects, but the following two problems still exist. e salient object detection method based on the fully convolutional neural network (FCN) can better extract multilevel features compared with previous methods. However, after continuous convolution and pooling operations, the loss of shallow fine details cannot be reconstructed by the upsampling operation, resulting in defects in the fine structure or boundary as shown in Figure 2. e saliency is defined primarily in terms of the global features of an image, rather than local or pixellevel features. In order to obtain more accurate results, salient object detection methods still need to understand the global significance of the whole image and the structural details of the object [19].

e Issue of Complex Background.
e most salient object detection networks adopt the U-Net structure as the encoder and the decoder, and multistage features provided by U-Net are used to reconstruct high-resolution feature images. Whether the effective features of the encoder can be transmitted to the decoder is the basis of whether the decoder can output an accurate and salient object. However, most U-Net-based methods only considered the information interaction between different levels in the encoder or the decoder and directly used all-pass skip-layer structure to connect the encoder features to the decoder. In these methods, information interference often occurs between different blocks, especially when an image has a complex background.
Qin et al. [36] proposed a method that divided the task into two parts, but optimized the edges without taking into account the loss of fine structure and the interference of the complex background. Inspired by the BASNet [36] structure, we propose a double U-shaped multireinforced unit structure network (DUMRN) to solve the above two problems simultaneously. is network can achieve a fine prediction of object boundary and accurate saliency object detection under the complex background. e main contributions of this paper include the following: (1) We propose a new detection module which includes an information processing unit, a dual-flow branch unit, and a semantic reinforcement unit. e information processing unit is used to control the amount of information flowing from each encoder block to the decoder while enhancing the effective information and suppressing the irrelevant information. e dual-flow branch unit is used to fuse the output of the information processing unit and the supplementary branch optimizing the residual information of the trunk branch. e semantic reinforcement unit makes full use of the top-level semantic information and integrates multilevel context information to obtain more accurate spatial information and fine boundary information.
(2) We propose a new reinforcement module. It includes a feature reinforcement unit and a heat map unit. e feature reinforcement unit further fuses the information in the output preliminary salient map through a U-shaped encoder-decoder structure. e heat map unit uses the improved activation function to delimit the feature map.
(3) We design a loss function for salient object detection.
It combines a salient loss of binary cross-entropy (BCE), a structural similarity (SSIM), and an IoU loss and can learn from real ground information at pixel, patch, and map levels.

Materials and Methods
Because encoder-decoder structures can make full use of context features in salient object detection, two encoderdecoder structures are designed to form a double U-shaped network, which is divided into the detection module (DM) and reinforcement module (RM) as shown in Figure 3.
In the previous double U-shaped network [36], the first U-shaped structure is a simple encoder-decoder structure, which often cannot effectively solve the loss of semantic information and interference of redundant information. So, we add some optimization units to the first U-encoderdecoder structure which includes an information processing unit (IPU), a dual-flow branch unit (DFBU), and a semantic reinforcement unit (SRU). We input an image into the encoder-decoder, and after the information allocation of the IPU and the information supplement of the DFBU, the output of D1 and SRU is finally added to get the preliminary feature map. Experimental results of the method with a double U-shaped structure show that adding a second encoder-decoder structure can further enhance the information [36]. e second U-shaped structure has a heat map unit (HMU) and a feature reinforcement unit (FRU). In the preliminary salient map input enhancement module, the operations of two branches are carried out, respectively. Finally, the output of S′1 of the FRU and HMU is added to obtain the final result.

Detection Module.
e detection module is mainly aimed at solving the problems of information interference under complex background and edge blurring. e detection module is a U-shaped encoder-decoder structure, which mainly contains IPU, DFBU, and SRU. IPU controls and processes the information exchange between encoders and decoders to solve the interference problem caused by the complex background. DFBU supplements the main information and solves the problem of detailed information. SRU makes better use of multilevel semantic information to solve the problem of the edge structure.

Information Processing Unit (IPU).
Compared with previous methods, the U-Net structure can obtain both deep semantic information and shallow spatial information. However, there is interference information when the encoder and decoder exchange information, and transmitted information has many invalid information, or the interference affects the quality of transmitting information.
To solve this problem, we add an IPU between each pair of corresponding encoders and decoders to distribute the information from the encoder and then transmit it to the decoder, as shown in Figure 4. Among them, E i represents the i-layer feature of the encoder, T i represents the ilayer feature parallel to the encoder, and D i + 1 represents the decoder feature of the i + 1 layer. When information is allocated, E i , T i , and D i + 1 are input into the IPU for a series of convolution, activation, and pooling operations to obtain a specific gravity to be allocated to X i . e operation formula is as follows: where Cat(·) is the connection operation between channels, Conv(·) is the convolution layer, and S(·) is the sign function on the element.

Dual-Flow Branch Unit (DFBU).
e DFBU structure, the trunk branch as the main fusion feature of the main trunk, is used to combine multilevel information to predict the overall information of the object, while the supplementary branch can combine more low-level information to supplement the trunk branch, aiming at optimization. e IPU processed information X i is divided into two branches and entered into the DFBU, respectively. Part of the information enters into the main branch to obtain the convolution layer D i . After the convolution, activation, and pooling operations, add it to the output of X i to get the convolution level D i − 1 . Part of the information goes to the supplementary branch, and the information at all levels is added in turn. Finally, the output information is added to the The main elements in the diagram DFBU SRU HMU Add operation IPU FRU Figure 3: DUMRN architecture. e network is divided into a detection module and a reinforcement module. e detection module includes an information processing unit, a dual-flow branch unit, and a semantic reinforcement unit. e reinforcement module includes a heat map unit and a feature reinforcement unit. DM represents the detection module, RM represents the reinforcement module, E i and S i represent convolution layers of the encoder, D i , and S i ′ represent convolution layers of the decoder, and T i represents transition convolution layers with the same size as E i . D 1 output information of the trunk branch to obtain the DFBU output result, which is denoted as output_1, as shown in Figure 5.

Semantic Reinforcement Unit (SRU).
Since spatial information and detailed information cannot be fully integrated in the U-Net structure, the shallow semantic information is lost step by step in the continuous process of convolution and pooling of input information. Rich semantic information and accurate detailed information play an important role in salient object detection. Due to the lack of shallow and deep features, the generated salient map cannot obtain fine boundaries in the case of satisfying the approximate salient region. Since the highest layer of the encoder has rich semantic features, we fuse the features of multiple layers in the encoder (E 2 , E 3 , E 4 , E 5 , and E 6 ) with E 1 , respectively, to obtain a convolution layer with the same size as E 1 . Finally, we add the five fused convolution layers Y 2 to Y 6 and E 1 , and the output result is output_2, as shown in formula (2) and Figure 6. Finally, output_1 and output_2 are added to obtain the preliminary feature graph output by the detection module.

Heat Map Unit (HMU).
e heat map unit mainly intensifies and weakens the features in the feature map. We introduce a nonlinear activation function, sigmoid activation function, to adjust the features in the graph. Sigmoid is also known as the logistic activation function. It compresses a real value to the range of 0 to 1. It can be applied to the output layer when our ultimate goal is to predict probability. It turns big negative numbers into zero and big positive numbers into one. We adjust this function, as shown in Figure 8, to make the graph of the function steeper and more dramatic as it approaches 0 in the x-direction. is function will strengthen salient features and suppress nonsalient features in the input image, thus forming a feature map similar to a heat map. Suppression of information on the left side of the Y-axis makes the background of the salient map cleaner.

Feature Reinforcement Unit (FRU).
e feature reinforcement unit is the second U-shaped structure of the network: the encoder and decoder structure. HMU can cause the loss of some valid information mixed into the HMU when it restrains nonsignificant information, so the FRU mainly supplements the information in the HMU. By using the characteristics of the U-Net structure, the FRU can make better use of deep and shallow information to reinforce the features of the initial salient feature graph and finally output the results in the last convolution layer S 1 ′ of the decoder. e output results are fused with the output results of the heat map unit, and the features of the convolution layer are further reinforced to obtain the final results, as shown in Figure 9.

Loss Function.
In order to train the salient target detection model, we design three significance loss functions according to the format of significance loss in the previous method [36], which include BCE loss, SSIM loss, and IOU loss.
where L (i) represents the output loss at the i-side; k represents the total number of sides; α (i) , β (i) , and c (i) represent the weight of each loss of BCE, SSIM, and IOU, respectively; and L bce , L ssim , and L iou represent the BCE loss, SSIM loss, and IOU loss, respectively. Our model has 8 outputs, namely, k � 8, including 7 outputs of the detection module and 1 output of the enhancement module. L bce is defined as where t represents the ground truth value and p represents the predicted value. SSIM is originally proposed for image quality evaluation. It explores structural information in an image by separating the effects of brightness on objects. e similarity measurement of SSIM can be composed of three contrast modules, respectively: brightness, contrast, and structure. L ssim is defined as Enter the trunk branch Enter the supplementary branch Information processing Assign weights L ssim (x, y) � where x, y represents the image feature; x i , y i represents the positions of the local SSIM index in the mapping; w i represents the value of the symmetric Gaussian weighting function; and the constants C 1 , C 2 , and C 3 are to avoid the instability of the system caused when μ 2 x + μ 2 y approaches 0. α, β, and c are greater than zero and are set to 1 in practice.
L iou is used as a standard evaluation measure and training loss for object detection and segmentation, which can reflect the detection effect. e expression is as follows: Among them, GroundTruth is the correct result annotated artificially, while Predict represents the result predicted by the algorithm. e IOU standard is used to measure the correlation between true and predicted, and the higher the correlation, the higher the value.
We have introduced the principle and calculation process of the three loss functions. ese functions represent different stages of the training process. L bce is a pixel-levelbased convergence assessment, and different weights are assigned to the foreground and background. L ssim calculates the local neighborhood of each pixel and assigns a higher weight to the boundary to make the boundary clearer. L iou is used to measure the correlation between real and predicted values. When combining these three kinds of loss, we use L bce to maintain the smooth gradient of all pixels and use L iou to pay more attention to the foreground. L ssim is used to enhance the target boundary information in the feature map.
the trunk branch the supplementary branch output_1 Figure 5: Dual-flow branch unit. X i is the IPU output and DFBU input, D i is each convolution layer of the main trunk branch, and output_1 is the output result.

Experimental Dataset.
In this section, we first test the proposed method using the following five image saliency detection datasets: 1 ECSSD contains 1000 images with complex structure. (2) DUT-OMRON contains 5168 images with complex foreground structure, each of which usually has complex background or multiple foreground objects. We then apply the proposed method to the safety harness detection task for the practical application. For this practical application, we have collected 2200 images from power construction sites, 2000 for the model training and 200 for testing.

Implementation and Experimental Setup.
We use an eight-core PC with AMD Ryzen 1800 × 3.5 GHz CPU (32 GB memory) and GTX 1080ti GPU (11 GB memory) for training and testing. We build our model on the basis of the BASNet framework. In the experiment, the proposed network is implemented on the PyTorch repository. We train the network using the DUTS-TR dataset. During training, each image is first resized to 256 × 256(pix) and randomly cut to 224 × 224(pix). For the optimizer, we use the Adam optimizer to train our network, and its superparameter settings Figure 7: F: unit of the semantic reinforcement unit. e convolution layer other than E 1 is enlarged to the same size as E 1 and finally fused with E 1 to obtain Y i .  are as follows: initial learning rate lr � le − 3, betas � (0.9, 0.999), eps � 1e − 8, and weight attenuation � 0. During testing, each input image is adjusted to 256 × 256(pix) and then input into the network to obtain the saliency map. e saliency map is adjusted to the size of the input image using bilinear interpolation.

Evaluation Metrics.
We use the three metrics to evaluate the proposed method: F-degree score, MAE, and Sdegree score. e calculation of F-degree F β is as follows: where TP means that the classifier recognizes the correct positive sample; TN means that the classifier recognizes the correct negative sample; FP means the classifier recognizes the wrong negative sample; and FN means the classifier recognizes the wrong positive sample. Precision is defined as Recall is defined as MAE represents the average of the absolute error between predicted and observed values, namely, the prediction between the significant mapping and its real mask average absolute deviation per pixel. e MAE is a linear score which means that the weight of all the individual differences in average is equal. As a supplement of the PR curve, it is calculated by the average absolute difference between the pixel significant value and the ground truth: where m represents the area of significance mapping; S i represents the probability map of significance of the pixel; and G i represents the real value of the pixel. S measure takes    into account the structural similarity of the area perception junction (SR) and object perception (SO), where α is set to 0.5.

Ablation Study.
In this section, we test the validity of each component proposed in our model and conduct ablation experiments on the ECSSD dataset. To demonstrate the effectiveness of our detection enhancement network, we first use the FPN branch, and the proposed IPU, DFBU, SRU, HMU, and FRU are then added in turn. Table 1 gives the results of this ablation study.

Quantitative Evaluation.
We compare our model with nine other models: AFNet, BASNet, EGNet, F3Net, GateNet, ITSD, LDF, MINNet, and PoolNet. To evaluate the qualities of the segmented protruding objects, Table 2 summarizes the F measure (F β ), S measure (S m ), and MAE measure for the largest region of all datasets. As Table 2 shows, the proposed method outperforms other methods in both area and boundary measures by using ResNet-50 as a backbone. In particular, our method improves by 4.1%, 5.1%, 6.2%, 3.4%, and 5.9% on ECSSD, HKU-IS, DUT-OMRON, DUTS-TE, and PASCAL-S datasets, respectively. To further demonstrate the superior performance of our method, we show a qualitative comparison of our method with other methods in Figure 10. We can see that the proposed method can suppress the interference information in the case of complex background and strengthen the effective information of saliency targets in images.

Practical Application.
In order to solve the problem of safety harness detection in power production safety monitoring, we apply the proposed double U-shaped multireinforced unit structure network in the YOLOv5 detection model and test the performance on the aforementioned power construction site dataset. Figure 11 shows that the proposed method can accurately detect target saliency maps under the complex power construction site background and improve the detection accuracy by 10% compared with the original YOLOv5 network as shown in Figure 12

Conclusions
In this paper, we have proposed a double U-shaped multireinforced unit structure network (DUMRN) to improve object detection. e proposed network consists of the detection module (DM), the reinforced module (RM), and the salient loss function (SLF). Quantitative evaluation on five public datasets has been conducted, and experimental results show that the proposed method gives the accurate performance and outperforms nine state-of-the-art methods. In addition, the safety harness detection experiment further verifies the effectiveness of the proposed method in the practical application. However, there are still some shortcomings in the proposed method. First of all, compared with general object detection methods, the proposed method consumes more time due to the salient object detection preprocess. Secondly, the proposed method may not provide a stable performance for small target detection. In the future, we will further expand datasets for more practical applications and improve the speed of the proposed method by optimizing network structures. Data Availability e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.