Fusion Attention Mechanism for Foreground Detection Based on Multiscale U-Net Architecture

Foreground detection is a classic video processing task, widely used in video surveillance and other fields, and is the basic step of many computer vision tasks. The scene in the real world is complex and changeable, and it is difficult for traditional unsupervised methods to accurately extract foreground targets. Based on deep learning theory, this paper proposes a foreground detection method based on the multiscale U-Net architecture with a fusion attention mechanism. The attention mechanism is introduced into the U-Net multiscale architecture through skip connections, causing the network model to pay more attention to the foreground objects, suppressing irrelevant background regions, and improving the learning ability of the model. We conducted experiments and evaluations on the CDnet-2014 dataset. The proposed model inputs a single RGB image and only utilizes spatial information, with an overall F-measure of 0.9785. The input of multiple images is fused, and the overall F-measure can reach 0.9830 by using spatiotemporal information. Especially in the Low Framerate category, the F-measure exceeds the current state-of-the-art methods. The experimental results demonstrate the effectiveness and superiority of our proposed method.


Introduction
Intelligent video surveillance plays an important role in the fields of transportation, security, and industrial production. It is very necessary for the foreground detection of targets such as people, animals, and vehicles. As the underlying task of intelligent video surveillance, foreground detection is an important basis for subsequent high-level tasks such as target tracking, target recognition, and behavior analysis. e quality of its detection effect directly affects the performance of subsequent tasks. Foreground detection, also known as foreground segmentation, is one of the research hotspots in the field of computer vision. Its application is not limited to intelligent video surveillance but is widely used in human-computer interaction [1], video coding [2], automatic driving [3], and other fields. In practical application scenarios, there are often many interference factors such as dynamic background, camera jitter, illumination changes, and shadows, so it is very important to study more robust and efficient foreground detection methods.
Foreground detection methods are generally divided into three categories: optical flow method, inter-frame difference method, and background modeling method. e optical flow method can adapt to changes in dynamic scenes, but it is difficult to use in practice due to the high complexity of the algorithm. e inter-frame difference method has low algorithm complexity and is not very sensitive to the illumination changes in the scene, but there will be a lot of holes in the detected foreground objects, which will affect the detection effect. e background modeling method is currently the mainstream method and is the most widely used. It usually has the following four steps: feature extraction, background model initialization, background model maintenance, and foreground detection (see Figure 1). Generally, a background model is established by designing a feature algorithm, the current input image and the background model are compared, the image is divided into foreground pixels and background pixels according to the threshold, and the result is represented by a binary image. e performance of its foreground detection largely depends on the accuracy of the background model. Traditional background modeling methods are generally based on handcrafted features and usually rely on strong prior conditions. ey only perform well in specific types of scenes, are usually difficult to adapt to various interference challenges, and perform poorly in complex scenes.
In recent years, with the rapid development of deep learning technology, convolutional neural networks (CNN) [4] have been proven to be able to effectively learn deep abstract features and have achieved great success in computer vision fields such as image classification, image segmentation, and object detection. e purpose of foreground detection is to separate the foreground objects from the background, which is a typical pixel-level binary classification problem. It is also an image segmentation task. Compared with traditional background modeling methods, CNN can learn from training data to obtain powerful feature extraction capabilities, and the extracted features are often better than hand-crafted features, which can significantly improve the effect of foreground detection.
In 2015, Long et al. [5] proposed a fully convolutional network (FCN). By replacing the fully connected layers of the VGG [6] network with convolutional layers, dense prediction at the pixel level of the image is achieved. For some application scenarios with strict segmentation accuracy requirements, such as medical image segmentation, Ronneberger et al. [7] proposed a multiscale fully convolutional neural network, U-Net, that can fuse shallow and deep features. e feature maps of different scales in the encoding network are passed to the decoding network of the corresponding size through skip connections and are concatenated with the feature map channels of the decoding network to achieve multiscale feature fusion. Oktay et al. [8] added an attention mechanism to the U-Net network, which better realized the attention to salient regions and suppressed irrelevant background regions, and obtained satisfactory results.
Inspired by the above research work, this paper proposes a model with a fusion attention mechanism based on the multiscale U-Net architecture for foreground detection. We name the model AMU-Net, which is an end-to-end encoderdecoder structure. e encoder adopts a pretrained VGG-16 [6] network for downsampling to extract feature information, and the decoder learns the mapping from feature space to image space by upsampling with transposed convolutions. e attention mechanism is added to the skip connections between the encoder and the decoder so that the network learns more features related to the foreground target and suppresses the learning of background features that are irrelevant to the task. We test and evaluate the model on the CDnet-2014 dataset [9], and the results show that the proposed method outperforms most existing methods. e main contributions of our work are as follows: (1) We propose a network model based on a multiscale feature fusion attention mechanism for foreground detection, which requires only a small amount of training data and only uses image spatial information to achieve accurate foreground segmentation. (2) We conducted two types of multi-input experiments, which not only utilized spatial information but also added time-varying information to further improve the detection performance of the model. Especially in the very challenging Low Framerate category, the multi-input methods show excellent results. e rest of this paper is organized as follows. Section 2 briefly reviews related work. Section 3 details the proposed model architecture. Section 4 presents the experimental results on the CDnet-2014 dataset and compares them with other state-of-the-art methods. Finally, Section 5 concludes the full text.

Related Works
In the past three decades, many scholars have proposed various background models and algorithms to improve the foreground detection effect. Bouwmans et al. [10][11][12][13] review the existing various foreground detection methods and summarize them well. We mainly review the more representative methods, which can be divided into traditional unsupervised methods and deep learning-based supervised methods according to whether the annotation information is used or not.

Traditional Unsupervised Methods. In 1999, Stauffer and
Grimson [14] proposed the classical Gaussian mixture model (GMM), which uses multiple Gaussian distributions to estimate changes in background color with time, but it is difficult to model the rapidly changing background, and more false detection is likely to be generated. Elgammal et al. [15] proposed a background model based on Kernel Density Estimation (KDE). By estimating the video sample data with a kernel function, the sample data with the highest probability density was selected as the background. It is a timeconsuming and space-complex method because it needs to store a large amount of historical data. Kim et al. [16] proposed a codebook model, which builds a codebook for each pixel, and each codebook contains a set of codewords. First, the background learning is performed on the video sequence to obtain the background model codebook. When detecting the foreground target, the pixels are matched with the codewords in the corresponding codebook. If the pixel value falls within the corresponding codeword, it is classified as a background pixel; otherwise, it is a foreground pixel. Heikkila and Pietikainen [17] first introduced Local Binary Pattern (LBP) into the foreground detection task. e pixels of the image area are marked by threshold processing of the center pixel and its neighbor pixels, and the result is represented in binary form. Liao  and PAWCS [22] algorithms, which improved the feature expression ability by combining texture and color features, and used a pixel-level feedback strategy to automatically adjust internal parameters. Wang et al. [23] proposed the FTSG algorithm, which combined a split Gaussian model with a flux tensor (optical flow feature) to improve the detection effect. e algorithm uses a Gaussian mixture model as the background, using a single Gaussian as the foreground; by computing the flux tensor [24], it can account for optical flow variations within a local 3D spatiotemporal volume and is used to detect blob motion. Bianco et al. [25] proposed the IUTIS-5 method, which integrates a variety of state-of-the-art algorithms through genetic coding to deal with complex background scenes. e above unsupervised traditional methods primarily perform foreground detection with hand-crafted features, which are sensitive to changes in the video scene. erefore, such methods are generally suitable for some specific or simple video scenes, and their performance will be poor when it faces sudden lighting changes, shadows, camouflage, etc. In addition, although these algorithms are unsupervised, to improve the robustness of the algorithm, a large number of parameters need to be adjusted by humans rather than computers. erefore, traditional unsupervised methods are not sufficient to handle complex real-world scenarios.

Methods Based on Deep Learning. In 2016, Braham and
Van Droogenbroeck [26] first introduced a convolutional neural network in foreground detection. e method first extracts the background image through median filtering in 150 initialization frames, then extracts image blocks centered on each pixel from the current input frame and background image, and finally inputs the image block combination into the trained network model to calculate the foreground probability for this pixel. Babaee et al. [27] proposed an improvement on the generation of background images, which enhanced the background images with the output results of unsupervised algorithms SuBSENSE [21] and FTSG [23]. Wang et al. [28] proposed a multiscale cascaded convolutional neural network. e model has two key structures: (1) In the multiscale structure, the model downsamples the current frame with a ratio of 0.75 and 0.5; then the current frame and the downsampled frame are sent to the CNN model; and finally, the output results of different sizes are upsampled, and average pooling is performed to achieve the segmentation of the foreground image. (2) In the cascade structure, to reduce pixel misclassification caused by local information, the foreground probability map output by the first CNN model and the current image are input to the second CNN model again, and a more refined foreground probability map is output. Different from the above three patch-based convolutional neural networks for prediction, a network architecture based on FCN [5] was proposed by Zeng and Zhu [29] for foreground detection.
e whole image is used as the input of the network for multiscale feature extraction, and the contrast layer is used to strengthen the learning of the difference between the foreground target and the background region. Lim and Yalim Keles [30] proposed multiscale segmentation architectures named FgSegNet_M and FgSegNet_S, both of which are encoder-decoder structures. FgSegNet_M uses pretrained VGG-16 [6] as the encoder, and the current image is divided into three different scales and input into three sets of encoders, which are decoded using a transposed convolutional neural network to obtain the final segmentation map. FgSegNet_S only keeps a set of encoders and uses a feature pooling module (FPM) with atrous convolution to extract multiscale information. In further research, they propose an improved [31] architecture, FgSegNet_v2, which modifies the FPM and adds skip connections between the encoder and decoder.
is algorithm currently ranks first on the CDnet-2014 dataset [9], outperforming all other methods. Tezcan et al. [32] proposed a BSUV-Net model for foreground detection of unseen videos. e method uses a U-Net [7] type fully convolutional neural network with skip connections. e input to the network consists of the current frame and two background frames at different time scales and their semantic information. Although the proposed method has certain generalizations, the performance of the method needs to be improved in challenging situations such as camera jitter and dynamic background. Sakkos et al. [33] first used 3D convolutional neural networks [34] to build a background subtraction model, spatiotemporally encoding video sequences and capturing the features of spatial dimensions and changes in temporal dimensions. Hu et al.
Computational Intelligence and Neuroscience [35] and Akilan et al. [36] both use 3D convolutional layers and ConvLSTM layers to extract spatiotemporal information. Zheng and Wang [37] proposed a foreground detection model, BSGAN, based on Bayesian Generative Adversarial Network (Bayesian GAN) [38]. BSGAN uses convolutional neural networks to build generators and discriminators and conduct adversarial training, so that the generator obtains the ability to segment foreground and background. To improve the robustness of the model, Zheng et al. further proposed BSPVGAN [39], which introduced the parallel vision theory on the basis of BSGAN to improve the foreground detection effect of complex scenes. e features that need to be extracted in many complex scenes in the foreground detection task are extremely challenging, and the hand-crafted features cannot meet the requirements, while the deep learning method can extract the required abstract features from the data and figure out the feature mapping between the input and output data, which can solve complex computer vision problems.

The Proposed Method
e proposed AMU-Net architecture is shown in Figure 2, which is an end-to-end fully convolutional encoder-decoder network. We divide it into three parts: encoder network, decoder network, and attention mechanism module. e encoder network aims to learn more semantic information by gradually reducing the spatial size of the feature maps, the decoder network restores the spatial size of the feature maps through an upsampling operation, and the skip connections fused with the attention mechanism capture the local and global context information on features at different resolutions (scales), resulting in more accurate foreground detection results. e input of the network is a three-channel RGB image, and the output of the network is a foreground probability map of the same size as the input. is method does not need to pre-extract the background image. Figure 2, the encoder network in the model is a VGG-16 network. Because we only use a small amount of training data in the experiments, we use the transfer learning method and use VGG-16 pretrained on the ImageNet dataset [40]. Transfer learning is widely used in many fields [41,42]. By initializing with a pretrained network and then fine-tuning the weights on the new network, this method can achieve faster convergence than training a new network with the random initialization and generally obtain higher accuracy. rough a series of convolution and pooling operations in the encoder network, the feature map is downsampled four times, and the size of the corresponding feature map is shown in Figure 2. It can be seen that the spatial resolution of the lower layer is higher; as the feature layer deepens, its spatial resolution continues to decrease; and the encoder achieves high-level semantic coding by continuously reducing the resolution while increasing the number of feature maps. We made some modifications to the VGG-16 network to make it suitable for the task of this paper. We removed the last pooling layer and all fully connected layers of the VGG-16 network. Although the fully connected layer contains more high-level semantic information, due to the lack of spatial details, it is not suitable for pixel-level foreground detection tasks, and the calculation cost of the fully connected layer is very high. Some recent studies [6,42] have shown that features extracted from different convolutional layers have different roles. e resolution of the lower layer is relatively high, but only the local details of the image are perceived, while the deeper layers can obtain the global context information of the image, but the resolution is lower. Not only can the use of depth feature information from different layers get more accurate location information, but it also preserves high-level semantic information. erefore, to fuse the multiscale features of different feature layers, we extract the convolution output before each pooling layer, fuse the attention mechanism through skip connections, and concatenate the corresponding convolutional layers of the decoder network together. at way, the detected foreground target boundary information is more complete, and the final foreground target is more accurate.

Encoder-Decoder Network. As shown in
e resolution of the AMU-Net input image is uniformly adjusted to 640 × 480, and after feature extraction by the VGG-16 encoder, 512 feature maps with a resolution of 40 × 30 are finally generated. ese feature maps are fed into the decoder network through 1 × 1 convolution. e decoder network is divided into four stages, and the feature maps of each stage are upsampled and then combined with the lowlevel features of the encoder passed through skip connections. is way, multiscale information from different feature layers is obtained. Upsampling is achieved by transposed convolution. To reduce the checkerboard effect [43], the transposed convolution of size 4 × 4 and stride 2 is used to enlarge the resolution of the feature map. e size is gradually enlarged with the upsampling; the concatenated feature maps are successively passed through the 3 × 3 convolution and ReLU activation function; the final network output size is consistent with the input size; and finally, the number of feature maps is reduced to 1 through 1 × 1 convolution. e decoder achieves foreground/background semantic decoding by continuously reducing the number of feature maps while increasing the resolution. e final output layer of the decoder uses the sigmoid activation function to map the features between 0 and 1 to generate a foreground probability map, and the probability map is binarized to obtain the foreground/background segmentation masks.

Attention Mechanism Module.
e attention mechanism [44] in the field of deep learning is essentially similar to the human visual selective attention mechanism, which tends to focus on some specific parts of all the information that has been collected. Considering the advantages of this mechanism in discrimination and focusing, the attention mechanism has been widely used in various fields of artificial intelligence. In the standard U-Net network of encodingdecoding mode, although there is more spatial information in the shallow network, the learned features are not as rich as the deep ones. In addition, similar shallow features should not be extracted repeatedly when using skip connections for feature fusion; otherwise, it will lead to computational burden and model parameter redundancy. erefore, this paper applies the attention mechanism to skip connections to increase the weight of foreground objects to suppress the interference of background pixels, thereby improving the learning ability of the model. e internal structure of the attention mechanism module is shown in Figure 3. e module has two inputs: one is the upsampled feature g(F g × H g × W g ), and the other is the encoded feature x l (F l × H x × W x ) with the same resolution transmitted through skip connections, where the upsampled feature g can be regarded as a gating signal that enhances the learning ability of x l . e two inputs are first subjected to a 1 × 1 convolution operation to obtain W T g g i and W T x x l i . en, the two results are added, and then ReLU (σ 1 (x l i ) � max(0, x l i )) activation is performed. e fused features will undergo convolution operation again, and through the activation function sigmoid (σ 2 ), the attention coefficient a is obtained; that is, where b g and b φ represent the bias term and φ represents the convolution kernel of size 1 × 1. Finally, the encoded features are multiplied by the attention coefficient α to output a new feature map y l .

Loss Function.
A common problem in foreground detection tasks is that there are far more background pixels in the scene than foreground pixels; this problem is also known as the class imbalance problem. We take foreground pixels as positive samples and background pixels as negative samples. e class imbalance problem has two consequences: (1) ere is extremely unbalanced proportion of positive and negative samples. e number of negative samples greatly exceeds that of positive samples, and many categories of negative samples are relatively easy to identify.
(2) e difficulty of sample classification is unbalanced.
For the negative samples that are easier to identify, although their loss values are not high, when the number of samples is very large, the superposition of these loss values will have a very large impact on the final total loss value. e gradient optimization process of training is over-influenced by easily identifiable negative samples, so that it pays too much attention to these loss values and eventually converges to an insufficiently good result.
In order to suppress the loss caused by a large number of easy-to-learn background samples and prevent the network from being misled, this paper adopts a loss function composed of binary cross-entropy loss (2) and Tversky loss [45] (3) to calculate the loss. e binary cross-entropy loss is defined as follows: where y is the binary label value and p(y) is the predicted probability of y. e Tversky loss function finds a better balance between recall and precision. e Tversky loss function is defined as follows: where P and G represent predicted and true values; the trade-off between false negatives (FN) and false positives (FP) can be controlled by adjusting the hyperparameters α and β. We set α to 0.3 and β to 0.7. e final model loss function is defined as follows: where ω is selected as 0.5 according to experience. e final constructed loss function can solve the class imbalance problem between foreground and background pixels.

Implementation and Training Details.
Since the CDnet-2014 dataset is a test dataset, there is no training set or validation set, so we adopt the same strategy as FgSegNet_v2 [31] to manually select 200 frames from each video sequence as training data and the rest as test data. ere are 53 video sequences in the dataset, and there are 10,600 training data images in total, accounting for about 6.65% of the total number of images in the dataset. All training and testing data are resized to a uniform size of 640 × 480, which is 640 in length and 480 in width. To demonstrate the effectiveness of our proposed model, our method does not use any preprocessing or post-processing procedures. Table 1 shows the configurations of the proposed AMU-Net model, where "conv" means convolution operation, "maxpool" means max-pooling operation, "attention" means attention operation, and "tranconv" means transposed convolution operation. e hardware platform of the experiment is based on an Intel Core i7-9700 8-core CPU and a single NVIDIA GeForce RTX 2080 Ti 11G GPU; the software environment is Windows 10 + Python 3.8 + PyTorch 1.10.0; the parallel computing framework is CUDA Toolkit 10.2; and the acceleration library is cuDNN 7.6.5. e weights of the VGG-16 module used in AMU-Net are initialized using pretrained weights on ImageNet. e network model parameters are updated using the Adam optimization algorithm during training, the momentum is set to 0.9 and 0.999, and the batch size is set to 4. e initial learning rate is set to 1e-4, a total of 40 epochs are trained, and the learning rate is reduced by a factor of 10 after every 20 epochs. 90% of the training data is used for training the model and 10% for validation. e upsampling layer is implemented by transposed convolution and is a trainable parameter. On the CDnet-2014 dataset, it takes about 12 hours (about 19 minutes per epoch) to complete the entire training process of AMU-Net.

Ablation Analysis.
To prove the effectiveness of the attention mechanism in the design of this model, we conducted ablation experiments on AMU-Net, removed the attention mechanism module, and called this network model MU-Net, which is similar to the original U-Net, and the feature maps corresponding to the decoder and encoder are directly concatenated through skip connections. e same training strategy is used during training to ensure the credibility of the comparison results. We use precision, recall, and F-measure as comparison indicators. e specific experimental results are shown in Table 2. It can be seen from the experimental results that the performance of the model with the addition of the attention mechanism has been comprehensively improved. Because the attention mechanism can achieve the purpose of enhancing the target features and suppressing the background, it is helpful to improve the detection quality.  Figure 3: e internal structure of the attention mechanism module. 6 Computational Intelligence and Neuroscience

Quantitative Evaluation.
To evaluate the performance of the proposed AMU-Net model, we conducted validation experiments on the CDnet-2014 dataset, and the results are shown in Table 3. In the Baseline, Camera Jitter, Dynamic Background, and Shadow categories, the F-measure score of AMU-Net is higher than 0.99, which shows that our model has a strong processing ability for camera jitter, dynamic background, and shadow scenes. In challenging categories like PTZ, Night Videos, Intermittent Object Motion, Bad Weather, ermal, and Turbulence, the F-measure score is also around 0.98. Taken together, an F-measure score of 0.9785 and a PWC score of 0.0603 were obtained on the entire dataset. According to [28], if the F-measure score of an algorithm is higher than 0.94 and the PWC score is lower than 0.9, the foreground detection results of the algorithm can be considered as good as the benchmark results of manual labeling. erefore, the segmentation accuracy of our model reaches manual labeling accuracy in most video scenes. e AMU-Net model does not perform as well in the Low Framerate category as other categories, because some video sequences (such as the port_0_17fps sequence) in the Low Framerate category contain extremely small foreground objects, and there are lighting changes and dynamic backgrounds in the scene. In this case, the features of foreground objects are difficult to extract effectively, and the model may pay more attention to the main category (background) and less attention to rare categories (foreground), resulting in the misclassification of very small foreground objects as background. e recall score is only 0.8921, so the F-measure score is low.
We compared the AMU-Net method with some classic and state-of-the-art methods, mainly comparing the F-measure metric scores on the CDnet-2014 dataset for different class sequences and overal dataset. We choose the following methods: GMM [14], FTSG [23], SuBSENSE [21], IUTIS-5 [25], DeepBS [27], CascadeCNN [28], BSPVGAN [39], and FgSegNet_v2 [31], where DeepBS, CascadeCNN, BSPVGAN, and FgSegNet_v2 are deep learning methods, and the rest are unsupervised traditional methods. Table 4 presents a quantitative comparison of the F-measure scores, and the first, second, and third results are marked with bold red, blue, and green fonts, respectively. It can be seen that the top three F-measure scores are all based on deep learning, and the results are much higher than traditional methods, especially in very challenging categories (such as Night Videos and PTZ), which shows the advantages of deep learning in the foreground detection task. e performance of our proposed AMU-Net is close to the current state-ofthe-art methods, significantly outperforming BSPVGAN and CascadeCNN, and only slightly worse than FgSeg-Net_v2. It should be noted that since FgSegNet_v2 was proposed, it has been ranked first on the CDnet-2014 official website, and its F-measure of most categories is close to 1, which is relatively difficult to improve. However, FgSeg-Net_v2 is a scene-specific method, which trains a network model separately for each scene video sequence, resulting in a total of 53 networks with different parameters. It requires more weight parameters and takes longer for training. On the other hand, our method only needs to train a general network model for 53 video sequences, the required weight parameters are reduced by nearly 20 times, and the training time is greatly shortened. Table 5 shows the differences between AMU-Net and FgSegNet_v2.

Qualitative Evaluation.
To evaluate the performance of our proposed method in different scenarios from a qualitative analysis point of view, we selected the following representative video sequences (without training frames) from the CDnet-2014 dataset for evaluation, covering several categories of typical challenges scenes: highway sequence in Baseline category (#820), badminton sequence in Camera Jitter category (#1139), skating sequence in Bad Weather category (#1910), sofa sequence in Intermittent Object Motion category (#2023), park sequence in ermal   Figure 4, where the first column shows the input frames, the second column shows the corresponding benchmark ground truth images, and the third to seventh columns show the following method results: our method (AMU-Net), CascadeCNN [28], IUTIS-5 [25], SuBSENSE [21], and GMM [14]. It can be seen visually that the results of our method significantly outperform all other methods, which is also consistent with the above quantitative evaluation results. e first row is the highway sequence in the Baseline category. Our model can resist the interference of dynamic backgrounds (shaking branches) and shadows. e segmentation results are almost consistent with the ground truth.
e foreground detection by the other methods is affected by shadows, often including shaded parts. e second line is the badminton sequence in the Camera Jitter category. e violent shaking of the camera causes the global motion of the video background. ere is a lot of noise in the detection results of the traditional algorithm, but our model relatively completely segmented the foreground target. e third row is the skating sequence in the Bad Weather category. Our model detects the human head and torso very clearly in the scene but hardly sees the human head in the detection results of other methods. e fourth row is the sofa sequence in the Intermittent Object Motion category. e color and texture of the pants worn by the people in the picture are very similar to the sofa behind them, which leads to holes in the foreground detection results of other methods, while our model can detect the complete outline of the person well. In addition, it can be seen that the traditional algorithm absorbs the box on the sofa into the background, resulting in missed detection, while our method accurately detects the target. e fifth row is the park sequence in the ermal category. Due to the serious loss of color information in the infrared image, in this case, it is difficult for even a person to accurately segment the foreground object from the background, but our method can also segment the pedestrian more accurately. e sixth row is the turnpike_0_5fps sequence in the Low Framerate category. e low frame rate leads to a large difference in foreground targets between adjacent frames. It can be seen that the segmentation results of the proposed model are the closest to the ground truth, while the results of other methods have a lot of noise and fail to detect foreground targets completely. e seventh row is the twoPosi-tionPTZCam sequence in the PTZ category. e global   Overall, the proposed method can perform well in the face of various interference challenges, because we introduce attention mechanisms in the multiscale structure, which improves the robustness of the model to various noises in the background so that the model can be more "focused" on the foreground targets of different sizes.

Multiple Input Experiments.
Our proposed AMU-Net model only needs to input a single RGB image and only uses the spatial information of the image to segment the foreground target very well. In order to use the temporal information of the video to further improve the detection ability of the model, we considered two multiple input methods.
e first is to refer to the method of [46], converting the input RGB image into a grayscale image as a spatial appearance cue, the segmentation mask of the SuBSENSE algorithm as a change cue, and the output of the FTSG algorithm as a motion cue, so that the original RGB three-channel input becomes a three-channel input of grayscale, change, and motion cues. In order to distinguish it from the single-input model, we call it AMU-Net_M1 (see Figure 5). e second is to convert the current input frame and its previous and next frames into grayscale images as three-channel input, called AMU-Net_M2 (see Figure 6). rough the fusion input of multiple images, more spatiotemporal information is included, and the deep learning framework can learn richer features from it, thereby achieving a more robust output.
We still conducted experiments and model evaluations on the CDnet-2014 dataset. We compare AMU-Net_M1, AMU-Net_M2, AMU-Net, SuBSENSE, and FTSG. e first is quantitative analysis, and the results are shown in Table 6. It can be seen that AMU-Net_M1 has the highest scores in the three categories of Intermittent Object Motion, Low Framerate, and Turbulence. Especially in the Low Framerate category, F-measure has increased from 0.9030 of AMU-Net to 0.9603, an increase of 6.35%, and even surpassed the score of FgSegNet_v2 in this category (0.9579), and the overall F-measure has increased to 0.9827. AMU-Net_M2 achieved the highest score in the four categories of PTZ, Baseline, Shadow, and ermal, with a total score of 0.9830. e above results show that the input of multiple images allows the deep neural network to learn more abstract spatiotemporal features, thereby achieving more robust detection performance.

Input Frame
Ground Truth AMU-Net CascadeCNN IUTIS-5 SuBSENSE GMM Computational Intelligence and Neuroscience e qualitative evaluation results are shown in Figure 7. We selected the following video sequences (without training frames) from the CDnet-2014 dataset for evaluation: office sequence in the Baseline category (#1717), winterDriveway sequence in the Intermittent Object Motion category (#1826), port_0_17fps sequence in the Low Framerate category (# 1082), peopleInShade sequence in the Shadow category (#1085), library sequence in the ermal category (#2615), and turbulence0 sequence in the Turbulence category (#2265). It can be seen that the AMU-Net_M1 and AMU-Net_M2 methods have the best detection results, which is consistent with the above quantitative analysis results. Especially in the port_0_17fps sequence in the Low Framerate category in the second row, the moving foreground target in the scene is extremely small, and it is also interfered with by the dynamic background and lighting.     e AMU-Net model pays more attention to the background and ignores the foreground, so it is almost impossible to detect the foreground target. However, with the help of multiple inputs, the AMU-Net_M1 and AMU-Net_M2 models can better learn the relevant features of the foreground targets. So the detection results are basically the same as ground truth.

Conclusion
is paper proposed a foreground detection method based on the multiscale U-Net architecture with a fusion attention mechanism. Introducing the attention mechanism into the multiscale structure can increase the weight of foreground objects and suppress the interference of background pixels. e model only needs a small amount of data for training and only uses spatial information to achieve accurate foreground segmentation. We also conducted two types of multi-input experiments on the model. One is to input the foreground mask of the traditional unsupervised method SuBSENSE algorithm and FTSG algorithm together with the grayscale image of the input frame into the model, and the other is to input the current input frame and its previous and next frames which are converted into grayscale images as input. rough multiple inputs, the model can learn more motion and spatiotemporal features, which improves the robustness of model detection. We introduce the architecture of the model network in detail and set up a series of comparative experiments from both qualitative and quantitative perspectives using the CDnet-2014 dataset. e proposed foreground detection method outperforms many existing methods without any training tricks or post-processing, and the F-measure score is significantly improved compared to other methods. However, the generalization ability of the proposed method is not strong, and the realtime performance is slightly poor. For future work, we plan to further improve the model architecture, introduce more spatiotemporal information, and generalize the method to unseen videos.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.