Foreground Detection Based on Superpixel and Semantic Segmentation

Foreground detection is an essential step in computer vision and video processing. Accurate foreground object extraction is crucial for subsequent high-level tasks such as target recognition and tracking. Although many foreground detection algorithms have been proposed, foreground detection in complex scenes is still a challenging problem. This paper presents a foreground detection algorithm based on superpixel and semantic segmentation. It first uses multiscale superpixel segmentation to obtain the initial foreground mask. At the same time, a semantic segmentation network is applied to separate potential foreground objects, and then use the defined rules to combine the results of superpixel and semantic segmentation to get the final foreground object. Finally, the background model is updated with the refined foreground result. Experiments on the CDNet2014 dataset demonstrate the effectiveness of the proposed algorithm, which can accurately segment foreground objects in complex scenes.


Introduction
Foreground detection is a fundamental lower-level task of computer vision. e performance of foreground detection directly affects the performance of a series of subsequent assignments, like object recognition and tracking. e primary purpose of foreground detection is to extract foreground objects from video frames and obtain feature information of foreground objects. Foregrounds are usually moving objects, such as moving human bodies and vehicles. e foreground detection process can be regarded as a classification process of image pixels, looking for differences in continuous image sequences (i.e., videos) and extracting moving objects of interest according to the differences. e moving objects of interest are in the foreground, and the rest are classified as background. e foreground object detection is generally realized by background subtraction, which usually contains unsupervised and supervised algorithms. In recent surveys [1][2][3][4], we can find various background subtraction algorithms studied from multiple aspects. Despite the continuous development of computer vision-related technologies, foreground detection has achieved impressive results in some application scenarios. However, foreground detection still faces significant challenges in real-world applications due to various complex interferences in natural scenes, such as dynamic background, shadow, lousy weather, camouflage. From Figure 1, we note that traditional methods often make false positives and negatives.
Superpixel is an image segmentation technology proposed and developed by Ren and Malik [5], which refer to irregular pixel blocks with specific visual significance composed of adjacent pixels with similar texture, color, brightness, and other features. e pixels in a superpixel have a more robust connection with each other than pixels in a region. Some applications [6,7] based on superpixels verify its effectiveness in foreground detection. Semantic segmentation is a deep learning algorithm that associates a label or class with each pixel of an image. It identifies the collection of pixels that make up a different class. It is exemplified in Figure 2. Semantic segmentation shows powerful performance in foreground detection [8]. Based on these motivations, a new approach named superpixel and semantic segmentation foreground detection (SSSFD) is proposed to address the challenges mentioned above. SSSFD can accurately detect moving objects according to the experiments on the CDNet2014 dataset [9]. e contributions of this paper are the following: (1) A background subtraction algorithm for multiscale superpixel segmentation with postoptimization is proposed. (2) A semantic segmentation network is trained on the CDNet2014 subset. (3) e final results combining superpixel and semantic segmentation participate in the background model update in real-time. e rest of the paper is organized as follows: Section 2 introduces the related works of our proposed method. In Section3, we report the implementations of SSSFD in detail. e experiments and experimental results' qualitative and quantitative analysis are described in Section 4, and finally, this paper is briefly summarized in Section 5.

Related Works
Over the past 30 years, plenty of foreground detection approaches have been studied. ey can be grouped into unsupervised and supervised methods.

Unsupervised Methods.
Unsupervised background subtraction algorithm usually combines handcraft features with advanced background modeling rules to achieve foreground separation. e unsupervised background subtraction algorithm usually consists of three parts: the initialization of the background model, the extraction of the foreground region, and the update of the background model. e first unsupervised algorithm widely used for foreground detection was the Gaussian Mixture Model (GMM) proposed by Stauffer and Grimson [10], which is a statisticalbased parametric model. e basic principle of GMM is to represent the distribution of image pixels in the time domain with multiple (3-5) Gaussian models. e central assumption of the GMM is that the pixel obeys a Gaussian distribution. e variance of a pixel determines the pixel class. If the variance of the pixel is higher than a specific value, the pixel is considered a foreground pixel. is method assumes that the variance of the background pixels is in a relatively low position, which is not always accurate in practice. In addition, when the background changes rapidly, it is hard for this method to keep up with the background variations, and the detection of fast-moving objects will result in missed detection and ghost. Moreover, efficiently updating model parameters to match changes in the background is also a challenging problem.
To overcome the problems arising from the above GMM, nonparametric statistical background models were proposed to improve foreground detection performance effectively.
e nonparametric estimation background subtraction methods do not need to preassume the model and estimate   [11] and Visual Background Extractor (ViBe) [12]. KDE reads the historical data in the recent samples of each pixel. It calculates the actual probability density of a pixel from the kernel density estimation of the pixel for foreground detection. is method requires a large amount of calculation, the real-time performance is not good enough, and the detection accuracy is average. In some cases, a smear phenomenon will occur. e ViBe is a pixel-level foreground detection algorithm. e key idea of this method is to store a sample set for each pixel in the image. e sample-set consists of the past pixel value of the pixel position and the pixel value of its neighborhood. Determine whether the pixel is a foreground or background point by comparing the sample set of the current pixel. If the pixel meets the characteristics of the sample set, the pixel is the background point. Otherwise, it is the foreground point.
e ViBe model uses a random strategy to establish and maintain the background model. ViBe makes use of a time sampling update strategy to reduce the update frequency.
is method requires less computation, less memory usage, and faster processing speed but has the disadvantage of ghost. In order to eliminate ghosts and further improve the performance of the ViBe, various algorithms based on ViBe have been proposed. St-Charles et al. [13] presented a pixellevel segmentation method with spatiotemporal binary features and color information. In addition, an adaptive feedback mechanism was used to dynamically adjust parameters such as model fidelity, segmentation noise level, distance threshold, and local update rate.
is approach achieved effective and robust results in the most challenging scenes. Lu and Lu [14] proposed a weight-sample-based method using samples with variable weights, minimumweight update policy, reward-and-penalty weighting strategy, and spatial-diffusion policy for foreground detection. Nonparametric models do not need to specify the distribution model of pixels but still need to use a threshold-based method for each pixel. Due to the different characteristics of various scenes, it is difficult to infer the threshold of each pixel through a unified calculation model.

Supervised Methods.
Supervised background subtraction algorithms typically rely on a trained deep neural network (DNN) to segment the foreground from the background. All of these methods are divided into four steps: (1) use a particular algorithm to get a clean background image; (2) generate a dataset of specific scenes for training; (3) exploit the dataset to train the deep neural network; (4) utilize the trained neural network and the background image to perform foreground detection on the image sequence. Supervised methods can achieve good results for foreground detection tasks in many complex scenes, significantly improving the accuracy of the foreground detection.
Braham and Van Droogenbroeck [15] proposed the first background subtraction model ConvNet based on convolutional neural networks (CNNs). Temporal median filtering generates a static background image from 150 consecutive frames. A pair of 27 * 27 image patches is extracted from the background image and the corresponding original image and are input into the constructed CNN model. e output is the foreground probability of the center pixel in this input image patch. Because ConvNet is one of the most straightforward approaches to model the differences between the background and the foreground using CNNs, there are several limitations listed below: (1) the high-level information is challenging to learn through patches [16]; (2) the trained network only works in specific scenes due to the highly redundant data of video frame sequences in training. If the ConvNet is used for other video scenarios, it should be retrained; (3) since ConvNet adopts a pixelwise scheme, treating each pixel as independent, isolated false negatives and false positives may appear in the foreground mask; (4) ConvNet is a deep encoder-decoder network with a generator network. However, one of the drawbacks of this generator network is that the object edges cannot be preserved, leading to the generation of blurry foreground areas. Since the first valuable research, various subsequent algorithms have been proposed to alleviate these limitations. Wang et al. [17] designed a new background model based on CNN with two critical ideas. One is the multiscale structure. For the model to segment foreground objects of different sizes accurately, the original image is down-sampled with 0.75 and 0.5, respectively. en the sampled image and the original image are used as the model's input. e second is cascaded architecture. In order to reduce pixel misclassification caused by local information, the foreground probability map output by the first multiscale structure and the original image is re-input to the second multiscale structure. e second CNN model further calculates the accurate foreground probability map.
A fully convolutional network (FCN) was first proposed to process the semantic segmentation of an image by Long et al. [18]. e FCN consists entirely of convolutional layers, which convert the fully connected layers of CNNs into deconvolutional layers, thereby preserving the location and semantic information of pixels. FCN can efficiently learn to make dense predictions for per-pixel tasks. erefore, it is widely used in pixel-by-pixel computer vision tasks such as pixel classification. Based on the traditional FCN with atrous convolution, pyramid scene parsing network (PSPNet) which was added to the PSP module between the encoder and the decoder was presented by Zhao et al. [19]. e PSP module can obtain more global context information, producing high-quality results in scene parsing assignments. DeepLab v1 [20] combines deep convolutional neural networks with conditional random field (CRF), which overcomes the local properties of deep networks and can better locate segmentation boundaries. DeepLab v2 [21] makes full use of atrous convolution, which can effectively expand the receptive field and incorporate more context without increasing parameters. In addition, deep convolution neural networks (DCNNs) are combined with CRF to optimize the network performance further. Moreover, the atrous spatial pyramid pooling (ASPP) module enhances the robustness of the network for multiclass segmentation at multiple scales. It uses different sampling scales and receptive fields to extract input features and capture target and context information at Computational Intelligence and Neuroscience multiple scales. DeepLab v3 [22] designs cascaded or parallel atrous convolution modules and improves ASPP. DeepLab v3+ [23] optimizes segmentation results by adding a simple yet efficient decoder module on top of DeepLab v3. Badrinarayanan et al. [24] proposed a deep network SegNet for semantic segmentation based on FCN and modified VGG-16 [25]. It still adopts the encoder-decoder structure, and the encoder part uses the first 13 layers of a convolutional network of VGG-16. Each encoder layer corresponds to one decoder layer. e output of the final decoder is fed into a softmax classifier, which generates class probabilities for each pixel independently.

Proposed Approach: SSSFD
SSSFD combines an unsupervised algorithm based on multiscale superpixel segmentation (MSS) and a supervised algorithm based on semantic segmentation. e detailed implementation steps are introduced in this section. Figure 3 shows the pipeline of SSSFD. First, train the semantic segmentation network and then, the background subtraction MSS is performed on the input image to obtain the coarse foreground mask. e process mainly consists of multiscale superpixel segmentation, foreground detection, and postprocessing. At the same time, the semantic foreground segmentation map is obtained by semantically segmenting the input image using the pretrained network. Subsequently, the background subtraction and semantic segmentation results are combined with specific rules to get refined foreground result. Finally, the background model is updated with the elegant foreground result.

Unsupervised Method MSS
3.1.1. Preprocessing. We first apply a Gaussian filter to the current input image. Gaussian filtering is a linear smoothing filter that can eliminate Gaussian noise and is widely used in the noise reduction process of image processing. Gaussian filtering is a process of weighted averaging of the entire image. Each Gaussian filtered image pixel value is obtained by the weighted average of the original image pixel and its neighboring pixels. e template defines the weight and neighborhood.
Next, we use different scales to perform superpixel segmentation on the Gaussian filter image to get foreground probability maps at each scale. e number of pixels in a superpixel block, that is, how many superpixels an image is divided into, is significant. Suppose the size of the superpixel block is large. In that case, the pixels in the superpixel block may contain both foreground and background object pixels, which will reduce the accuracy and sensitivity of the final detection result. If the size of the superpixel block is small, compared with the pixelwise foreground detection algorithms, the accuracy of the last detection result is not significantly improved, and the robustness is poor. erefore, we perform multiscale superpixel segmentation on video sequences and then execute foreground detection. We use the SLIC algorithm [37] to segment the current frame, and then the segmented superpixel blocks at multiple scales can be expressed as follows: where i is the scale subscript used in multiscale segmentation, K is the number of multiscales used, N i is the number of superpixels under scale i, and j is the corresponding index of the superpixel block under a particular scale.

Foreground Detection.
In foreground detection, the feature that distinguishes foreground and background is the RGB color feature. We use improved frame difference (IFD) to compute the difference between two corresponding superpixel patches in the current and background images. e IFD approach is not to calculate the feature values pixel by pixel but to perform a subtraction operation with a region of R around a pixel and find the minimum value of the result as the variation of the pixel. is measure can effectively improve the robustness of the background subtraction algorithm and cope with the foreground detection of complex scenes such as dynamic backgrounds. IFD can be defined using the following formula:  Computational Intelligence and Neuroscience where (x, y) is the coordinate of a pixel point in the superpixel (i, j). I t represents the pixel value of the current frame, I b represents the pixel value of the background image, (m, n) is the coordinate of pixels in the neighbor region of (x, y), and R is the radius of the neighbor region. We use a superpixel instead of a single-pixel for background subtraction. e superpixel, including the similarity and spatial relationship between pixels, can improve the performance of the proposed unsupervised algorithm. e foreground image consists of superpixels with foreground or background labels. Whether each superpixel is marked as foreground or background is determined by the variation in the RGB color characteristics of the superpixel itself. e color feature variation of a superpixel can be calculated by averaging the color feature changes of each pixel that composes the superpixel. is average, in turn, acts as a feature variation value for each pixel in the superpixel. Formula (3) can represent the process: where D(x, y) is obtained from formula (2), and represents the minimum difference of the feature values between the current frame and the background model in coordinate (x, y). |SP i,j | represents the number of pixels in the superpixel block. After calculating the average value of superpixel, M SP (i,j) is compared with a threshold T sp to determine whether the superpixel is foreground or background according to formula (4), so as to obtain the foreground segmentation map at a certain scale.
where Fg i (x, y) means the foreground of the pixel with coordinate (x, y) at scale i, and T sp is the threshold to judge whether a superpixel belongs to the foreground or not. e number of superpixels in an image will affect the results of foreground detection. We use a multiscale strategy to effectively reduce the performance degradation of the algorithm caused by the uncertainty of the number of superpixels and better detect foreground objects. e formula is as follows: where Fg(x, y) is the foreground probability of the image, and K is the total number of used scales. Finally, we compare the foreground probability map with a threshold to obtain the foreground detection result of the image. e formula is as follows: where T fg is the threshold for judging the foreground.

Postprocessing.
We use a region-based strategy when using the IFD to calculate the pixel difference between the current frame and the background model. is strategy will lead to the detection result being always slightly larger than the ground truth, called Region of Moving Object Slightly Larger (RMO-SL), as illustrated in [38]. erefore, we use contouroptimizer (CO) [38] to optimize the background subtraction results to gain the final foreground segmentation map of the unsupervised method. e CO postprocessing method consists of suppression region, edge, and wrong edge detection. Suppression region detection aims to determine areas where foreground moving objects must not appear. Edge detection can generate the edges of the result. e wrong edges are detected by combing suppression regions and caught edges. After eliminating the wrong edges, we get the last outcomes of the MSS BS t , which are shown in the second column of Figure 4. Computational Intelligence and Neuroscience

Train the Network for Semantic Segmentation.
We adopt an advanced semantic segmentation network called Deep-Labv3plus [23] and pretrain it on the ADE20K dataset [39] to acquire initial weight parameters. ADE20K can be used for scene perception, parsing, segmentation, multiobject recognition, and semantic understanding, with 20210/2000 densely annotated images for training and validation. ere are 150 common objects in the dataset, denoted as C � c 1 , c 2 , c 3 , . . . , c 150 . Following the same procedure as [40], we divide these objects into the foreground and background classes. e foreground class is F c � person, car, cushion, box, book, boat, bus, truck, bottle, van, bag, bicycle}, and the background class is Bc, Since there is no unified training dataset for foreground detection, we generally choose a certain number (N) of frames from CDNet2014 to form a training set when training a semantic segmentation network. ere are usually three strategies for selecting frames [17]. One is to select one frame per M/N, where M is the total number of video frames. e second is to choose N frames randomly from the video. e third is to choose N frames manually. Considering that the video contains many redundant frames, it will cause the problems of insufficient foreground information and an unbalanced number of positive and negative samples, which will affect the training effect of the network. erefore, we use the manual selection of frames and select frames that contain foreground objects as much as possible for training, which will improve the accuracy of the network for foreground segmentation. Considering that there are a considerable amount of redundant frames in the video, actively screening valid frames instead of randomly selecting the training data can not only effectively reduce the amount of redundancy but also reduce the severe imbalance problem of positive and negative samples. We employ the dataset Sub-CDNet2014 provided by [41], which consists of 200 frames manually selected from each category of CDNet2014. e Sub-CDNet2014 is a total of 10,600 images, of which 80% are used for training, and the remaining 20% are used for validation. Transfer the pretrained network to Sub-CDNet2014.
Before training, to further improve the performance of the network, we perform data augmentation and positive and negative sample weight calculations on the images in Sub-CDNet2014. Data augmentation adopts standard geometric transformation methods such as inversion, translation. e weights of positive and negative samples are calculated as follows: expand every input ground truth into a corresponding one-dimensional vector, denoted as P 1 , P 2 , . . . , P N , N is the number of input ground truth. Calculate the total number of pixels in P: where L represents the number of pixels in an image, and t is the time sequence. Compute the classes of distinct pixel values in P: where "unique" means to find the types of different classes in the ground truth. Calculate the frequency of occurrence of each value in each class: where "idx" represents the class, and "bincount" is used to count the frequency of pixels in each class. Compute the weight value of each class according to the following formula: After the training is completed, the trained network is executed on the CDNet2014 dataset to obtain each image's semantic foreground probability map. e probability is mapped to SS t in the range 0-255 according to formula (11): where x represents a fixed location of a pixel, and map t is the image's semantic foreground probability map. e third column of Figure 4 shows some semantic foreground segmentation results.

Temporal Analysis of Semantic Background Model.
In the semantic segmentation result, SS t , stationary semantic foreground such as cars in a parking lot will be incorrectly classified as the foreground. is problem can be addressed by a similar approach to [8], that is, establishing a semantic background model (SBM). SS FG t and SS BG t two information with high confidence in foreground and background, respectively, are derived from SS t . SS FG t can be used to help reliably detect foreground pixels. At the same time, SS BG t provides necessary information for detecting background pixels. ey are defined as follows: where SBM t (x) represents the semantic background model at the pixel x at time t. At first, initialize SBM t (x) with the first frame. We then continuously update the semantic background model to adapt to scene variations using a conservative update strategy that alleviates the problem of slow and intermittently moving objects blending into the background model. Specifically, the foreground pixels FG of the final detection result DR are not updated, and the background pixels BG are updated according to the learning rate z.
6 Computational Intelligence and Neuroscience Figure 5 shows the updating process of SBM t (x).

Combination Rules.
Using the detection results of MSS and semantic segmentation, we can determine the class of a pixel according to the defined rules. First, in SS BG t , pixels with a low semantic probability should be classified as background points, regardless of the background subtraction result BS t . Depending on this rule, challenges such as dynamic backgrounds, illumination variations, and cast shadows that seriously affect the performance of unsupervised foreground detection algorithms can be solved, and false positives in such scenarios can be greatly reduced, as shown in the first and second rows of Figure 4.
where Th BG is the decision threshold value of background pixels. Second, in SS FG t , pixels with high-semantic probability should be classified as foreground points. is rule reduces holes in foreground objects due to occlusion or model absorption, which tends to be false negatives, as shown in the third row of Figure 4: where Th FG denotes the decision threshold value of foreground. Specify Th FG > Th BG . Finally, if the value of a pixel does not meet the above two conditions, the semantic segmentation result cannot provide enough decisions to determine the classification of the pixel. At this time, we directly use the MSS result BS t as the pixel class: e complete collaboration process is summarized in Table 1.
In brief, this process can also be represented by the following formula:

Background Update.
e background update is significant. A timely and accurate update can make the algorithm better cope with complex scene changes and improve the robustness and accuracy of the algorithm. We adopt different update strategies for the pixels detected as foreground or background.
If a pixel is detected as a background pixel during foreground detection, the pixel value of the corresponding pixel in the background model is directly replaced with the current pixel value. We use a statistical histogram model to update the foreground pixel if a pixel is detected as foreground. Figure 6 shows the histogram statistics of pixel values for a pixel within a certain period. e abscissa is the pixel's gray value at different times, and the ordinate is the frequency of occurrence of the gray value in this period. We utilize the statistical histogram's peak to replace the background model's pixel at the same location as the foreground pixel of the current frame e process is denoted as follows: where p bm is the pixel value in the background model at the same position as the foreground pixel of the current frame, and M represents the mode operation that computes the max value in a set, and P is the set of pixel values in the time sequence from 1 to t. e detailed procedure of the SSSFD algorithm for detecting moving objects is summarized in Algorithm 1.

Dataset and Evaluation
Metrics. CDNet2014 [9] dataset is currently the most extensive video dataset with pixel-level labels. It consists of 53 real-world scenarios in 11 different categories. All the video sequences are captured by fixed   Computational Intelligence and Neuroscience cameras. is paper runs our proposed algorithm on five categories of the CDNet2014 dataset, evaluates the algorithms' performance, and compares them with other representative foreground detection methods. During the quantitative evaluation, we choose recall, precision, and F-measure [42] to evaluate the performance of the proposed approach. e recall is the detection rate, indicating the percentage of pixels correctly classified as foreground than foreground pixels in the ground truth. Precision is the accuracy rate, meaning the proportion of detected true positives to the number of all detected positives. F-measure is the weighted harmonic average of recall and precision, which is the comprehensive evaluation metric that can best reflect the performance of the foreground detection algorithm. e higher values of recall, precision, and F-measure, the better performance of the algorithm is. ey are defined as follows: where TP, FP, TN, and FN are true positive, false positive, true negative, and false negatives, respectively.

Experimental Results and Analysis.
is section executes several experiments on the CDNet2014 dataset to analyze the proposed SSSFD algorithm in detail. We conducted an ablation study of SSSFD and performed qualitative and quantitative comparisons with other fundamental and state-Input: Image sequence, the hyperparameters i, j, R, and T SP of MSS, the hyperparameters Th BG , Th FG , and z of semantic segmentation. Output: Foreground segmentation map DR t (1) for each frame in image sequence (2) if the first frame (3) Initialize the background model of SSSFD. (4) Initialize the semantic background model SBM t . (5) end if (6) Obtain the foreground mask BS t of background subtraction using the MSS algorithm.
Get the semantic segmentation map SS t using semantic segmentation network. (8) Compute SS FG t using (12). (9) Compute SS BG t using (13). (10) Combine BS t , SS FG t , and SS BG t to get the final result DR t using (18). (11) Update SBM t using (14). (12) Utilize DR t to update the background model of SSSFD using Section 3.4. (13) current frame ← next frame.     A few hyperparameters should be set before initializing a background model, as shown in Table 2. Figure 7 presents the visual image comparison of MSS and SSSFD, and Figure 8 reports the quantitative comparison results of SSSFD and MSS in some scene categories of the CDNet2014 dataset. From the chart, we can note that the SSSFD algorithm that combines superpixel and semantic segmentation outperforms the superpixel-only foreground detection algorithm MSS in all categories.

Ablation Study for SSSFD.
SSSFD improves the F-measure by a significant boundary of 12.5% on the baseline category. is is mainly because the semantic segmentation network can completely segment potential foreground objects encountered in the training process. It is given a lift to F-measure by improving the foreground detection rate and more correctly recalled foreground pixels. SSSFD provides around 11.4% improvement compared to MSS in the bad weather category. MSS may be wrongly detected large snowflakes as foreground. Semantic segmentation can eliminate this false detection, reduce false positives, and improve precision and F-measure. Since MSS exploits the RGB color feature, it is easy to detect moving shadows as foreground. e highperformance semantic segmentation network can more accurately determine the shadow position through the extracted global information, eliminating the shadow.  SSSFD also has a remarkable improvement relative to MSS in the shadow category, around 19.9%. Dynamic background category is the different kind of challenge as camera jitter. ere is the movement of some background objects in the dynamic background, such as curtains moving with the wind, swaying leaves, and sparkling water. e camera jitter video background contains no moving objects. e position of the entire camera changes irregularly due to external forces (such as wind blowing, vibration), resulting in the blurred video. Camera jitter will cause the pixels on the object's edge to be detected as foreground points in the previous frame, the current frame as background points, and the next frame may be detected as foreground points. In the "fountain01" video sequence with dynamic background, the MSS algorithm detects the gushing fountain as the foreground while missing small natural moving objects, making this sequence detection fail. When the SSSFD is adopted, the false foregrounds are eliminated, and the correct ones are recalled. e F-measure has a surprising improvement on this sequence. e improvement in F-measure across the dynamic background category is 8.3%. Although MSS can recall most of the foreground moving object pixels in the camera jitter category, some false foreground pixels are also identified simultaneously due to drastic camera jitter, resulting in low precision of recalled pixels. However, SSSFD will segment the correct foreground object for each input frame, ignoring camera shake. Compared with MSS, the F-measure value of SSSFD is improved by an astonishing 80.9%. e MSS is removed from SSSFD, and the quantitative foreground results obtained by the semantic segmentation network (SSN) and SSSFD are compared, as shown in Table 3. Each metric of SSSFD is better than SSN, which proves the role of MSS in SSSFD. e pixels that the semantic segmentation is not sure whether it is background or foreground, can be determined the class by MSS.

Comparison with Other Algorithms.
is section presents the exhaustive evaluation of SSSFD. e proposed algorithm is compared with eleven fundamental and stateof-the-art approaches, including unsupervised methods like GMM [44], KDE [11], ViBe [12], PAWCS [45], SuBSENSE  [13], WeSamBE [14], SemanticBGS [8], and supervised methods such as IUTIS-5 [46], GraphMOS [47], BSUV-Net 2.0 [48], DVTN [49], Cascaded CNN [17], 3DCNN [35], and DeepBS [29]. Qualitative comparisons of the foreground results are shown in Figure 9, which involves ten challenging video sequences from the CDNet2014 dataset. Simultaneously, Table 4 reports the comparisons of the quantitative results of the SSSFD algorithm on category baseline, bad weather, dynamic background, shadow, and camera jitter. Note that the best and second-best performing algorithms are displayed in bold red and blue, respectively. It is apparent from Figure 9 that SSSFD presents the best visual results, while GraphMOS and Cascaded CNN show competitive performance for these sequences. In the severe weather conditions of the "blizzard," only SSSFD and Cascaded CNN detect six cars in far-end motion, and other algorithms lost foreground objects more or less. e sequence "fall" presents dynamic background variations challenge because of a large area of swaying leaves. ere are many false positives in GMM's result because a single statistical model cannot describe the complex dynamic background. Graph-MOS detects a stationary car in a parking lot as the foreground object, which may happen in some supervised algorithms. Benefiting from the supplementary information from the semantic background model, SSSFD effectively classifies stationary cars as the background. e "overpass" is another typical dynamic background sequence that contains a little shadow. Only SSSFD and GraphMOS can accurately locate pedestrians. However, due to the oversmoothing of GgraphMOS, the pedestrian is visually slightly rounder than the ground truth. In the "peopleInShade" and "badminton" sequences, a similar situation also occurs. e "sidewalk" is a video shot by a camera shakes violently. e camera jitter causes the position of all objects to move throughout the image, which is a massive challenge for any foreground detection algorithms. We observe that GMM cannot detect pedestrians because the variations of pixels in the picture do not follow a Gaussian distribution. e foreground masks provided by SuBSENSE and IUTIS-5 are also an incomplete part of the pedestrian's body. e detection results of SSSFD also lose part of the foreground, which is caused by the inability of the semantic segmentation network to segment small objects with irregular shapes accurately. Table 4 proves that the performance of the deep learning-based foreground detection algorithm is much higher than that of the traditional unsupervised algorithm. Although it is controversial whether supervised foreground detection algorithms can be applied in actual video surveillance, we believe that high-performance supervised algorithms are helpful for some monitoring situations where foreground objects are relatively fixed. SSSFD achieves the best results in four out of five challenges, including baseline (0.9818), bad weather (0.9743), dynamic background (0.9756), and camera jitter (0.9819), while it gains encouraging performance for the remaining categories shadow (0.9840).
In conclusion, it is evident from the qualitative and quantitative results that SSSFD obtains better performance and is closer to the ground truth in most video sequences.

Conclusion
is paper proposes a foreground detection algorithm called SSSFD. It utilizes an unsupervised algorithm based on multiscale superpixel segmentation to acquire background subtraction results. We train a semantic segmentation network on a subset of manually selected frames from CDNet2014 and employ this network to perform semantic segmentation on input images. Updating the background model with fine foreground results composed of superpixel and semantic segmentation dramatically improves the accuracy of foreground detection. e experimental results on the CDNet2014 dataset verify the effectiveness of the proposed algorithm. SSSFD achieves state-of-the-art performance in visual and qualitative evaluation compared with other algorithms. e limitation of the SSSFD algorithm is that the realtime performance is poor. is is because, first, multiscale superpixel segmentation takes up a lot of computational costs. Second, the processing speed of the high-precision semantic segmentation network is slow. erefore, our future work will investigate more accurate and faster semantic segmentation networks and unsupervised background subtraction algorithms, as well as efficient fusion strategies for semantic information and background subtraction results.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.