Moving Object Detection for Dynamic Background Scenes Based on Spatiotemporal Model

,

Compared to optical flow [10,11] and interframe difference algorithms [12], background subtraction algorithm needs less computation and performs better, and it is more flexible and effective. The idea of background subtraction is to differentiate the current image from a reference background model. These algorithms initialize a background model at first to represent the scene with no moving objects and then detect the moving objects by computing the difference between the current frame and the background model. Dynamic background is a challenge for background subtraction, such as waving tree leaves and ripples on river. In the past several years, many background subtraction algorithms have been proposed, and most of them focus on building more effective background model to handle dynamic background as follows: (1) Features: texture and color [13][14][15] (2) Combining methods: combining two or more background models as the new model [16] (3) Updating the background model [17] In this paper, a new pixelwise and nonparametric moving object detection method is proposed. Background model is built by the first 1 frames and sampling times in 3 × 3 neighborhood region randomly. On the one hand, spatiotemporal model represents dynamic background scenes well. On the other hand, a new update strategy makes the background model fit the dynamic background. In addition, the proposed method can deal with ghost well. Experimental results show that the proposed method can efficiently and correctly detect the moving objects from the dynamic background. This paper is organized as follows. In the next section, an overview of existing approaches of background subtraction is presented. Section 3 describes the proposed method in detail, and then Section 4 provides the experimental results and comparison with other methods. Section 5 includes conclusions and further research directions.

Related Work
In this section, some background subtraction methods will be introduced, which are divided into parametric and nonparametric models.
For parametric models, the most commonly used method is Gaussian Mixture Model (GMM) [18]. Before GMM, 2 Advances in Multimedia a per-pixel Gaussian model was proposed [19], which calculated the mean and standard deviation for each pixel at first and then compared the probability with a certain threshold of each pixel to classify the current pixel as background or foreground. But this Gaussian model cannot deal with noise and dynamic situation. GMM was proposed to solve these problems. GMM usually set three-to-five Gaussian models for each pixel and updated the model after matching. Several papers [20,21] improved the GMM method to be more flexible and efficient in recent years.
In contrast to parametric models, nonparametric models are commonly set up by the collection of the observed pixel values or neighborhood pixel values of each pixel. Kernel Density Estimation (KDE) [22] was proposed to open the door of hot research of nonparametric methods. In [13], a clustering technique was proposed to set up a nonparametric background model. The background model's samples of each pixel were clustered into the set of code words. In [23], Wang et al. chose to include large number (up to 200) of samples in the background model. Since the background models set up by [13,23] are only based on temporal information, they cannot deal with dynamic background scenes well without the spatial information. In ViBe [24,25], a random scheme was introduced to set up and update background models. They initialized the background model from the first frame, and the model elements were sampled from the collection of each pixel's neighborhood randomly. ViBe shows robustness and effectiveness for dynamic background scenes in a sense. In order to improve ViBe further, Hofmann et al. [17] proposed an adaptive scheme to automatically tune the decision threshold based on previous decisions made by the system. However, the background models set up by [17,24,25] are only based on spatial information. The lack of temporal information makes it hard to deal with time-related situation well. In [26], a modified Local Binary Similarity Pattern (LBSP) descriptor was proposed to set up the background model in feature space. It calculated the LBSP descriptor by absolute difference which is different from LBP. What is more, intra-LSBP and inter-LSBP were calculated in the same predetermined pattern to capture both texture and intensity changes. The change detection results from LSBP proved efficiency against many complex algorithms. Reference [27] improved LSBP in threshold area and combined with ViBe method to detect motion. The improvement was obviously in noisy and blurred regions. Reference [28] proposed spatiotemporal background model by integrating the concepts of a local feature-based approach and a statistical approach into a single framework; the results show that it can deal with illumination and dynamic background scenes well. These algorithms contain both temporal information and spatial information, resulting in not bad performance.
Initialization and update strategy are important steps common for background modeling. As for initialization, some background subtraction methods initialized the background models with pixel values at each pixel in the first frames [16]. However, it was not effective for dynamic background situation because of the lack of neighboring pixel information. Reference [24] initialized from the first frame by choosing the neighborhood pixel values as sample randomly.
However, it initialized the background model by only one frame. In addition, it sampled 20 pixels as the background model in the field of current pixel neighborhood. However, there were only 8 pixels in neighborhood, which inevitability resulted in repeated selection. Then it would affect segmentation decision because of the ill-considered model. Reference [29] proposed a different method to initialize the background model. Every element of the model contained pixel value and an efficacy , and the element with the least value of will be removed or updated. However, element with the least value of might not be the worst element in dynamic background scenes. As for update strategy, in [24], when a pixel has been classified as background, a random process determined whether this pixel was used to update the corresponding pixel model. It was workable but too blind to update the model well.
Herein, a nonparametric model collecting both the history and the neighborhood pixel values is presented to improve the performance for dynamic background scenes. The proposed method, based on spatiotemporal model, collects pixel values as sample from the history and neighborhood of a pixel, and the model elements are sampled from neighborhood region in the first 1 frames. As for update strategy, the range of randomness is decreased to increase the accuracy. All above methods proposed are different from other methods based on spatiotemporal model.

Spatiotemporal Model for Background
Normally, a background model can fit only one kind of scenes and it is difficult to get a universal background model which can deal with all the complex and diverse scenes. Some background subtraction methods combine the different models or features like texture together to get universal models. These methods regard every frame as the most complex scenes and result in a large amount of calculation. As for this question, this paper proposes a novel and simple method to model background for dynamic background scenes, and the idea is employed to initialize the model. Next, the details of our spatiotemporal model will be introduced. The diagram of the proposed method is shown in Figure 1.

Initialization.
The proposed method initializes background model from the first 1 frames. First of all, the spatial model BN( ) can be initialized by picking out pixel value randomly in the neighborhood of for times at each frame, and is less than 8.
Then these spatial background models are integrated together to construct spatiotemporal model ( ): For the convenience of record, As for the value of 1 , , will be discussed in Section 4 later. The spatial information and the temporal information are integrated, and the combined idea is used here without large amount of computation. The proposed background model is proved to be effective.

Segmentation Decision.
Since the proposed model only consists of grayscale value of pixel, the segmentation decision is simple in our single model. It just compares the distance between the current pixel and the pixel in the background model, and the formula is shown as follows: where ( ) represents the th element in model ( ). # min defines the least number of elements in background model meeting the threshold condition. If ( ) = 1, it implies that the pixel belongs to foreground, and conversely, the pixel belongs to background.

Updating Process.
Background changes all the time in dynamic background scenes, so it is necessary to update the background model regularly to fit the dynamic background. In this section, update of the spatiotemporal model and adaptive update of decision threshold will be described in detail.

Update of the Spatiotemporal Model.
The proposed method divides the model elements into two parts, highefficacy part and low-efficacy part. The elements which meet the formula dist( ( ), ( )) < ( ) belong to high-efficacy part, and the rest belong to low-efficacy part. Then the random strategy will be conducted in the range of these elements belonging to low-efficacy part. What is more, learning rate is determined by experiments to fit the proposed method better.

3.3.2.
Update of the Neighborhood. Background pixels always exist together in some regions, so the neighborhood of a pixel may be background pixels if this pixel has been classified as background. However, it may not be true in the edge region.
In conclusion, pixels in neighborhood region of a background pixel are more likely to be background pixels compared with other pixels. So the background model of neighborhood pixel will be updated as well with the same method introduced in Section 3.3.1. After the update process, parameter # min will become # min -1 when segmentation decision is conducted in neighborhood, which is just like adaptive update.
The update method above is a memoryless update strategy. The samples in the background model at time are preserved after the update of the pixel model with the probability ( − 1)/ . For any further time + , this probability formula is shown as follows: This formula can also be written as follows: where ( , + ) denotes the probability after time , and it shows that the expected remaining lifespan of any sample value of the model decays exponentially.

Experiments and Results
In this section, a series of experiments are conducted to analyze the parameter setting and evaluated the performance of the proposed method with others. Here, we first express our gratitude to changedetection.net [34], which provides the datasets for our experiments. The datasets include six test videos on the category of dynamic background and several objective indexes for evaluating performance quantitatively: where True Positive (TP) is the number of correctly classified foreground pixels and True Negative (TN) is the number of correctly classified background pixels. On the other hand, False Positive (FP) is the number of background pixels that is incorrectly classified as foreground and False Negative (FN) is the number of foreground pixels that is incorrectly classified as background pixel in background subtraction method. The data above are used to calculate Recall, Precision, and -Measure. Recall represents the percent of the correctly detected foreground relative to the ground truth foreground. Precision represents the percent of the correctly detected foreground relative to the detected foreground including true foreground and false foreground. -Measure is a comprehensive index of Recall and Precision, which is primarily used to evaluate the performance of different parameters and different methods.
The proposed method is implemented in C++ programming language with opencv2.4.9 on a core i3 CPU with 3.0 GHz and 2 G RAM.

Parameter Setting.
It was mentioned in Section 3 that we initialized the model from 1 frames and sampled elements from neighborhood randomly times. We conducted a series of experiments on the adjustment of and 1 with the fixed parameter, learning rate and # min , and without postprocessing.
It is clear that performance with parameter from 5 to 6 and 1 from 6 to 10 are better in Figure 2. Further experiments tested with different parameters are shown in Table 1. Performance with different value is shown in Figure 3.  The parameters and # min will be determined by experiments with fixed 1 and . The experiment result of selecting can be seen in Figure 3 and the experiment result of selecting # min can be seen in Figure 4.
(4) A median filter step was applied, and it can be seen that, in Table 2, a 9 × 9 window behaves better. The median filter is a step to make the results better, while, compared with other algorithm, this step is removed. Figures 5(b) and 5(c) show the detection results of [13] and the proposed method from the input frame (a), respectively. The waving tree leaves in (a) are the dynamic background. Since [13] is a temporal-only model method, the background model lacks the neighborhood pixel information, which will regard the dynamic background as moving objects. The proposed method considers both temporal information and spatial information, setting up the background model from the first 8 frames and sampling 5 times in the 3 × 3 neighborhood region randomly. Therefore, the performance in dynamic background scenes is better than [13]. Figure 6 shows the detection results of ViBe [24] and the proposed method. Since ViBe [24] sets up the background model only based on spatial information, time-related situation such as ghost may exist. As shown in Figure 6(c), it sets up background model just from the first frame and regards all pixels in it as background pixels without moving objects. If there are some moving objects in first frame and the  moving objects move away (the fiftieth frame (b)), they will be detected as ghosts (cars marked in red rectangles in (c)). The background model of the proposed method contains not only spatial information but also temporal information, so it can recognize the moving objects from first frame. Therefore, ghost can be well eliminated.

Comparison with Other Methods.
The proposed method focuses on building and updating more effective background model to deal with dynamic background scenes. The public dynamic background video datasets from changedetection.net, which are "Boats" with 7999 frames, "Canoe" with 1189 frames, "Fall" with 4000 frames, "Fountain01" with 1184 frames and "Fountain02" with 1499 frames, "Overpass" with 3000 frames, are used to conduct the experiments. For fair comparison, the results of the proposed method do not use any postprocessing. ViBe [24] and CodeBook [13] are two classical methods for background segmentation, so we conducted comparison between the proposed method and them. Experimental results are shown in Figure 7. detection results. The first row is the "Bad Weather" category, the second row is the "Baseline" category, the third row is the "Thermal" category, the fourth row is "Intermittent Object Motion" category and "Turbulence" category, the fifth row is "Low Framerate" category and "Night Videos" category, and the sixth row is "Camera Jitter" category and "PTZ" category.
Beyond dynamic background scenes, the results of other categories in changedetection.net are shown in Figure 8. It can be seen that the proposed method performs well in several different categories, such as "Bad Weather," "Baseline", "Thermal," and "Intermittent Object Motion." But in other categories, the proposed method performs not very well. For example, in "PTZ" category, after the camera moves, the proposed method needs a rather long time to learn the new background by the update process, which may result in false detection during this process. However, although the proposed method is not a universal method, it can deal with most scenes satisfactorily.
The quantitative comparison results of "dynamic background" category between the proposed method and more other background subtraction methods are shown in Table 3. Among these methods, ViBe [24] is a nonparameter algorithm, from which the proposed method is derived. LOB-STER [27] and Multiscale Spatiotemporal BG Model [30] 8 Advances in Multimedia are spatiotemporal background modeling algorithms, which are similar to the proposed method. EFIC [31] is a popular method in changedetection.net. TVRPCA [32] is an advanced RPCA based method, which is also designed for dynamic background scenes. As shown in Table 3, AAPSA [33] has the highest -Measure for its autoadaptive strategy. Expect AASPA, in the aspect of -Measure, the proposed method gets the highest score. Herein, although the proposed method's -Measure is not the highest, it can deal with not only dynamic background scenes well but also ghost elimination.

Conclusion
In this paper, a novel change detection method of nonparametric background segmentation for dynamic background scenes is proposed. The background model is built by sampling 5 times in 3 × 3 neighborhood region randomly from first 8 frames. The samples of background model are separated to high-efficacy part and low-efficacy part, and the samples in low-efficacy part will be replaced randomly. The update strategy which replaces sample in low-efficacy part can continuously optimize the background model. It can be seen from the experimental results that the proposed method is robust in dynamic background scenes and ghost elimination compared to other methods.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.