Automatic Moving Object Segmentation for Freely Moving Cameras

This paper proposes a newmoving object segmentation algorithm for freelymoving cameras which is very common for the outdoor surveillance system, the car build-in surveillance system, and the robot navigation system. A two-layer based affine transformation model optimizationmethod is proposed for camera compensation purpose, where the outer layer iteration is used to filter the nonbackground feature points, and the inner layer iteration is used to estimate a refined affine model based on the RANSAC method. Then the feature points are classified into foreground and background according to the detected motion information. A geodesic based graph cut algorithm is then employed to extract the moving foreground based on the classified features. Unlike the existing global optimization or the long term feature point tracking based method, our algorithm only performs on two successive frames to segment the moving foreground, which makes it suitable for the online video processing applications. The experiment results demonstrate the effectiveness of our algorithm in both of the high accuracy and the fast speed.


Introduction
Moving object detection and segmentation is a basic technique for many applications such as intelligent video surveillance, intelligent transportation system, video content analysis, video event detection, and video semantic annotation.In all these applications, the cameras capturing the videos may not be static.For example, the camera of an outdoor surveillance system may be slightly shaking because of strong winds, and the video used for content analysis or event detection may be captured by a hand-held camera.Thus a moving object detection and segmentation algorithm that can handle the freely moving cameras is necessary for these cases.However, on one hand most of the existing moving object detection and segmentation algorithms are only designed for the static cameras, such as Gaussian Mixture Models proposed by Stauffer and Grimson [1], Kernel density estimation (KDE) used in [2].Although many methods have been proposed to improve these kinds of algorithms, such as Sun et al. [3] who proposed to employ graph cut [4] algorithm to improve the accuracy of the segmentation results and Patwardhan et al. [5] who constructed a layer model for the scene to improve the robustness of foreground detection and segmentation, none of these methods can be directly extended for the freely moving cameras.
In recent years, several moving object detection and segmentation algorithms for freely moving cameras have been proposed [6][7][8][9][10][11][12].Liu and Gleicher [6] proposed to learn a moving object model by collecting the sparse and insufficient motion information throughout the video.They first detect the moving patches of the foreground object, and then combine the moving patches of many frames to learn a color model of the foreground object which is used for segmentation.However, this kind of method can only be used to process video sequences offline and cannot be applied for the online cameras.Kundu et al. [7] proposed a motion detection framework based on multiview geometric constraints such as the epipolar constraints.However, this method needs to calibrate the robot-camera with a chess board and can only detect rough moving regions instead of accurate object segmentation.This restricts the application of this algorithm.Zhang et al. [8] proposed to use structure from motion method to detect and segment the foreground moving object.This method needs to first estimate the dense depth map for each frame, and then in the segmentation step, a global optimization is applied to multiframes to extract the moving object.The depth map estimation and object segmentation step will be run iteratively for several times in order to obtain accurate results.This method is very time consuming and can only be used for offline video sequences.Several algorithms [9][10][11]13] employing point trajectories to segment the moving objects are proposed in recent years.The intuition of these kinds of methods is that the motion caused by the camera movement is restricted by some geometric constraints, while the motion caused by the object movement is not.Thus the moving object can be detected and segmented by analyzing the long term trajectories of the key points.However these methods usually need to calculate the dense optical flow over long time frames, which may be too time consuming to run in real time.Once again, these methods cannot be used in online scenario, because they are not processing the video frame by frame.Elqursh and Elgammal [14] improve point trajectories based method by adding Bayesian filtering framework to estimate the motion and appearance models.And it also updates the point trajectories and motion/appearance models online, so that this algorithm can be used for the online video segmentation scenario.However, the high computational cost is still a problem.
In this paper, we propose a novel moving object detection and segmentation algorithm for the freely moving cameras.
Compared to the existing moving object segmentation algorithms for freely moving cameras, our algorithm has the following characteristics.
(1) Unlike most of the existing algorithms, our algorithm does not employ the global optimization or long term feature point tracking.It only uses two successive frames to extract the moving object, which makes it suitable for the online video processing task.(2) A two-layer iteration based camera motion compensation method is proposed, where the outer layer iteration is used to update the foreground and background feature sets according to the current parameters of the camera motion compensation models, and the inner layer iteration employs a RANSAC method to estimate the parameters of the camera motion compensation model based on the current background feature set.This two-layer iteration based method makes the camera motion compensation more robust and accurate.
(3) A feature classification and filtering algorithm based on GMM color model is proposed, and the classified feature points are used as the input of the geodesic distance based graph cut algorithm, which can return a very accurate segmentation result.
The rest of the paper is arranged as follows.Section 2 is an overview of our algorithm, and Section 3 describes the details of our algorithm.After the experiments and discussions in Section 4, the conclusions are presented in Section 5.

Algorithm Overview
Figure 1 shows a flow chart of our algorithm.As we described before, our algorithm is just based on two successive frames, so the input of our algorithm is the former and current frames of one video.The algorithm has 3 steps.
(1) Camera Motion Compensation.Since the camera movement between two successive frames is very small in most cases, we can simply assume that the background between the former frame and the current frame only has the translation and the rotation movement.Thus an affine transformation model can be employed to simulate the movement of the background.When estimating the affine transformation parameters, the corresponding feature points are first found by a forward and backward optical flow algorithm, and then a two-iteration based method is proposed to estimate the parameters.
(2) Feature Extraction and Classification.The edge and the corner features [15] are extracted and then classified into the moving foreground features (denoted as red points) and the background features (denoted as blue points) according to the detected motion regions.The foreground and background feature sets are then filtered by GMM color models.
(3) Foreground Extraction with Geodesic Distance Based Graph Cut.After the foreground and background feature sets are obtained, the geodesic distance from other pixels to the feature points are calculated, and a geodesic confidence map is generated.By incorporating the geodesic distance and the geodesic confidence map with the graph cut algorithm, accurate foreground object can be segmented.

Details of Our Algorithm
3.1.Camera Motion Compensation.For most of the videos, the camera only has a very small movement between two successive frames; thus it is assumed that the camera only has the translation and rotation movement in such a short interval, which can be modeled by the affine transformation.In this model, it is just assumed that the displacement vector u = (, V) of pixel (, ) can be written as an affine function of the coordinate (, ): where R is the rotation matrix with parameters Here the rotation matrix R also contained the scale change parameters  1 and  4 ; thus this model can handle the scale changes of the background scene, such as the video captured by a forward or backward moving camera.
Since the camera motion and the foreground motion are distinct, this means that the foreground motion is not appropriate to be modeled by the affine transformation model.Thus in ideal, the pixels used to estimate the affine parameters should only contain the background pixels.This can be achieved by our two-layer iteration based method as shown in Figure 2. The outer layer iteration is used to update the fore-and background feature points according to the motion regions detected by the current affine parameters.The RANSAC process is used to estimate the affine parameters based on the updated background features.
The feature points used in our paper are the edge and corner points which can be detected using the method described in [15].In order to estimate the affine model parameters, the corresponding feature points of the two successive frames should be detected.We employ the forward and backward optical flow estimation to achieve this goal.For the current frame   , we first extract its feature points (denoted as F  = {   ,  = 1 ⋅ ⋅ ⋅ }, where  is the number of the feature points) and then use the pyramid Lucas Kanade optical flow [16] to track these features to the next frame  +1 .Thus we obtain a set of the feature points on frame  +1 by this forward optical flow, which are denoted as F +1 = { +1  ,  = 1 ⋅ ⋅ ⋅ }.Then we track the features  +1  back from  +1 to   using the backward optical flow and obtain a new set of features on   and denote it as F   = {    ,  = 1 ⋅ ⋅ ⋅ }.In the ideal case, F  and F   should be the same.However, due to the errors of the optical flow estimation, they are not identical.By comparing F  , F   , and F +1 , we can remove the feature points that have erroneous optical flow, so as to find the correct corresponding feature points between the two successive frames.We use two criteria to filter the optical flow errors.The first is to employ the ZNCC (zero-mean normalized cross correlation), which is defined as where x and x  are the coordinates of the corresponding feature points in F  and F +1 , respectively,   (x) and  +1 (x  ) are the mean values of the pixel intensity for the given  ×  ( = 11 in our experiment) windows centered at x and x  , respectively.The ZNCC score for each pair of feature points in F  and F +1 should be calculated, and then a part of the feature points with erroneous optical flows can be filter out by setting a threshold  1 ; that is, if ZNCC(x, x  ) <  1 , then the optical flow from x to x  is considered as error.In our experiment, we find that setting  1 as the median value of the ZNCC scores can obtain good enough results.Another criterion to filter the erroneous optical flows is to use the displacements of the corresponding pixels between F  , F   , which is defined as the Euclidian Distance between the coordinates of the corresponding points and denoted as Dis(x, x  ).Similarly, if Dis(x, x  ) >  2 , the forward optical flow from x to x  , and the backward optical flow from x  to x  are considered as errors. 2 is also set as the median value of the displacements of the corresponding features points.After filtering the erroneous optical flow, we obtain the feature point matching results as shown in Figure 2. The matching feature points are denoted as a feature set S  .
Once obtaining the matching feature points, the two-layer iteration is performed.The detail is described as Algorithm 1.
In the inner-layer iteration, the RANSAC algorithm requires 3 pairs of corresponding feature points to estimate the affine parameters ( 1 ,  2 ,  3 ,  4 ,  1 ,  2 ).Since we use the 6 parameters to estimate the global motion of the whole image, the 3 pairs of feature points sampled from S  should be distributed over the whole image instead of a local area.For the moving region detection, we use the estimated affine parameter to compensate the camera motions, and calculate the frame difference to find the moving regions.Then S  can be updated by classifying the features set S  into foreground and background according to the frame difference: where S  (, ) denotes the feature points at location (, ), S  and S  are two sets of foreground and background feature Initialization: The background feature point set is initialized as S  = S  ; Step 1. Inner-layer iteration, employs the RANSAC algorithm to estimate the affine parameters based on the current feature set S  ; Step 2. Moving region detection, finds the moving regions based on the current affine parameters; Step 3. Update the background feature set according to the detected moving regions; Step 4. Jump to Step 1 to start a new outer-layer iteration until it converges.That is, the feature points in S  are stable.
Algorithm 1: Two-layer iteration based camera motion compensation.
points, respectively, and  is a threshold value.(, ) is the frame difference value at pixel (, ), and is calculated as where   (, , ) is the affine warped current frame.

Feature Extraction and Classification.
After obtaining the final affine parameters, we can obtain the frame difference using (4) and then classify the feature points F +1 of the current frame into the foreground and background feature sets F  and F  using (3) as shown in Figure 1.F  and F  cannot be directly used for graph cut algorithm in the following step to extract the foreground object, because there usually exist some classification errors.As pointed out by the green ellipses in Figure 3(a), some foreground feature points are misclassified into background.This is because the moving regions detected by the frame difference as shown in Figure 3(c) are composed of both the real foreground regions and the false foreground regions.These false foreground regions are actually background regions occluded by the moving object.In order to eliminate these misclassifications, we further perform a refining process in our algorithm.Since we already have an initial classification of the feature points, we can build two Gaussian mixture models (GMMs) for F  and F  and then use these two models to reestimate the probability of each feature point belonging to the foreground.The feature points in F  or F  are first classified into  clusters, respectively, by a farthestpoint clustering algorithm [17], and then the mean and variance (, ) for each cluster are calculated to construct GMMs.The probability of feature points belonging to the foreground can be estimated as where  is the color vector of one feature point to be estimated,   () and   () are the prior probability of the feature points in this Gaussian component and can be calculated as the ratio between the number of feature points in this component and the number of feature points in the whole GMM, and  denotes the color channel and Φ denotes the Gaussian kernel.Then for the feature points in F  , if   ≤   , this point will be removed from F  .Similarly, for the feature points in F  , if   ≥   , then this feature point will be removed from F  .It should be noted that the feature points removed out from the F  (or F  ) are not added into F  (or F  ); they are all denoted as unknowns and will be assigned a label by the graph cut algorithm.The feature classification after eliminating the error becomes much better as shown in Figure 3(b).

Foreground Extraction with Geodesic Graph
Cut. Till now, we have obtained the foreground and background key points.This means we have labeled partial pixels as foreground and background.Starting from the initial labeling, we can obtain a complete foreground segmentation by employing a geodesic graph cut algorithm [18], where we use the geodesic distance and color models to calculate the energy function of the graph cut algorithm, which is defined as where L = (L  ) is a binary vector and L  is the label F or B for pixel   .(L  ) is a unary term and (  ,   )|L  −L  | is the pairwise term of the energy function. is a weight to balance the unary and pairwise term.The unary term is defined as follows: where  L  is a constraint for the foreground and background feature points: where Ω L  indicates the foreground and background features and L  denotes the label opposite L  (i.e., if L  = F, then L  = B). L  (  ) is computed by normalizing the relative foreground/background geodesic distances: where the geodesic distances from each pixel to the foreground and background feature points are computed efficiently by the method proposed by [19].(  ) is the geodesic confidence which is defined as The pairwise term (  ,   ) is defined as where (  ) and (  ) are pixel colors.

Experiment and Discussion
4.1.Test of Our Algorithm.We test our algorithm with many videos that were captured by freely moving cameras.Some results are shown in Figure 4.
It can be seen that the image alignment algorithm employed in our algorithm is very efficient, so that the moving object regions can be well detected by the frame difference algorithm as shown in Figure 4(b), and the feature points can be classified accurately as shown in Figure 4(c).From Figure 4(d), we can see that although we only constrain a small part of pixels (feature points), the labels can be correctly propagated to other nonfeature pixels, and the accurate segmentation can be obtained.
We test our algorithm on a laptop with four cores, 2.1 GHz CPU, and 8 G RAM.The image alignment step costs most of the computational time.However, we speed up this algorithm by using the affine transformation parameters obtained from the former pair of frames to initialize parameters of the current pair of frames.Thus the whole system can run in about 15 fps for videos with 320 × 240 size.For image alignment step, we suggest to downsample the image to a relatively small scale to estimate the affine transformation parameters.This not only can improve the speed, but also can help improve the accuracy.This is because for the large scale image, the background displacement may be very large, which may not be well estimated by the affine transformation models, though the affine estimation algorithm has already  dealt with this problem by employing a pyramid model.Figure 5 shows a comparison of the image alignment results for the same pair of frames at different scales.As you can see, when aligning the images at 640 × 480 scale, the frame difference for the background regions is a little high, so that some background feature points are misclassified into foreground as shown in Figure 5(b).While as a comparison shown in Figure 5(c), aligning the images in 320 × 240 scale obtains much better results.
Our algorithm can also work for the static camera videos and can usually run in very high speed because the image alignment step can be removed.According to our test, our system can run in more than 25 fps on the laptop mentioned above.Figure 6 shows some result of applying our algorithm on the videos captured by static cameras.

Comparison with Existing
Algorithms.We compare our algorithm with the three algorithms proposed recently [20][21][22].The comparison results are shown in Figure 7.As can be seen, the results obtained by our method are clearly more accurate than the results obtained by Zhang et al. [20] and Zhou et al. [21].Both Zhang's and Zhou's methods may mistakenly label the large area of foreground or background area.More concretely, our algorithm is more robust to the topology changes of the object, while Zhang's and Zhou's methods tend to erroneously segment the object when the topology of the object changes greatly.Our algorithm obtains comparable or even better segmentation results than the Papazoglou and Ferrari [22] method as shown in the last rows of Figure 7, which is reported outperforming the state-of-theart algorithms [22].The comparison results can be observed more clearly by the Error Rate evaluation of these algorithms.Error Rate evaluation is criteria commonly used for the accuracy comparison of the object segmentation algorithms, and it is defined as Error Rate = # mis-labeled pixels in frame  # all the pixels in frame  , where "#" means "the number of ".The error rate comparison results for each frame of the two test image sequences are shown in Figure 8.It can be seen that our results are better than the results obtained by [20] and Zhou et al. [21].
Although the results obtained by our algorithm are comparable with Papazoglou and Ferrari [22], our algorithm runs much faster than Papazoglou and Ferrari [22] as discussed in the following section.

Run Time.
As described before, our algorithm can perform in a high speed, about 15 fps for the videos captured by nonstatic cameras and 25 fps for the videos captured by static cameras.The three existing algorithms [20][21][22] are all performed in a relatively low speed compared to our method.According to our test, Zhou's algorithm [21] takes more than 2 s in average to process a 320 × 240 image, which may take at lease 30 times longer time to process the same sequence compared to our method.Papazoglou and Ferrari [22] have reported that given optical flow and superpixels, their fast segmentation method only takes 0.5 s/frame, which is still slower than our algorithm.Furthermore, the optical flow and superpixels should also be calculated and will cost a long time.For example, in Papazoglou and Ferrari [22], they employ Brox's optical flow estimator [23], and it will take more than 7 s which is a very time consuming process.Zhang et al. [20] employ an even more complex method including optical flow estimation, GMM and EM algorithm, and a graph cut optimization algorithm.This costs even longer time, that is, more than 8.5 s/frame in our test.

Conclusion
This paper proposed a real time online moving object detection and segmentation algorithm for the video captured by freely moving cameras which only use two successive frames to segment the moving object.A two-layer iteration algorithm is proposed to accurately estimate the affine transformation parameters between two successive frames.A feature point detection and filtering algorithm is proposed to remove the error foreground and background feature points.The object is finally extracted by a geodesic graph cut algorithm.This algorithm is demonstrated to be very efficient for many videos.Compared to the existing long term key point trajectory based algorithm, our algorithm not only can perform in online processing mode, but also can run in high speed.This makes our algorithm very practical in many applications.

Figure 1 :
Figure 1: The flow chart of our algorithm.

Figure 2 :
Figure 2: Camera motion compensation based on two-layer iteration.

Figure 3 :
Figure 3: Feature classification errors and elimination.

Figure 4 :
Figure 4: Foreground extraction results for videos captured by moving cameras.(a) shows the original frames, (b) shows the frame difference after the image alignment, (c) shows the feature extraction and classification, and (d) shows the binary mask of the moving object.

Figure 5 :
Figure 5: Image alignment results at different scales.(a) is the original current image, (b) is obtained at 640 × 480 scale, and (c) is obtained at 320 × 240 scale.

Figure 7 :
Figure 7: Comparison with the algorithm proposed recently by Zhang et al.[20], Zhou et al.[21] and Papazoglou and Ferrari[22].For both (a) and (b), the top row shows some images of image sequence-1 and sequence-2, the following three rows show the foreground segmentation results obtained by the three existing algorithms, and the last row shows the results obtained by our algorithm.