Video background modeling is an important preprocessing stage for various applications, and principal component pursuit (PCP) is among the state-of-the-art algorithms for this task. One of the main drawbacks of PCP is its sensitivity to jitter and camera movement. This problem has only been partially solved by a few methods devised for jitter or small transformations. However, such methods cannot handle the case of moving or panning cameras in an incremental fashion. In this paper, we greatly expand the results of our earlier work, in which we presented a novel, fully incremental PCP algorithm, named incPCP-PTI, which was able to cope with panning scenarios and jitter by continuously aligning the low-rank component to the current reference frame of the camera. To the best of our knowledge, incPCP-PTI is the first low-rank plus additive incremental matrix method capable of handling these scenarios in an incremental way. The results on synthetic videos and Moseg, DAVIS, and CDnet2014 datasets show that incPCP-PTI is able to maintain a good performance in the detection of moving objects even when panning and jitter are present in a video. Additionally, in most videos, incPCP-PTI obtains competitive or superior results compared to state-of-the-art batch methods.
Innóvate Perú169-Fondecyt-20151. Introduction
Video background modeling consists of segmenting the “foreground” or moving objects from the static “background.” It is an important first step in various computer vision applications [1] such as abnormal event identification [2] and surveillance [3].
Several video background modeling methods, using different approaches such as Gaussian mixture models [4], kernel density estimations [5], or neural networks [6], exist in the literature. More comprehensive surveys of other methods are presented in [1, 7]. Principal component pursuit (PCP) is currently considered to be one of the leading algorithms for video background modeling [8]. Formally, PCP was introduced in [9] as the nonconvex optimization problem:(1)argminL,SrankL+λS0,s.t.D=L+S,where the matrix D∈ℝm×n is formed by the n observed frames, each of size m=Nr×Nc×Nd (rows, columns, and number of channels, respectively); L∈ℝm×n is a low-rank matrix representing the background; S∈ℝm×n is a sparse matrix representing the foreground; λ is a fixed global regularization parameter; and rankL is the rank of L and S0 is the ℓ0 norm of S.
Although the convex relaxation is given by(2)argminL,SL∗+λS1,s.t.D=L+S,where L∗ is the nuclear norm of matrix L (i.e., ∑kσkL, the sum of the singular values of L) and S1 is the ℓ1 norm of S, which is at the core of most PCP algorithms (including the augmented Lagrange multiplier (ALM) and inexact ALM (iALM) algorithms [10, 11]), there exists several others (for a complete list, see [12], Table 4). In particular, we point out(3)argminL,S12L+S−DF2+λS1,s.t. rankL≤r,where ⋅F2 is the Frobenius norm, which was originally proposed in [13] since we will use it as the starting point of our proposed method (Sections 2.2.2 and 3.1).
Bouwmans and Zahzah [8] showed that PCP provides state-of-the-art performance in video background modeling problems but also states some of its limitations.
First, PCP is inherently a batch method with high computational and memory requirements. This problem has been addressed in the past by means of solutions based on rank-1 updates for thin SVD [14, 15] (applied to (3)), by low-rank subspace tracking [16] (applied to (2)), stochastic optimization [17] (which applies the maximum-margin matrix factorization (M3F) method [18] to (2)), or random sampling [19] (also applied to (2)).
The second shortcoming of PCP, which is particularly relevant to the present work, is its sensitivity to jitter and its inability to cope with panning video frames. For a general review and classification of methods for motion segmentation able to cope with different degrees of camera motion, we recommend [20] and the many references therein. Among the methods based on low-rank plus additive matrices model, we highlight the robust alignment by sparse and low-rank decomposition (RASL) method [21]. This method used (2) as its starting point and addressed the problem of jitter in PCP using a series of geometric transformations on the observed frame, but as originally casted, it is a batch method. On the contrary, t-GRASTA [22] and incPCP-TI [23], which used (2) and (3) apiece as their starting point, addressed the problem of jitter in a semiincremental or fully incremental way by applying geometric transformation to the observed frames or low-rank component, respectively. Other proposed methods are robust against moving camera and panning [24–26], but all of them are batch or semibatch methods; furthermore, all of them used (2) as their starting point and also used the same general ideas (4) as RASL. Recently, Gao et al. [27] presented a new batch PCP method that produces a panoramic low-rank component that spans the entire field of view, which gives much better results in long panning sequences. We notice, nonetheless, that a fully online PCP algorithm able to cope with both jitter and panning is still an open problem. This phenomenon is of particular importance in some applications such as surveillance systems that use moving traffic or air cameras.
In the present study, we expand our previous work [28], where we proposed to address the panning problem by modifying the optimization problem solved by incPCP-TI [14], which in turn uses (3) as its starting point, and y applying a set of transformations to the low-rank component that are updated with each incoming new frame. We substantially expand [28] by
expanding the theoretical basis of our algorithm so that the present manuscript is self-contained
testing our algorithm in an additional real-life dataset with moving cameras
comparing our algorithm with two previously proposed batch methods
Our computational experiments on synthetically created datasets and publicly available videos of the Moseg [29], DAVIS [30], and CDnet2014 [31] datasets show that the proposed algorithm, henceforth referred as panning and transformation invariant incPCP (incPCP-PTI), is able to correctly handle video background modeling in panning and basic jitter conditions.
2. Previous Related Work2.1. Batch Methods
In this section, two previous motion segmentation batch methods that work under jitter/panning conditions are reviewed. For a more complete review of all available methods, refer [20]. The two algorithms hereby described work on batch fashion but were chosen as a comparison benchmark in this paper due to the public availability of their codes and/or binary executables. It is noted that although [27] recently published a new PCP method for moving cameras, the algorithm was not chosen for comparison due to the unavailability of public code or executables.
2.1.1. Segmentation by Long-Term Video Analysis
In [29], the authors proposed to use a dense point tracker based on variational optical flow in which, instead of the classical two-frame approach of optical flow, long-term analysis is used. It is worth mentioning that following the general classification of methods for motion segmentation proposed in [12, 29] was catalogued as a trajectory classification method (see Table 3 of [12] for a summary of the aforementioned classification, along with their associated properties). After the initial tracking of points, spectral clustering with a spatial regularity constraint is utilized to form groups of point trajectories corresponding to different objects in the image. Finally, an energy minimization model is used to transform the clusters into a dense segmentation of moving objects. Throughout this paper, this method will be referred as LTVA.
2.1.2. DECOLOR
The detecting contiguous outliers in the low-rank representation (DECOLOR) method [25] uses a nonconvex penalty and a Markov random field [32] model to detect outliers that correspond to moving objects. Bouwmans et al. [12] classified this algorithm as a low rank and as sparse representation method [21]. For moving cameras, the method uses a transformation obtained from a prealignment to the middle frame. The prealignment is performed using the robust multiresolution method proposed in [33] and DECOLOR then iteratively refines this transformation.
2.2. Online or Semionline Methods
In this section, two previous online or partially online PCP methods that work under jitter conditions are reviewed. It should be noted that, without modification, these two methods are not directly applicable to panning scenarios.
2.2.1. t-GRASTA
The Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA) [16] is a semionline method for low-rank subspace tracking that has been applied to the foreground-background separation problem. GRASTA is not a fully online algorithm as it requires an initialization stage to obtain an initial low-rank subspace from the first p frames. A modification called t-GRASTA was presented in [22], and it is based on the Robust Alignment by Sparse and Low-Rank decomposition (RASL) algorithm [21]. RASL tries to handle the misalignment in the video frames by solving(4)argminL,S,τL∗+λS1,s.t.τD=L+S,where τ⋅ are a series of per frame transformations that align all the observed frames; it is straightforward to note that (4) is an extension of (2).
The non-linearity in the transformations τ of (4) is handled via a linearization using the Jacobian. The main drawback of t-GRASTA is that, aside from the required low-rank subspace initialization, the initial transformation τ is estimated by using a similarity transformation obtained from a series of three points manually chosen from each of the p initial frames. This initialization stage severely constraints its application in automatic processes and reduces its applicability in panning scenarios, as the feature points in initial frames may not be present on subsequent frames.
2.2.2. incPCP-TI
The incPCP-TI [23] considers the optimization problem(5)argminL∗,S,T12D−TL∗−S22+λS1,s.t.rankL∗≤r,where D is the observed video sequence that suffers from jitter, L∗ is the properly aligned low-rank representation, and T=Tk is a set of transformations that compensate translational and rotational jitter; that is,(6)D=TD∗=H∗RD,α,where D∗ represents the unobserved jitter-free video sequence, H=hk is a set of filters that independently models translation for each frame, ∗ represents the convolution, and RD,α is a set of independent rotations applied to each frame with angle α=αk. It is interesting to note that T=τ−1; that is, the transformation used in (5) can be understood as the inverse of the transformation used in RASL or t-GRATSA (4).
In [14, 15], a computationally efficient and fully incremnetal algorithm, based on rank-1 updates for thin SVD [34–37] (also see Section 2.3), was proposed to solve (3); in [23], it was shown that, since (5) is based on (3), such incremental solution can also be used: letting dk,k∈1,2,…,n represent each frame of the observed video D, and using similar relationships for sk and lk∗ w.r.t. S and L∗, respectively, then indeed the solution of(7)argminL,S,H,α12∑khk∗Rlk∗,αk+sk−dkF2+λS1+γ∑khk1,s.t.rankL∗≤r,can be efficiently computed in an incremental fashion ([23], Section 3.3 for details).
2.3. Incremental and Rank-1 Modifications for Thin SVD
Given a matrix D∈ℝm×l with thin SVD D=U0Σ0V0T where Σ0∈ℝr×r and column vectors a and b (with m and l elements, respectively), note that(8)D+abT=U0aΣ000T1V0bT,where 0 is a zero column vector of the appropriate size. Based on [35, 36], as well as on [37], we will briefly describe an incremental (thin) SVD and rank-1 modifications (update, downdate, and replace) for thin SVD.
The generic operation consisting of the Gram–Schmidt orthonormalization of a and b w.r.t. U0 and V0, i.e., x=U0Ta, zx=a−Ux, ρx=zx2, p=1/ρxzx and y=V0Tb, zy=b−Vy, ρy=zy2, and q=1/ρyzy, is used as a first step for all the cases described below.
2.3.1. Incremental or Update Thin SVD
Given d∈ℝm×1, we want to compute thin SVDDd=U1Σ1V1T, with (i) Σ1∈ℝr+1×r+1 or (ii) Σ1∈ℝr×r. In this case, we note that D0=U0Σ0V00T and that Dd=D0+deT, where e is a unit vector (with l+1 elements in this case); then, (8) is equivalent to (9) and (10), where Σ^∈ℝr+1×r+1:(9)D0+deT=U0p⋅GΣ^HT⋅V0T00T1,(10)GΣ^HT=SVDΣ0x0Tρx.
Using (11), we get SVDDd with (i) Σ1∈ℝr+1×r+1; similarly using (12), we get SVDDd with (ii) Σ1∈ℝr×r (Matlab notation is used to indicate array slicing operations):(11)U1=U0p⋅G,Σ1=Σ^,V1=V000T1⋅H,(12)U1=U0⋅G1:r,1:r+p⋅Gr+1,1:r,Σ1=Σ^1:r,1:r,V1=V0⋅H1:r,1:r;Hr+1,1:r.
2.3.2. Downdate Thin SVD
Given Dd=U0Σ0V0T, with Σ0∈ℝr×r, we want to compute thin SVDD=U1Σ1V1T with r singular values. Noting that D0=Dd+−deT, then the rank-1 modification (8) is equivalent to(13)Dd+−deT=U00⋅GΣ^HT⋅V0qT,GΣ^HT=SVDΣ0−Σ0yyT−ρy⋅Σ0y0T0,from which we can compute thin SVDD via the following equation:(14)U1=U0⋅G1:r,1:r,Σ1=Σ^1:r,1:r,V1=V0⋅H1:r,1:r+q⋅Hr+1,1:r.
2.3.3. Replace Thin SVD
Given Dd=U0Σ0V0T, with Σ0∈ℝr×r, we want to compute thin SVDDd^=U1Σ1V1T with r singular values. This case can be understood as a mixture of the previous cases and can be easily derived noticing that Dd^=Dd+d^−deT.
Finally, we point out that the computational complexity of any of the above procedures ([35], Section 3 and [37], Section 4) is upper bounded by O10⋅m⋅r+Or3+O3⋅r⋅l. If r≪m,l holds, then the complexity is dominated by O10⋅m⋅r.
3. Methods3.1. Proposed incPCP-PTI Method
The proposed algorithm (named incPCP-PTI) is a modification of the previously proposed incPCP-TI [23] so that it is able to handle panning and camera motion. It was briefly presented in [28], and is more thoroughly explained and evaluated in this work. The method continuously estimates the alignment transformation T so that Tlk∗=dk, i.e., the transformation that aligns the previous low-rank representation with the observed current frame. Thus, incPCP-PTI effectively uses Tlk∗ as a local estimation of a composite panoramic background image. After applying such transformation to L∗, the PCP problem can be solved in the reference frame of dk. After this initial alignment, it is considered that only minor jitter remains in the image and so a procedure similar to incPCP-TI is utilized by estimating a transformation ξk for the k-th frame. However, instead of solving the Affinely Constrained Matrix Rank Minimization [38] as in the original incPCP-TI [14], the low-rank approximation problem is solved in the reference frame of dk by applying ξk−1 to the residual dk−sk. The whole procedure is presented in Algorithm 1. This algorithm makes use of the incSVD, repSVD, and downSVD operators, which correspond to the thin SVD update, replacement, and downdate operators, respectively (Section 2.3).
<bold>Algorithm 1: </bold>IncPCP-PTI.
Input: observed video D, internal parameters for shrinkage, internal parameters for transformation estimation, number of innerLoops iL, background frames bl, m=k0
In line 3 of Algorithm 1, the latest low-rank frame lk is aligned to the current frame dk. The transformation is estimated as the composition of a translation and rotation. Such found aligned transformation TkL is used then to update the whole low-rank matrix representation L to the current reference axis (lines 4 and 5 of Algorithm 1) in order to obtain L∗. After this initial align transformation is performed, it is assumed that only minor misalignments, modeled by ξk, due to jitter remain (line 10 of Algorithm 1).
The ghosting suppression mentioned in line 16 is detailed in Section 3.3. The shrinkage in line 9 of Algorithm 1 can be performed by either soft thresholding or projection on the ℓ1 ball. Soft thresholding is performed with a simple element-wise shrinkage operator (shrinkx,λ=signxmax0,x−λ). Projection onto the ℓ1 ball is detailed in Section 3.2. For all our experiments, the latter was chosen.
3.2. Projection on the <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M124"><mml:mrow><mml:msub><mml:mrow><mml:mi>ℓ</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> Ball
Although theoretical guidance is available for selecting a minimax optimal regularization parameter λ in (2) [39], practical problems do not fully satisfy the idealized assumptions, and thus λ often has to be heuristically tuned. This problem is also observed if (3) is used instead of (2).
To tackle this problem, Rodrıguez and Wohlberg [40] introduced the alternative relaxation of (1) given by(15)argminL,S12L+S−DF2,s.t.S1≤μ, rankL≤r,which can also be incrementally solved via rank-1 updates for thin SVD (as is the case of the incPCP and related algorithms [14, 15, 23]); however, (15) has the advantage that a simple heuristic can be derived for the adaptive selection of μ for each frame. Furthermore, μ can be spatially adapted in order to reduce ghosting effects. The algorithm they propose is very similar to incPCP, save for the shrinkage step, which is calculated as sk=proj⋅1dk−lk,μ, where(16)proj⋅1u,μ≜minx12x−u22,s.t.x1≤μ.
Thus, for the shrinkage step, the solution is given by projections into the ℓ1 ball of radius μ.
While there are several well-known and efficient algorithms that solve (16), studies [40–43] used the algorithm in [44], a recently published algorithm, for solving (16) that has a better computational performance than either that in [41] or [42].
Furthermore, Rodrıguez and Wohlberg [40] also proposed a simple scheme for adapting μk with every frame, which is given by(17)μk=α⋅dk−lk1,where α is a value between 0.5 and 0.75.
3.3. Ghosting Suppression
Ghosting refers to when the foreground estimates include phantoms or smear replicas from actual moving objects. Rodríguez and Wohlberg [45] proposed a procedure for ghosting suppression in the incPCP algorithm which consists using binary masks obtained from different frames in order to remove the ghosts from the low-rank component. In this approach, two sparse components at different time steps n1 and n2 are used to compute respective binary masks mkn1 and mkn2. These masks will include the moving objects as well as ghosts. A new binary mask bk=mkn1∩mkn2C, i.e., the complement of the intersection of binary masks obtained from the aforementioned two frames, will include, with high probability, all pixels of the background that are not occluded by a moving object. bk can then be used to generate a modified input frame d^kn=dk⊙bk+lk⊙1−bk, where ⊙ represents an Hadamard product, which is used to update the low-rank component. Additionally, if the procedure with the ℓ1 ball projection described in Section 3.2 is used for the shrinkage step, μk can be spatially adapted in order to reduce ghosting [40]. Based on the difference between current and previous sparse approximation zk=sk−sk−1, a binary mask mk can be computed and then the sparse component is modified as(18)sk=1−mk⊙s^k+mk⊙s˜k,where s^k=proj⋅1dk−lk,μk and s˜k=proj⋅1mk⊙s^k,μkg and μkg=β⋅mk⊙s^k1, where prox⋅1⋅ is defined in (16), and β is suggested to take values between 0.1 and 0.3.
4. Description of Datasets and Computational Experiments
For the evaluation of the proposed incPCP-PTI algorithm, four datasets were considered. The first dataset consists of synthetic jitter and panning videos. The second one consists of videos of real panning taken from the MoSeg dataset [29]. The third dataset consists of videos of the recently published DAVIS dataset [30, 46]. The last one was obtained from the CDnet2014 dataset [31]. All datasets are detailed in this section. All tests were carried out using GPU-enabled Matlab code running on an Intel i7-2600K CPU (8 cores, 3.40 GHz, 8 MB cache, and 32 GB RAM) with a 12 GB NVIDIA Tesla K40C GPU card. To the best of our knowledge, no other incremental low-rank plus additive matrix video background modeling technique capable of handling panning has been reported in the literature. This situation puts some constraints in our evaluations, and for most of our tests, we do comparisons with batch methods. Specific details for each of the datasets are described in their corresponding sections. Furthermore, to test the stability of incPCP-PTI to jitter, we also use stab + incPCP-PTI, which consists on a preprocessing stage using a recent state-of-the-art video stabilization technique [47] followed by incPCP-PTI.
4.1. Synthetic Datasets
A dataset with synthetic panning and jitter was generated from the 3rd Tower video of the USC Neovision2 dataset [48], which consists of 900 frames of size 1920 × 1088 pixel at 25 fps. For this purpose, a subregion of 720 × 480 pixel was selected from each frame and the centroid of the subregion was translated with each new frame in order to simulate an aerial panning scenario using the piecewise linear trajectory un given by(19)un=un−1+v⋅1,1,un−1x<Q,un−1+v⋅1,0,un−1x≥Q,where u0 is the initial point (in this case, it is chosen as 150,688 pixel) and is the point of slope change in the curve (chosen as Q=500 pixel). This process is depicted in Figure 1. The panning velocity v was taken as 1, 3, and 5 pixels per frame. A fourth case in which the velocity changed randomly between 1 and 7 pixels per frame was also considered. This dataset will be referred as SP (“synthetic panning”) dataset. Additionally, this same procedure was used to construct a dataset on jittered versions of the original frames. Each frame of the 3rd Tower video was jittered with random uniformly distributed translations on the −10,10 pixels range and random uniformly distributed rotations on the −0.5,0.5 degrees range. The same trajectory and subregion selection of the SP dataset was used. This second synthetic dataset will be referred as SPJ (“synthetic panning and jitter dataset”) dataset. For both SP and SPJ datasets, the sparse approximation via the batch iALM method [10] using 20 outer iterations was used as a proxy ground truth by selecting the same regions that were selected from the original frames. The iALM was chosen as the proxy ground truth since, as reported in [8], Tables 6 and 7, its segmentation is considered to be reliable. For these synthetic datasets, the performance of the proposed algorithm was measured in terms of the normalized ℓ1 distance:(20)Msk=skgt−sk1N,where skgt and sk are the ground truth and computed sparse components for frame k, respectively, and N is the number of pixels of the frame. Considering images normalized between 0 and 1, the value of Msk varies from 0 (perfect match with the ground truth) to 1.
Construction of the synthetic panning and jitter dataset. The selected region (blue rectangles) was of 720 × 480 pixel, and the centroid of the region was translated at a velocity v (red vector) along a piecewise linear trajectory (green). This figure is a slightly modified version of Figure 1 in [28].
For the SP dataset, only the incPCP-PTI method was evaluated. For the SPJ dataset, we evaluated incPCP-PTI and, as mentioned in Section 4, stab + incPCP-PTI, which consists of a preprocessing stage using a recent state-of-the-art video stabilization technique [47] followed by incPCP-PTI. This comparison had as objective determining if jitter is handled correctly by incPCP-PTI [23] alone. Additionally, we include a baseline comparison with the sparse components obtained with incPCP on the full Neovision2 Tower video and then segmented using the same procedure described in Section 4.1.
4.2. Moseg Dataset
We used 15 video sequences of the Freiburg-Berkeley Motion Segmentation (Moseg) Dataset [29, 49, 50]. We selected sequences that contained panning or camera movement. For all incPCP-PTI variants, three inner loops and a window size of 30 background frames were used. For the ℓ1 ball projection, α was set to 0.75 and the ghosting suppression n2−n1 was set to 20 frames. α controls the adaptation of τ, with lower α forcing a sparser solution, whereas the difference n2−n1 controls the number of frames used for ghosting suppression. The binary mask obtained with incPCP-PTI was postprocessed using the computation of the convex hull of the connected objects [51]. It is noted that the application of this postprocessing was not applied to the other methods as it tended to reduce their performance.
For comparison, both LTVA and DECOLOR algorithms were used as a reference. The LTVA code was obtained from [52], and the default parameter of a spatial subsampling of 8 for the point tracking was used. The tracking component of the algorithm runs in single-threaded C, whereas the dense clustering component runs in CUDA C 5.5. The Single-threaded Matlab DECOLOR code was obtained from [53], and a tolerance of 1E-4 was used. All other parameters were left at their default values. For reporting the results, we further subdivided the selected videos into two categories: short panning (comprising nine cars sequences and one people sequences) and long panning (comprising 5 marple videos). In the former category, the final and first frames in the panning motion still share some common area, while in the latter, these two frames do not. This subdivision was necessary as DECOLOR performs a preregistration and was not able to work properly on the long panning sequences.
For all videos, the binary masks of the methods were compared to the ground truth provided in the dataset in order to obtain an F measure, defined as(21)F=2⋅P⋅RP+R,P=TPTP+FN,R=TPTP+FP,where P and R stands for precision and recall, respectively, and TP, FN, and FP are the number of true positives, false negatives, and false positive pixels, respectively. It is noted that approximately only one in ten frames possessed ground truth information.
4.3. DAVIS Dataset
We used 10 video sequences of the DAVIS [30, 46]. We selected some sequences that contained panning or camera movement and that contained at least 50 frames. The algorithms were configured with the same parameters specified in Section 4.2 and the same F measure comparison is performed. However, as mentioned before, DECOLOR is not able to run on all sequences and just the F measures where its prealigment phase runs correctly are reported. Additionally, in this dataset, LTVA presented oversegmentation of some objects and, accordingly, two methods to obtain the final binary mask of moving objects are considered (this problem was not present in the Moseg sequences). The first, reported as LTVA-aPP (automatic postprocessing), considers an automatic selection of the mask by designating the object label with the largest area as the background and considering all others labels as foreground objects. The second, reported as LTVA-mPP (manual postprocessing), entails the manual selection of the labels corresponding to moving objects. Although the latter method is more accurate, for an automatic pipeline, LTVA-aPP would be the more realistic option.
4.4. Change Detection 2014 (CDnet2014) Dataset
Two videos from the PTZ category of the CDnet2014 dataset [31] were chosen:
continuousPan(CP): 704 × 480 pixel, 1700 frames-color video containing a continuous panning of a PTZ camera. The video is almost jitter free.
intermittentPan(IP): 560 × 368 pixel, 3500 frame color of a PTZ camera that changes between two fixed positions. The video contains intermittent panning and additional real jitter.
As mentioned in the previous section, DECOLOR is unable to work on this type of long panning sequences and accordingly, only incPCP-PTI and LTVA are compared. The F measure was evaluated only on frames that contained ground truth motion. LTVA presented the same segmentation problems as in the DAVIS dataset. However, in this case, the mask did not provide a segmentation good enough to produce a manual postprocessing, and thus, only the results of LTVA + aPP are presented. All the parameters for incPCP-PTI and LTVA are the same as in previous sections. We also included a comparison with the edge based foreground background segmentation and interior classification (EFIC) [54] and its color version, C-EFIC [55]. These methods were chosen as they obtained the second and third best F measure in the PTZ category of the CDnet2014 dataset results [56]. The top performer in the category was not selected as it corresponded to a supervised convolutional neural network that needs proper training before classification. Unfortunately, no open code is available for EFIC and C-EFIC, and we only had access to the segmented binary masks submitted to the challenge [57, 58]. Due to this limitation, only a referential F measure could be computed. The absence of open code makes it difficult to ascertain if EFIC and C-EFIC can be implemented in a fully incremental way and to compare them in terms of computational performance. Additionally, as EFIC and C-EFIC already include a postprocessing step on the binary mask, we did not apply the convex hull postprocessing of the connected objects [51] that was used for incPCP-PTI.
The distance Msk (20) computed for each frame of the different videos of the SP dataset is shown in Figure 2. Table 1 shows the average distance Msk¯ and average time for processing one frame along with a baseline metric, described in Section 4.1. It can be noticed that the distance tends to increase as the panning velocity increases but the distance in all cases maintains relatively small (below 0.01).
Value of distance δ between binary mask and ground truth frames for incPCP-PTI on the SP dataset (Section 4.1). This figure is a slightly modified version of Figure 2 in [28].
Value of average distance Msk¯ between binary mask and ground truth frames for incPCP-PTI and baseline incPCP on the SP dataset (Section 4.1). Computational time per frame for incPCP-PTI is also shown.
Dataset
IncPCP-PTI Msk¯
IncPCP-PTI average time per frame (seconds)
Baseline incPCP Msk¯
v=1
0.0028
2.10
0.0024
v=3
0.0053
2.10
0.0047
v=5
0.0064
2.11
0.0054
Changing v
0.0055
2.09
0.0029
5.1.2. SPJ Dataset
Representative frames of the SPJ video with changing velocity and the segmented sparse components with incPCP-PTI and stab + incPCP-PTI are shown in Figure 3. The distance Msk computed for each frame of the different videos of the SPJ dataset are shown in Figures 4 and 5 for incPCP-PTI and stab + incPCP-PTI.
Representative frames of the video and the segmented sparse components for the SPJ dataset. Frames (a) 100 and (b) 355. This figure is a slightly modified version of Figure 3 in [28].
Value of distance Msk between binary mask and ground truth frames for incPCP-PTI on the SPJ dataset (Figure 5). This figure is a slightly modified version of Figure 4 in [28].
Value of distance Msk between binary mask and ground truth frames for stab + incPCP-PTI on the SPJ dataset (Figure 4 and Section 4.1). This figure is a slightly modified version of Figure 5 of [28].
5.2. Moseg Dataset
Representative frames of the video and the segmented sparse components for the cars8, people1, and marple13 videos of the Moseg dataset are shown in Figure 6. Tables 2 and 3 show the F measure obtained in the short and long panning subsets, respectively. As noted in the previous sections, DECOLOR did not properly work on the long panning sequences, and so it is excluded from the comparisons in this subset.
Selected frames from three sequences of the Moseg dataset (Section 4.2) along with the segmented sparse components for incPCP-PTI, DECOLOR, and LTVA. The sequences where DECOLOR did not work properly are left blank.
Average F measure performance of incPCP-PTI, DECOLOR, and LTVA on the short panning Moseg sequences.
Sequence
IncPCP-PTI
DECOLOR
LTVA
cars1
0.78
0.50
0.90
cars2
0.42
0.66
0.75
cars3
0.76
0.83
0.93
cars4
0.77
0.81
0.92
cars5
0.73
0.82
0.86
cars6
0.73
0.75
0.94
cars7
0.72
0.84
0.92
cars8
0.74
0.47
0.85
cars9
0.50
0.41
0.49
people1
0.70
0.94
0.75
Average
0.68
0.70
0.83
Average F measure performance of incPCP-PTI, DECOLOR, and LTVA on the long panning Moseg sequences.
Sequence
IncPCP-PTI
Decolor
LTVA
Marple1
0.29
—
0.87
Marple3
0.12
—
0.29
Marple7
0.22
—
0.37
Marple10
0.25
—
0.11
Marple13
0.55
—
0.82
Average
0.29
—
0.49
5.3. DAVIS Dataset
Representative frames of the video and the segmented sparse components for the tennis, horsejump-high, swing, and dog-gooses videos of the Davis dataset are shown in Figure 7. Table 4 shows the F measure obtained in the short and long panning subsets, respectively. The sequences in which DECOLOR did not run properly were left unreported. As described in Section 4.3, LTVA-aPP and LTVA-mPP refer to automatic and manual postprocessing of the binary masks generated by the LTVA method, respectively.
Selected frames from three sequences of the Davis dataset (Section 4.3) along with the segmented sparse components for incPCP-PTI, DECOLOR, and LTVA (aPP and mPP). The sequences in which DECOLOR did not work properly are left blank.
Average F measure performance of incPCP-PTI, DECOLOR, and LTVA on the DAVIS dataset sequences.
IncPCP-PTI
DECOLOR
LTVA-aPP
LTVA-mPP
Tennis
0.60
—
0.15
0.73
Soapbox
0.53
—
0.48
0.79
Bmx-bumps
0.31
—
0.30
0.46
Horsejump-high
0.46
0.50
0.22
0.82
Dance-jump
0.19
0.51
0.30
0.30
Swing
0.51
—
0.37
0.88
Dog-gooses
0.55
—
0.02
0.02
Skate-park
0.27
0.17
0.15
0.59
Bmx-trees
0.35
—
0.46
0.46
Scooter-gray
0.29
—
0.40
0.85
Average
0.40
0.39
0.28
0.59
5.4. CDnet2014 Dataset
Representative frames of the video and the segmented sparse components for the CP and IP videos are shown in Figures 8 and 9, respectively. Figure 10 shows the F measure (with no postprocessing) for incPCP-PTI (grayscale and color versions) and EFIC and C-EFIC on the frames of the CP video, while Figure 11 shows the same metric for all methods on the frames of the IP video. Tables 5 and 6 show the average F measure and computational time obtained overall frames. For stab + incPCP-PTI, the computational time is shown as (total stabilization time) + (incPCP-PTI time per frame). For LTVA, the total time of the batch execution was divided over the total number of frames in order to obtain an average time per frame.
Frames 988 (a) and 1008 (b) of the video and the segmented sparse components for the CP video obtained with both incPCP-PTI and LTVA. It is observed that LTVA is unable to separate the moving objects any of the frames. This figure is a modified version of Figure 6 in [28].
Frames 1330 (a) and 1870 (b) of the video and the segmented sparse components for the IP video obtained with incPCP-PTI, stab + incPCP-PTI, and LTVA-aPP. It is observed that LTVA is unable to separate the moving objects in frame 1870. This figure is a modified version of Figure 7 in [28].
F measure per frame for the CP video for the grayscale-based methods (a) and color methods (b). Note: shown only for available frames (restriction of dataset). This figure is a slightly modified version of ([28], Figure 8).
F measure per frame for the IP video for the grayscale-based methods (a) and color methods (b). Note: they are shown only for available frames (restriction of dataset). This figure is a slightly modified version of Figure 9 of [28].
Value of F measure for grayscale and color incPCP-PTI and for EFIC and C-EFIC on the CP video.
Method
F measure
Average time per frame (seconds)
Grayscale incPCP-PTI
0.50
2.10
Color incPCP-PTI
0.49
3.58
LTVA-aPP
0.07
3.32
EFIC
0.42
—
C-EFIC
0.46
—
Value of F measure for grayscale and color incPCP-PTI and stab + incPCP-PTI on the IP video.
Method
Average F measure
Average time per frame (seconds)
Grayscale incPCP-PTI
0.69
1.41
Color incPCP-PTI
0.70
2.31
Grayscale stab + incPCP-PTI
0.63
(89) + (1.41)
Color stab + incPCP-PTI
0.64
(89) + (2.31)
LTVA-aPP
0.27
9.01
EFIC
0.68
—
C-EFIC
0.64
—
6. Discussion
It is observed in the results of Section 5.1.1 that, as expected, the distance Msk increased; that is, the sparse approximation was worse, as the panning velocity increased. On the contrary, incPCP-PTI is able to maintain an adequate performance even when the panning velocity changes. Also expected is the fact that adding jitter to the panning scenario (Section 5.1.2) increased the distance Msk for all panning velocities with respect to their jitter-free counterparts. The overall stability of the estimated distance also decreased, as evidenced in the higher variability of the curves in Figure 4. The inclusion of a video stabilization preprocessing technique (stab + incPCP-PTI) seemed to decrease such variability, as evidenced in Figure 5. Nevertheless, even with jitter, standalone incPCP-PTI maintained a low average Msk distance and its performance is comparable with stab + incPCP-PTI, as can be observed in Table 7. Furthermore, although incPCP-PTI obtained higher distances than baseline incPCP, values tend to be close to each other and, for all tested velocities, incPCP-PTI managed to maintain a very small distance from the ground truth (below 0.01 for all cases).
Value of average distance Msk¯ for incPCP-PTI, stab + incPCP-PTI, and baseline incPCP (Section 4.1) on the SPJ dataset.
Dataset
incPCP-PTI
stab + incPCP-PTI
Baseline incPCP
v=1
0.0057
0.0038
0.0015
v=3
0.0064
0.0064
0.0021
v=5
0.0071
0.0079
0.0022
Changing v
0.0066
0.0065
0.0024
The results of Table 2 (related to the Moseg dataset, Section 4.2) suggest that incPCP-PTI is able to perform comparably to DECOLOR, even though the latter is a batch method and our proposed method is incremental. LTVA has substantially higher average F measure for this particular dataset, although working in a batch fashion. In Table 3, the same trend is observed. As mentioned above, DECOLOR has problems working in these sequences due to its prealignment phase failing to find a suitable unique frame for reference. The low performance of incPCP-PTI in some of the Moseg sequences might stem from the short number of video frames that cause the initial low-rank estimation of PCP to be less precise.
The results of Sections 5.3 and 5.4 (related to the DAVIS and CDNet datasets, respectively, Sections 4.3 and 4.4) suggest that incPCP-PTI can perform adequately in longer real panning videos with more complex scenarios. Regarding the DAVIS dataset, we observe that the highest performance is obtained with LTVA-mPP. Nevertheless, this is a batch method and the final binary segmentation required human interactions. On the contrary, incPCP-PTI shows an average performance superior to LTVA-aPP and comparable to DECOLOR, though the latter did not run on all tested sequences.
The representative frames of Figures 8 and 9 exhibit different positions of the PTZ camera and thus evidence the ability of incPCP-PTI of handling the panning movements in the scene. IncPCP-PTI presents a relatively good F measure for both videos. This metric tended to be higher for the color version of the algorithm. In Figure 11, it can be observed that the F measure suffers decays at specific intervals of the video that coincide with sudden movements of the PTZ camera. However, after these sudden movements, the algorithm is able to restabilize and perform correctly. On the contrary, LTVA fails to track moving objects in a large number of frames. The lower performance on the CDNet dataset might be caused by the higher speed moving objects and panning movements that complicate the optical flow tracking using by LTVA. Additionally, the higher complexity of the objects and panning movements causes the clustering stage to produce a large number of false positives.
For both CP and IP videos (described in Section 4.4), incPCP-PTI showed a higher F measure than stab + incPCP-PTI, although a possible explanation is the misalignment of the ground truth reference frame and the reference frame of the stabilization algorithm. Nevertheless, the visual inspection of the frames and the results from the SPJ dataset suggests that incPCP-PTI is able to handle the presence of jitter in a panning scenario and that it does not need a stabilization preprocessing step. Compared to EFIC, incPCP-PTI showed superior performance in F measure in the CP videos, even without the postprocessing stage. In the IP video, incPCP-PTI is comparable or superior in F measure when compared with EFIC. As mentioned, the absence of open code for EFIC makes it difficult to make a more throughout comparison and to draw further conclusion from these comparisons. Compared to LTVA, incPCP-PTI shows a much higher F measure in both cases. These results suggest that incPCP-PTI might be more adequate than LTVA to track fastest panning and more complex scenarios. It is also noticed that incPCP-PTI attains these results in an incremental manner and with comparable or less computational average time per frame, despite the fact that the LTVA public code implemented in C and CUDA.
7. Conclusion
We have presented a novel algorithm, incPCP-PTI, and have shown with artificial datasets and real videos from the Moseg, DAVIS, and CDnet2014 datasets that it can adequately detect moving objects in scenarios with simultaneous panning and jitter. To the best of our knowledge, this is the first incremental PCP-like method able to handle panning conditions. For the synthetic datasets, the algorithm maintained a low distance with respect to a proxy ground truth, and for the real videos, it maintained an adequate F measure and was able to stabilize after sudden panning of the camera. Additionally, the comparisons with stab + incPCP-PTI (independent video stabilization followed by incPCP-PTI) suggest that a stabilization stage preceding incPCP-PTI is not needed, as it is able to handle the jitter present in the camera motions. The evaluations on real videos show that the incPCP-PTI might be comparable or superior, depending on the case, to the state-of-the-art batch PCP (e.g., DECOLOR) and non-PCP-like (e.g., LTVA, EFIC) foreground separation methods.
Further improvements of the algorithm might focus on (i) making it able to handle other types of distortion-like perspective changes or zooming in/out of the camera and (ii) reduce the time it takes per frame in order to make it more readily accessible for high frame rate real-time applications.
Data Availability
The video data used to support the findings of this study are included within the article. The datasets used in the article are referenced and can be found at publically available sites, namely, (1) synthetic datasets (Section 4.1) were constructed from: USC Neovision2 Project, https://goo.gl/5Si2Nm; (2) Moseg dataset (Section 4.2): “Freiburg-Berkeley motion segmentation dataset,” https://goo.gl/bzEvvi; (3) DAVIS dataset (Section 4.3): “DAVIS: Densely Annotated VIdeo Segmentation,” https://goo.gl/G8Hb7o; (4) CDnet2014 dataset (Section 4.4): http://www.changedetection.net/. Additionally, our implemented method can be found at http://goo.gl/4jEvck.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the “Programa Nacional de Innovación para la Competitividad y Productividad” (Innóvate Perú) Program, 169-Fondecyt-2015.
XuY.DongJ.ZhangB.XuD.Background modeling methods in video analysis: a review and comparative evaluationCalderaraS.CucchiaraR.PratiA.A distributed outdoor video surveillance system for detection of abnormal people trajectoriesProceedings of ACM/IEEE International Conference on Distributed Smart CamerasSeptember 2007Vienna, Austria364371BouwmansT.PorikliF.HöferlinB.VacavantA.ZivkovicZ.Improved adaptive Gaussian mixture model for background subtraction2Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004)August 2004Cambridge, UK2831ElgammalA.HarwoodD.DavisL.MaddalenaL.PetrosinoA.A self-organizing approach to background subtraction for visual surveillance applicationsShahM.DengJ. D.WoodfordB. J.Video background modeling: recent approaches, issues and our proposed techniquesBouwmansT.ZahzahE. H.Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillanceWrightJ.GaneshA.RaoS.PengY.MaY.Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimizationProceedings of Advances in NIPSDecember 2009Vancouver, BC, Canada20802088LinZ.ChenM.MaY.The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices2010http://arxiv.org/abs/1009.5055LiuG.LinZ.YuY.Robust subspace segmentation by low-rank representationProceedings of 27th International Conference on Machine Learning (ICML 2010)June 2010Haifa, Israel663670BouwmansT.SobralA.JavedS.JungS. K.ZahzahE.-H.Decomposition into low-rank plus additive matrices for background/foreground separation: a review for a comparative evaluation with a large-scale datasetRodríguezP.WohlbergB.Fast principal component pursuit via alternating minimizationProceedings of IEEE International Conference on Image Processing (ICIP 2013)September 2013Melbourne, VIC, Australia6973RodriguezP.WohlbergB.Incremental principal component pursuit for video background modelingRodriguezP.WohlbergB.A matlab implementation of a fast incremental principal component pursuit algorithm for video background modelingProceedings of IEEE International Conference on Image Processing (ICIP 2014)October 2014Paris, France34143416HeJ.BalzanoL.SzlamA.Incremental gradient on the grassmannian for online foreground and background separation in subsampled videoProceedings of IEEE CVPR2012Providence, RI, USA15681575FengJ.XuH.YanS.Online robust PCA via stochastic optimizationProceedings of Advances in NIPSDecember 2013Lake Tahoe, NV, USA404412SrebroN.RennieJ.JaakolaT.Maximum-margin matrix factorizationProceedings of Advances in NIPSDecember 2005Vancouver, BC, CanadaMIT Press13291336RahmaniM.AtiaG.High dimensional low rank plus sparse matrix decompositionYazdiM.BouwmansT.New trends on moving object detection in video images captured by a moving camera: a surveyPengY.GaneshA.WrightJ.XuW.MaY.RASL: robust alignment by sparse and low-rank decomposition for linearly correlated imagesHeJ.ZhangD.BalzanoL.TaoT.Iterative grassmannian optimization for robust image alignmentRodríguezP.WohlbergB.Translational and rotational jitter invariant incremental principal component pursuit for video background modelingProceedings of IEEE International Conference on Image Processing (ICIP 2015)September 2015Quebec City, QC, Canada537541ChenC.LiS.QinH.HaoA.Robust salient motion detection in non-stationary videos via novel integrated strategies of spatio-temporal coherency clues and low-rank analysisZhouX.YangC.YuW.Moving object detection by detecting contiguous outliers in the low-rank representationEbadiS.OnesV.IzquierdoE.Efficient background subtraction with low-rank and sparse matrix decompositionProceedings of IEEE IEEE International Conference on Image Processing (ICIP 2015)September 2015Quebec City, QC, CanadaIEEE48634867GaoC.MooreB. E.NadakuditiR. R.Augmented robust pca for foreground-background separation on noisy, moving camera video2017http://arxiv.org/abs/1709.09328ChauG.RodrıguezP.Panning and jitter invariant incremental principal component pursuit for video background modelingProceedings of IEEE International Conference on Computer Vision (ICCV 2017)October 2017Venice, Italy18441852OchsP.MalikJ.BroxT.Segmentation of moving objects by long term video analysisPont-TusetJ.CaellesS.PerazziF.The 2018 davis challenge on video object segmentation2018http://arxiv.org/abs/1803.00557WangY.JodoinP.PorikliF.KonradJ.BenezethY.IshwarP.CDnet 2014: an expanded change detection benchmark datasetProceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2014)June 2014Columbus, OH, USA387394GemanS.GemanD.Stochastic relaxation, gibbs distributions, and the bayesian restoration of imagesOdobezJ.-M.BouthemyP.Robust multiresolution estimation of parametric motion modelsBunchJ.NielsenC.Updating the singular value decompositionChahlaouiY.GallivanK.Van DoorenP.Computational information retrievalBakerC.GallivanK.DoorenP. V.Low-rank incremental methods for computing dominant singular subspacesBrandM.Fast low-rank modifications of the thin singular value decompositionGoldfarbD.MaS.Convergence of fixed-point continuation algorithms for matrix rank minimizationCandèsE.LiX.MaY.WrightJ.Robust principal component analysis?RodrıguezP.WohlbergB.An incremental principal component pursuit algorithm via projections onto the l1 ballProceedings of XXIV International Conference on Electronics, Electrical Engineering and Computing (INTERCON)August 2017Cusco, PeruDuchiJ.Shalev-ShwartzS.SingerY.ChandraT.Efficient projections onto the l1-ball for learning in high dimensionsProceedings of 25th International Conference on Machine Learning (ICML)2008New York, NY, USA272279CondatL.Fast projection onto the simplex and the ℓ1 ballRodriguezP.Accelerated gradient descent method for projections onto the l1-ballProceedings of IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP)June 2018Zagorochoria, GreeceRodriguezP.An accelerated Newton’s method for projections onto the l1-ballProceedings of 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)September 2017Tokyo, Japan16RodríguezP.WohlbergB.Ghosting suppression for incremental principal component pursuit algorithmsProceedings of IEEE GlobalSIPDecember 2016Washington, DC, USA197201DAVIS: densely annotated VIdeo segmentationhttps://goo.gl/G8Hb7oDongJ.LiuH.Video stabilization for strict real-time applicationsUSC Neovision2 Project, http://goo.gl/5Si2NmFreiburg-berkeley motion segmentation datasethttps://goo.gl/bzEvviBroxT.MalikJ.Object segmentation by long term analysis of point trajectoriesProceedings of European Conference on Computer VisionSeptember 2010Crete, GreeceSpringer282295ChasseryJ.GarbayC.An iterative segmentation method based on a contextual color and shape criterionObject segmentation by long term analysis of point trajectorieshttps://goo.gl/VtdQbbDetecting contiguous outliers in the low-rank representationhttps://goo.gl/xAp1NcAlleboschG.DeboeverieF.VeelaertP.PhilipsW.EFIC: edge based foreground background segmentation and interior classification for dynamic camera viewpointsProceedings of 16th International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS)October 2015Catania, ItalySpringerAlleboschG.Van HammeD.DeboeverieF.VeelaertP.PhilipsW.C-EFIC: color and edge based foreground background segmentation with interior classificationProceedings of VISIGRAPPMarch 2015Berlin, GermanySpringer433454Results for CDnet 2014, http://goo.gl/SSBvFAEFIC results, http://goo.gl/LQBeKRC-EFIC results, http://goo.gl/ctqmNs