Online Detection of Abnormal Events in Video Streams

,


Introduction
Visual surveillance is one of the major research areas in computer vision.In a crowd image analysis problem, the scientific challenge includes abnormal events detection.For instance, Figure 1(a) illustrates a normal scene where the people are walking.In Figure 1(b), all the people are suddenly running in different directions.This dataset imitates panicdriven scenes.
Trajectory analysis of objects was described in [1][2][3].The moving object was labeled by a blob in consecutive frames, and then a trajectory was produced.The deviation from the learnt trajectories was defined as abnormal events.Tracking based approaches are suitable for the sparse scenes with a few objects.The target might be lost due to occlusion.
In [4,5], abnormal detection approaches which used features encoding motion, texture, and size of the objects were introduced.Local image regions in a video were analyzed by employing background subtraction method; then a dynamic Bayesian network (DBN) was constructed to model normal and abnormal behavior, and finally a likelihood ratio test was applied to detect abnormal behaviors.In [6], a spacetime Markov random field (MRF) model which detected abnormal activities in a video was proposed, mixture of probabilistic principal component analyzers (MPPCA) was adopted to model local optical flow.The prediction is based on probabilistic assumption techniques where an accurate model exists, but there are various situations where a robust and tractable model cannot be obtained; model-free methods are needed to be studied.
Spatiotemporal motion features described by the context of bag of video words were adopted to detect abnormal events.In [7], the authors presented an algorithm which monitored optical flow in a set of fixed spatial positions, and constructed a histogram of optical flow.The likelihood of the behavior in a new coming frame concerning the probability distribution of the statistically learning behavior was computed.If the likelihood fell below a preset threshold, the behavior was considered as abnormal.In [8], irregular behavior of images or videos was detected by an inference process in a probabilistic graphical model.In [9,10], the video pixels were densely sampled to form the feature.These methods are based on the partial information of images, such as small blocks in a frame, without fully exploiting the global information of the feature.In [11][12][13], spatiotemporal features modeled motion regions of the frame as background, and anomaly was detected by subtracting the newly sample to the background template.These works are similar to the change detection method when the background is not stable.
In this paper, the proposed algorithm is composed of two parts.Firstly, a covariance feature descriptor is constructed over the whole video frame, and then a nonlinear one-class support vector machine algorithm is applied in an online fashion in order to detect abnormal events.The features are extracted based on the optical flow which presents the movement information.Experiments of real surveillance video dataset show that our online abnormal detection techniques can obtain satisfactory performance.The rest of the paper is organized as follows.In Section 2, covariance matrix descriptor of motion feature is introduced.In Section 3, the online one-class SVM classification method is presented.In Section 4, two abnormal detection strategies based on online nonlinear one-class SVM are proposed.In Section 5, we present results of real-world video scenes.Finally, Section 6 concludes the paper.

Covariance Descriptor of Frame Behavior
The optical flow is a feature which presents the direction and the amplitude of a movement.It can provide important information about the spatial arrangement of the objects and the change rate of this arrangement [14].We adopt Horn-Schunck (HS) optical flow computation method in our work.The optical flow of the gray scale image is formulated as the minimizer of the following global energy functional: where  is the intensity of the image,   ,   , and   are the derivatives of the image intensity value along the , , and time  dimension,  and V are the components of the optical flow in the horizontal and vertical direction, and  represents the weight of the regularization term.We introduce the covariance matrix encoding the optical flow and intensity of each frame as the descriptor to represent the movement.The covariance feature descriptor is originally proposed by Tuzel et al. [15] for pattern matching in a target tracking problem.The descriptor is defined as where  is the color information of an image (which can be gray, RGB, HSV, HLS, etc.),   is a mapping relating the image with the th feature from the image,  is the  ×  ×  dimensional feature extracted from image ,  and  are the image width and image height, and  is the number of chosen features.For each frame, the feature can be represented as  ×  covariance matrix: where  is the number of the pixels sampled in the frame, z  is the feature vector of pixel ,  is the mean of all the selected points, and C is the covariance matrix of the feature vector .The covariance descriptor C of each frame dose not have any information regarding the sample ordering and the number of points [15].Because the feature  can be designed as different approaches to fuse features, the covariance matrix descriptor proposes a way to merge multiple parameters.Different choices of feature vectors extraction are shown in Table 1, where  is the intensity of the gray image,  and V are horizontal and vertical components of optical flow,   ,   , and V  are the first derivatives of the intensity, horizontal optical flow, and vertical optical flow in the  direction respectively,   ,   , and V  are the first derivatives of the corresponding feature in the  direction and respectively,   ,   , and V  are the second derivatives in  direction,   ,   , and V  are the second derivatives in  direction.The flowchart of covariance matrix descriptor computation is shown in Figure 2. The optical flow and corresponding partial derivative characterize the interframe information or can be regarded as the movement information.The intensity of the frame and partial derivative of the intensity describe the intraframe information; they encode the appearance information of the frame.
If proper parameters are given, the traditionally used kernel, such as Gaussian, polynomial, and sigmoidal kernel, has similar performances [19].Gaussian kernel (x  , x  ) = exp(−‖x  − x  ‖ 2 /2 2 ) is chosen for our spatial features.The covariance matrix is an element in Lie group; the Gaussian kernel on the Euclidean spaces is not suitable for the covariance descriptors.The Gaussian kernel in Lie Group is defined as [20,21]: where X  and X  are matrices in Lie Group .

Online One-Class SVM
The essence of an abnormal detection problem is that only normal scene samples are available.The one-class SVM framework is well suitable to an abnormal detection problem.Support vector machine (SVM) is initially proposed by Vapnik and Lerner [22,23].It is a method based on statistical learning theory and has fine performance to classify data and recognize patterns.There are two frameworks of one-class SVM, one is support vector data description (SVDD) which is presented in [24,25] and the other is ]-support vector classifier (]-SVC) introduced in [26].The SVDD formulation is adopted in our work.It computes a sphere shaped decision boundary with minimal volume around a set of objects.The center of the sphere c and the radius  are to be determined via the following optimization problem: subject to where  is the number of training samples and   is the slack variable for penalizing the outliers.The hyperparameter  is the weight for restraining slack variables; it tunes the number of acceptable outliers.The nonlinear function Φ : X → H maps a datum x  into the feature space H; it allows to solve a nonlinear classification problem by designing a linear classifier in the feature space H.  is the kernel function for computing dot products in H, (x, x  ) = ⟨Φ(x), Φ(x  )⟩.By introducing Lagrange multipliers, the dual problem associated with ( 6) is written by the following quadratic optimization problem: The decision function is For the large training data, the solution cannot be obtained easily, and an online strategy to train the data is used in our work.Let c D denotes a sparse model of the center c  = (1/) ∑  =1 Φ(x  ) by using a small subset of available samples which is called dictionary: where D ⊂ {1, 2, . . ., }, and let  D denote the cardinality of this subset x D .The distance of a mapped datum Φ(x) with respect to the center c D can be calculated by A modification of the original formulation of the oneclass classification algorithm that consists of minimizing the approximation error The final solution is given by where K is the Gram matrix with (, )th entry (x  , x  ) and  is the column vector with entries (1/) ∑  =1 (x  , x  ),  ∈ D. In the online scheme, at each time step there is a new sample.Let   denote the coefficients, K  denote the Gram matrix, and   denote the vector, at time step .A criterion is used to determine whether the new sample can be included into the dictionary.A threshold  0 is preset, for the datum x  at time step , the coherence-based sparsification criterion [29,30] is First Case (  >  0 ).In this case, the new data Φ(x +1 ) is not included into the dictionary.The Gram matrix K +1 = K  .  changes online: where b is the column vector with entries (x  , x +1 ),  ∈ D.
Second Case (  ≤  0 ).In this case, the new data Φ(x +1 ) is included into the dictionary D. The Gram matrix K changes: By using Woodbury matrix identity K −1 +1 can be calculated iteratively: The vector  +1 is updated from   , with Computing  +1 as (19) needs to save all the samples {x} +1 =1 in memory.For conquering this issue, it can compute as  +1 = ( + 1)(x +1 , x +1 ) by considering an instant estimation.The update of  +1 from   is Based on (20), we have an online implementation of the oneclass SVM learning phase.

Abnormal Events Detection
In an abnormal event detection problem, it is assumed that a set of training frames { 1 , . . .,   } (the positive class) describing the normal behavior is obtained.The general architectures of abnormal detection are introduced below.
The offline training strategy refers to the case where all the training samples are learnt as one batch, as shown in Figure 3(a).We propose two abnormal detection strategies; the difference between these two strategies is the time when the dictionary is fixed.These two strategies are shown in Figures 3(b) and 3(c).Strategy 1 is shown in Figure 3(b).The training data are learnt one by one.When the training period is finished, the dictionary and the classifier are fixed.Each test datum is classified based on the dictionary.Figure 3(c) illustrates Strategy 2. The training procedure is the same as Strategy 1.But in the testing period, the dictionary is updated if the datum x  satisfies the dictionary update condition.The details of these two strategies are explained in the following.people run optical flow x n,1 x n,2 x n,n . . . . . . . . .

⋱ ⋱
Figure 4: Major processing states of the proposed abnormal frame event detection method.The covariance of the frame is computed.

Strategy 1.
In Strategy 1, the dictionary is updated merely through the training period.
Step 1.The first step is calculating the covariance matrix descriptor of training frames based on the image intensity and the optical flow.This step can be generalized as where {(I where the set {C 1 , C 2 , . . ., C  } is the first  covariance matrix descriptors of the training frames; it is the original dictionary C D .In one-class SVM, the majority of the training samples do not contribute to the definition of the decision function.The entries of a monitory subset of the training samples, { 1 ,  2 , . . .,   },  ≤ , are support vectors contributing to the definition of the decision function.
Step 3.After learning the dictionary C D which includes the first , 1 ≤  ≪  samples, the training samples {C +1 , C +2 , . . ., C  } are learned online via the technique described in Section 3.This step can be generalized as where C D is the dictionary obtained through Step 2, C  is a new sample in the remaining training dataset.According to the criterion introduced in Section 3, if the new sample C  satisfies the dictionary updated condition, it will be included into the dictionary C D .
Step 4. Based on the dictionary and the classifier obtained from the training frames, the incoming frame sample C + is classified.The workflow of Strategy 1 is shown in Figure 4 and described by the following equation: where C + is the covariance matrix descriptor of the ( + )th frame needed to be classified and C  and C  are the samples of the dictionary C D ."1" corresponds to the normal frame; "−1" corresponds to the abnormal frame.

Strategy 2.
In this strategy, the dictionary is updated through both training and testing periods.The feature extraction step (Step 1) and the online training steps (Steps 2 and 3) are the same as the ones presented in Strategy 1.The testing step is different.The new coming datum which is detected as normal but satisfies dictionary update condition should be included into C D .The dictionary is needed to be updated through the testing period to include new samples.
Step 4: Strategy 2. If the incoming frame sample C + is classified as normal ((C + ) = 1), the data is checked by the criterion described in Section 3. When the data satisfies the dictionary update criterion, this testing sample will be included into the dictionary.This step can be generalized by the following equation:

Abnormal Detection Result
This section presents the results of experiments conducted to analyze the performance of the proposed method.A competitive performance through both Strategy 1 and Strategy 2 of UMN [31] dataset is presented.

Abnormal Visual Events Detection: Strategy 1.
The results of the proposed abnormal events detection method via Strategy 1 online one-class SVM of UMN [31] dataset are shown below.
The UMN dataset includes eleven video sequences of three different scenes (the lawn, indoor, and plaza) of crowded escape events.The normal samples for training or Table 3: The comparison of our proposed method with the stateof-the-art methods for the abnormal event detection in the UMN dataset.

Method Area under ROC Lawn
Indoor Plaza Social force [16] 0.96 Optical flow [16] 0.84 NN [17] 0.93 SRC [17] 0.995 0.975 0.964 STCOG [18] 0  Online one-class SVM is appropriate to detect abnormal visual events.There are 1431 frames in the lawn scene, and 480 normal frames are used for training.In the offline strategy, all the 480 frames covariance matrices should be saved in the memory.In Strategy 1, 100 frames covariance matrices are considered as the dictionary firstly.When  5 -17×17 feature is adopted to construct the covariance descriptor, the variance of Gaussian kernel is  = 1, the preset threshold of the criterion is  0 = 0.5, the dictionary size increases from 100 to 101, and the maximum accuracy of the detection results is 91.69%.In the indoor scene, there are 2975 normal frames and 1057 abnormal frames.In the plaza scene, there are 1831 normal frames and 286 abnormal frames.The processes of the experiments are similar to the ones of the lawn scene.When feature vector is  5 -17 × 17,  = 1,  0 = 0.5, the dictionary size of these two scenes remain 100.The online strategy keeps the memory size almost unchanged when the size of training dataset increases.the dictionary and the classifier are not changed.Otherwise, if the sample is classified as a normal one, the sparse criterion introduced in Section 3 is used to check the correlation between the earlier dictionary and this new datum.It will be included into the dictionary when it satisfied the update condition.The dictionary will be updated through the whole testing period.The other two scenes, the indoor and plaza scene, are handled by the same methods.When  5 -17 × 17 feature is adopted, the variance of the Gaussian kernel is  = 1, and the preset threshold of the criterion is  0 = 0.5, and the dictionary size of the lawn, indoor, and plaza scene is increased from 100 to 106, 102, and 102, respectively.The ROC curve of detection results of these three scenes is shown in Figures 8(a The results performances of offline strategy, Strategy 1, and Strategy 2 are shown in Table 2.The performances of these two strategies results are similar to that of the results when all training samples are learnt together.When  4 (12 × 12) or  5 (17×17) are chosen as the features to form covariance matrix descriptor, the results have the best performance.

Abnormal Visual
These two features are more abundant to include movement and intensity information.
The result performances of the covariance matrix descriptor based online one-class SVM method and the state-ofthe-art methods are shown in Table 3.The covariance matrix based online abnormal frame detection method obtains competitive performance.In general, our method is better than others except sparse reconstruction cost (SRC) [17] in lawn scene and indoor scene.In that paper, multiscale HOF is taken as a feature, and a testing sample is classified by its sparse reconstructor cost, through a weighted linear reconstruction of the overcomplete normal basis set.But computation of the HOF might takes more time than calculating covariance.By adopting the integral image [15], the covariance matrix descriptor of the subimage can be computed conveniently.So the covariance descriptor can be appropriately used to analyze the partial movement.

Conclusions
A method for the abnormal event detection of the frame is proposed.The method consists of covariance matrix descriptor encoding the movement features, and the online nonlinear one-class SVM classification method.We have developed two nonlinear one-class SVM based abnormal event detection techniques that update the normal models of the surveillance video data in an online framework.The proposed algorithm has been tested on a video dataset yielding successful results to detect abnormal events.

Figure 1 :Figure 2 :
Figure 1: Examples of the normal and abnormal scenes: (a) Normal lawn scene: all the people are walking.(b) Abnormal lawn scene, all the people are running.

2 Figure 3 :
Figure 3: Offline and two online abnormal event detection strategies based on one-class SVM.(a) Strategy offline.The training data are learnt as one batch offline.(b) Strategy 1.The dictionary is fixed when all the training data are learnt.(c) Strategy 2. The dictionary continues being updated through the testing period.

1 Figure 5 :
Figure 5: Detection results of the lawn scene.(a) The detection result of one normal frame.(b) The detection result of one abnormal panic frame.(c) ROC curve of different features  of the lawn scene results via one-class SVM.All the training samples are learned together offline.The biggest AUC value is 0.9591.(d) ROC curve of different features  results via Strategy 1 online one-class SVM.The biggest AUC value is 0.9581.

1 Figure 6 :
Figure 6: Detection results of the indoor scene.(a) The detection result of one normal frame.(b) The detection result of one abnormal panic frame.(c) ROC curve of different features  of the indoor scene results via one-class SVM.All the training samples are learned together offline.The biggest AUC value is 0.8649.(d) ROC curve of different features  results via Strategy 1 online one-class SVM.The biggest AUC value is 0.8628.

Figure 7 :
Figure 7: Detection results of the plaza scene.(a) The detection result of one normal frame.(b) The detection result of one abnormal panic frame.(c) ROC curve of different features  of the plaza scene results via one-class SVM.All the training samples are learned together offline.The biggest AUC value is 0.9649.(d) ROC curve of different features  results via Strategy 1 online one-class SVM.The biggest AUC value is 0.9632.

Figure 8 :
Figure 8: ROC curve of UMN dataset.(a) ROC curve of different features  results via Strategy 2 of the lawn scene.The biggest AUC value is 0.9605.(b) Strategy 2 results of the indoor scene.The biggest AUC value is 0.8495.(c) Strategy 2 results of the plaza scene.The biggest AUC value is 0.9746.(d) The ROC curve of best performance of the lawn, indoor, and plaza scene when the training samples are learnt offline.The biggest AUC values of the lawn, indoor, and plaza are 0.9591, 0.8649 and 0.9649.
), 8(b), and 8(c).Besides the merit of saving memory of Strategy 1, Strategy 2 also has the advantage of adaptation to the long duration sequence.

Table 1 :
Different choices of feature  to construct the covariance descriptor.
1 , OP 1 ), (I 2 , OP 2 ), . . ., (  , OP  )} are the image intensity and the corresponding optical flow of the 1st to th frame.{C 1 , C 2 , . . ., C  } are the covariance matrix descriptors.Step 2. The second step consists of applying one-class SVM on the small subset of extracted descriptor of the training normal frames to obtain the support vectors.Consider a subset {C  }  =1 , 1 ≤  ≪  of data selected from the full training sample set {C  }  =1 , without loss of generality, and assume that the first  examples are chosen.This set of  examples is called dictionary C

Table 2 :
AUC of abnormal event detection method of different features  via original SVDD which learns training samples offline, Strategy 1 online one-class SVM, and Strategy 2 online one-class SVM.The biggest value of each method is shown in bold.
Events Detection: Strategy 2. The results of the abnormal event detection method via Strategy 2 of UMN dataset are shown as follows.In the experiment process of the lawn scene, 100 normal samples from the training samples are learnt firstly, and then the other 380 training data are learnt online one by one.After these two training steps, we can obtain the basic dictionary from the training samples and also the classifier.In the following testing step, the dictionary is updated if the sample satisfies the dictionary update criterion.When a new sample is coming, it is firstly detected by the previous classifier.If it is classified as anomaly,