Recently, attention toward autonomous surveillance has been intensified and anomaly detection in crowded scenes is one of those significant surveillance tasks. Traditional approaches include the extraction of handcrafted features that need the subsequent task of model learning. They are mostly used to extract low-level spatiotemporal features of videos, neglecting the effect of semantic information. Recently, deep learning (DL) methods have been emerged in various domains, especially CNN for visual problems, with the ability to extract high-level information at higher layers of their architectures. On the other side, topic modeling-based approaches like NMF can extract more semantic representations. Here, we investigate a new hybrid visual embedding method based on deep features and a topic model for anomaly detection. Features per frame are computed hierarchically through a pretrained deep model, and in parallel, topic distributions are learned through multilayer nonnegative matrix factorization entangling information from extracted deep features. Training is accomplished through normal samples. Thereafter, K-means is applied to find typical normal clusters. At test time, after achieving feature representation through deep model and topic distribution for test frames, a statistical earth mover distance (EMD) metric is evaluated to measure the difference between normal cluster centroids and test topic distributions. High difference versus a threshold is detected as an anomaly. Experimental results on the benchmark Ped1 and Ped2 UCSD datasets demonstrate the effectiveness of our proposed method in anomaly detection.
1. Introduction
Automatic video surveillance has recently attracted the attention of researchers since a large number of cameras, installed in surrounding places, may not let human based-surveillance be error free. Thus, computer vision and machine learning come to help analyze the output videos for various tasks of automatic recognition and anomaly detection. Originally, raw signals are used to extract information through machine learning techniques [1]. However, the high dimensionality of video signals captured by high-resolution video cameras makes traditional methods computationally complex. Thereby, to combat the issue of curse of dimensionality, dimensionality reduction techniques have received more attention. Linear and nonlinear dimensionality reduction approaches can be applied as task-dependent techniques. PCA, MDS, LLE, and autoencoder are some to name a few. Generally speaking, all computer vision-based feature extraction methods like handcrafted features (SIFT, HOG, etc...) can also be considered a kind of dimensionality reduction.
New emerging embedding methods, basically introduced in natural language modeling/processing (NLP), map the original high-dimensional signals to embed spaces and consecutively capture high-level information, which besides the compression, the semantic relations of signals are also preserved [2, 3]. Embedding techniques in NLP are based on representing each word as a vector in a vector space model. Preliminary one hot encoding suffers from lack of preservation of semantic relations, since orthogonally between words neglects the probable coherence between them. Topic-based representations such as LSA, probabilistic LSA, LDA, and NMF try to capture semantics [3].
Embedding can also be applied to vision tasks to bridge the semantic gap in image or video analysis. Recently, deep learning architectures (CNN, RNN, AE, RBM, etc.) have been well studied for anomaly detection [4]. Diving into the high-level features, they have shown considerable results in comparison to handcrafted features. Supervised CNNs consist of both convolution and fully connected (FC) layer for feature extraction and classification/recognition, respectively. Ultraparameters in CNN are caused by those terminative FC layers, which may cause overfitting in limited dataset regimes when training from scratch. Therefore, attention is trended toward using only pretrained convolutional layers for feature extraction and powerful image representations, putting aside FC layers.
In most researches, anomaly detection is investigated based on defining a model(s) on normal samples and detecting anomalies as deviation from this normality. This deviation can be measured either by likelihood or similarity. In [5], an anomaly was defined based on interaction forces between pedestrians using the social force model (SFM), and LDA was used to compute likelihood for test set to evaluate deviation from a normal model in a probabilistic framework, whereas in [6, 7], normal training samples were used to create a dictionary model and deviation was calculated as high sparse reconstruction cost between an original test sample and its reconstruction through a linear combination of normal bases in the Euclidean space.
In this paper, we investigate a combination of the deep model, topic model, and statistical distance for anomaly detection. In contrast to previous methods which were based on either handcrafted or deep features, neglecting semantic and interpretable information, we analyze the combination of a deep model with a topic model hierarchically to produce semantic representation. We apply a pretrained deep model for hierarchical feature extraction from different layer levels, for each training image. Thereafter, we take the advantages of nonnegative matrix factorization (NMF) as a topic modeling approach in capturing semantic features. Specially, we applied a multilayer NMF, for hierarchical topic representation injecting information extracted from hierarchical layers of a deep model in hierarchical decompositions. After learning topic distribution per frame in the training stage, we apply K-means clustering to compute cluster centroids as typical normal topic-based representations. At test time, in a similar pipeline for feature extraction at the train stage, semantic representation for test frames is calculated and compared to typical normal topic distributions through a statistical distance metric. Here, the earth mover distance (EMD) metric is chosen as a distance metric since it has shown efficient performance in comparing distributions.
Our main contributions are as follows:
We take the advantages of both the deep model (pretrained VGG-Net) and the topic model (multilayer NMF), hierarchically and in combination to reach high-level and semantic frame representation
Since topic distributions are extracted at the final level as the frame representations, after K-means clustering, some normal representative topic distributions for normality are achieved, and then, EMD statistical distance metric is applied in clustering-based anomaly detection framework
The organization of the rest of this paper is as follows: literature review in three domains of anomaly detection, topic modeling, and statistical learning methods are provided in Section 2. Section 3 introduces our proposed pipeline for crowd anomaly detection. Experimental results are reported in Section 4. Finally, Section 5 concludes this paper.
2. Literature Review
In this section, we review researches in anomaly detection, topic modeling, and statistical distance separately.
2.1. Anomaly Detection
Video surveillance studies for anomaly detection was started by using traditional handcrafted feature extraction and model learning and improved over the years by applying end-to-end deep architectures. Formerly, low-level features like color, texture, and its variants, like mixture of dynamic texture (MDT), SIFT, SURF, optical flow, and trajectories, were extracted either from appearance, motion, or both, depending on the anomaly definition. At model learning stages, binary classifiers like SVM, decision tree, and NN have been applied for supervised scenarios [1]. However, in semisupervised and unsupervised scenarios, given only normal videos at the training stage, a model for normal behavior is created and an anomaly is detected as a deviation from this model. This has been done for instance by one-class SVM (OCSVM) or fitting a Gaussian model on normal samples. Some researchers took the idea of the inherent sparsity of vision. A dictionary was learned from normal samples, and at the test time, a large reconstruction error was interpreted as an anomaly. Reconstruction was done as a linear combination of dictionary bases which are representative of all normal samples. Dictionary can be learned offline through codebook generation or online through updating along with observing new normal samples [8].
Recently, deep learning methods have commenced entering to the practical realm like vision, lexical, and speech. The intermediate image representations learned through CNN, especially when trained on large-scale datasets like ImageNet, have been proven to be powerful image descriptors.
In [9], anomalous behaviors were captured through a novel concept of aggregation of ensembles (AOE), based on fine-tuning different pretrained ConvNets and a pool of classifiers. They assumed that different CNN architectures learn different levels of representation from crowd videos, and thus, an ensemble of CNNs will enable enriched feature sets to be extracted. Autoencoder-based architectures were also studied where a large reconstruction error was considered a sign of anomaly score. The autoencoder can reduce dimensionality and is vastly used in unsupervised learning problems or as the preliminary stage of supervised task [10]. In particular, after training an AE or sparse AE on normal samples, the bottleneck layer can be considered feature extraction layers for any test samples. Some researchers tried to incorporate both handcrafted and deep features in a unified configuration. In [11], a trajectory-pooled deep convolutional descriptor was introduced combining dense trajectories and convolutional feature maps which results in high discriminative features. Convolutional networks outperform both traditional low-level features and their compositional forms like BoW, Fisher Kernel, and VLAD, [12] although sometimes are used cooperatively. In [12], features extracted from within layers of a convolutional network were used in VLAD to compress the data and subsequently feed to SVM for classification. Wimmer et al. [13] applied Fisher vector encoding to the output feature maps of CNN to find fixed-length representation for image classification.
Sabokrou et al. investigated video anomaly detection through different deep architectures [14–21]. Autoencoder-based anomaly detection and localization using sparsity was introduced in [14, 15]. An architecture based on deep 3D autoencoder, deeper 3D convolutional neural network (CNN), and cascade of two cascaded classifiers was proposed in [16] for anomaly detection. High speed and accurate detection and localization of anomalies were achieved in [18] using fully convolutional neural networks (FCNs) and cascaded outlier detection. Some researches applied generative adversarial networks and its variants for image anomaly detection [17, 19, 22]. Semisupervised anomaly detection was analyzed in [23] based on information theory. A novel self-supervised representation learning based on integration of a neighbourhood-relational encoding (NRE) among the training data and an encoder-decoder structure was proposed in [20]. In [21], they propose an adversarial training approach to detect out-of-distribution samples in an end-to-end model through jointly training two deep neural networks which collaborate at test time to detect novelties.
2.2. Topic Modeling
Topic modeling is an unsupervised method, originally introduced for text analysis, but has been also noticed in vision. It is based on the idea that documents containing similar contents will likely use a similar set of words that are indicated by topics. Topic modeling discovers patterns as low-dimensional latent representation given unlabeled collection of documents constituted of words. pLSA, LDA, and NMF are among the most common probabilistic topic modeling approaches [24–26]. Topic models take as input a set of documents J, a set of words V, and in a cooccurrence matrix of words and documents F=nwjwϵV.jϵJ (or BoVW representation, and produce a set of topic T, or more especially Pw∣k and pk∣j, for w∈V.j∈J.k∈T, as word distribution per topic and topic distribution per document, respectively. Consider nwj as the number of times the word w appears in document j, then documents can be represented as mixtures of topics.
F can be decomposed into two matrices F=ΦΘ, where Φ=ϕwkwϵV.kϵK is a word-topic matrix with ϕwk=pw∣k and ϕk=ϕwkwϵV, and Θ=θkjkϵK.jϵJ is a topic-document matrix with θkj=pk∣j and θj=θkjkϵK. The decomposition can be solved through the various topic model algorithms with a different assumption. For instance, LDA uses a predefined number of topics, whereas hierarchical Dirichlet process (HDP) [27] estimates the best number of topics based on the training dataset.
In [28], Niebles et al. studied the application of latent topic models, namely, pLSA and LDA, for action categorization. Especially, they extract spatiotemporal interest points along the input volumes followed by codebook generation. In an unsupervised fashion, they succeeded in detecting and localizing actions, which were considered latent topics. New learning algorithms based on EM and variational Bayes inference were proposed in [29] for activity analysis in videos where the description of activities and behaviors was made by the dynamic topic model. The activities and behaviors were described by a dynamic topic model. They also evaluated anomaly localization procedures in the topic modeling framework. In [30], scene classification was made by discovering objects per image in an unsupervised fashion using pLSA. They subsequently used object distribution in each image for scene classification using supervised kNN. Topic modeling-based abnormal behavior recognition has been previously investigated in [5, 31]. In almost all cases, low likelihood corresponds to abnormal test samples. An unsupervised topic model (pLSA) anomaly detection and localization were studied in [32] based on extra information of location and size beside quantized spatiotemporal gradient descriptors to create a more informative vocabulary over visual clips. Each document (frame) is fully described by a corresponding distribution over topics.
2.3. Statistical Distance
Statistical distances try to find the distance between two statistical objects, and when accompanied with a symmetric property, they are known as a metric. In the anomaly detection area, distance measures such as Jensen Shannon divergence or Z score value were applied for comparing query observation to those extracted patterns from normal samples [33]. According to the evaluation of this distance concerning the threshold, the anomaly can be detected. As a powerful statistical distance, earth mover distance (EMD), also known as the Wasserstein metric, was applied in the image domain [34, 35] to compare two probability distributions, mainly based on low-level features like color or texture. It is based on computing statistical distance between two signatures. The typical signature consists of a list of pairs:
(1)S=x1.m1.x2.m2⋯xn.mn,where each xi is a certain feature, and mn is its mass (how many times that feature occurs in the record). Considering two signatures P and Q which contain m and n clusters, respectively,
(2)P=p1.wp1.p2.wp2⋯pm.wpm,(3)Q=q1.wq1.q2.wq2⋯qn.wqn,and piqi is the cluster representative and wpiwqi is the weight of cluster i. Also, consider D=di.jas the ground distance between clusters pi and qj. It can be chosen or learned according to the problem at hand. The aim is to find flow matrix F=fi.j, where fi.j is the flow between pi and qj, such that the below overall cost is minimized with its related constraints.
(4)min∑i=1m∑j=1nfi.jdi.j,fi.j≥01≤i≤m.1≤j≤n,∑jfi.j≤wpi1≤i≤m,∑ifi.j≤wqi.1≤j≤n,∑i=1m∑j=1nfi.j=min∑j=1mWpi.∑j=1nWqj.
This optimization can be solved via linear programming. It is based on solving a kind of transportation problem. Once the flow F is calculated, then the EMD is defined as the work normalized by the total flow:
(5)EMDP.Q=∑i=1m∑j=1nfi.jdi.j∑i=1m∑j=1nfi.j.
EMD suffers from high computational complexityON3logN. Wavelet EMD was proposed in [36] to reach a linear time algorithm for approximating the EMD for low-dimensional histograms using the sum of absolute values of the weighted wavelet coefficients of the difference histogram.
Rare studies have gained from EMD in anomaly detection. To the best of our knowledge, only in [7], wavelet EMD was applied in conjunction with sparse representation for anomaly detection instead of the Euclidean distance, for its robustness. In this paper, we investigate wavelet EMD on our proposed clustering-based anomaly detection.
3. Proposed Method
In this paper, we analyze anomaly detection at frame level in crowded scenes. Our proposed architecture is shown in Figure 1. The pipeline consists of two stages: (1) feature extraction and (2) anomaly detection. The feature extraction stage itself consists of two parts entangled with each other: (1) hierarchical feature extraction through pretrained VGG-Net [37] and (2) hierarchical latent representation from multilayer NMF. Both architectures start from low-level features and increase in depth to high-level information resulting in ultimate representation.
Our proposed architecture for anomaly detection. It consists of two stages of hierarchical feature representation and cluster-based anomaly detection.
In the second stage, we applied clustering-based anomaly detection. Precisely, K-means is applied to all processed training samples’ ultimate representations, to create typical normal clusters. Since the training dataset consists of only normal samples, thus, cluster centroids are normal frame representatives. At test time, test frames are processed to be represented in learned topic space from the training stage and compared to each cluster centroids. A large statistical distance from all centroids is detected as an anomaly. In the following, we explain each part in more detail.
3.1. Preprocessing and Feature Extraction
The dataset is separated into two subsets as train and test set. Let Xtrain=x1.x2⋯xnTrainT∈RnTrain×B0, where nTrain is the number of frames in the train dataset, B0=m×n×c and m, n, and c are the width, height, and number of channel, respectively, for the original captured image.
3.1.1. Deep Representation
Pretrained model is applied for feature extraction in problems encountering scarcity of training datasets, since training from scratch may result in overfitting. As higher layer feature maps are task specific, we extract more general features from lower layers. We resized each frame to be in a compatible size as the input for VGG-Net model (m0×n0×c0) and extract features hierarchically from different depths of the architecture. Let a0=x∈Rm0×n0×c0 be a typical train image in compatible size with VGG input layer. Then,
(6)al=fwl−1al−1+bl−1∈Rml×nl×cl,is the output feature map from layer l. wl−1 and bl−1 are VGG weights and biases pretrained, respectively, for layer l.ml×nl is the spatial size of the feature map, and cl is the feature map’s depth at layer l. We extract feature maps from L different depths l=1⋯.L; then, feature maps at each layer ll=1.2⋯.L are separately feed to the global average pooling (GAP) layer to get representations in vector format. GAP layers take input volumes of size ml×nl×cl and create 1×cl dimensional vector by spatial averaging. Therefore, for each frame x, now, we have L vector representations, fDl∈Rcll=1.2⋯.L. Considering all training samples, now we have L different size matrices, Ml∈RnTrain×fDl.
3.1.2. Topic-Based Representation
In parallel, we try to capture semantic information based on the topic model. Specially, we applied multilayer NMF since multilayer has been shown to improve performance by capturing more semantic features [38]. We adopt a similar approach to [39] by considering a frame as a document and trying to extract topic distribution per document. However, we apply multilayer NMF for hierarchical topic modeling. Single-layer NMF decomposes a nonnegative matrix V into two low-rank nonnegative basis and coefficient matrices W and H.
(7)V=WH′.V∈Rm×n.W∈Rm×k.H∈Rn×k,
where H is the new low-dimensional representation for V. The decomposition is solved as an optimization problem through a multiplicative update approach. In multilayer NMF, computed latent representation in preceding layers is decomposed hierarchically in subsequent layers. Consider Xtrain−pca=PCAXtrain−vec and Xtrain−pca=x1.x2⋯xnTrainT∈RnTrain×D0, where PCA applied to each vectorized frame to decrease dimensionality from m0×n0 to D0<m0×n0 per frame and standardized to stay in range 0‐1 . Let H0=Xtrain−pca as input to the first stage of multilayer NMF. Then, it can be decomposed as H0=W1H1. Instead of directly applying the second NMF to H1, as the new low-dimensional representation, H1 is processed to V1before being introduced to the next layer. Vl is computed as Vl=fHl.Ml.l=1⋯L where f. is the nonlinear function, like softmax, and Ml is feature representation from pretrained VGG-Net at layer l .
(8)Vl=fHl.Ml=Wl+1H′l+1.Wl+1∈RDl−1×Dl.Hl+1∈RnTrain×Dl.
Here, we use softmax as a nonlinear function to have a distribution-like representation. Since the ReLu activation function has been applied in deep architecture, nonnegativity is preserved. Bringing in Mls in multilayer NMF decomposition results in both high-level and semantic information, which can improve the performance of the subsequent tasks. By decomposing Vl in the next layer, we force the architecture to learn how to combine information from the previous layer; therefore, Dl<Dl−1. Training separately each NMF layer, to learn Wl and Hl, ultimate data representation VL is acquired. Finally, VL integrates features throughout the deep model and topic model.
3.2. Anomaly Detection
Upon training completion, VL∈RnTrain×DL is acquired from normal frames in the training set. We apply K-means algorithm to VL to find K cluster centroids as normality representatives. Therefore, now, we have K cluster centroids si.i=1⋯.K which are used in cluster-based anomaly detection. Each test frame xtest is fed to our learned feature extraction block from the training phase, and ultimate representation VL.test is acquired. VL.test can be considered as the final topic distribution for xtest. VL.test is compared to each si and exceedance of statistical wavelet EMD distance from threshold th is detected as an anomaly.
(9)mindEMD.iVL.test.si>thi=1:K→VL.test,is an abnomal frame.
4. Results and Discussion
We conducted experimental analysis on UCSD dataset as one of the benchmark datasets in crowd anomaly detection introduced in [40], recorded with a static camera at 10 fps. This dataset contains two scenes as Ped1 and Ped2, each of which is split into train and test sequences. The nonpedestrian objects, like bikers, skaters, and small carts, are considered anomalies. More details about this dataset are provided in Table 1. Typical normal and abnormal sample frames for Ped1 and Ped2 datasets are also shown in Figure 2.
UCSD dataset in detail.
Dataset
Resolution
Number of training sequences
Number of test sequences
Ped1
158×238
34~200 images
36~200 images
Ped2
240×360
16120~200 images
12120~200 images
Typical normal and abnormal samples of the UCSD dataset. Left to right: normal frame and abnormal frame for Ped1 and normal frame and abnormal frame for Ped2.
When originally introduced, VGG [37] was trained on the ImageNet dataset which only consists of object classes; however, recently, pretrained VGG on both the ImageNet and Places dataset is provided which consider scene classes, as well. 1000 classes from the ImageNet and the 365 classes from the Places365Standard [41] were merged to train a VGG16-based model (Hybrid1365-VGG [42]). We use VGG model pretrained both on the ImageNet and Places datasets to improve the capability of our deep feature extraction block in capturing both objects and scenes features. For this paper, our algorithms have been implemented in Python and run on a PC with 2.9 GHz Core i5 GPU, with GTX1080 GPU, and 16G RAM. Original frames are resized to be compatible with VGG, as VGG accepts input of size 224×224×3. Feature maps from different depths, namely, block2 − pool, block3 − pool, and block4 − pool of VGG architecture, were extracted and resulted in 56×56×128, 28×28×256, and 14×14×512 feature maps, respectively. Then, we applied global average pooling to each feature map separately which results in fD1:128D, fD2:256D, and fD3:512D representation vectors in hierarchical order. On the other hand, we applied multilayer NMF with L=3 on our train set with reduced dimensionality by PCA (2000D vector each frame). W0, W1, and W2 are learned separately with a multiplicative updates. D1,D2, and D3 are chosen as 512, 256, and 128, respectively. K-means clustering with K=50 is applied to the final representation VL∈RnTrain×DL to generate typical representative centroids. In the UCSD dataset, there arenTrain=6800 for Ped1 and nTrain=2550 for Ped2 datasets.
In our experiment, there are some parameters that we investigate their values and fixed after evaluation. These parameters are shown in Table 2.
Fixed parameter used in the proposed algorithm.
Dataset/parameters
Ped1
Ped2
Number of training samples
2550
6800
L (number of levels for feature hierarchies)
3
K (K-means clustering)
50
Threshold (for WEMD distance comparison)
0.33
0.24
VGG16 consists of several layers (C11-C12-P1-C21-C22-P2-C31-C32-C33-P3-C41-C42-C43-P4-C51-C52-C53-P5-FC1-FC2-FC3). Convolutions and fully connected layers have trainable parameters. Three last fully connected layers provide task specific features. So, we focus on first 5 convolution layers. We chose L=3 to achieve a trade-off between accuracy and complexity. The number of clusters in K-means clustering was also evaluated for K=30,40,50,60 and chosen as K=50 based on accuracy evaluation. We decided on the value of threshold for WEMD comparison based on average distance from training samples representations, since the training dataset consists only of normal samples.
For Ped 1, we compare our proposed approach both to traditional methods (SRC [6], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (AVID [19], Sabokrou [8], and deep cascade [16]). As introduced and calculated in [26], evaluation metrics such as equal error rate (EER) and area under curve (AUC) are computed at frame level and compared to the state-of-the-art methods. EER indicates the point where false positive rate equals to false negative rate. The lower the EER is, the higher accuracy can be achieved. A comparison of EER of our proposed approach to the previous method is shown in Table 3 for Ped1. Results show the comparable performance for our proposed method. Besides, AUC as the area under ROC curve is computed and compared to the state-of-the-art. Results show the outperformance of our proposed approach in AUC, as well.
Comparison of AUC performance for the UCSD Ped1 dataset at frame level.
Method
SRC [6]
MPPCA [43]
MDT [40]
AVID [19]
Sabokrou [8]
Deep cascade [16]
Proposed approach
EER
19
40
25
12.3
8.4
9.1
8.1
AUC
86
59
81.8
—
93.2
—
93.9
For Ped 2, The Ped1 dataset suffers from the perspective problem. For this reason, most researches have been conducted on Ped2. We compare our proposed approach both to traditional methods (SF [5], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (Conv-AE [44], AVID [19], deep anomaly [18], deep cascade [16], ALOCC [17], and ST-AE [45]). A comparison of EER of our proposed approach to the previous method is shown in Table 4 for Ped2. Results show the comparable performance for our proposed method. Besides, AUC is computed and compared to the state-of-the-art. Results show the outperformance of our proposed approach in AUC.
Comparison of EER performance for the UCSD Ped2 dataset at the frame level.
Method
SF [5]
MPPCA [43]
MDT [40]
Conv-AE [44]
AVID [19]
Deep anomaly [18]
Deep cascade [16]
ALOCC [17]
ST-AE [45]
Proposed approach
EER
42
36.0
24.0
21.7
14.
13.5
9.
13
12.0
6.1
AUC
63
71
85
90
—
—
—
—
87.4
97.3
Moreover, we evaluated accuracy as
(10)accuracy=TP+TNTP+FP+FN+TN.
The results, shown in Table 5 for the Ped1 and Ped2 datasets, indicate the high performance of our proposed method.
Accuracy criteria for the Ped 1 and Ped 2 datasets.
Dataset/criteria
Ped1
Ped2
Accuracy
90.3
95.4
5. Conclusions
In this paper, we discussed a new semantic and statistical distance-based crowd anomaly detection at the frame level. In particular, inspired by the earth mover distance metric applied previously on low-level vision features, we applied this statistical distance to hierarchically learned features, through pretrained deep convolutional neural network and topic model, for anomaly detection. Features from VGG-Net, pretrained on hybrid dataset (Places dataset and ImageNet dataset) and multilayered NMF as semantic interpretable features, were computed in combination as hierarchical representation and used in clustering-based anomaly detection using wavelet EMD statistical distance. Experimental results show the outperformance of our proposed approach. In the future, we will investigate anomaly localization by patch analysis through the kernel convolutional network (CKN) [46] and EMD in a similar framework to localize anomalies.
Data Availability
The readers can access the UCSD Ped1 and Ped2 datasets in http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
LiT.ChangH.WangM.NiB.HongR.YanS.Crowded scene analysis: a survey201525336738610.1109/tcsvt.2014.23580292-s2.0-84924362214MikolovT.SutskeverI.ChenK.CorradoG. S.DeanJ.Distributed representations of words and phrases and their compositionality201331113119WiriyathammabhumP.Summers-StayD.FermullerC.AloimonosY.Computer vision and natural language processing: recent approaches in multimedia and robotics2016494144WangR.NieK.WangT.YangY.LongB.Deep learning for anomaly detectionProceedings of the 13th International Conference on Web Search and Data Mining2020Houston, TX, USA894896MehranR.OyamaA.ShahM.Abnormal crowd behavior detection using social force model2009 IEEE Conference on Computer Vision and Pattern Recognition2009Miami, FL, USA935942CongY.YuanJ.LiuJ.Sparse reconstruction cost for abnormal event detectionCVPR 20112011Colorado Springs, CO, USA34493456ZhuX.LiuJ.WangJ.LiC.LuH.Sparse representation for robust abnormality detection in crowded scenes20144751791179910.1016/j.patcog.2013.11.0182-s2.0-84893666383SabokrouM.FathyM.MoayedZ.KletteR.Fast and accurate detection and localization of abnormal behavior in crowded scenes201728896598510.1007/s00138-017-0869-82-s2.0-85027875429SinghK.RajoraS.VishwakarmaD. K.TripathiG.KumarS.WaliaG. S.Crowd anomaly detection using aggregation of ensembles of fine-tuned ConvNets202037118819810.1016/j.neucom.2019.08.0592-s2.0-85072575083KiranB. R.ThomasD. M.ParakkalR.An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos2018423610.3390/jimaging40200362-s2.0-85056774690WangL.QiaoY.TangX.Action recognition with trajectory-pooled deepconvolutional descriptorsProceedings of the IEEE conference on computer vision and pattern recognition2015Boston, MA, USA43054314XuZ.YangY.HauptmannA. G.A discriminative CNN video representation for event detectionIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)2015Boston, MA, USA17981807WimmerG.VécseiA.HäfnerM.UhlA.Fisher encoding of convolutional neural network features for endoscopic image classification201853, article 03450410.1117/1.jmi.5.3.0345042-s2.0-85054051885SabokrouM.FathyM.HoseiniM.KletteR.Real-time anomaly detection and localization in crowded scenesProceedings of the IEEE conference on computer vision and pattern recognition workshops2015Boston, MA, USA5662SabokrouM.FathyM.HoseiniM.Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder201652131122112410.1049/el.2016.04402-s2.0-84975038994SabokrouM.FayyazM.FathyM.KletteR.Deep-cascade: cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes20172641992200410.1109/TIP.2017.26707802-s2.0-8501850716428221995SabokrouM.KhalooeiM.FathyM.AdeliE.Adversarially learned one-class classifier for novelty detectionProceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018Salt Lake City, UT, USA33793388SabokrouM.FayyazM.FathyM.MoayedZ.KletteR.Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes2018172889710.1016/j.cviu.2018.02.0062-s2.0-85042595485SabokrouM.PourrezaM.FayyazM.EntezariR.FathyM.GallJ.AdeliE.Avid: adversarial visual irregularity detectionAsian Conference on Computer Vision2018Springer, Cham488505SabokrouM.KhalooeiM.AdeliE.Self-supervised representation learning via neighborhoodrelational encodingProceedings of the IEEE/CVF International Conference on Computer Vision2019Seoul, Korea (South)80108019SabokrouM.FathyM.ZhaoG.AdeliE.Deep end-to-end one-class classifier202132267568410.1109/tnnls.2020.297904932275608DeeckeL.VandermeulenR.RuffL.MandtS.KloftM.Image anomaly detection with generative adversarial networksJoint European conference on machine learning and knowledge discovery in databases2018Springer, Cham317RuffL.VandermeulenR. A.GörnitzN.BinderA.MüllerE.MüllerK. R.KloftM.Deep semi-supervised anomaly detection2019arXiv preprint arXiv:1906.02694AlghamdiR.AlfalqiK.A survey of topic modeling in text mining20156114715310.14569/ijacsa.2015.060121WangX.MaX.GrimsonE.Unsupervised activity perception by hierarchical Bayesian models2007 IEEE conference on computer vision and pattern recognition2007Minneapolis, MN, USA18WangX.GrimsonE.Spatial latent Dirichlet allocationAdvances in neural information processing systems. MIT Press: Cambridge2008MA, USA, London, UK15771584TehY. W.JordanM. I.BealM. J.BleiD. M.Hierarchical Dirichlet processes20061014761566158110.1198/0162145060000003022-s2.0-33749249312NieblesJ. C.WangH.Fei-FeiL.Unsupervised learning of human action categories using spatial-temporal words200879329931810.1007/s11263-007-0122-42-s2.0-45049084813IsupovaO.KuzinD.MihaylovaL.Learning methods for dynamic topic modeling in automated behavior analysis20182993980399310.1109/TNNLS.2017.27353642-s2.0-8503077324028961126BoschA.ZissermanA.MunozX.Scene classification via pLSA2006Berlin, HeidelbergSpringer517530PopoolaO. P.Kejun WangVideo-based abnormal human behavior recognition| a review201242686587810.1109/TSMCC.2011.21785942-s2.0-84867846829PathakD.SharangA.MukerjeeA.Anomaly localization in topic based analysis of surveillance videos2015 IEEE Winter Conference on Applications of Computer Vision2015Waikoloa, HI, USA389395Bo duLiangpei ZhangA discriminative metric learning based anomaly detection method201452116844685710.1109/tgrs.2014.23038952-s2.0-84902073586RubnerY.TomasiC.GuibasL. J.The earth mover’s distance as a metric for image retrieval20004029912110.1023/A:10265439000542-s2.0-0034313871RuzonM. A.TomasiC.Edge, junction, and corner detection using color distributions200123111281129510.1109/34.9691182-s2.0-0035510301ShirdhonkarS.JacobsD. W.Approximate earth mover’s distance in linear time2008 IEEE Conference on Computer Vision and Pattern Recognition2008Anchorage, AK, USA18SimonyanK.ZissermanA.Very deep convolutional networks for large-scale image recognition2014arXiv preprint arXiv:1409.1556SongH. A.KimB. K.XuanT. L.LeeS. Y.Hierarchical feature extraction by multi-layer non-negative matrix factorization network for classification task2015165637410.1016/j.neucom.2014.08.0952-s2.0-84929954089WanX.A novel document similarity measure based on earth mover's distance2007177183718373010.1016/j.ins.2007.02.0452-s2.0-34250782737Weixin LiMahadevanV.VasconcelosN.Anomaly detection and localization in crowded scenes2014361183210.1109/tpami.2013.1112-s2.0-84890419942ZhouB.LapedrizaA.KhoslaA.OlivaA.TorralbaA.Places: a 10 million image database for scene recognition20184061452146410.1109/tpami.2017.27230092-s2.0-8502319957428692961WangL.WangZ.DuW.QiaoY.Objectscene convolutional neural networks for event recognition in imagesProceedings of the IEEE conference on computer vision and pattern recognition workshops2015Boston, MA, USA3035KimJ.GraumanK.Observe locally, infer globally: a spacetime MRF for detecting abnormal activities with incremental updatesProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR ‘09)2009Miami, Fla, USA29212928HasanM.ChoiJ.NeumannJ.Roy-ChowdhuryA. K.DavisL. S.Learning temporal regularity in video sequencesProceedings of the IEEE conference on computer vision and pattern recognition2016Las Vegas, NV, USA733742ZhaoY.DengB.ShenC.LiuY.LuH.HuaX. S.Spatio-temporal autoencoder for video anomaly detectionProceedings of the 25th ACM international conference on multimedia, Mountain View2017California, USA19331941MairalJ.KoniuszP.HarchaouiZ.SchmidC.Convolutional kernel networks201426272635