1. Introduction

WCMC

Wireless Communications and Mobile Computing

1530-86771530-8669

Hindawi

10.1155/2021/5513582

5513582

Research Article

A New Semantic and Statistical Distance-Based Anomaly Detection in Crowd Video Surveillance

https://orcid.org/0000-0002-8041-3643

Rezaei

Fariba

https://orcid.org/0000-0002-8889-7048

Yazdi

Mehran

Jolfaei

Alireza

School of Electrical and Computer Engineering

Shiraz University

Shiraz

Iran

shirazu.ac.ir

2021

1752021

20215220219420217520211752021

2021

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Recently, attention toward autonomous surveillance has been intensified and anomaly detection in crowded scenes is one of those significant surveillance tasks. Traditional approaches include the extraction of handcrafted features that need the subsequent task of model learning. They are mostly used to extract low-level spatiotemporal features of videos, neglecting the effect of semantic information. Recently, deep learning (DL) methods have been emerged in various domains, especially CNN for visual problems, with the ability to extract high-level information at higher layers of their architectures. On the other side, topic modeling-based approaches like NMF can extract more semantic representations. Here, we investigate a new hybrid visual embedding method based on deep features and a topic model for anomaly detection. Features per frame are computed hierarchically through a pretrained deep model, and in parallel, topic distributions are learned through multilayer nonnegative matrix factorization entangling information from extracted deep features. Training is accomplished through normal samples. Thereafter, K-means is applied to find typical normal clusters. At test time, after achieving feature representation through deep model and topic distribution for test frames, a statistical earth mover distance (EMD) metric is evaluated to measure the difference between normal cluster centroids and test topic distributions. High difference versus a threshold is detected as an anomaly. Experimental results on the benchmark Ped1 and Ped2 UCSD datasets demonstrate the effectiveness of our proposed method in anomaly detection.

1. Introduction

Automatic video surveillance has recently attracted the attention of researchers since a large number of cameras, installed in surrounding places, may not let human based-surveillance be error free. Thus, computer vision and machine learning come to help analyze the output videos for various tasks of automatic recognition and anomaly detection. Originally, raw signals are used to extract information through machine learning techniques [1]. However, the high dimensionality of video signals captured by high-resolution video cameras makes traditional methods computationally complex. Thereby, to combat the issue of curse of dimensionality, dimensionality reduction techniques have received more attention. Linear and nonlinear dimensionality reduction approaches can be applied as task-dependent techniques. PCA, MDS, LLE, and autoencoder are some to name a few. Generally speaking, all computer vision-based feature extraction methods like handcrafted features (SIFT, HOG, etc...) can also be considered a kind of dimensionality reduction.

New emerging embedding methods, basically introduced in natural language modeling/processing (NLP), map the original high-dimensional signals to embed spaces and consecutively capture high-level information, which besides the compression, the semantic relations of signals are also preserved [2, 3]. Embedding techniques in NLP are based on representing each word as a vector in a vector space model. Preliminary one hot encoding suffers from lack of preservation of semantic relations, since orthogonally between words neglects the probable coherence between them. Topic-based representations such as LSA, probabilistic LSA, LDA, and NMF try to capture semantics [3].

Embedding can also be applied to vision tasks to bridge the semantic gap in image or video analysis. Recently, deep learning architectures (CNN, RNN, AE, RBM, etc.) have been well studied for anomaly detection [4]. Diving into the high-level features, they have shown considerable results in comparison to handcrafted features. Supervised CNNs consist of both convolution and fully connected (FC) layer for feature extraction and classification/recognition, respectively. Ultraparameters in CNN are caused by those terminative FC layers, which may cause overfitting in limited dataset regimes when training from scratch. Therefore, attention is trended toward using only pretrained convolutional layers for feature extraction and powerful image representations, putting aside FC layers.

In most researches, anomaly detection is investigated based on defining a model(s) on normal samples and detecting anomalies as deviation from this normality. This deviation can be measured either by likelihood or similarity. In [5], an anomaly was defined based on interaction forces between pedestrians using the social force model (SFM), and LDA was used to compute likelihood for test set to evaluate deviation from a normal model in a probabilistic framework, whereas in [6, 7], normal training samples were used to create a dictionary model and deviation was calculated as high sparse reconstruction cost between an original test sample and its reconstruction through a linear combination of normal bases in the Euclidean space.

In this paper, we investigate a combination of the deep model, topic model, and statistical distance for anomaly detection. In contrast to previous methods which were based on either handcrafted or deep features, neglecting semantic and interpretable information, we analyze the combination of a deep model with a topic model hierarchically to produce semantic representation. We apply a pretrained deep model for hierarchical feature extraction from different layer levels, for each training image. Thereafter, we take the advantages of nonnegative matrix factorization (NMF) as a topic modeling approach in capturing semantic features. Specially, we applied a multilayer NMF, for hierarchical topic representation injecting information extracted from hierarchical layers of a deep model in hierarchical decompositions. After learning topic distribution per frame in the training stage, we apply K-means clustering to compute cluster centroids as typical normal topic-based representations. At test time, in a similar pipeline for feature extraction at the train stage, semantic representation for test frames is calculated and compared to typical normal topic distributions through a statistical distance metric. Here, the earth mover distance (EMD) metric is chosen as a distance metric since it has shown efficient performance in comparing distributions.

Our main contributions are as follows: (1)

We take the advantages of both the deep model (pretrained VGG-Net) and the topic model (multilayer NMF), hierarchically and in combination to reach high-level and semantic frame representation

(2)

Since topic distributions are extracted at the final level as the frame representations, after K-means clustering, some normal representative topic distributions for normality are achieved, and then, EMD statistical distance metric is applied in clustering-based anomaly detection framework

The organization of the rest of this paper is as follows: literature review in three domains of anomaly detection, topic modeling, and statistical learning methods are provided in Section 2. Section 3 introduces our proposed pipeline for crowd anomaly detection. Experimental results are reported in Section 4. Finally, Section 5 concludes this paper.

2. Literature Review

In this section, we review researches in anomaly detection, topic modeling, and statistical distance separately.

2.1. Anomaly Detection

Video surveillance studies for anomaly detection was started by using traditional handcrafted feature extraction and model learning and improved over the years by applying end-to-end deep architectures. Formerly, low-level features like color, texture, and its variants, like mixture of dynamic texture (MDT), SIFT, SURF, optical flow, and trajectories, were extracted either from appearance, motion, or both, depending on the anomaly definition. At model learning stages, binary classifiers like SVM, decision tree, and NN have been applied for supervised scenarios [1]. However, in semisupervised and unsupervised scenarios, given only normal videos at the training stage, a model for normal behavior is created and an anomaly is detected as a deviation from this model. This has been done for instance by one-class SVM (OCSVM) or fitting a Gaussian model on normal samples. Some researchers took the idea of the inherent sparsity of vision. A dictionary was learned from normal samples, and at the test time, a large reconstruction error was interpreted as an anomaly. Reconstruction was done as a linear combination of dictionary bases which are representative of all normal samples. Dictionary can be learned offline through codebook generation or online through updating along with observing new normal samples [8].

Recently, deep learning methods have commenced entering to the practical realm like vision, lexical, and speech. The intermediate image representations learned through CNN, especially when trained on large-scale datasets like ImageNet, have been proven to be powerful image descriptors.

In [9], anomalous behaviors were captured through a novel concept of aggregation of ensembles (AOE), based on fine-tuning different pretrained ConvNets and a pool of classifiers. They assumed that different CNN architectures learn different levels of representation from crowd videos, and thus, an ensemble of CNNs will enable enriched feature sets to be extracted. Autoencoder-based architectures were also studied where a large reconstruction error was considered a sign of anomaly score. The autoencoder can reduce dimensionality and is vastly used in unsupervised learning problems or as the preliminary stage of supervised task [10]. In particular, after training an AE or sparse AE on normal samples, the bottleneck layer can be considered feature extraction layers for any test samples. Some researchers tried to incorporate both handcrafted and deep features in a unified configuration. In [11], a trajectory-pooled deep convolutional descriptor was introduced combining dense trajectories and convolutional feature maps which results in high discriminative features. Convolutional networks outperform both traditional low-level features and their compositional forms like BoW, Fisher Kernel, and VLAD, [12] although sometimes are used cooperatively. In [12], features extracted from within layers of a convolutional network were used in VLAD to compress the data and subsequently feed to SVM for classification. Wimmer et al. [13] applied Fisher vector encoding to the output feature maps of CNN to find fixed-length representation for image classification.

Sabokrou et al. investigated video anomaly detection through different deep architectures [14–21]. Autoencoder-based anomaly detection and localization using sparsity was introduced in [14, 15]. An architecture based on deep 3D autoencoder, deeper 3D convolutional neural network (CNN), and cascade of two cascaded classifiers was proposed in [16] for anomaly detection. High speed and accurate detection and localization of anomalies were achieved in [18] using fully convolutional neural networks (FCNs) and cascaded outlier detection. Some researches applied generative adversarial networks and its variants for image anomaly detection [17, 19, 22]. Semisupervised anomaly detection was analyzed in [23] based on information theory. A novel self-supervised representation learning based on integration of a neighbourhood-relational encoding (NRE) among the training data and an encoder-decoder structure was proposed in [20]. In [21], they propose an adversarial training approach to detect out-of-distribution samples in an end-to-end model through jointly training two deep neural networks which collaborate at test time to detect novelties.

2.2. Topic Modeling

Topic modeling is an unsupervised method, originally introduced for text analysis, but has been also noticed in vision. It is based on the idea that documents containing similar contents will likely use a similar set of words that are indicated by topics. Topic modeling discovers patterns as low-dimensional latent representation given unlabeled collection of documents constituted of words. pLSA, LDA, and NMF are among the most common probabilistic topic modeling approaches [24–26]. Topic models take as input a set of documents J, a set of words V, and in a cooccurrence matrix of words and documents F=nwjwϵV.jϵJ (or BoVW representation, and produce a set of topic T, or more especially Pw∣k and pk∣j, for w∈V.j∈J.k∈T, as word distribution per topic and topic distribution per document, respectively. Consider nwj as the number of times the word w appears in document j, then documents can be represented as mixtures of topics.

F can be decomposed into two matrices F=ΦΘ, where Φ=ϕwkwϵV.kϵK is a word-topic matrix with ϕwk=pw∣k and ϕk=ϕwkwϵV, and Θ=θkjkϵK.jϵJ is a topic-document matrix with θkj=pk∣j and θj=θkjkϵK. The decomposition can be solved through the various topic model algorithms with a different assumption. For instance, LDA uses a predefined number of topics, whereas hierarchical Dirichlet process (HDP) [27] estimates the best number of topics based on the training dataset.

In [28], Niebles et al. studied the application of latent topic models, namely, pLSA and LDA, for action categorization. Especially, they extract spatiotemporal interest points along the input volumes followed by codebook generation. In an unsupervised fashion, they succeeded in detecting and localizing actions, which were considered latent topics. New learning algorithms based on EM and variational Bayes inference were proposed in [29] for activity analysis in videos where the description of activities and behaviors was made by the dynamic topic model. The activities and behaviors were described by a dynamic topic model. They also evaluated anomaly localization procedures in the topic modeling framework. In [30], scene classification was made by discovering objects per image in an unsupervised fashion using pLSA. They subsequently used object distribution in each image for scene classification using supervised kNN. Topic modeling-based abnormal behavior recognition has been previously investigated in [5, 31]. In almost all cases, low likelihood corresponds to abnormal test samples. An unsupervised topic model (pLSA) anomaly detection and localization were studied in [32] based on extra information of location and size beside quantized spatiotemporal gradient descriptors to create a more informative vocabulary over visual clips. Each document (frame) is fully described by a corresponding distribution over topics.

2.3. Statistical Distance

Statistical distances try to find the distance between two statistical objects, and when accompanied with a symmetric property, they are known as a metric. In the anomaly detection area, distance measures such as Jensen Shannon divergence or Z score value were applied for comparing query observation to those extracted patterns from normal samples [33]. According to the evaluation of this distance concerning the threshold, the anomaly can be detected. As a powerful statistical distance, earth mover distance (EMD), also known as the Wasserstein metric, was applied in the image domain [34, 35] to compare two probability distributions, mainly based on low-level features like color or texture. It is based on computing statistical distance between two signatures. The typical signature consists of a list of pairs: (1)S= x1.m1. x2.m2⋯ xn.mn,where each xi is a certain feature, and mn is its mass (how many times that feature occurs in the record). Considering two signatures P and Q which contain m and n clusters, respectively, (2)P= p1.wp1. p2.wp2⋯ pm.wpm,(3)Q= q1.wq1. q2.wq2⋯ qn.wqn,and piqi is the cluster representative and wpiwqi is the weight of cluster i. Also, consider D=di.j as the ground distance between clusters pi and qj. It can be chosen or learned according to the problem at hand. The aim is to find flow matrix F=fi.j, where fi.j is the flow between pi and qj, such that the below overall cost is minimized with its related constraints. (4)min∑i=1m∑j=1nfi.jdi.j,fi.j≥0 1≤i≤m.1≤j≤n,∑jfi.j≤wpi1≤i≤m,∑ifi.j≤wqi.1≤j≤n,∑i=1m∑j=1nfi.j=min∑j=1mWpi.∑j=1nWqj.

This optimization can be solved via linear programming. It is based on solving a kind of transportation problem. Once the flow F is calculated, then the EMD is defined as the work normalized by the total flow: (5)EMDP.Q=∑i=1m∑j=1nfi.jdi.j∑i=1m∑j=1nfi.j.

EMD suffers from high computational complexity ON3logN. Wavelet EMD was proposed in [36] to reach a linear time algorithm for approximating the EMD for low-dimensional histograms using the sum of absolute values of the weighted wavelet coefficients of the difference histogram.

Rare studies have gained from EMD in anomaly detection. To the best of our knowledge, only in [7], wavelet EMD was applied in conjunction with sparse representation for anomaly detection instead of the Euclidean distance, for its robustness. In this paper, we investigate wavelet EMD on our proposed clustering-based anomaly detection.

3. Proposed Method

In this paper, we analyze anomaly detection at frame level in crowded scenes. Our proposed architecture is shown in Figure 1. The pipeline consists of two stages: (1) feature extraction and (2) anomaly detection. The feature extraction stage itself consists of two parts entangled with each other: (1) hierarchical feature extraction through pretrained VGG-Net [37] and (2) hierarchical latent representation from multilayer NMF. Both architectures start from low-level features and increase in depth to high-level information resulting in ultimate representation.

Figure 1

Our proposed architecture for anomaly detection. It consists of two stages of hierarchical feature representation and cluster-based anomaly detection.

In the second stage, we applied clustering-based anomaly detection. Precisely, K-means is applied to all processed training samples’ ultimate representations, to create typical normal clusters. Since the training dataset consists of only normal samples, thus, cluster centroids are normal frame representatives. At test time, test frames are processed to be represented in learned topic space from the training stage and compared to each cluster centroids. A large statistical distance from all centroids is detected as an anomaly. In the following, we explain each part in more detail.

3.1. Preprocessing and Feature Extraction

The dataset is separated into two subsets as train and test set. Let Xtrain=x1.x2⋯xnTrainT∈RnTrain×B0, where nTrain is the number of frames in the train dataset, B0=m×n×c and m, n, and c are the width, height, and number of channel, respectively, for the original captured image.

3.1.1. Deep Representation

Pretrained model is applied for feature extraction in problems encountering scarcity of training datasets, since training from scratch may result in overfitting. As higher layer feature maps are task specific, we extract more general features from lower layers. We resized each frame to be in a compatible size as the input for VGG-Net model (m0×n0×c0) and extract features hierarchically from different depths of the architecture. Let a0=x∈Rm0×n0×c0 be a typical train image in compatible size with VGG input layer. Then, (6)al=fwl−1al−1+bl−1∈Rml×nl×cl,is the output feature map from layer l. wl−1 and bl−1 are VGG weights and biases pretrained, respectively, for layer l. ml×nl is the spatial size of the feature map, and cl is the feature map’s depth at layer l. We extract feature maps from L different depths l=1⋯.L; then, feature maps at each layer l l=1.2⋯.L are separately feed to the global average pooling (GAP) layer to get representations in vector format. GAP layers take input volumes of size ml×nl×cl and create 1×cl dimensional vector by spatial averaging. Therefore, for each frame x, now, we have L vector representations, fDl∈Rcl l=1.2⋯.L . Considering all training samples, now we have L different size matrices, Ml∈RnTrain×fDl.

3.1.2. Topic-Based Representation

In parallel, we try to capture semantic information based on the topic model. Specially, we applied multilayer NMF since multilayer has been shown to improve performance by capturing more semantic features [38]. We adopt a similar approach to [39] by considering a frame as a document and trying to extract topic distribution per document. However, we apply multilayer NMF for hierarchical topic modeling. Single-layer NMF decomposes a nonnegative matrix V into two low-rank nonnegative basis and coefficient matrices W and H. (7)V=WH′.V∈Rm×n.W∈Rm×k.H∈Rn×k,

where H is the new low-dimensional representation for V. The decomposition is solved as an optimization problem through a multiplicative update approach. In multilayer NMF, computed latent representation in preceding layers is decomposed hierarchically in subsequent layers. Consider Xtrain−pca=PCAXtrain−vec and Xtrain−pca=x1.x2⋯xnTrainT∈RnTrain×D0, where PCA applied to each vectorized frame to decrease dimensionality from m0×n0 to D0<m0×n0 per frame and standardized to stay in range 0‐1 . Let H0=Xtrain−pca as input to the first stage of multilayer NMF. Then, it can be decomposed as H0=W1H1. Instead of directly applying the second NMF to H1, as the new low-dimensional representation, H1 is processed to V1 before being introduced to the next layer. Vl is computed as Vl=fHl.Ml.l=1⋯L where f. is the nonlinear function, like softmax, and Ml is feature representation from pretrained VGG-Net at layer l . (8)Vl=fHl.Ml=Wl+1H′l+1.Wl+1∈RDl−1×Dl.Hl+1∈RnTrain×Dl.

Here, we use softmax as a nonlinear function to have a distribution-like representation. Since the ReLu activation function has been applied in deep architecture, nonnegativity is preserved. Bringing in Mls in multilayer NMF decomposition results in both high-level and semantic information, which can improve the performance of the subsequent tasks. By decomposing Vl in the next layer, we force the architecture to learn how to combine information from the previous layer; therefore, Dl<Dl−1. Training separately each NMF layer, to learn Wl and Hl, ultimate data representation VL is acquired. Finally, VL integrates features throughout the deep model and topic model.

3.2. Anomaly Detection

Upon training completion, VL∈RnTrain×DL is acquired from normal frames in the training set. We apply K-means algorithm to VL to find K cluster centroids as normality representatives. Therefore, now, we have K cluster centroids si.i=1⋯.K which are used in cluster-based anomaly detection. Each test frame xtest is fed to our learned feature extraction block from the training phase, and ultimate representation VL.test is acquired. VL.test can be considered as the final topic distribution for xtest. VL.test is compared to each si and exceedance of statistical wavelet EMD distance from threshold th is detected as an anomaly. (9) mindEMD.iVL.test.si>thi=1:K→VL.test,is an abnomal frame.

4. Results and Discussion

We conducted experimental analysis on UCSD dataset as one of the benchmark datasets in crowd anomaly detection introduced in [40], recorded with a static camera at 10 fps. This dataset contains two scenes as Ped1 and Ped2, each of which is split into train and test sequences. The nonpedestrian objects, like bikers, skaters, and small carts, are considered anomalies. More details about this dataset are provided in Table 1. Typical normal and abnormal sample frames for Ped1 and Ped2 datasets are also shown in Figure 2.

Table 1

UCSD dataset in detail.

Dataset	Resolution	Number of training sequences	Number of test sequences
Ped1	158×238	34~200 images	36~200 images
Ped2	240×360	16120~200 images	12120~200 images

Figure 2

Typical normal and abnormal samples of the UCSD dataset. Left to right: normal frame and abnormal frame for Ped1 and normal frame and abnormal frame for Ped2.

When originally introduced, VGG [37] was trained on the ImageNet dataset which only consists of object classes; however, recently, pretrained VGG on both the ImageNet and Places dataset is provided which consider scene classes, as well. 1000 classes from the ImageNet and the 365 classes from the Places365Standard [41] were merged to train a VGG16-based model (Hybrid1365-VGG [42]). We use VGG model pretrained both on the ImageNet and Places datasets to improve the capability of our deep feature extraction block in capturing both objects and scenes features. For this paper, our algorithms have been implemented in Python and run on a PC with 2.9 GHz Core i5 GPU, with GTX1080 GPU, and 16G RAM. Original frames are resized to be compatible with VGG, as VGG accepts input of size 224×224×3. Feature maps from different depths, namely, block2 − pool, block3 − pool, and block4 − pool of VGG architecture, were extracted and resulted in 56×56×128, 28×28×256, and 14×14×512 feature maps, respectively. Then, we applied global average pooling to each feature map separately which results in fD1:128D, fD2:256D, and fD3:512D representation vectors in hierarchical order. On the other hand, we applied multilayer NMF with L=3 on our train set with reduced dimensionality by PCA (2000D vector each frame). W0 , W1, and W2 are learned separately with a multiplicative updates. D1, D2, and D3 are chosen as 512, 256, and 128, respectively. K-means clustering with K=50 is applied to the final representation VL∈RnTrain×DL to generate typical representative centroids. In the UCSD dataset, there are nTrain=6800 for Ped1 and nTrain=2550 for Ped2 datasets.

In our experiment, there are some parameters that we investigate their values and fixed after evaluation. These parameters are shown in Table 2.

Table 2

Fixed parameter used in the proposed algorithm.

Dataset/parameters	Ped1	Ped2
Number of training samples	2550	6800
L (number of levels for feature hierarchies)	3
K (K-means clustering)	50
Threshold (for WEMD distance comparison)	0.33	0.24

VGG16 consists of several layers (C11-C12-P1-C21-C22-P2-C31-C32-C33-P3-C41-C42-C43-P4-C51-C52-C53-P5-FC1-FC2-FC3). Convolutions and fully connected layers have trainable parameters. Three last fully connected layers provide task specific features. So, we focus on first 5 convolution layers. We chose L=3 to achieve a trade-off between accuracy and complexity. The number of clusters in K-means clustering was also evaluated for K=30,40,50,60 and chosen as K=50 based on accuracy evaluation. We decided on the value of threshold for WEMD comparison based on average distance from training samples representations, since the training dataset consists only of normal samples.

For Ped 1, we compare our proposed approach both to traditional methods (SRC [6], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (AVID [19], Sabokrou [8], and deep cascade [16]). As introduced and calculated in [26], evaluation metrics such as equal error rate (EER) and area under curve (AUC) are computed at frame level and compared to the state-of-the-art methods. EER indicates the point where false positive rate equals to false negative rate. The lower the EER is, the higher accuracy can be achieved. A comparison of EER of our proposed approach to the previous method is shown in Table 3 for Ped1. Results show the comparable performance for our proposed method. Besides, AUC as the area under ROC curve is computed and compared to the state-of-the-art. Results show the outperformance of our proposed approach in AUC, as well.

Table 3

Comparison of AUC performance for the UCSD Ped1 dataset at frame level.

Method	SRC [6]	MPPCA [43]	MDT [40]	AVID [19]	Sabokrou [8]	Deep cascade [16]	Proposed approach
EER	19	40	25	12.3	8.4	9.1	8.1
AUC	86	59	81.8	—	93.2	—	93.9

For Ped 2, The Ped1 dataset suffers from the perspective problem. For this reason, most researches have been conducted on Ped2. We compare our proposed approach both to traditional methods (SF [5], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (Conv-AE [44], AVID [19], deep anomaly [18], deep cascade [16], ALOCC [17], and ST-AE [45]). A comparison of EER of our proposed approach to the previous method is shown in Table 4 for Ped2. Results show the comparable performance for our proposed method. Besides, AUC is computed and compared to the state-of-the-art. Results show the outperformance of our proposed approach in AUC.

Table 4

Comparison of EER performance for the UCSD Ped2 dataset at the frame level.

Method	SF [5]	MPPCA [43]	MDT [40]	Conv-AE [44]	AVID [19]	Deep anomaly [18]	Deep cascade [16]	ALOCC [17]	ST-AE [45]	Proposed approach
EER	42	36.0	24.0	21.7	14.	13.5	9.	13	12.0	6.1
AUC	63	71	85	90	—	—	—	—	87.4	97.3

Moreover, we evaluated accuracy as (10)accuracy=TP+TNTP+FP+FN+TN.

The results, shown in Table 5 for the Ped1 and Ped2 datasets, indicate the high performance of our proposed method.

Table 5

Accuracy criteria for the Ped 1 and Ped 2 datasets.

Dataset/criteria	Ped1	Ped2
Accuracy	90.3	95.4

5. Conclusions

In this paper, we discussed a new semantic and statistical distance-based crowd anomaly detection at the frame level. In particular, inspired by the earth mover distance metric applied previously on low-level vision features, we applied this statistical distance to hierarchically learned features, through pretrained deep convolutional neural network and topic model, for anomaly detection. Features from VGG-Net, pretrained on hybrid dataset (Places dataset and ImageNet dataset) and multilayered NMF as semantic interpretable features, were computed in combination as hierarchical representation and used in clustering-based anomaly detection using wavelet EMD statistical distance. Experimental results show the outperformance of our proposed approach. In the future, we will investigate anomaly localization by patch analysis through the kernel convolutional network (CKN) [46] and EMD in a similar framework to localize anomalies.

Data Availability

The readers can access the UCSD Ped1 and Ped2 datasets in http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Chang

Wang

Hong

Yan

Crowded scene analysis: a survey

IEEE Transactions on Circuits and Systems for Video Technology2015253367386

10.1109/tcsvt.2014.2358029

2-s2.0-84924362214

Mikolov

Sutskever

Chen

Corrado

G. S.

Dean

Distributed representations of words and phrases and their compositionality

Advances in Neural Information Processing Systems the MIT press:neurIPS proceedings201331113119

Wiriyathammabhum

Summers-Stay

Fermuller

Aloimonos

Computer vision and natural language processing: recent approaches in multimedia and robotics

ACM Computing Surveys (CSUR)2016494144

Wang

Nie

Wang

Yang

Long

Deep learning for anomaly detection

Proceedings of the 13th International Conference on Web Search and Data Mining

2020

Houston, TX, USA

894896

Mehran

Oyama

Shah

Abnormal crowd behavior detection using social force model

2009 IEEE Conference on Computer Vision and Pattern Recognition

2009

Miami, FL, USA

935942

Cong

Yuan

Liu

Sparse reconstruction cost for abnormal event detection

CVPR 2011

2011

Colorado Springs, CO, USA

34493456

Zhu

Liu

Wang

Sparse representation for robust abnormality detection in crowded scenes

Pattern Recognition201447517911799

10.1016/j.patcog.2013.11.018

2-s2.0-84893666383

Sabokrou

Fathy

Moayed

Klette

Fast and accurate detection and localization of abnormal behavior in crowded scenes

Machine Vision and Applications2017288965985

10.1007/s00138-017-0869-8

2-s2.0-85027875429

Singh

Rajora

Vishwakarma

D. K.

Tripathi

Kumar

Walia

G. S.

Crowd anomaly detection using aggregation of ensembles of fine-tuned ConvNets

Neurocomputing2020371188198

10.1016/j.neucom.2019.08.059

2-s2.0-85072575083

Kiran

B. R.

Thomas

D. M.

Parakkal

An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos

Journal of Imaging20184236

10.3390/jimaging4020036

2-s2.0-85056774690

Wang

Qiao

Tang

Action recognition with trajectory-pooled deepconvolutional descriptors

Proceedings of the IEEE conference on computer vision and pattern recognition

2015

Boston, MA, USA

43054314

Yang

Hauptmann

A. G.

A discriminative CNN video representation for event detection

IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

2015

Boston, MA, USA

17981807

Wimmer

Vécsei

Häfner

Uhl

Fisher encoding of convolutional neural network features for endoscopic image classification

Journal of Medical Imaging201853, article 034504

10.1117/1.jmi.5.3.034504

2-s2.0-85054051885

Sabokrou

Fathy

Hoseini

Klette

Real-time anomaly detection and localization in crowded scenes

Proceedings of the IEEE conference on computer vision and pattern recognition workshops

2015

Boston, MA, USA

5662

Sabokrou

Fathy

Hoseini

Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder

Electronics Letters2016521311221124

10.1049/el.2016.0440

2-s2.0-84975038994

Sabokrou

Fayyaz

Fathy

Klette

Deep-cascade: cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes

IEEE Transactions on Image Processing201726419922004

10.1109/TIP.2017.2670780

2-s2.0-85018507164

28221995

Sabokrou

Khalooei

Fathy

Adeli

Adversarially learned one-class classifier for novelty detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2018

Salt Lake City, UT, USA

33793388

Sabokrou

Fayyaz

Fathy

Moayed

Klette

Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes

Computer Vision and Image Understanding20181728897

10.1016/j.cviu.2018.02.006

2-s2.0-85042595485

Sabokrou

Pourreza

Fayyaz

Entezari

Fathy

Gall

Adeli

Avid: adversarial visual irregularity detection

Asian Conference on Computer Vision

2018

Springer, Cham

488505

Sabokrou

Khalooei

Adeli

Self-supervised representation learning via neighborhoodrelational encoding

Proceedings of the IEEE/CVF International Conference on Computer Vision

2019

Seoul, Korea (South)

80108019

Sabokrou

Fathy

Zhao

Adeli

Deep end-to-end one-class classifier

IEEE transactions on neural networks and learning systems2021322675684

10.1109/tnnls.2020.2979049

32275608

Deecke

Vandermeulen

Ruff

Mandt

Kloft

Image anomaly detection with generative adversarial networks

Joint European conference on machine learning and knowledge discovery in databases

2018

Springer, Cham

317

Ruff

Vandermeulen

R. A.

Görnitz

Binder

Müller

K. R.

Kloft

Deep semi-supervised anomaly detection

2019arXiv preprint arXiv:1906.02694

Alghamdi

Alfalqi

A survey of topic modeling in text mining

International Journal of Advanced Computer Science and Applications201561147153

10.14569/ijacsa.2015.060121

Wang

Grimson

Unsupervised activity perception by hierarchical Bayesian models

2007 IEEE conference on computer vision and pattern recognition

2007

Minneapolis, MN, USA

Wang

Grimson

Spatial latent Dirichlet allocation

Advances in neural information processing systems. MIT Press: Cambridge

2008

MA, USA, London, UK

15771584

Teh

Y. W.

Jordan

M. I.

Beal

M. J.

Blei

D. M.

Hierarchical Dirichlet processes

Journal of the American Statistical Association200610147615661581

10.1198/016214506000000302

2-s2.0-33749249312

Niebles

J. C.

Wang

Fei-Fei

Unsupervised learning of human action categories using spatial-temporal words

International Journal of Computer Vision2008793299318

10.1007/s11263-007-0122-4

2-s2.0-45049084813

Isupova

Kuzin

Mihaylova

Learning methods for dynamic topic modeling in automated behavior analysis

IEEE transactions on neural networks and learning systems201829939803993

10.1109/TNNLS.2017.2735364

2-s2.0-85030773240

28961126

Bosch

Zisserman

Munoz

Scene classification via pLSA

European Conference on Computer Vision2006

Berlin, Heidelberg

Springer

517530

Popoola

O. P.

Kejun Wang

Video-based abnormal human behavior recognition| a review

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)2012426865878

10.1109/TSMCC.2011.2178594

2-s2.0-84867846829

Pathak

Sharang

Mukerjee

Anomaly localization in topic based analysis of surveillance videos

2015 IEEE Winter Conference on Applications of Computer Vision

2015

Waikoloa, HI, USA

389395

Bo duLiangpei Zhang

A discriminative metric learning based anomaly detection method

IEEE Transactions on Geoscience and Remote Sensing2014521168446857

10.1109/tgrs.2014.2303895

2-s2.0-84902073586

Rubner

Tomasi

Guibas

L. J.

The earth mover’s distance as a metric for image retrieval

International Journal of Computer Vision200040299121

10.1023/A:1026543900054

2-s2.0-0034313871

Ruzon

M. A.

Tomasi

Edge, junction, and corner detection using color distributions

IEEE Transactions on Pattern Analysis and Machine Intelligence2001231112811295

10.1109/34.969118

2-s2.0-0035510301

Shirdhonkar

Jacobs

D. W.

Approximate earth mover’s distance in linear time

2008 IEEE Conference on Computer Vision and Pattern Recognition

2008

Anchorage, AK, USA

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition

2014arXiv preprint arXiv:1409.1556

Song

H. A.

Kim

B. K.

Xuan

T. L.

Lee

S. Y.

Hierarchical feature extraction by multi-layer non-negative matrix factorization network for classification task

Neurocomputing20151656374

10.1016/j.neucom.2014.08.095

2-s2.0-84929954089

Wan

A novel document similarity measure based on earth mover's distance

Information Sciences20071771837183730

10.1016/j.ins.2007.02.045

2-s2.0-34250782737

Weixin LiMahadevan

Vasconcelos

Anomaly detection and localization in crowded scenes

IEEE Transactions on Pattern Analysis and Machine Intelligence20143611832

10.1109/tpami.2013.111

2-s2.0-84890419942

Zhou

Lapedriza

Khosla

Oliva

Torralba

Places: a 10 million image database for scene recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence201840614521464

10.1109/tpami.2017.2723009

2-s2.0-85023199574

28692961

Wang

Qiao

Objectscene convolutional neural networks for event recognition in images

Proceedings of the IEEE conference on computer vision and pattern recognition workshops

2015

Boston, MA, USA

3035

Kim

Grauman

Observe locally, infer globally: a spacetime MRF for detecting abnormal activities with incremental updates

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR ‘09)

2009

Miami, Fla, USA

29212928

Hasan

Choi

Neumann

Roy-Chowdhury

A. K.

Davis

L. S.

Learning temporal regularity in video sequences

Proceedings of the IEEE conference on computer vision and pattern recognition

2016

Las Vegas, NV, USA

733742

Zhao

Deng

Shen

Liu

Hua

X. S.

Spatio-temporal autoencoder for video anomaly detection

Proceedings of the 25th ACM international conference on multimedia, Mountain View

2017

California, USA

19331941

Mairal

Koniusz

Harchaoui

Schmid

Convolutional kernel networks

Advances in Neural Information Processing Systems the MIT press:neurIPS proceedings201426272635