A New Method of Pedestrian Abnormal Behavior Detection Based on Attention Guidance

the original


Introduction
Public safety has always received much attention. In recent years, monitoring equipment in public places has gradually increased and improved. Ten, unsafe factors can be identifed from massive surveillance video data [1], which can promptly fnd and stop suspected unsafe behavior and more efectively maintain public safety. In order to explore pedestrian behavior in monitoring, researchers frst started with pedestrian trajectories and obtained complete traces of pedestrians from videos to summarize pedestrian behavior [2]. Trough the analysis of traces, abnormal behaviors in monitoring can be divided into group and individual abnormal behaviors. Te group abnormal behavior includes the sudden dispersion and gathering of the crowd and other abnormal behavior, which can be classifed. Tus, the researchers directly applied the general deep action recognition framework [3] to the group abnormal behavior detection. Tey used optical fow and trajectory to classify and identify the abnormal and obvious movements of the population and obtained good results [4,5]. Individual abnormal behavior is usually personal misconduct that violates public order, such as driving a vehicle on the sidewalk, violating the usual walking direction stipulated in public places, and so on. Since monitoring video in public places contains various kinds of individual abnormal behavior but a small number within every class and individual abnormal behavior accounts for a fairly small proportion of all behavior data, this imbalance will seriously afect the accuracy of detecting abnormal behavior according to general methods. Terefore, researchers have divided all behavior into two classes. Also, they have proposed a weakly supervised method to get the features of normal behavior data. Ten, all behavior data are compared with the reconstructed or predicted behavior data by normal behavior features, and those with large diferences are judged to be abnormal behavior frames [6].
At present, the weakly supervised method for abnormal behavior detection is divided into two kinds: reconstruction and prediction according to the diference in input data [7]. Te reconstruction method with a singleframe input can be applied to image and video data, and the prediction method with multiframe input can only be applied to video data. With the existing weakly supervised method for abnormal behavior detection, it is the key point to make the refactored abnormal behavior frames to be obviously diferent from the original data. At present, the models based on reconstruction and prediction mainly focus on this problem from two aspects. On the one hand, a variety of input data are input in the model of training normal behavior frames, and the appearance and motion features are extracted through the multibranch model to accurately refactor the normal behavior [8,9]. On the other hand, a new module is proposed to record typical normal behavior features and increase the proportion of normal behavior features in the refactoring process [7,10,11]. Te abovementioned studies help to realize that the refactored abnormal behavior will be far from the original data and the refactored normal behavior will be similar to the original data. So, the weak supervision method can complete the accurate detection of abnormal behavior. However, surveillance videos contain a large number of pedestrians, and pedestrian behavior usually occupies a small area. Also, the normal behavior features extracted by the existing models lack specifcity, and it is easy to ignore the details of key regions, resulting in the failure of highlighting the obvious diference between abnormal behavior and normal behavior when refactoring all behaviors. To solve this problem, we introduce diferent attention modules in the encoder-decoder structure to capture the key areas of the input data and pay attention to the detailed features with increasing weights. At the same time, the added edge loss function focuses on the texture detail information of the data. In addition, the existing models have a single and inaccurate way to calculate abnormal scores of the refactored data. Terefore, it is difcult to select the appropriate division threshold with the abnormal scores to accurately distinguish abnormal behavior and normal behavior. To this end, we improve the abnormal evaluation method by using a combination of multiple evaluation methods to evaluate the quality of refactoring images, which increases the discrimination between refactored abnormal behavior and original abnormal behavior.
Te main contributions of this paper are as follows: frst, we propose an attention-directed deep model, which combines the encoder-decoder structure based on reconstruction and prediction methods, add attention modules that ft the structure of the encoder or decoder, and use edge loss to focus on the behavior region. Second, we propose a new abnormal evaluation method, which analyzes the quality of the refactored images from multiple aspects including the pixels and the whole of the images, and use the abnormal score as a standard to distinguish normal and abnormal behavior. Finally, we perform experiments on diferent datasets for abnormal behavior detection and verify the performance of the proposed models with the existing models through various experimental data such as ablation, quantitative, qualitative, and visualization experiments.

Related Work
Due to the imbalance between normal and abnormal behavior data, the method based on weak supervision is usually adopted, and the autoencoder is generally the infrastructure to extract normal behavior features. Ten, all behavior data are reconstructed or predicted based on the extracted features. Finally, the normal and abnormal behavior data are distinguished according to the error between the reconstructed or predicted data and the original data. Chen and He [12] refracted video segments and video frames by using diferent branches to incorporate spatiotemporal information, which only contained normal behavior, and fused the reconstruction errors based on the Bayesian law. In order to increase the similarity of reconstructed normal behavior and input normal behavior, Gong et al. [7] added a memory module for recording the features of normal behavior, which were input into the decoder.
Te autoencoder structure is prone to missing features due to linear compression; so, the deep learning model U-Net is used as the basic model, which is similar to the autoencoder. Li et al. [10] proposed a spatial-temporal model by combining U-Net for representing spatial information with the ConvLSTM for extracting motion information. Park et al. [13] improved the model proposed by Gong et al. [7]. Instead of the autoencoder, U-Net was selected as the basic model. Te memory module was set with the appropriate amount of memory, and the feature compactness loss and separation loss were also set to efectively update the memory module. Based on the research by Park et al. [13], Lv et al. [11] set the memory module as a dynamic unit without occupying memory for reducing the cost of the model.
In order to process a large number of features in deep learning models, researchers have begun to apply attention mechanisms to various models of video anomaly detection. Inspired by the visual attention mechanism, Huan et al. [14] used the input data to obtain spatial-temporal anomaly salient maps and then combined the maps with video frames to detect abnormal behavior frames. Zhu and Newsam [15] used an attention mechanism to assign the weights to different instances in the loss function of the multiple instance method. Wei et al. [16] also used attention for multiple instance anomaly detection, and the attention mechanism was used to assign weights to the input C3D and optical fow features. Whereas, Sun and Ji [17] adopted attention-based addressing when looking for the most matching normal feature in the memory module.
Since the attention mechanism contains many variants, it can be divided into two categories according to its adaptability. One can only adapt to one model and the other can transfer to multiple models. Among them, Attention U-Net (UA) [18] belongs to the former based on U-Net, which connects and processes features with the same number of channels on both sides of the U-shaped structure to obtain feature weights. UA is currently mainly used in the feld of medical image segmentation, such as liver tumor segmentation [19] and retinal vascular segmentation [20]. SE (Squeeze-and-excitation) attention [21] belongs to the latter, which can be fexibly transferred, and it selects valid features by adjusting the channel weights of model features. SE is commonly used in the feld of image classifcation, such as microexpression recognition [22] and ECG (electrocardiogram) classifcation [23].
Inspired by the abovementioned research, we use U-Net as the basic model and combine multilayer attention modules UA and SE, which are appropriate for two diferent networks of reconstruction and prediction methods. As the attention-directed model, two new networks strengthen the extraction of feature details for video anomaly detection.

Model Structure.
Tis paper proposes an attention-directed model named Att_AE. Te model structure is shown in Figure 1, which includes the encoder, decoder, memory module, loss function, and abnormal evaluation module. According to the diferent network structures in reconstruction and prediction methods, the parts of the encoder and decoder are diferent. So, the two attention mechanisms adapted to the two networks are added to the encoder or decoder module to form new structures. Te multilayer attention module UA [18] is constructed in the decoder of the prediction method and the features of diferent layers are calculated. Te multilayer SE attention module [21] is added to the encoder of the reconstruction method to calculate the feature weights between channels.
Te memory module is located in the middle of the encoder and decoder, which has the same function in the reconstruction and prediction methods. Te memory module is used to select and store the normal behavior features obtained by the encoder. Te stored features are used to connect with the features obtained by the encoder and then input into the decoder. Te memory module widens the gap between reconstructed or predicted normal behavior frames and abnormal behavior frames by increasing the proportion of normal behavior features in the input of the decoder. In addition, the training loss is also improved, and the edge constraint on the images is added by the image gradient. In addition, a new abnormal behavior evaluation module is proposed to improve the calculation method of abnormal scores among testing data, so that the model can more accurately distinguish normal behavior frame and abnormal behavior frame.

Attention-Directed Encoder-Decoder Model.
Te encoder and decoder modules in the U-Net refect the attention directed by combining attention mechanisms to extract important features. In this paper, the corresponding attention mechanisms are set up according to the reconstruction and prediction model structures.

Attention Encoder-Decoder Structure in Prediction
Method. For the prediction method of inputting four consecutive frames to predict the next frame, a multiframe input model is proposed, and the multiframe images split from one video are input to the encoder in a multichannel way. According to the skip connection in the prediction network structure, the attention module UA that fts the model is introduced into it. Te transformed encoder features are connected to the decoder features with the same number of channels for inputting into the attention module. Ten, the feature weight of the next layer in the decoder is calculated. Finally, the purpose of improving the weights of local interest regions and suppressing the noninterest regions is achieved by three-layer attention modules of the decoder. Te details of the prediction model are shown in Figure 2 below.
In Figure 2, the four-frame continuous frames are input into the encoder as a multichannel feature, and the spatial resolution is reduced by convolution and maxpooling to obtain advanced semantic features. Ten, the output of the encoder is input into the memory module and the input feature is stitched with the normal behavior feature that is most similar stored in the memory module. After stitching, the feature is entered into the decoder. In order to avoid the loss of feature details, the skip connection is used to stitch features of the same resolution. In order to increase the attention to the key features, we combine skip connection, input the same resolution features in the encoder and decoder into the UA module, assign weights to them by UA, then realize the skip connection with the features in the encoder, and fnally restore the image resolution through convolution and deconvolution to output the refactored image. Figure 3 displays a UA structure in the prediction method. Te input is a middle feature in U-Net. F g and F l are the outputs of diferent layers in the encoder and decoder. F l is the output of the layer in the decoder, whose next layer has the same dimensions as F g . g and l indicate that they are obtained from the encoder and decoder, respectively. First, this paper uses an upsampling method to ensure that the number of channels of F l and F g are consistent. Ten, we add the processed F l and F g , input it into the ReLU layer for sparse processing, increase the nonlinearity through the Sigmoid layer to obtain the attention weight coefcient O, and multiply O with the original input F l to get the output F 1 ′ in the current attention module.
Tis output is connected to F l and is used to calculate the next attention weight coefcient with the corresponding encoder input. Finally, more efcient features can be obtained through the multilayer attention module.

Attention Encoder-Decoder Structure in Reconstruction
Method. For the reconstruction method of inputting a single frame to refactor the input frame, Figure 4 shows the specifc network structure. Te input data are a single video frame and the resolution is reduced by convolution and maxpooling. To select and strengthen important features, the SE attention module is added to extract the attention between channels. Te SE module is located after the convolutional block of the encoder to calculate the weight of each channel in the encoder. After adding the weighted feature to the original feature, the new feature is input to the Advances in Multimedia next layer of the network for continued training. Moreover, the more efective normal behavior features are selected by the multilayer attention module, so that the memory module that records the normal behavior features and the decoder can train with the more efective features.
Since the input frame is a single frame, the skip connection with the power generative ability of the deep learning model may cause the refactored image to become a replica of the input image. So, the skip connection is removed from the reconstruction model. Ten, through a series of convolution and deconvolution operations, the output data are restored to the same size as the input data and the refactored image is output.
Te structure of the SE attention module in the reconstruction method is shown in Figure 5. Feature S in the encoder is input into the SE module after convolutional blocks and its size is converted to 1 × 1 × Z from H × W × Z by the global average pooling method. Te pooling method is similar to the function of the special fully connected layer, and the calculation equation is as follows:

S(m, n). (2)
Te output T is the new feature with the size of 1 × 1, which is obtained after averaging the pixel values of each channel. m and n represent the height and width of the pixel coordinates in the feature map, respectively. Ten, T is input into a fully connected layer FC with the coefcient (Z, Z/r), and the value of r is 16. Te purpose is to reduce the number of feature channels and the amount of calculation for the model. Next, we input into the ReLU layer for nonlinear operation, then input into another fully connected layer FC′ with the coefcient (Z/r, Z) to recover the original channel number, and fnally input into the Sigmoid layer to normalize. Te whole process can be expressed by the following equation: where σ represents the Sigmoid layer, δ represents the ReLU layer, and E 1 and E 2 are the parameters in the FC and FC′. Te channel attention weight U with the size of 1 × 1 × Z is obtained. U is used to multiply with the original input S to get the output V of the attention module. Finally, the obtained features according to the multilayer convolutional block and attention block in the encoder are input into the memory and decoder module to continue training.

Training Loss.
Te loss function has 4 parts, the intensity loss Loss p [7], which constrains pixel similarity, the feature aggregation loss Loss f g and separation loss Loss f s [13], which ensure that the memory module can record typical normal behavior, and the edge loss Loss g , which characterizes the edges of the image. Te loss function is defned as follows: where c, θ, and β are weight factors corresponding to the diferent parts of the loss function (0 < c, θ, β < 1). Te intensity loss reduces the diference between real and refactored data by penalizing the distance between them calculated by pixel similarity. Specifcally, the intensity loss Loss p is calculated by using the mean squared diference between two images with the following equation: where x i and x i ′ represent the pixel value of the original image and the prediction or reconstruction image, respectively. I represents the total number of pixels in the image.
Te feature aggregation loss is set for normal behavior features stored in memory modules. Te feature aggregation loss Loss f g ensures that the input features are similar to the recorded features in the memory module, and the calculation equation is as follows: where f k represents the features extracted by the model. K is the total number of extracted features. mem c is the closest normal behavior feature to f k of the memory module. By constraining the diference between the input features and the features in the memory module by equation (6), it is convenient to ensure the similarity between the stored features in the memory module and the input normal features. Tus, it is conducive to the refactoring quality of normal behavior frames after inputting the two into the decoder.
Based on the fact that the input feature is matched with the most similar feature in the memory module, the number of normal behavior features in the memory module is also required to control for reducing the memory. Terefore, the features stored in the memory module are required to be representative and diferent. Te feature separation loss

Advances in Multimedia
Loss f s is used to guarantee the diversity of features in the memory module, and the equation is as follows: where mem s is the second closest normal behavior feature to f k of the memory module. α is set to make the loss greater than zero. Te feature separation loss limits the distance between the input feature and the second similar feature in the memory module, which increases the diference between mem c and mem s . Tus, the feature separation loss increases the diference among all features in the memory module. For more clearly delineating the outline edges of the contents in the input frame, the model sets the edge loss Loss g to consider the details of texture structure well. Tis loss mainly constrains the horizontal and vertical gradients of the image calculated by the Sobel operator [24].
where G x and G y are the convolutional kernel that computes the image gradient horizontally and vertically, respectively. G and G ′ are the gradients of the original image and the reconstruction or prediction image, respectively. X g and Y g represent the horizontal and vertical gradients obtained by the convolution between the original image and the Sobel operator, respectively. X g ′ and Y g ′ represent the horizontal and vertical gradients of the reconstruction or prediction  image and the Sobel operator, respectively. Te image gradient is calculated as follows: We defne the gradient diference as the edge loss and use the L 1 distance to calculate the edge loss Loss g . In equation (10), the calculated edge loss improves the detail of the refactored image by constraining the gradient diference between the real image and the output refactored image.

Abnormal Behavior Evaluation.
After training with the normal behavior data, the abnormal behavior evaluation is used to provide abnormal scores for testing frames. Tis paper proposes a new method to judge abnormal behavior, which is shown in equation (11). It includes feature similarity D, pixel error PSNR, and image similarity SSIM. Ten, we fuse three values according to a certain proportion to calculate the fnal abnormal score. If the value is high, the frame will be much more possible to include abnormal behavior.
where 0 < λ, η, φ < 1 and λ � 1 − η − φ, x, x ′ are vectors of the input image and output image. Te diference is calculated between the normal behavior features in the memory module and the features obtained in testing. Moreover, it is an efective way of determining whether it is an abnormal behavior frame. If the distance is large, the probability of the abnormal behavior frame will be large. Te model uses L 2 distance to calculate the distance, and the used equation is as follows: PSNR is one of the most widely used image evaluation indicators [8] and is usually used to measure the gap between the distorted image and the original image. Moreover, this paper uses it to judge the diference in pixels between the output image and the input image. Te equation PSNR is as follows: where max(x ′ ) represents the maximum pixel value of the input image. Te high value of PSNR indicates that the output image is similar to the input image, and the input image is judged as a normal behavior frame. Otherwise, the input image is judged as an abnormal behavior frame. PSNR is only concerned with pixel error, so another image similarity evaluation criterion SSIM [25] is added to calculate the similarity between whole images from three aspects, brightness, contrast, and structural similarity. Moreover, the lower the value, the greater the probability of an abnormal behavior frame. Te calculation equation is as follows: μ x and μ x′ represent the average gray level of the input image and the output image, respectively. Te intensity similarity l(x, x ′ ) is calculated as follows: C 1 is set to prevent the denominator from being zero, and l(x, x ′ ) is always in the range of (0, 1]. σ x and σ x' represent the standard deviation of the input image and output image, respectively. Te contrast similarity c(x, x ′ ) is calculated as follows: Te function of C 2 is the same as C 1 , and the range of c(x, x ′ ) is the same as l(x, x ′ ). Finally, the structural similarity s(x, x ′ ) of images is calculated, and the calculation equation is as follows: C 3 is also set to prevent the denominator from being zero. σ xx′ represents the covariance of the input image and output image.
Te similarity between the input image and output image can be calculated according to equation (14). Since the mean and variance of the entire image usually have a large variation, the sliding window method is used to calculate the multiregion SSIM of the image. Finally, the average value of the multiregion SSIM is taken as the fnal result.

Experiments
For verifying the performance of the model, the proposed model is trained on three classic abnormal behavior detection datasets. Te experiments include parameter selection, ablation experiments, comparison experiments, and visualization of abnormal behavior.
Te USCD Ped1 dataset and the USCD Ped2 dataset [26] are pedestrian monitoring videos at the University of California, San Diego. Te Ped1 dataset contains 34 training set videos and 36 testing set videos with a total of 14000 frames. Te Ped2 dataset contains 16 training set videos and 12 testing set videos with a total of 4560 frames. Te labels of the two datasets are frame-level labels, and the training set only contains the normal behavior frames and the testing set Advances in Multimedia contains the normal and abnormal behavior frames. Te videos record crowd behavior on campus sidewalks, and the labeled abnormal behavior mainly includes cycling, skateboarding, and driving.
Te CUHK Avenue dataset [27] contains 16 training set videos and 21 testing set videos with a total of 30652 frames, and the labels are also frame-level labels. Te division of training and testing sets is similar to the Ped1 and Ped2 datasets. Te Avenue dataset records the crowd behavior at the subway entrance, and the abnormal behavior mainly includes fast running, abnormal actions in places, and wrong walking direction.
Te experimental environment framework is Python 3.6, the compilation tool is PyCharm, the graphics card is TI-TAN XP, and the video memory is 12 GB. Te number of training epochs is 60, the initial learning rates used by the prediction and reconstruction methods are 2e − 4 and 2e − 5, and the optimizer is Adam. Te two models use the same loss function, as defned in 3.3 Training Loss. Based on the abovementioned parameters, the training time of the Ped1, Ped2, and Avenue datasets is about 5, 2, and 10 hours, respectively. Te time performance of the Ped1, Ped2, and Avenue datasets is 58.5 FPS (Frames Per Second), 58.5 FPS, and 50 FPS, respectively.

Model Parameters.
Te model parameters mainly include the weights of the loss function and the weights of the abnormal behavior evaluation module. Moreover, the parameters are selected according to the experimental results by combining diferent weights. Te evaluation indicator of the model is AUC [7], which is used for binary classifcation models. Te larger the AUC, the better the result of the model.

Weights of Training Loss.
Te loss function of the model is defned in 3.3 Training Loss, which consists of 4 parts, including intensity loss, feature aggregation loss, feature separation loss, and edge loss.
According to equation (4) in Section 3.3, only the weight parameters of the three losses are required to determine, and they are defned as c, θ, and β. c and θ are still set to 0.01 in the reconstruction method and 0.1 in the prediction method, respectively [13]. Taking β as the only variable, the weight parameters that correspond to the optimal result are chosen by training diferent models with diferent values. Figure 6 shows the experimental results of prediction and reconstruction models with diferent β on three datasets. According to Figure 6, the construction model has the best performance with β of 0.01 on three datasets. Te prediction model has the best performance with β of 0.01 on the Ped1 dataset, and with β of 0.5 on the Ped2, Avenue dataset.

Weights of Abnormal Behavior Evaluation.
Te abnormal behavior evaluation module consists of three parts, feature similarity, pixel error, and image similarity. And the corresponding weight parameters are set as λ, η, φ in Section 3.4. Since the sum of feature similarity and pixel error weights is set to 1 [13], this paper sets λ � 1 − η − φ. η is set to 0.7 and 0.6 in reconstruction and prediction methods [13]. Terefore, this paper sets η around the value of literature [13] and then adjusts the value of the newly added weight φ. Table 1 shows the results (AUC%) with diferent weights of the prediction model on Ped2 dataset.
According to the experimental results in Table 1, when η is 0.6 and φ is 0.01, the performance of the prediction model on the Ped2 dataset is the best. Based on the abovementioned method, when η is 0.6 of the prediction models on the Ped2 dataset and Avenue dataset, 0.75 of other models on other datasets, and φ is 0.01 of all models, the trained models achieve their optimal performance.   Table 2. Te data of the frst row in the two method areas are the baseline of three datasets and the other rows include the model results of diferent adding parts. Trough data comparison, it can be seen that the single part and diferent combinations of parts have a promotion efect on the performance of abnormal behavior detection.
Regarding the single added part, in the reconstruction model, the attention module increases to the most efective values of 0.8%, 0.4%, and 1.6% on the Ped1, Ped2, and Avenue datasets. In the prediction model, the edge loss increases to the most efective values of 1.1%, 1.6%, and 3.6% on three datasets. Moreover, other adding parts also improve the performance. Tus, for the single added part, the attention module and the edge loss bring the most improvements in reconstruction and prediction models, respectively. About the combination model parts, in the reconstruction model, the combination of the attention module and the edge loss brings more increases with 1.1%, 1.3%, and 2.1% on the Ped1, Ped2, and Avenue datasets; the fnal model brings the most improvement with 1.3%, 1.4%, and 2.2% on three datasets. In the prediction model, the combination of the attention and the abnormal behavior evaluation brings more increases with 2.3% on the Ped1 dataset; the combination of the attention module and the edge loss brings more increases with 2.5% and 3.8% on Ped2 and Avenue datasets; the fnal model brings the most improvement with 2.9%, 2.5%, and 3.9% on three datasets. With the results of the combinations, the attention module, edge loss, and abnormal behavior evaluation are positively correlated with the model result. Moreover, all combinations of the three parts are benefcial to improve performance.

Quantitative Experiments.
In this section, the experimental results of the model in this paper are compared with the existing abnormal behavior detection models with the same indicator on three datasets. In the comparison models, unmasking [28] was the model to learn the efective classifer through sliding windows, the level set method [29] was the model that used horizontal set detection to extract image descriptors, SRNN [32] and sRNN-AE [35] used RNN as the basic model, GAN_pred [33] used GAN combined with U-Net, PST [36] used pose components for abnormal behavior detection, and others were variant models based on autoencoder and U-Net. Table 3 shows the results of the proposed model and the comparison models on three datasets. In Table 4, we use another evaluation index, Equal Error Rate (EER) [30], to compare the results of this model with others on the same dataset. EER is used to measure the error rate of the model, and a smaller EER value means a lower error rate of the model. Te model in this paper includes two methods for detecting abnormal behavior. Tables 3 and 4 are separately divided into three areas to compare the proposed model with other models separately. Te frst area includes some abnormal behavior detection models that are not based on reconstruction or prediction methods; the second area includes the models using the reconstruction method; the third area includes the models using the prediction method. Att_AE_recon and Att_AE_pred are the proposed networks with reconstruction and prediction methods in this paper. In Tables 3 and 4, the results of Att_AE_recon and Att_AE_pred are signifcantly better than those of the comparison models in the corresponding areas. Moreover, the result of Att_AE_pred is better than those of all models.

Abnormal Behavior Scores.
Te abnormal behavior dataset selected in this paper is labeled with frame-level labels, and the specifc labels are 0 and 1. 0 represents the normal behavior frame and 1 represents the abnormal behavior frame. In this paper, the model scores the testing videos containing normal and abnormal behavior frames through the abnormal behavior evaluation defned in

Advances in Multimedia 9
Section 3.4 and sets the appropriate threshold to divide normal and abnormal behavior frames. In Figures 7-9, it is shown that the partial anomaly scores and corresponding division thresholds on the test set given by the prediction model of this paper and the literature [13] in the Ped1, Ped2, and Avenue datasets.
In Figures 7-9, score1 is used to represent the model scores in literature [13], score2 is used to represent the model scores in this paper, the label is used to represent the label of the testing videos, and the threshold is the division value of abnormal and normal behavior. Te abnormal score is the credential that divides the normal and abnormal behavior frames. If the abnormal scores are close to the corresponding labels, it will be easy to distinguish normal and abnormal behavior. In Figures 7-9, it can be seen that the abnormal scores of the proposed model are closer to the dataset labels. It can also be seen from the thresholds in Figures 7-9that the model in this paper has a higher accuracy rate of dividing normal and abnormal behavior. So, the model in this paper can more accurately divide the normal and abnormal behavior frames than the model in the literature [13].

Abnormal Behavior Detection Efect.
For abnormal behavior detection, we refactor the test set data that mix abnormal behavior and normal behavior according to the features of normal behavior and obtain the refactoring error by the output image and the input image. If it is a normal behavior frame, its refactoring error should be small; if it is an abnormal behavior frame, the refactoring error should be large. In order to evaluate the model performance, we visualize the refactoring error obtained by the prediction model of literature [13] and this paper. Te good performance of the proposed model in abnormal behavior detection can be seen through the comparison of fgures. Figures 10 and 11 show some refactoring error pictures obtained from three datasets.
In Figures 10 and 11, the frst column is the original video frame, the second column is the refactoring error    obtained by the literature [13], and the third column is the refactoring error obtained by this paper. Te images in the fgures are selected from the Ped1, Ped2, and Avenue datasets by row major. Figure 10 shows the refactoring error images of the normal behavior frames in the literature [13] and this paper, and it can be seen that the refactoring error of the normal behavior frames in this paper is less than that of the literature [13]. Figure 11 shows the refactoring error images of abnormal behavior in the literature [13] and this paper, where abnormal behavior is marked with a green rectangle box. Trough the comparison of the second and third columns in Figure 11, it can be seen that the refactoring error obtained in this paper for the abnormal behavior frame is larger than that of the literature [13], especially the marked abnormal behavior area. Terefore, based on the visual comparison, the abnormal behavior detection performance of the proposed model is better.

Comparison of Attention Mechanisms.
For exploring the efectiveness of the attention mechanisms in this paper, the CBAM module for calculating spatial and channel attention [37], and the ECA module for calculating channel attention [38] are selected for comparison. Te same number of diferent attention modules are added in the same way to train the Ped2 dataset with the prediction and reconstruction networks. Moreover, the AUC (%) of the diferent reconstruction and prediction networks is shown in Figures 12 and 13. SE module in the reconstruction method and the UA module in the prediction method can achieve better results, which are used in this paper. In addition, the average time performance of these networks is around 55FPS, which proves that the attention mechanisms used in this paper are the more efective scheme.

Visualization of Abnormal
Behavior. Te model in this paper fully learns the normal behavior features in the training process, so that the normal behavior frames can be efectively reconstructed or predicted in the testing set. Tere is a large reconstruction or prediction error of abnormal behavior frames in the testing set. Te reconstruction or prediction error is the diference between the input image and the output image. Also, the specifc areas of abnormal behavior can be observed by the visualization of the error. As shown in Figures 14 and 15, the prediction error of the Ped2 dataset and the Avenue dataset is visualized by the prediction network.
In Figure 14, the frst column includes the abnormal behavior frames selected in the original testing set. Te second column includes the diference between input images and output images, which can more clearly identify the specifc areas of abnormal behavior. Te third column includes the mark of the abnormal behavior area on the original frames. By marking the prediction error obtained by the proposed model in red, the scene and specifc meaning of the abnormal behavior can be explored. In Figure 14, it can  be seen that the abnormal behavior selected in the Ped2 dataset includes cycling, driving cars, and skateboarding.
Tere are also three columns of images in Figure 15, and each column is with the same meaning as in Figure 14. Trough the red annotation of the third column, it can be seen that the abnormal behavior selected in the Avenue dataset includes walking in the wrong direction, running fast, and making abnormal actions (throwing the backpack in place). Terefore, while correctly distinguishing between normal and abnormal behavior frames, the proposed model can get the specifc area of abnormal behavior, which is efective for more accurate analysis and induction of abnormal behavior combined with the scenes.

Runtime.
With a TITAN XP, the average time of the two baseline models [13] is 54.8 FPS and the average time of the models in this paper is 55.7 FPS. Also, the average time of the models in this paper is faster than other state-of-art models. For example, the average time of unmasking [28], SRNN [32], GAN_pred [33], and sRNN-AE [35] are 20 FPS, 50 FPS, 53.4 FPS, and 10 FPS, respectively.

Discussion
Te proposed model is trained and tested on three abnormal behavior detection datasets, and the good performance of the proposed model is verifed by comparison with other model results. However, in the process of analyzing and induction, we also fnd that there are still some problems in the proposed model, which lead to errors in distinguishing between abnormal behavior and normal behavior. For further work, we will discuss some problems arising from the proposed model and analyze the possible causes and then propose the solutions as the direction of future work. Figure 16 shows some detecting errors, and the data are selected from diferent datasets. In the abnormal scoring chart including the error frame, the location of the detection error frame is marked with a green circle. Also, the original data and corresponding refactoring error are provided for analyzing the cause of the detection error. It can be seen that the detection error frames are located near the demarcation between abnormal and normal behavior. Te abnormal behavior frame in the frst row is judged to be a normal behavior frame. From the abnormal behavior area marked by the green box, it can be seen that the abnormal behavior in Figure 16 is a cyclist in the crowd, which may be judged to be a normal behavior frame because it is located at the edge of the image and is obscured by the pedestrian in front. Te second row includes a normal behavior frame. Te abnormal behavior data in front of the normal behavior frame are that the pedestrian throws the backpack in place and walks quickly. We think the reason why this frame is judged to be an abnormal behavior frame may be without considering changes in motion speed and the similarity between this frame and its front abnormal data.
Based on the abovementioned analysis, we believe that the next step should be concerned with how to extract comprehensive features and solve judgment errors caused by occlusion and movement. Due to the diference in the motion state between abnormal behavior and normal behavior frames, we consider extracting motion features based on interframe variation to capture the diference for improving the judgment result of frames around the dividing line.

Conclusions
Based on the weak supervision, this paper proposes an attention-directed abnormal behavior detection model for the situation that cannot detect local abnormal behavior effectively. Te multilayer attention modules are added to obtain key features adapted to diferent structures of prediction and reconstruction methods. On this basis, this paper also modifes the loss function and proposes a new abnormal behavior evaluation module to increase the gap between normal and abnormal behavior frames after reconstruction or prediction, which is benefcial to the effective detection of abnormal behavior. Experiments on the Ped1, Ped2, and Avenue datasets have verifed the advancement of the proposed model.
Te proposed model in this paper has improved the performance of abnormal behavior detection to a certain extent, but the model still has some problems. For example, at the partial boundaries between the normal and abnormal behavior frames in videos, the diference between the abnormal behavior data and the normal behavior features extracted by the model is still small, which may lead to misjudgment. Terefore, further work is required to study how to extract the full and discriminating features of normal behavior for more efective abnormal behavior detection.

Conflicts of Interest
Te authors declare that there are no conficts of interest.

18
Advances in Multimedia