MTAD-TF: Multivariate Time Series Anomaly Detection Using the Combination of Temporal Pattern and Feature Pattern

Currently, multivariate time series anomaly detection has made great progress in many ﬁelds and occupied an important position. The common limitation of many related studies is that there is only temporal pattern without capturing the relationship between variables and the loss of information leads to false warnings. Our article proposes an unsupervised multivariate time series anomaly detection. In the prediction part, multiscale convolution and graph attention network are mainly used to capture information in temporal pattern with feature pattern. The threshold selection part uses the root mean square error between the predicted value and the actual value to perform extreme value analysis to obtain the threshold. Finally, the model in this paper outperforms other latest models on actual datasets.


Introduction
Anomaly detection of time series data has always been a hot issue in academia and industry.e detection of abnormal points and the location of abnormal areas can provide important information at critical moments, so that people can intervene with abnormal events in a targeted way to prevent or eliminate abnormal events.Anomaly detection of time series data has attracted people's attention in industry, finance, military, medical treatment, insurance, robotics, multiagent, network security, IOT, complex biological systems, etc. [1,2].
e anomaly detection of time series is to detect points with outliers, oscillations, or other abnormal conditions.In general, the proportion of anomalies in the overall time series is very low, so people hope to successfully capture the outliers by learning the distribution of original data or other characteristics through the algorithm.Univariate anomaly detection is carried out on the time series with only one feature.Since there is only one dimension of data, many traditional filtering algorithms can be used, that is, spectral residual algorithm [3].Multivariate time series anomaly detection refers to the anomaly detection of time series data with multiple sequences.is kind of problem is extended based on univariate time series anomaly detection.e occurrence of anomalies in multivariate time series data is often determined by multiple features, and the individual analysis of each feature cannot accurately locate the anomalies.Complex biological systems generally have this characteristic.For example, time series data from an epidemic model may include the number of patients, the number of healthy people, infection rate and the immunization rate, etc. e severity of epidemic cannot be judged by partial characteristics.erefore, a more reasonable method is to comprehensively analyse multiple variables to identify anomalies.
At present, significant progress has been made in the study of MTAD (multivariate time series anomaly detection) in deep learning.For example, Malhotra et al. [4] proposed an encoder-decoder network based on LSTM, which modelled the reconstruction probability of "normal" time series and used reconstruction errors to detect anomalies in multiple sensors.Hundman et al. [5] used the long-and short-time memory network (LSTM) to detect the spacecraft multivariate time series based on prediction loss.Ding et al. [6] proposed RADM, a real-time anomaly detection algorithm based on Hierarchical Temporal Memory (HTM) and Bayesian Network (BN), which improved the performance of real-time anomaly detection.However, most of the proposed methods often rely on the RNN (Recurrent Neural Network) learning properties and distribution in temporal pattern; relationship between sequences is still unutilized.erefore, we believe that new latent dependencies can be exploited from feature pattern, which is more conductive to anomaly detection.We propose a method combination of temporal pattern and feature pattern.
Our main contribution is as follows: (1) To the best of our knowledge, this is the first study on multivariate time series anomaly detection generally from a graph-based perspective with graph attention network in forecast (2) We propose a new model that combines temporal with feature pattern, capturing more latent relationship between variables (3) Experimental results show that our method outperforms the state-of-the-art methods on 3 benchmarks e arrangement of this article is as follows.We give related work on time series anomaly detection in Section 2. In Section 3, the prerequisite knowledge of GAT and GRU in the model is introduced.In Section 4, the proposed method is introduced in detail.e fifth section conducts experiments and analysis.Finally, we summarize the full text.

Related Work
Anomaly detection is also known as novelty detection, outlier detection, or event detection in other related fields [7].Time series anomaly detection is one of the most concerning problems.It can be classified into supervised, semisupervised, and unsupervised abnormal detection according to whether labels are used during training.Supervised learning method [8] requires labelled data for training and can only identify known abnormal types [9], so its application scope is limited.Semisupervised method is a kind of learning method combining supervised learning and unsupervised learning.Semisupervised method uses a large amount of untagged data as well as tagged data, rarely studied in the field of TSAD (Time Series Anomaly Detection).erefore, research of TSAD focuses on the unsupervised problem.
According to the number of sequences in the data, the problem can be divided into univariate and multivariate time series anomaly detection.Univariate time series anomaly detection [3,10,11] only considers whether the variables conforms to long-term pattern; when there is a big difference between data value and the overall distribution, it is regarded as an outlier instance.e traditional method in univariate time series anomaly detection is to use mainly hand-made features to model patterns of normal and abnormal events [12].For example, there are SVD [13], wavelet analysis [9], ARIMA [14], and so on.Besides, Netflix released a document based on robust Principal Component Analysis [15] and received a good response.Twitter also published a method which uses the seasonal hybrid extreme study deviation test (S-H-ESD) [16].In addition, the use of neural networks for detection has also made great progress [17].Multivariable problems have multiple variables on each timestamp [18].
e existing multivariate time series anomaly detection methods can be divided into two categories: (1) univariate based anomaly detection [15], where each sequence is monitored separately by univariate algorithm and the results are summarized to give the final judgment, and (2) direct anomaly detection [19], where multiple features are considered at the same time for algorithm analysis.Let us focus on the second type of approach.Zong et al. [20] proposed a model which uses deep autoencoder to generate low dimensional data, represent the reconstruction errors of each input data point, and input into a Gaussian mixture model (GMM) for multivariable anomaly detection.LSTM-VAE algorithm [7] is a LSTM network based on encoder-decoder to reconstruct the error of time series and use the reconstruction error to detect the abnormal situation of some sensors.LSTM-NDT [5] is an unsupervised algorithm without parameter threshold selection.e objective of this paper is to establish an anomaly detection system to monitor the data sent back by the spacecraft which is marked by experts in related fields.
Graph neural network is very popular in recent years which have enjoyed great progress in dealing with spatial dependencies among entities in a network.Gugulothu et al. [21] combined nontime pattern reduction technology and periodic automatic encoder through the end-to-end learning framework for time series modelling.OmniAnomaly [22] proposes a stochastic recurrent neural network that captures the normal pattern of multiple variable through modelling data distribution with stochastic variables.

Problem Statement.
When analysing real-world datasets, a common requirement is to find out those instances that can be considered as outliers, which are significantly different from most other points.
e goal of the anomaly detection task is to be data-driven to find abnormal of all samples.In our work, we are concerned about multivariable data X � x 1 , x 2 , . . ., x N   ∈ R m * n ; the value at time i is x i ∈ R m , i � 1, 2, . .., n. m means there are m variables and n is the length of data.Our target is to determine whether x t is an abnormal point.is is a time series problem; we have a huge amount of data; historical data is helpful for understanding the current moment x t .To efficiently use and learn the information of X, sliding window w: x t−w , x t−w+1 , . . ., x t−1 used to predict x t which would be considered to be normal.e difference between the predicted x t with the ground truth will be put into the threshold selection module; the larger the difference, the greater the possibility of x t being abnormal; when such difference exceeds the threshold we set, we consider it to be an abnormality.

GAT (Graph Attention Network
).We know that many data are in Euclidean space.e most significant characteristic of data in Euclidean space is that it has a regular spatial structure.For example, the picture is a regular square grid, the voice data is a one-dimensional sequence, and so on.ese data can be represented by a one-dimensional or two-dimensional matrix.However, many data in real life do not have a regular spatial structure, that is, data in non-Euclidean space, such as abstract graphs of electronic transactions, recommendation systems, social networks, and so on; each node in the graph is related to other nodes.e connection is not fixed.erefore, people use graph neural networks to model data in non-Euclidean spaces.In recent years, due to the strong expressiveness of graph structure, the research of analysing graphs with machine learning methods has received more and more attention.Graph neural network (GNN) is a method of processing graph pattern information based on deep learning.Due to its better performance and interpretability, GNN has become a widely used graph analysis method.Commonly used graph neural networks include Graph convolution networks, graph attention networks, and graph autoencoder.Among them, GAT [23] proposes to utilize the attention mechanism to add weighted features of neighbouring nodes.e weight of neighbouring node features completely depends on the node, independent of the graph structure.In our model, to find the latent relationship between variables, we use GAT to calculate the correlation between nodes.e specific details are explained in Section 4.3.

GRU (Gated Recurrent Unit).
Recurrent neural network (RNN) is a kind of neural network that captures the dynamic information in serialized data through the periodic connection of nodes in the hidden layer.It is different from feedforward neural networks; RNN can save the state of a context and even store, learn and express relevant information in any long context window.No longer limited to the spatial boundaries of traditional neural networks, it can be extended in time series.Intuitively speaking, there is an edge between the nodes of the hidden layer of this time and the hidden layer of the next moment.But RNN's most significant drawback is that it cannot learn to preserve and exploit older information, namely, gradient vanishing and gradient explosion.Sepp Hochreiter and Jurgen Schmidhuber proposed long-and short-term memory (LSTM) in 1997 [24].LSTM is a kind of periodic neural network, which alleviates the problem of RNN to some extent.Practice shows that this method is very suitable for processing time series data.In fact, the LSTM algorithm has evolved many variations in recent years.Rafal Jozefowicz et al. of Google conducted a comprehensive architecture search to evaluate over 10,000 different RNN/LSTM architectures [25] and as a result we could not find an architecture with better performance than the GRU, and, except for the language model, GRU works better than LSTM in other application scenarios.GRU (Gated Recurrent Unit) is a variant of LSTM, which has fewer parameters and is more efficient than LSTM.Hence, our model chooses GRU structure instead of LSTM.
Cho et al. [26] proposed a Gated Recurrent Unit (GRU) to enable each recursive unit to adaptively capture the dependencies of different time scales.Like classical recurrent neural networks, GRU are a chain of neural units too.Its structure is expressed mathematically as follows: x t and h t−1 represent the input at the current time and the output ) at the next time.Where r t is a set of reset gates, it is used to control how much information about previous state is forgotten.e smaller the value of reset gate, the more the past information is discarded.z t is update gates.
e update gate is used to control the degree how much information from the previous moment is brought into the current state.e larger the value is, the more the information from the current needs to remain and the less the information from the previous neuron can be retained.(,) represents two vectors concatenate, and * is an element-wise multiplication.
σ is the commonly used sigmoid function which controls numbers between 0 and 1.We are accustomed to using tanh function (hyperbolic tangent function) as hidden update activation function: (2)

Forecasting Model.
e overview of the proposed model is shown in Figure 2. First, for the sake of alleviating the possible noise effects of the original data X, 1D convolution operation is carried out to smooth the data: e result of convolution X CNN is then fed into three identical blocks which are shown as green box.Each block has temporal convolution component in series with graph attention networks.

Temporal Convolution Component.
e temporal convolution module captures sequential patterns of time series data in temporal dimension through 1D convolutional filters to come up with a temporal convolution module that is able to both discover temporal patterns with various ranges and handle long sequences, that is, using multiscale convolution filters [27].However, how to choose the correct filter size is a challenging problem.To understand convolution in terms of communication theory and image processing, the convolution kernel size is generally set to odd [28].e reasons are as follows: compared with even numbers, odd numbers have a center point and are more sensitive to edges and lines, which can extract edge information more effectively and avoid the deviation of position information.In addition, the odd number can ensure that the two sides of the image are symmetrical to each other when padding, so that size of the output image is the same as size of the input.erefore, as shown in Figure 3, we select filters sizes of 1 × 3, 1 × 5, 1 × 7, and 1 × 9 which consist of temporal inception layer.e combination of these filters of different sizes can contain some periodic temporal signals, such as data of period 12. e model can start the input layer from the first temporal convolution layer through the 1 × 5 and then from the second temporal convolution layer through the 1 × 7.
e selection of small convolution kernel can not only reduce the parameters but also add more nonlinear mappings to improve the robustness.Finally, we patch the results of different convolution, respectively, to restore the previous data size.e input of temporal convolution component in block 2 is the average value of GAT's output and X CNN .TC component in block 3 is the average value of block 2's input (include X CNN ) and block 2's output.

Graph Attention Network Component.
Multivariate time series anomaly detection is a challenge due to the increase of variable and data volume.However, more variable also means more information which is brought.It is actually very critical for anomaly detection.Previous models did not pay attention to feature pattern, but only focus on temporal pattern.erefore, we combine temporal pattern and feature pattern in the model.Specially, each block has a temporal convolution component that connects to a GAT.In GAT, each node in the graph can be assigned different weights based on the characteristics of its neighbor nodes.And it does not require costly matrix operations or rely on a preconceived graph structure.
e input to the graph attention layer is a set of vectors for a node: v 1 , v 2 , . . ., v n  , where v i have the same dimension with x i .e output of each node calculated by the GAT layer is shown as follows: Leaky RELU:  4 Complexity where h i is the output of node x i with the same dimension.α ij is the correlation degree between x i and x j like (8) is calculated: ⊕ is the result of concatenate of two nodes, and w is the parameters obtained by learning.Leaky RELU is a nonlinear activation function as shown in (7).L denotes the number of adjacent points to x i .e results of each GAT and X CNN (after 1D convolution of original input X) are the data of the same dimension, which are three-dimensional tensor, and each dimension is batch size, window size, and the number of variables, respectively.e output of GAT which is in three blocks and X are concatenated in the third dimension of tensor, which thickens the temporal information of data and is conducive to prediction from GRU.Finally, the results of the forecasting part are obtained by carrying on the three full connection layers.

4.4.
reshold Selection Model.e loss function of the prediction model selects root mean square error (RMSE) is as follows: where  y t,i is the prediction value of the i-th feature at time t and x t,i is the real value at the same time.e RMSE between denotes loss at time t. e test set was input to the trained forecasting model, and the RMS loss between the predicted value and the true value of each observation point in the test set was recorded as l 1 , l 2 , . . ., l Q   ∈ R Q and utilizes POT (peaks over threshold) model of EVT (extreme value theory) to select the threshold value of the subsequence.
Extreme value theory is a statistical theory to find the law of extreme values in a sequence.It is generally believed that extreme values are the outliers to be found in the problem of anomaly detection, and they are located at the tail of the distribution in most cases.e advantage of the extreme value theory is that it does not need to assume the data distribution and the threshold can be set automatically through parameter selection.
e second theorem POT shows that samples larger than threshold are subject to Complexity generalized Pareto distribution (GPD).erefore, select the threshold th through POT: where th is the initial threshold.c denotes shape parameters in GPD and β is any value in scale parameters L � l 1 , l 2 , . . ., l Q  .L-th represents the part above the threshold.th is the quantile obtained by experience.Similar to literature [10], we utilize maximum likelihood estimation (MLE) for parameter estimation of  c and  β. e threshold th F is calculated according to the following formula: q is the proportion of L > th and Q is the number of observed values.Q th denotes the number of L > th.To select the threshold value of POT, the process of parameter adjustment is needed.

Benchmarks and Evaluation Metrics.
Regarding datasets, we use three real-world datasets to verify the effectiveness of MTAD-TF, namely, MSL (Mars Science Laboratory) rover, SMAP (Soil Moisture Active Passive) satellite, and SMD.MSL and SMAP are two public datasets of NASA's spacecraft [29].
SMD [22] is five weeks of server data in a large Internet company, which has been published on GitHub.SMD is divided into two parts with the same data size.e first part is the training set and the second part is the testing set.e abnormal data on the testing set has been marked by experts in related fields.Among them, the training set and the testing set contain 28 groups, which need to be trained and tested separately.at is, the model trained on the first group of data in the training set is tested by the same group of the testing set.e final score is the average of 28 groups.
e details of the three datasets are given in Table 1, including the number of variables, size of the training set and testing set, proportion of abnormal samples in the testing set, and partial variable names.
Regarding metrics, we followed the typical evaluation metrics like other anomaly detection models: precision, recall, and F1 score.ey are defined as follows: Among them, TP is true positives (correctly detected anomaly), FP represents false positives (falsely detected anomaly), and FN refers to false negative (falsely detected normally).e higher the values of the above three indicators, the stronger the robustness of the model.

Baselines for Comparison.
is section will show the comparison results with the other 4 baselines on 3 benchmarks.
e compared models include LSTM-NDT [5], LSTM-VAE [7], DAGMM [20], and OmniAnomaly [22]: (i) LSTM-NDT: LSTM is used for anomaly detection of multidimensional time series which also is a dynamic and unsupervised method for determining threshold.Besides, to reduce the false positive rate and identify false positive data, a "pruning strategy" is proposed.(ii) LSTM-VAE: VAE's feedforward network uses LSTM replacement but does not consider the dependence between stochastic variables.(iii) DAGMM: combine neural network, estimation network, and Gaussian mixture model organically to do unsupervised anomaly detection.(iv) OmniAnomaly: the core idea of this paper is to learn latent representations to capture the normal patterns of multivariate time series while considering time dependence and stochastic.
Table 2 summarizes the evaluation results of all the baselines, which shows excellent generalization capability and achieves the best F1 score on 4 datasets.
LSTM-NDT has a high score on SMAP, but it performs poorly on MSL and SMD, reflecting that the model is very sensitive to different scenarios.Our model is stable and has excellent performance on different benchmarks.
Short-term information is also very important for multivariable time series.e reason why DAGMM's performance is not ideal is that short-term information is not considered.We utilize multiscale convolution, which can better adapt to data with different periods. is article also conducts additional ablation experiments (see Section 5.3) to compare the effectiveness of different components in our model.
OmniAnomaly applies a stochastic model, regards variables as stochastic variables, and then learns its distribution, which has high performance on the three datasets.
e limitation of this model is that it does not consider the relationship between the variables.

Ablation Study.
To illustrate the necessity and effectiveness of core components in the forecasting part, we conduct an ablation study on the four datasets to validate the multiscale convolution, GAT, and GRU that contribute to the improved outcomes of our proposed model.Firstly, we name the MTAD-TF without different components as follows: 6 Complexity (i) w/o temporal: removing the multiscale convolution processing in the temporal pattern, only GAT is left in each block (ii) w/o GAT: Removing the GAT processing in feature pattern, only temporal pattern is left in each block (iii) w/o GRU: Removing GRU means X CNN and output of three blocks are directly ingested to the FC layer From Table 3, different components have different effect on different benchmarks.For MSL and SMD, deletion of GAT makes the F1 score drop the most, while SMAP is most affected by temporal convolution component.e score of EEG-EYE has not decreased much, but it has reduced to varying degrees.

Case Study.
We will carry out case analysis of noise experiment in the EEG-EYE state data and GAT in this part.
EEG-(electroencephalogram-) EYE state is from UCI, one continuous EEG measurement with the Emotive EEG Neuroheadset, looking for the relationship between 13 EEGs in different positions of the human brain with the opening and closing of human eyes.erefore, EEG-EYE state is a dataset that can be classified into two categories.We regard the open-eye label as the anomaly to be searched for and then perform anomaly detection on it.

Noise Experiment.
To understand the antinoise ability of the model, we carried out case analysis of noise adding experiment.Five kinds of Gaussian white noise with mean value of 0 and variance of {0.1, 0.2, 0.3, 0.4, 0.5} were added into the training set, respectively.en the trained model was tested with the unchanged test set, and the F1 value was obtained as shown in the blue broken line in Figure 4.As the variance of Gaussian noise increases, the data shows a downward trend, which conforms to our common sense.However, it also indicates that the model is still not robust enough and the addition of noise does not play a role in data enhancement.e effect of variance 0.02 is better than that of variance 0.01.Compared with variance 0.01, the noise of variance 0.02 increases the difficulty of network training, prevents overfitting, and improves the generalization ability, which can be regarded as the effect of data enhancement.
According to the verification in literature [10], it can be known that one-dimensional convolution has the effect of smoothing data.From another perspective, we illustrate the function of 1D convolution with experimental scores, and we add a contrast experiment to the above pure noise experiment: noise with different variances is added to the model with 1D convolution removed.As shown in the orange broken line in Figure 4, compared with the score in pure noise, the score of without convolution drops significantly, indicating that the existence of convolution can reduce the impact of noise during data preprocessing.

GAT.
We took out the correlation between abnormal and normal before the abnormality from GAT, respectively, and drew the heat map in Figure 5. e right side of Figure 5 shows the correlation between feature 1 and features 2, 3, 4, 5, 6, 16, 17, 18, 19, 20, and 21 at normal time, while the correlation was at abnormal time on the left.e darker the color block, the higher the correlation between features, and vice versa.On the same horizontal line, the large chromatic aberration between the left and right sides means that when an abnormality occurs, the correlation between features has Complexity changed greatly, which can be used as a partial basis for abnormal location.Due to the lack of information about abnormal location in the dataset, further experimental verification cannot be carried out.However, it can be assumed that when an abnormality occurs, the correlation between certain features is significantly different from normal conditions.

Conclusions
In this paper, a new multivariate time series anomaly detection framework MTAD-TF is proposed.By using the temporal pattern and feature pattern model of multiple time series to make joint prediction, more latent information can be obtained than that of single pattern model.e method is superior to the other four baselines in the three common datasets.In addition, this model has a good antinoise ability and the GATmaybe can help with abnormal location.Future work may come from two aspects.First, attempts to combine the prediction model with the reconstruction model may further improve the accuracy of the model.Secondly, there is too little information on abnormal location and it is hoped that further abnormal location experiments can be carried out to improve the robustness of the MTAD-TF.Complexity

Figure 4 :
Figure 4: Noise experiment.e green line is the score of the original model without any processing; the value is as high as 0.945.e blue line is the score of the noise with different variances.e orange line is the score of the model without 1D convolution in data preprocess as well as added noises.

Figure 5 :
Figure 5: Heat map about correlation between variables.e number of the chromaticity bar is the chromaticity value, not the correlation between the features.

Table 2 :
Performance of our model and baselines.