Behavior Pattern Mining from Traffic and Its Application to Network Anomaly Detection

Accurately detecting and identifying abnormal behaviors on the Internet are a challenging task. In this work, an anomaly detection scheme is proposed that employs the behavior attribute matrix and adjacency matrix to characterize user behavior patterns. Then, anomaly detection is conducted by analyzing the residual matrix. By analyzing network traﬃc and anomaly characteristics, we construct the behavior attribute matrix, which incorporates seven features that characterize user behavior patterns. To include the eﬀects of network environment, we employ the similarity between IP addresses to form the adjacency matrix. Further, we employ CUR matrix decomposition to mine the changing trends of the matrices and obtain the residual pattern characteristics that are used to detect anomalies. To validate the eﬀectiveness and accuracy of the proposed scheme, two datasets are used: (1) the public MAWI dataset, collected from the WIDE backbone network, which is used to validate accuracy; (2) the campus network dataset, collected from the northwest center of Chinese Education and Research Network (CERNET), which is used to verify practicability. The experimental results demonstrate that the proposed scheme can not only accurately detect and identify abnormal behaviors but also trace the source of anomalies.


Introduction
User behavior profiling, along with anomaly detection in network traffic, plays an important role in network management, which helps keep the network under control. User normal behavior patterns often present as stable and routine for long periods, but abnormal behaviors cause unexpected changes to the normal patterns. erefore, behavior pattern changes can be used for anomaly detection. Features extracted from raw traffic packets are usually used to capture the dynamic changes in behavior patterns, such as total number of packets or flows in a specific time window, and then, machine learning methods are combined to mine abnormal changes [1][2][3][4][5].
ese methods are effective for detecting obvious changes caused by abnormal behaviors. However, the attack technologies are becoming more and more intelligent, and anomalies only cause slight changes in the traffic patterns. Meanwhile, the traffic volume continues to increase, and characterizing user behavior and accurately identifying the anomalies from massive network traffic are still challenging tasks for network security monitoring.
Accurate behavior pattern characterization is the foundation for anomaly detection. Many techniques have been proposed in the past decade, such as statistical analysis [6][7][8], data mining [9,10], and machine learning [11,12]. Nearly all of those methods only extract attributes according to network traffic, such as number of packets in a specific time window, without consideration of the effect of the network environment, but we find that network environment is another important factor for smart attack detection. As we know, botnet is one of the famous smart attacks that appeared recently, and one of the methods for detecting botnet is to analyze the "co-occurrence" behavior patterns of the hosts in one subnet [13][14][15][16], which means some hosts in one subnet have similar access patterns, such as always access to the same URL at the same time point, and those hosts may be infected by bots. In this study, we employ the adjacency matrix to capture this kind of pattern.
On the other hand, how to identify the anomalies from the massive traffic patterns is a typical needle in a haystack (NIHA) problem. Matrix decomposition is an effective tool for anomaly detection [17,18], which can divide the massive patterns into two parts: one is the major pattern in the original patterns and the other is corresponding to the abnormal changes, which is suitable for anomaly detection in the network today. us, we can use a matrix decomposition technique to distinguish normal and abnormal traffic.
To this end, we proposed an anomaly detection scheme in this study, which jointly employs the attribute matrix and the adjacency matrix to characterize user behavior patterns and network environment characteristics and uses matrix decomposition to identify anomalies. We extract seven statistical features from network traffic to construct an attribute matrix to capture the user behavior characteristics in the time window △T. We also employ the similarity of the specific IP addresses to construct an adjacency matrix to capture the behavior characteristics related to the network environment. We jointly employ the attribute matrix and the adjacency matrix to construct a model to describe user behavior characteristics accurately. en, we employ CUR matrix decomposition [19] to mine the major behavior pattern from the joint model and obtain a residual matrix, which can be used to identify anomalies.
We employ two kinds of traffic datasets to verify the effectiveness and accuracy of our method. e first is the public MAWI dataset [20], which is collected from the WIDE backbone network, a trans-Pacific transit link between Japan and the USA. is dataset is labeled and used to evaluate the performance of our proposed method. e second dataset is collected from the northwest center of CERNET.
e users in the monitoring network include students, faculty members, and contract personnel from service-providing companies. e behavior patterns contained are sufficiently complex and can be employed to measure the practicability of our method. e experimental results based on the two datasets show that the proposed method achieves an anomaly detection accuracy rate higher than 90% without any prior knowledge. Furthermore, our method can also trace anomalies for efficient network management.
Our contributions to this study can be summarized as follows: ( e remainder of this study is organized as follows. Section 2 presents the related work; Section 3 outlines the motivations and design goals, after which the feature definitions and framework description are presented in Section 4. In Section 5, we put forward a detailed description of the anomaly detection model. e experimental results and analysis are presented in Sections 6 and 7, after which the conclusion follows in Section 8.

Related Work
e goal of anomaly detection is to find the rare occurrences that do not conform to the patterns of the majority in the datasets [21,22], which has been widely applied in many fields, including security, finance, health care, and social network [23][24][25][26][27]. A variety of techniques have been proposed for identifying anomalies, which can be presented from two aspects: supervised and unsupervised anomaly detection methods [28]. e works related to our work are summarized as follows.
As for supervised anomaly detection techniques, they usually require a labeled dataset for model training. A support vector machine (SVM) can classify the samples as normal and anomalies by maximizing the classification margin to detect anomalies. Kong et al. [29] designed an abnormal traffic identification system (ATIS) based on SVM. Gu et al. [30] proposed an intrusion detection framework based on an SVM ensemble classifier with increasing feature selection. e naive Bayesian is another simple and effective tool to detect anomalies, and many algorithms have been proposed based on Bayes' theorem. Swarnkar et al. [31] proposed a naive Bayesian class classifier based on packet payload analysis to detect HTTP attacks. Han et al. [32] developed a naive Bayesian model for network intrusion detection based on principal component analysis (PCA). Nie et al. [33] designed a Bayesian network to model the causal relationships between network entries. Neural networks (NNs) are also widely used for anomaly detection as they can increase the accuracy of anomaly detection systems. Hodo et al. [34] employed packet traces to train an artificial neural network to detect DDoS attacks. Kwon et al. [35] used a convolutional neural network (CNN) to detect anomalies, which can select traffic features automatically from the raw dataset. A recurrent neural network (RNN) was proposed in [36] to learn temporal behaviors in large-scale network traffic data.
ose methods are effective in identifying anomalies with accurately labeled datasets; however, highquality labeled datasets are very difficult to construct in the network today.

Security and Communication Networks
Unsupervised anomaly detection techniques are widely used recently, as they do not require labeled dataset for model training. K-means is one basic approach to the unsupervised anomaly detection [37]. e authors in [38] used K-means to cluster the network connections into normal and anomalous communities. However, it is difficult to select the suitable k, as it depends on the applications and environments. Recently, Chen et al. [39] proposed a convolutional autoencoder (CAE)-based anomaly detection model. Said Elsayed et al. [40] proposed a hyper-approach based on long short-term memory (LSTM) autoencoder and one-class support vector machine (OC-SVM) to detect anomalies. Although the methods hold high accuracy, it is difficult to obtain a clear explanation of the results; furthermore, it is hard to trace the anomalies and apply control policy. Principal component analysis (PCA) is another unsupervised method used for anomaly detection, which can capture the normal and abnormal behaviors of the data by projecting the data instances to the principal components [41]. Wang and Battiti [42] proposed an intrusion detection method combined PCA with SVD, which can identify intrusions based on the error between the original data vector and its reconstruction data vector. However, it is not efficient for interpretation as the principal components are a linear combination of all original variables [17]. To interpret these results, the work in [43] introduced a new method, named sparse principal component analysis (SPCA), to produce modified principal components with sparse loading. Although this method can improve the interpretation, a linear relationship still exists between the principal components and the original variable. However, the variables usually do not hold the linear relationship. e sample-based matrix decomposition methods are proposed to deal with those problems, which select rows or columns from the original matrix to form the low-rank matrices. Kumar et al. [44] proposed CUR matrix decomposition to interpret the decomposition process. However, the decomposition process will occupy a large amount of memory. Sun et al. [45] proposed a new method named compact matrix decomposition (CMD), which can avoid repeated selection, in turn, reduces the computational complexity. However, this method needs to seek a non-orthogonal base by sampling the columns and/or rows of the original matrix, which will produce over-complete bases. Tong et al. [46] proposed a Colibri method to deal with these challenges. is method can iteratively find a nonredundant base and accordingly save space and time cost. However, it fails to improve the accuracy compared with the CUR and CMD matrix decomposition.
Inspired by the related works, we propose an anomaly detection method based on matrix decomposition. By combining the advantages of CUR and network characteristics, the developed methods can not only detect known and unknown anomalies, but also trace the source of the anomalies.

Basic Assumption and Its Verification.
To capture the characteristics of network environment, we assume that the users who hold IP addresses with the same prefix have similar behavior patterns, and we verify this assumption from the following three aspects.
Firstly, we analyze the general principle of the IP address arrangement. Generally, IP addresses have no relationship with user behavior patterns; however, to manage conveniently, network administrators usually assign IP addresses with the same prefix to the users in one specific area. e IP address arrangement process can be summarized as follows: (1) the Internet Assigned Numbers Authority (IANA) assigns IP address pools to five different regional internet registry (RIR) organizations in the world. (2) e regional organization assigns the IP addresses to different Internet service providers (ISPs). (3) e IP address blocks are assigned to different countries by the ISP. (4) Network administrators assign IP address blocks to different areas when they construct their local area network (LAN). Based on the above analysis, we can find that IP addresses with the same prefix are often assigned to the same area. Furthermore, the users in the same area often have similar behavior patterns as they have similar network requirements. us, we can conclude that users who hold IP addresses with the same prefix may hold similar behavior patterns.
Secondly, there are also some researchers working on traffic pattern profiling found that the users who hold IP addresses with the same prefix have similar behavior patterns. Jiang found that traffic behavior with the same prefix often keeps stable over time, which can be used for anomaly detection [47]. Xu found that the hosts with the same network prefixes have similar behavior among different Internet applications [48,49]. Jiang found that the behavior similarity captured by the aggregated flows with the same network prefixes can be used to construct an abnormal identification mechanism [50]. ose works further verify the assumption.
irdly, we analyze the behavior patterns of the IP addresses with the same p prefixes in the MAWI and CERNET datasets. We randomly select three IP blocks, and the results are shown in Figure 1, where a-c are the results of the MAWI dataset, d-f are that of the CERNET dataset, a and d are the results of p � 8, b and e are the results of p � 16, and c and f are the results of p � 24. From the figure, we can find that users who hold IP addresses with the same prefix have similar behavior patterns, especially the behavior patterns in the CERNET dataset.

Design Goals.
Based on the above analysis results, we mainly focus on developing a new anomaly detection method, which is effective at mining the anomalies in the network today, and the design goals are listed as follows: (1) Improve the Management Efficiency: to control the anomalies, tracing the anomalies is important and necessary. e IP address should be retained during the detection process. In our developed method, we regard each specific IP address as an index of the column to construct the attribute matrix, which can achieve the goal of abnormal IP addresses tracing easily.

Security and Communication Networks
(2) Improve the Detection Accuracy: to develop an accurate abnormal detection model with the consideration of network environment, we construct an adjacency matrix to describe the network environment. e adjacency matrix is made up of IP address of similar degree, which is calculated by the binary similarity of the IP addresses. (3) Improve the Practicality: the designed method should be deployed on most enterprise networks without new hardware components, and the features used should be easily extracted. Furthermore, the method should be sensitive to special abnormal behaviors that can cause slight changes so as to detect the new anomalies.  features from the packet headers. e information in the packet headers is shown in Table 1. Secondly, we analyze the characteristics of typical attacks, and the results are shown in Table 2. From the table, we can find that different attacks may lead to obvious changes to the statistics of the attributes in the packet headers, in turn, the extracted features will be changed, and those changes can be used to detect anomalies.

Feature Definition and Framework Description
Finally, we can find that different network attack behaviors will lead to changes in the different attributes based on the analysis above. eir definitions are shown in Table 3. We employ the port scan attack as an example to verify the efficiency of the features extracted. Port scan attack is usually employed by hackers to find vulnerable hosts and ports, and one simple example of port scan is shown in Figure 2. e attack host will send connection requests to the ports of one same destination host, and if the destination port is open and the host provides the corresponding service, the hackers can receive responses. is kind of behavior pattern is typical and representative of the smart attacks, and this kind of pattern can be easily captured by OD, NDDA, and NDDP defined above.

Network Environment Feature Definition.
We employ the adjacency matrix composed of similarity between the IP addresses of the monitoring network to capture the characteristics of the network environment. An example of the similarity calculation between two IP addresses is presented in Table 4. e distance d is defined as the number of 1 in m, which can be calculated by applying XOR to the binary of two given IP addresses. en, we select the first 1 as label from left to right of the operation results and set all bits behind the label to 0, and other bits are set to 1. e similarity s is defined as the distance d divided by 32, and the maximum value of the similarity is 1, which means that the two IP addresses are identical, while the minimum value is 0, which denotes that the two IP addresses are completely different. If the similarity is bigger than a specific threshold, the value of the two IP addresses in the adjacency matrix is set 1 and the construction method is shown as follows: where i and j represent any IP address and thresh denotes the threshold predefined. We employ a campus network to illustrate the features extracted to characterize the network environment. One simple topology of a typical campus network is shown in Figure 3. e users in the same area often have similar behavior characteristics as they have similar roles, such as the users in the office area often engage in teaching-related activities as they are teachers, while the users in the dormitory area often engage in entertainment-related activities after classes as they are students.

Anomaly Detection Framework Developed.
e detailed description of the scheme is illustrated in Figure 4, which consists of four steps. e main symbols used in the study are summarized in Table 5.
Step 1. Traffic Matrix Construction. We employ IP addresses and their features to design an attribute matrix to describe the traffic patterns in time window △T. Let V � sip 1 , sip 2 , . . . , sip n denote a set of n IP addresses, and . . , f d denote the d-dimensional features per IP address. At the same time, we also construct an adjacency matrix using IP address similarity as a constraint to capture the pattern relationship between IP addresses. We combine the attribute matrix and adjacency matrix to design an anomaly detection model.
Step 2. Matrix Decomposition. We employ CUR matrix decomposition to select several representative features and IP addresses from the attribute matrix X to reconstruct an attribute matrix X and attempt to keep the reconstruction matrix as similar as possible to the attribute matrix X.
e difference between the attribute matrix X and the reconstruction matrix X is referred to as the residual matrix, which is defined as R � X − X.
Step 4. Anomaly Detection. e residual matrix R presents the pattern changes in each feature of the IP addresses. We calculate the sum of each column of the residual matrix, which represents the total pattern changes in each IP address, and then, they are ranked in descending order. IP addresses with larger values are regarded as anomalies.

Model for Attribute Matrix Decomposition.
e seven defined traffic features can construct an attribute matrix X, as the following formula shows, where the columns are the specific IP addresses, the rows are the seven features, and the element is the statistical values of specific features in time window △T. X(2, n) . . . . . . . . . . . . X(7, 1) X(7, 2) . . . X(7, n) We employ CUR matrix decomposition to obtain user normal behavior patterns from the original attribute matrix X, which can select several rows and columns from X to    reconstruct an attribute matrix, and the reconstructed matrix can indicate the major pattern of the original attribute matrix. e process can be formulated as follows: where W � CUR ∈ R n×d , C ∈ R n×n , U ∈ R n×d , and R ∈ R d×d are three low-rank matrices, R indicates the residual matrix, α and β control the row and column sparsity of the matrix W, and c is used to control that of matrix R. If C is a diagonal matrix and its k elements are 1, the other elements are 0, and XC keeps k columns of X unchanged and sets n-k columns to zero vectors. Similarly, WX can be regarded as a coefficient matrix, and X(: i) is chosen as a representative source IP address when W of WX is not a zero vector. Apparently, ‖W‖ 2,1 ensures that only a few source IP addresses are chosen.

Model for Adjacency Matrix Decomposition.
We use adjacency matrix A to capture the characteristics of network environment, and one simple example is shown as follows: where the rows and columns are the specific IP addresses. If two users hold IP addresses with high similarity, they should have similar behavior patterns in the residual matrix R. e matrix decomposition model constructed between the residual matrix and adjacency matrix is shown as follows: where A is the adjacency matrix and L is the Laplacian matrix of it.

Anomaly Detection Model.
Based on the attribute matrix and adjacency matrix constructed, we develop an anomaly detection model, which is shown as follows: where δ is a parameter to indicate the importance of the adjacency matrix. Formula (6) may be non-convex if W and R change simultaneously, but we can get the optimal values of them by fixing one of them. If we fix W, we can rewrite formula (6) to (7) by setting the derivative of R to zero.
where I, cD R , and δL are diagonal matrices with positive diagonal entries, and D R is a diagonal matrix with the i th elements Similarly, if we fix R and set the derivative of W to zero, we can obtain the following equation: where C 1 and C 2 are symmetric and positive semidefinite matrices, D 1 and D 2 are two diagonal matrices, and a and b are two parameters. Formula (9) can be rewritten as follows: where C 1 � CΘ 1 C T , C 2 � R T Θ 2 R, K � C T HR T , C and R are two orthogonal matrices, Θ 1 and Θ 2 are two diagonal matrices composed of eigenvalues. According to the two lemmas, we pre-multiply formula (8) by C T and post-multiply it by R T . Equation (8) can be reformulated as follows: where  (11) can be expressed as follows: For each column vector, it can be denoted as follows: We can get the matrix U by calculating each column vector u i , and the matrix W can be obtained according to the formula U � C T WR T .

Running Flowcharts.
e detailed running process of the developed model is illustrated in algorithm 1. Firstly, the attribute matrix X ∈∈ R 7×n and the adjacency matrix A ∈∈ R n×n are constructed and selected as input. After setting other parameters, the algorithm iteratively selects some representative IP addresses and features by means of the variable matrix W to reconstruct the attribute matrix X. Lines 1 to 4 build the Laplacian matrix L and initialize the identity matrices D R , D W ′ , and D ′ ′ W , the residual matrix R, orthogonal matrices C and R, and diagonal matrices Θ 1 and Θ 2 . As shown in lines 5 to 11, the optimal residual matrix R can be obtained when the objective function in equation (6) converges by iteratively updating W and R. Line 12 calculates the anomaly scores for each IP address by means of its ℓ 2 norm in the residual matrix R. Lines 13 to 17 judge whether the IP addresses are anomalies or not.

MAWI Dataset.
It is important to find a public dataset with reliable ground truth to evaluate anomaly detection methods. However, some existing datasets are outdated due to lack of new attack trends or traffic patterns, such as the 1998/99 DARPA dataset [51] and KDD CUP 99 dataset [52]. Other datasets, such as DDoS 2016 [53], are created in a simulated network environment, and LBNL dataset [54] is publicly available but does not provide attack labels. Although UNB ISCX dataset [55] provides attack labels, it is not always publicly available. Due to the drawbacks of the datasets, we use the MAWI dataset to validate the performance of our proposed approach. e dataset is collected from the WIDE backbone network, a trans-Pacific transit link between Japan and the USA, and it is updated daily to include new patterns of new applications or anomalies, and the payloads of the packets are removed due to privacy protection issues. Furthermore, a graph-based methodology that combines different anomaly detectors is used to label the dataset [20]. Firstly, it uses a similarity estimator to uncover the relations between the outputs of different anomaly detectors, then construct an undirected graph using the anomalies detected by different anomaly detectors, and mine different communities from the undirected graph. Secondly, they employ the confidence score and combination strategies to decide whether one specific community corresponds to an anomalous or not.
e statistical information about the dataset is presented in Table 6, where #anomalies represents the number of anomalies, #flows is the total number of flows, #diffsrcIP denotes the number of different source IP addresses, #diffsrcPort is the number of different source ports, #diffdstIP denotes the number of different destination IP addresses, and #diffdstPort is the number of different destination ports. GRE denotes generic routing encapsulation protocol, while encapsulating security payload (ESP) is a typical protocol in IPsec architecture.
We analyze the percentage of different anomalies in the datasets, and the results are shown in Figure 5. From the figure, we can find that the top 4 anomalies occupy more than 90% in the MWAI dataset, including ntscUDPUDPPrp (network_scan_udp_udp_response), mptp (multi-point_to_point), mptmp (multipoint_to_multipoint), and ptmp (point_to_multipoint). Furthermore, those anomalies are typical and representative, such as mptmp that usually employs many controlled hosts to attack several destination hosts coordinately, which is similar to Bonet and Worm attacks.
us, we mainly employ the anomalies with ntscUDPUDPPrp, mptp, mptmp, and ptmp labels to evaluate our approach.

Evaluation Metrics.
We use precision (P), recall (R), and F 1 , which are widely used to evaluate accuracy of many approaches [56]. eir definitions are provided as follows.
P is the ratio between the number of true anomalies detected and total number of anomalies detected, which is shown in the following formula, where TP is the number of true anomalies detected and PA is the total number of anomalies detected.
where R is the ratio between the number of true anomalies detected and the total number of anomalies in the dataset. Its definition is presented in the following formula, where TA represents the total number of anomalies in the dataset.
where F 1 is the harmonic mean between P and R, which is defined as follows: Input: attribute matrix X, adjacency matrix A, parameters α, β, c, δ, θ Output: top k source IP addresses satisfying the condition θ. (1) Build Laplacian matrix L from the adjacency matrix A; (2) Initialize D R , D W ′ and D ′ ′ W to be the identity matrix; (3) Initialize R � X(I + cD R + δL) − 1 ; (4) Build orthogonal matrices C, R by the eigenvectors of X T X and XX T , and diagonal matrices Θ 1 , Θ 2 through the eigenvalues of X T X and XX T , respectively; (5) while the objective function in equation (6) does not converge do (6) Update W by equation (14); (9) Update R by equation (7); (11) end (12) Compute the anomaly score for the i th source IP address as ‖R(: , i)‖ 2 ; (13) for I ≤ n do (14) if the score of the i th source IP address > θ then (15) Output the source IP address; Security and Communication Networks 9

Parameter Selection.
ere are six parameters in the proposed algorithm, the parameter s presents the similarity of two IP addresses, and we employ this parameter to ensure that two IP addresses are similar enough; thus, the similarity threshold thresh for the adjacency matrix is 0.9. e parameters α, β, and c are used for controlling the sparsity of the attribute matrix and residual matrix, and the parameter δ is used for controlling the importance of the adjacency matrix, and their values are important for the model performance. We firstly keep one of the parameters increasing and other three parameters fixed and then employ the changing trends of F 1 to select the parameter values. e results are shown in Figure 6. We can find that F 1 increases rapidly when parameter δ is becoming larger and then decreases quickly and tends to be stable, which means that the IP address structure plays an important role in anomaly detection. For the parameters α, β, and c, F 1 increases rapidly with the parameters increasing and then tends to stable. erefore, we set the initial values of the four parameters in the algorithm according to the changing trends of F 1 . α � 0.017, β � 0.015, c � 0.012, and δ � 0.018. e parameter θ has more impacts on the final accuracy. To select optimal θ for different time windows, we calculate the P, R, and F 1 with different θ. For the two evaluation metrics P and R, they are existing constrained relation, and we employ F 1 to make some trade-offs. e results of specific time window are presented in Table 7. From the table, we can find that R is larger than P when the threshold θ is smaller than 0.005. Contrarily, R is smaller than P when the threshold θ is larger than 0.005. erefore, the optimal threshold θ is selected as 0.005 with the largest F 1 . We can obtain the optimal thresholds for each time window based on the above analysis, which is shown in Figure 7.
To achieve the goal of optimal parameter θ automatic selection, we use the exponentially weighted moving average (EWMA) [57] method to predict optimal parameter θ. Assume θ t−1 and θ t−1 are the optimal θ and the forecast θ for the time window t − 1, respectively. e EWMA method can be given as follows: where 0 ≤ b ≤ 1 is the weight factor. We employ the mean squared error (MSE) to evaluate the effectiveness of the prediction for optimal thresholds, and the definition of the MSE is shown as follows: where the parameter m is the number of windows, and θ i and θ i are the optimal threshold and the prediction threshold, respectively. We employ the manually selected optimal θ (results of the beginning fifteen time points) to obtain the weight factor b used for EWMA, and then, the EWMA is used to predict the threshold θ used for the following time points. e analysis results are shown in Figure 7.    Protocol  UDP  TCP  ICMP  ESP GRE IPv6  #flows  2984810 10431720 21318751 1273 3825  7  #anomalies 233475 3667771  42116  0  2793  0  #diffsrcIP  146969  215962  4382036  6  6  1  #diffsrcPort 63054  65468  1  1  1  1  #diffdstIP  279882 287689 16326875  4  18  1  #diffdstPort 60257  61350  27  1  1  1 10 Security and Communication Networks

Performance Evaluation.
Network traffic is a kind of time-series data, the traffic volume usually is massive, and it is very difficult to process in real time. To improve the practicality, we employ time window mechanism and the size of the time window is set at 6 seconds in the provided figure. We have 150 time windows for the MAWI dataset and analyze the anomalies in each time window using different anomaly detection approaches. For matrix decomposition methods, we use the proposed method and SVD and SPCA approaches to assign anomaly value for traffic in each time window and employ the threshold θ to identify anomalies; if the anomaly value is larger than the threshold θ, the corresponding network traffic is regarded as anomalies. en, we can calculate their P, R, and F 1 in each time window as the dataset is labeled. Similarly, we also apply the LOF, CBLOF, ROS, and COF approaches to assign anomaly value for traffic in each time window, and we can obtain the F 1 values of these methods in different time windows.

Comparison with Matrix Decomposition Methods.
We select two popular abnormal detection methods using matrix decomposition to evaluate our methods. Wang employed singular value decomposition (SVD) to compute the eigenvalues and their corresponding eigenvectors of the covariance matrix [42] and then select the k eigenvectors with the top k largest eigenvalues to form a new matrix. For a new vector, SVD first projects it to the k-dimension subspace and then calculates the distance between the new data vector and its reconstruction using squared Euclidean distance. If the distance exceeds a given threshold, the vector is identified as anomalous. In this study, we employ SVD to the attribute matrix and treat the features extracted in the coming time window as given vector to perform abnormal detection. Erichson proposed the SPCA approach to improve the interpretation of low-rank matrix decomposition [43]. e developed methods can be formulated as a regression-type optimization problem of the PCA method. Sparse loadings are obtained by imposing the lasso constraint or elastic net on the regression coefficients. e modified principal components are therefore further obtained based on sparse loading. In this study, we apply the SPCA to the attribute matrix and obtain a new matrix composed of principal components.
en, the distance between the features extracted in the coming time window and its reconstruction is calculated and abnormal detection is performed.
e experimental results are presented in Figure 8. As the figure shows, our method outperforms than the other two methods. e attribute matrix is used to capture the dynamics of the traffic patterns, while the adjacency matrix is used to characterize the environment. us, our method holds higher accuracy by including more consideration during the process of matrix decomposition. Furthermore, the SVD method lacks interpretability as the eigenvectors are usually a linear combination of all columns of the original matrix. As the SPCA method, although it improves the interpretability based on sparse principal components, it cannot interpret the results completely. e proposed method can easily interpret the detection process by selecting representative rows and columns, in turn, tracing the anomalies for efficient network management.

Comparison with Other Data Mining Methods.
Tuan proposed an anomaly detection method based on the local outlier factor (LOF) [58]. e LOF is defined as the ratio between the local reachability densities of object p's knearest neighbors and the local reachability density of p. A larger LOF means that the local reachability density of p is smaller than the local reachability densities of p's k-nearest neighbors, and the object p has a higher probability to be an anomaly. In this study, we select k as the square root of all the data to perform abnormal detection, which has been proven to be an optimal option [59].
He proposed the FindCBLOF algorithm for discovering outliers, which employs the clustering algorithm to divide the dataset into large and small clusters [60]. ey used the cluster-based local outlier factor (CBLOF), which is defined as the distance between the being detected item and its closest large cluster to measure the significance of an outlier; the larger CBLOF of the being detected item, the more likely it to be an anomaly.  Tang proposed an outlier detection scheme by calculating the chaining distance [61]. ey use the connectivitybased outlier factor (COF) to indicate the probability of a being detected object to be an anomaly. e probability of a being detected object p to be an anomaly is defined as the ratio between the chaining distance of p's k-nearest neighbors and the average of the chaining distance of its neighbor's k-distance neighbors.
Pei proposed a reference-based outlier detection approach, which is capable of reducing the number of distance calculations compared with the LOF method [62]. ey first calculate the distance between each reference point and the data points, then find the k reference-based nearest neighbors for each data point, and compute their average neighborhood density. en, the minimum of the neighborhood density of a data point is used to define the reference-based outlier score (ROS). Data points with higher ROS are considered to be anomalies. In this study, we select the parameter k similar to LOF and select the reference points using the grid vertices [62].
e experimental results are shown in Figure 9. As the figure shows, F 1 of the proposed method is larger than those of the LOF, CBLOF, ROS, and COF. e four approaches only employ user behavior characteristics to mine anomalies, while the proposed method considers user behavior and network environment simultaneously; thus, the proposed method is more practicable. Furthermore, compared with them, the matrix decomposition not only can mine the abrupt changes caused by the abnormal behavior patterns but also that of slight changes.

Analysis of the Time Complexity.
e time complexity of the proposed scheme mainly consists of two parts: feature matrix establishment and matrix decomposition. ere are two matrices used to profile the traffic patterns, including the attribute matrix and the adjacency matrix. If we do not use any data structure to optimize the establishment process, both of the establishment time complexity of them are O (n 2 ), where n is the total number of unique IP addresses in the monitoring network. If we use the hash method to optimize the established process, the time complexity of the attribute matrix establishment can be reduced to O (n). As for the adjacency matrix, as the IP addresses of the monitored network are fixed, and we can obtain the IP similarity by offline calculation. e most time-consuming operation involved in the proposed algorithm is the matrix inverse operation, whose computational complexity is O (n 2 * d) when updating the residual matrix R at each iteration. Moreover, the computational complexity of updating W is O (n 2 ). e total time complexity of the developed method is m * O (n 2 * d), where m is the number of iterations, n is the number of unique IP addresses in the monitoring network, and d is the dimension of attribute extracted. We can find that time complexity of the proposed method is irrelevant to the traffic volumes of the monitoring network, and only the increment of the number of IP addresses in the monitoring network can increase the time complexity. However, for an enterprise network, the total number of IP addresses is usually fixed. Furthermore, the developed method does not require any prior knowledge, which can be used for unknown anomaly detection. Another advantage of the developed method is that it can trace the anomalies easily, which can greatly improve the management efficiency. In conclusion, the proposed method is suitable for online security monitoring for medium-sized enterprise networks.  contains thousands of users with self-governed IP addresses, including students, faculty members, and contract personnel from service-providing companies. e services used include HTTP, email, FTP, and VoIP. To evaluate our method, we collect a trace lasting seven days from the campus network, named as CERNET dataset. e statistical results of the dataset are presented in Figure 10. e x-axis represents time points (three minutes), and the y-axis represents the size of bytes and number of packets, respectively. As the figure shows, the user behavior patterns change dynamically and have obvious routine characteristics.

Anomaly Mining Approach.
e selected monitoring network contains 1000 hosts with public IP addresses. e obtained residual matrix and the residual values of a specific time window are shown in Figure 11. Figure 11(a) is the residual matrix, x-axis is IP addresses, y-axis is seven features, and z-axis represents the changing range of residual values. As the figure shows, most of the residual values of the residual matrix are very small. It is difficult to identify abnormal hosts based on the analysis of residual values of each feature. erefore, we calculate the residual values of each host using ℓ 2 norm. e results are shown in Figure 11(b), x-axis is IP addresses, and y-axis is residual values. We analyzed the IP addresses with larger residual values to identify anomalies, which can not only determine whether an IP address is an anomaly or not but also identify the detailed patterns of the anomaly by analyzing its specific features.
We analyze the host that holds the largest residual value as an example to explain our method, its IP address is 115.154.XXX.XXX, and the residual value is 1.28. Both residual values of the OD and the NDDA are the largest in the residual matrix. It means that the host sends massive connections to different destination hosts. We can claim that the host may perform network scan attacks, as an attack host will send a great number of scan packets to multiple different destination hosts when network scan attack appears. e host with the 202.117.XXX.XXX holds the second largest residual value. e values of the NDR and the SNF of the host are 0.84 and 0.61, respectively, which shows that many different hosts send a lot of packets to this host. For DDoS attacks, they usually employ many hosts controlled by hackers to send a lot of packets to the destination host in a short time period, which makes the destination host cannot provide normal service for users as the attack consumes most of the resources [63,64]. Features NDR and SNF in the residual matrix are larger compared with the others, which means that many different hosts may send massive packets to the specific host in a short time period, and these characteristics are in accordance with that of DDoS attacks. erefore, we can conclude that the host may suffer DDoS attacks.

reshold Analysis.
User behaviors exhibit periodic characteristics of day and night; thus, we should select different thresholds for the day and night time. To set suitable thresholds, we select five different time windows in the daytime and nighttime to analyze their respective residual values. e results are presented in Figure 12  axis is IP addresses, and y-axis is residual values. As Figure 12(a) shows, the residual values of some IP addresses are larger than 0.2 in different time windows, which occupies about 0.1%. ose behaviors are regarded as anomalies, and we set the threshold to 0.2 for daytime monitoring. As Figure 12(b) shows, the residual values of several IP addresses are larger than 0.3 in different time windows. We can set the threshold to 0.3 for nighttime monitoring. We find that the daytime threshold is smaller than that of the nighttime, as more people use the Internet in the daytime, and causes more complicated user behavior.  Table 9, where top k denotes the number of anomalies detected by our proposed method, and #Anomalies presents the number of the true anomalies by analyzing the detected anomalies. From the table, we can determine that the precision of our proposed method sits at around 90%.

Conclusion
Detecting and controlling the network anomalies are one of the most important problems for network management. In this study, we propose a novel anomaly detection method based on matrix decomposition. By analyzing the behavior characteristics of attacks, we extract seven features from the network traffic to construct an attribute matrix for use in characterizing the difference between normal and abnormal user behavior patterns. We combine the attribute matrix and adjacency matrix to construct an anomaly detection model and then employ CUR matrix decomposition to mine user behavior patterns and obtain a residual matrix to identify anomalies. We use two datasets, MAWI and CERNET, to evaluate the performance of our proposed method. e experimental results show that the proposed method achieves a detection accuracy larger than 90%, which verifies that it outperforms other related methods. Moreover, the developed method can not only locate anomalies and interpret the anomaly detection process but also can identify new anomalies without any prior knowledge. In future work, we will focus on how to reduce the computational complexity and improve the practicality of the algorithm.

Data Availability
In this study, the public data can be available through the URL http://www.fukuda-lab.org/mawilab/v1.1/2019/05/05/ 20190505.html and the data collected from CERNET are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.