Interaction Context-Aware Network Behavior Anomaly Detection for Discovering Unknown Attacks

Network behavior anomaly detection is an eﬀective approach to discover unknown attacks, where generating high-eﬃcacy network behavior representation is one of the most crucial parts. Nowadays, complicated network environments and advancing attack techniques make it more challenging. Existing methods cannot yield satisﬁed representations that express the semantics of network behaviors comprehensively. To tackle this problem, we propose XNBAD, a novel unsupervised network behavior anomaly detection framework, in this work. It integrates the timely high-order host states under the dynamic interaction context with the conversation patterns between hosts for behavior representation. High-order states can better summarize latent interaction patterns, but they are hard to be obtained directly. Therefore, XNBAD utilizes a graph neural network (GNN) to automatically generate high-order features from series of extracted base ones. We evaluated the detection performance of XNBAD in a publicly available benchmark dataset ISCX-2012. To report detailed and precise experimental results, we carefully reﬁned the dataset before evaluation. The results show that XNBAD discovered various attack behaviors more eﬀectively, and it signiﬁcantly outperformed the existing representative methods by at least 3 . 8% relative improvement in terms of the overall weighted AUC.


Introduction
Network security of companies and organizations has long been threatened by various cyberattacks. Attacks targeting these institutions aim at stealing core secrets, leaking sensitive information, and tampering important data, which cause great damage and pose potential serious threats. Cyberattacks became more severe during COVID-19 pandemic because many companies and organizations use telecommuting, and their networks are more open than ever. According to a report from Zscaler, 1, 200 of the COVID-19 related attacks in phishing, malicious websites, and malware targeting remote users were observed and blocked in January 2020, and the number grew to 380, 000 in April, showing an increase of about 30, 000% [1]. Another report released by World Health Organization (WHO) in April 2020 indicated the number of cyberattacks directed at the organization is five times more than that in the same period of 2019, and 450 active WHO e-mail addresses and passwords were leaked along with thousands of researcher information in a week [2].
Intrusion detection is one of the key steps to protect network security, which has been continuously studied. e anomaly detection-based technique builds a normal profile from large amounts of data and then detects anomalies (attacks) based on their degrees of deviation from the normal profile. us, the detection system is empowered to discover unknown attacks. With evolution and innovation of techniques, novel attacks are constantly emerging. us, anomaly detection attracts increasing attention from both the computer security community and the machine learning community.
In recent years, more and more researchers have utilized machine learning to develop anomaly detection methods by harnessing its powerful ability of learning complex data patterns and distributions. Nevertheless, it is still a nontrivial task of simply applying machine learning to detect malicious network traffic due to the following challenges. (1) Labeling Difficulty. Most machine learning approaches are in supervised manner which requires abundant labeled training data to guide their model learning process. However, labeling network data is laborious and expensive. Huge volume of network traffic can be collected in a day but attack traffic rarely appears, let alone to be found. Besides, the manual labeling process can only annotate known attacks from overwhelming network traffic, which means supervised models may fail to retrieve unknown attacks. (2) Encryption Barrier. e payload content is helpful for discriminating attacks because attackers usually exploit the vulnerabilities of network applications through the payloads. With the popularization of encryption protocols such as Transport Layer Security (TLS) protocol, attackers can leverage encryption protocols to deliver payloads, which makes it very hard, if not impossible, to obtain this information. (3) Attack Diversification and Sophistication. Cyberattacks are becoming more and more diversified and sophisticated. Companies and organizations are facing various attacks such as phishing, scanning, brute force, denial of service (DoS), and so on. ese attacks may have completely different abnormal characteristics on different aspects, so it is very challenging to perform effective detection on a single aspect. Beyond that, skilled attackers tend to customize attacks for their targets, such as advanced persistent threats (APTs), which make the attacks subtle, stealthy, and hard to discover.
ere are two kinds of detection mechanism, i.e., content anomaly detection and behavior anomaly detection. Content detection methods rely on the payloads of application layers to discover anomalies. However, it is difficult to apply these methods directly on the encrypted traffic. In contrast, behavior detection methods do not have this trouble because they are not dependent on payload contents to discover anomalies. An effective data representation is the foundation of a successful behavior detection method. It must be able to describe network activities from a sufficiently high-level perspective, consider various aspects related to attacks as many as possible, and reflect network activities in time.
Existing behavior detection methods can be categorized into four types, i.e., packet-based methods, flow-based methods, extended flow-based methods, and static embedding-based methods, from the perspective of data representation. Packet-based methods [3,4] utilize L2-L4 header information (e.g., IP addresses and packet sizes) to represent network behaviors and infer anomaly degrees of packets. ough the header information is easy to retrieve, it is too simple to describe high-level attack activities. e information acquired from packet headers is highly untrustworthy in the dynamic and high-speed contemporary network environments [5]. Packet-level detection in massive network traffic generally also induces huge computation costs. Flow-based methods [6,7] extract features from the packets in flows to represent the network behaviors. Flow features are able to capture the conversation patterns between hosts, which are relatively powerful to discriminate abnormal behaviors and normal ones. However, considering conversation patterns alone is not sufficient. Skilled attackers could mimic normal conversation patterns to evade the detection mechanism. Recently, researchers also explored extracting features from other aspects. Extended flow-based methods not only utilize flow features but also consider extra information such as payload contents [5,8] and the relationships of multiple flows [9,10] to improve detection performance. Nevertheless, these existing works fail to handle encrypted traffic or require labeled data for training. Inspired by the field of natural language processing, several methods leverage embedding learning in feature extraction [11,12]. ese methods also target at flows but regard network data (e.g., IP addresses) as words.
ey obtain numerical representations of words by embedding models like Word2Vec [13], which automatically learns relationships among categorical data in flows. ese methods mainly suffer from two problems. First, these methods learn static embeddings. erefore, in highly dynamic environments, neither can they deal with unseen words nor can they describe changing characteristics of words over time [14]. Second, these methods either ignore numerical features or treat them as categorical features, which means important information in numerical features is likely to be omitted. In a word, existing methods fail to obtain satisfied behavior representations for anomaly detection.
To address the above challenges and overcome the problem of the existing methods, an interaction context-aware network behavior anomaly detection framework XNBAD is proposed in this paper. e key idea of XNBAD is to integrate the conversation patterns between hosts and the timely interaction states of hosts to comprehensively characterize network behaviors for anomaly detection. e host interaction state, summarizing how a host interacts with the remaining hosts in network, refers to one-to-many behavioral characteristics while the conversation pattern refers to one-on-one behavioral characteristics. By taking host interaction states into consideration, network activities are viewed from a higher level, the semantics of network behaviors are expressed more comprehensively, and at the same time, the difficulty of evasion is increased since attackers need to imitate a wider behavioral pattern. Concretely, XNBAD extracts flow features to represent the conversation patterns and generates host features from a short-term interaction context to represent the timely interaction states. e interaction context is modeled as a graph structure, and series of base host features are first extracted from the graph, and then a graph neural network (GNN) is utilized to generate enhanced host features, where latent interaction patterns are mined through the high-order structural information of the graph. To get rid of labeling difficulty, XNBAD trains the GNN and builds the normal profile on an attack-free training set. To evaluate detection performance of XNBAD, we conducted extensive experiments on the publicly available benchmark dataset ISCX-2012 which contains multiple attack scenarios and various malicious behaviors. We summarize the contributions of this work as follows: (i) A novel unsupervised network behavior anomaly detection framework XNBAD is proposed for detecting unknown attacks. It estimates anomaly scores of network behaviors under the dynamic host interaction context. Besides, to express the semantics of network behaviors comprehensively, it considers both of the conversation patterns between hosts and the interaction states of hosts.
(ii) A novel method that generates timely high-order host interaction features from network traffic is introduced. It allows to extract series of base features from various aspects (e.g., domain knowledge and graph analysis). Further, a GNN-based feature enhancement is proposed to generate high-order features from the base ones. (iii) A further exploration on the benchmark dataset ISCX-2012 has been made, which is helpful to obtain more accurate results and more detailed analysis in experiments. In the exploration, mislabeled data are corrected and fine-grained malicious behavior labels are added according to the detailed materials provided with the dataset. (iv) Extensive experiments on ISCX-2012 are conducted to evaluate the detection performance of XNBAD. e results show that XNBAD effectively discovered more types and numbers of malicious behaviors and significantly outperformed the best competitor with the relative improvement of 3.8% in terms of the overall weighted AUC (a more rigorous metric than the general AUC) at 1% level. e effectiveness of the proposed GNN-based feature enhancement was validated. Besides, the effect of different hyperparameter configurations was investigated.
To our best knowledge, XNBAD is the first unsupervised learning framework utilizing GNN to detect malicious network behaviors in traffic, and this is the first work to carefully study the ISCX-2012 dataset and analyze the detection ability on different malicious behavior types. e rest of the paper is organized as follows. Section 2 reviews the related works. Section 3 describes the overview and details of our proposed framework XNBAD. Section 4 introduces the basic information of the ISCX-2012 dataset and our refinement on it. en, Section 5 explains the experiments conducted on ISCX-2012, and Section 6 presents and analyzes the experiment results. Finally, Section 7 draws the conclusions and discusses future work.

Related Work
Several previous studies related to our work are reviewed in this section. As mentioned in Section 1, the network behavior anomaly detection methods can be divided into four categories according to their data representations, i.e., packet-based methods, flow-based methods, extended flowbased methods, and static embedding-based methods. ey are briefly summarized in Table 1 and discussed in detail as follows.

Packet-Based Methods.
is category of methods extracts features from L2-L4 packet headers and estimates the anomaly degree for each packet. PHAD [3] is one of the earliest packet-based methods, which uses 33 features from packet headers, such as IP addresses and IP packet size, for detection. It learns the normal value ranges on the features, estimates the scores for anomalous features in a packet, and sums the feature scores up as the packet's anomaly score for detection. Recently, several deep learning empowered methods are proposed. Kitsune [4] is a packet-based method using an ensemble of autoencoders (AEs) [15] as anomaly detector. It applies damped incremental statistics on packet sizes and counts to represent the behaviors of the source host, channel (source and destination host pair), and socket of a given packet.
ere are totally 115 packet features extracted including 1D features like the weight, mean, and standard deviation of the source host and 2D features like the approximate covariance between the source and destination hosts. en, it groups these features and feeds them forward to the unsupervised learning detector for anomaly detection. HELAD [16] is a novel packet-based framework integrating multiple deep learning models. Similar to Kitsune, it extracts packet features by improved damped incremental statistics and then uses a combination of AE and long short-term memory (LSTM) [17] to obtain anomaly scores. Adding LSTM is aimed at considering the relationship between consecutive attacks, but it requires labeled data for training. ough using deep learning techniques can improve the learning capability for complex behavior patterns, the packet-based methods might perform poorly on detecting sophisticated attacks since packet features show a microview of network traffic which cannot reflect the highlevel network behaviors. Besides, the header information could become untrustworthy in contemporary dynamic network environments [5]. Moreover, this category of methods will suffer from heavy computation when dealing with massive network traffic.

Flow-Based Methods.
is category of methods first groups packets into flows by the same flow key, e.g., the common used network five-tuple (IP source , Port source , IP destination , Port destination , Protocol), and then extracts flow features based on the header information of packets in flows to estimate the flows' anomaly degrees. In decades of research, various flow features have been proposed, such as statistical features, sequential features, and neural features generated by deep learning models. CICFlowMeter [18,19] is a popular open-source flow collection and statistical feature extraction tool. It extracts totally 83 statistical features for bidirectional flows, such as the statistics of the packet sizes, the packet interarrival times, and the TCP flags. ZED-IDS [20] trains a deep AE with the flow features produced by CICFlowMeter to discover potential zero-day attacks, DoS attacks in particular. Similarly, AE-D3F [6] trains an AE with 27 of the statistical flow features proposed in [21] for DDoS attack detection. In [22], the sequential features that array the payload sizes of the first 50 packets of a TLS session are used to classify whether a TLS encrypted session is malicious or not. In [23], similar sequential features are used to infer HTTP protocol semantics and improve malware detection on encrypted traffic. As deep learning gradually gains attention in the field of cyber security, researchers start training end-to-end neural network models, where the automatically learned neural features are intermediately generated from sequential features, for relevant classification and detection tasks. FS-Net [24] trains an encoder-decoder recurrent neural model to learn Security and Communication Networks 3 representative neural features from the sequences of packet sizes within flows for encrypted traffic classification. LUCID [25] organizes the sequence of packet header information as an image and trains a convolutional neural network (CNN) on the images to learn the flow representations for DDoS attack detection. STIDM [26] combines a 1D-CNN and a length-sensitive LSTM to generate discriminative flow representations from the sequences of packet sizes and time intervals for intrusion detection. Flow-based methods focus on the conversations between hosts, which establish a moderate view on network traffic. Although flow features do not involve the contents of conversations, they can provide discriminative behavioral information of flows for detection. However, this category of methods ignores the information outside the conversation which might be helpful for discriminating malicious network behaviors, and furthermore, attackers could imitate normal conversation patterns to bypass these detectors. Besides, the end-to-end deep learning models need to be trained with labeled data, which does not meet the challenge of labeling difficulty.

Extended Flow-Based Methods.
is category of methods extends flow-based detection methods by leveraging extra information. Some methods combine the features extracted from the contents of the application payloads with flow features for detection. TR-IDS [8] concatenates the payload features generated by deep learning models and a series of statistical flow features as the input of a supervised random forest anomaly detector. In [5], a set of novel features is proposed to detect intrusion in IoT networks, where the contents of the application layer protocols utilized in IoT environments are extracted as the service-based features, such as the DNS query and the HTTP method. Besides, a bunch of statistical features in a window of 100 records are also used to enhance the flow features. Table 1: Summarization of network behavior anomaly detection methods from the aspect of data representation (the existing methods are divided into four categories; their representative references are, respectively, listed in column category).

Category
Outline of data representation Pros Cons Packet-based methods [3,4,16] (1) Directly extract feature for each packet from its L2-L4 header information.
(1) Header information is easy to retrieve.
(1) Poor performance on sophisticated attacks since only a microview of network activities is provided.
(2) Header information could be untrustworthy in modern network environments.
(1) Relatively good performance since a moderate view on network activities is provided.
(1) Have false negatives since it only considers the conversation patterns inside a flow.
(2) Extract features for each flow based on series of header information of its packets.
(2) Possibility of evasion by mimic attacks targeting conversation patterns.
(3) Efficient since there is no need to discriminate every packet.
(3) Labeled malicious samples are required for training (only for those supervised learning methods).
Extended flow-based methods [8][9][10] (1) On the basis of flow features, incorporate extra information from different aspects such as payload content and multi-flow relationship.
(1) Good performance on various attacks since different aspects of network activities are considered.
(1) Existing methods fail to utilize extra information without supervised learning.
(2) Free from encrypted payload (only for those not using payload content).
(2) Poor performance on encrypted traffic (only for those using payload content). (3) Efficient since there is no need to discriminate every packet.
(3) Possible inevitable delay when processing extra information.
(1) Exploit the co-occurrence relationship of discrete values in network data.
(1) Poor generalization ability in open and dynamic network environments.
(2) Use word embedding methods like Word2Vec to generate embeddings for the "words." (2) Free from encrypted payload.
(2) Fail to treat numerical feature values properly.
(3) Embedding representations or their combinations are used as features.
(3) Very efficient since the embeddings are stored and used immediately once they are learned.
Note that those content-related features could help for detection, but they may become useless or be hard (if not impossible) to obtain in encrypted traffic. Other methods attempt to mine the relations of multiple flows to improve detection performance. AMF-LSTM [9] inputs the statistical feature vectors of a flow and its previous flows into an attention-based LSTM model to find their correlations and generate the final representation of the current flow for anomaly detection. Similar methods are used in [27,28]. STDeepGraph [10] explores spatial (structural) similarities and temporal dependencies of flows. It first associates the flows in a time interval by a graph structure called temporal communication graph and then applies the shortest-path graph kernel, graph signal processing, and 1D-CNN to generate the spatial feature vectors for each flow. Finally, similar to AMF-LSTM, it adopts a modified LSTM model to learn the temporal dependencies and final representations of flows for detection, where the spatial feature vectors and the statistical flow feature vectors are combined as input. Unfortunately, all the above methods of this category are trained in a supervised manner, which does not meet the labeling difficulty challenge. Among them, the most related and inspiring work to ours is STDeepGraph since it extracts features from graphs as well. However, the differences between STDeepGraph and our proposed XNBAD are significant. First of all, XNBAD is designed to train without labeled data, free from the labeling difficulty. Second, XNBAD improves detection performance by supplying the host states under the current network context, in which expert knowledge can be utilized, rather than the structural similarities of flows the supervised neural models learn. Finally, the form of the graphs XNBAD builds is quite different from that of STDeepGraph.

Static Embedding-Based Methods.
is category of methods, inspired by the field of natural language processing, regards the data in network (e.g., IP addresses, port numbers, and so on) as the words (tokens) in natural language and leverages the embedding methods like Word2Vec to generate the numerical representations of network data for detection. e general idea of the embedding methods is to learn the numerical vectors (embeddings) by modeling the co-occurrence likelihood of a target word and its context words. erefore, the learned embeddings of the frequently co-occurring words are more similar than the embeddings of the infrequently ones. In [14], IP2Vec, a Word2Vec variant adapted to flow records, is proposed to learn the embeddings of IP addresses as well as port numbers and protocols. In [11], SkipGram, one of the Word2Vec models, is used to generated the embeddings of both the categorical and numerical values in flow records, which means numbers are also viewed as categorical values. en, for each record, the concatenation of the embeddings, replacing the original flow feature vector, is input to supervised detection models. In [12], a network connection is represented by two words, system and connection type where system is the IP source of the connection and connection type is the combination of IP destination , Port destination , and Protocol. en, the word embeddings are provided by a variant of SkipGram, and the cosine similarity between the embeddings is calculated as the normal degree of the connection. It should be noted that the numerical flow features are not well considered in the above methods, either ignored or treated as categorical. RL-IDS [29] takes care of this issue. It only learns embeddings on categorical features with an unsupervised method FVRL and then integrates the embeddings and numerical features with a supervised method NNRL. ese embedding methods mine co-occurrence relationships inside network traffic for data representation, which much differs from the other three kinds of methods. However, they perform poor in terms of generalization in highly dynamic network environments because the embedding learning methods are static, which are twofold [14]. First, they cannot learn the embeddings for words outside training sets, such as new addresses or port numbers. Second, the learned embeddings are fixed, and they only represent the historical states of words but cannot vary with the present situation of network.

Methodology
In this section, we present our proposed framework XNBAD in detail. e objective of XNBAD is generating an effective data representation of network behaviors to detect anomalous (malicious) flows. To this end, it describes the network behavior of a flow under the current host interaction context, which includes the timely host interaction states and the conversation between the hosts. Compared to the mainstream methods which only consider the conversations, XNBAD considers both inside and outside of a flow and therefore is able to express the semantics of network behaviors more comprehensively, thereby improving the performance of network behavior anomaly detection. e overview of XNBAD framework is shown in Figure 1. e host interaction of the entire network is changing over time, so the interaction states of the same host could be very different on two periods. erefore, XNBAD uses a short-term window of flows to model the current interaction context. e context is modeled as a graph G (called host interaction graph, HIG) where nodes stand for hosts and edges stand for flows. For a flow record r initiated from host s to host d, the interaction states of the hosts s and d under the current context are represented by two host feature vectors z s and z d , respectively, which are generated based on G, and the conversation is represented by a flow feature vector q which can be directly extracted from the network traffic.
en, the network behavior under the context is represented by a vector v r as where ψ denotes a function that integrates z s , z d , and q. XNBAD finally feeds the behavior vector v r to an anomaly detector to estimate the anomaly degree of the flow record r, i.e., the anomaly score A(r). Accordingly, the detection phase of XNBAD can be divided into three stages, flow collection and feature extraction, host interaction feature generation, and network Security and Communication Networks behavior anomaly detection. Inside the second stage, three steps are included, i.e., graph construction, base feature extraction, and GNN-based feature enhancement. In the following sections, we will detail the above three stages of XNBAD as well as the training procedures involved.

Flow Collection and Feature Extraction.
A flow is generally defined as a series of packets with the same network five-tuple (IP source , Port source , IP de stination , Port de stination , Protocol). Upon that, flows are categorized into unidirectional flows and bidirectional flows. In a unidirectional flow, all the packets move from the source to the destination while in a bidirectional flow, the packets move between these two ends. XNBAD collects records on bidirectional flows since they contain more sufficient information of the conversations between the hosts than unidirectional flows. A flow record consists of metadata such as the five-tuple, the start and stop timestamps of the flow, and a flow feature vector representing the host conversation within the flow. Formally, a flow record r is represented as (2) in this work.
where s and d, respectively, stand for the source and destination IP addresses, μ stands for the other metadata of the flow, and q ∈ R D flow is the D flow -dimensional flow feature vector. e general process of this stage is shown in Algorithm 1. During this stage, a dictionary B is used to buffer living (unfinished) flows, and a queue Q is used to collect issued (finished) flow records that wait for the next stage. For each incoming packet, a bidirectional flow key is calculated first to find the corresponding living flow, and then the packet information is used to initiate/update the metadata and flow feature vector of the flow. Besides, to avoid waiting for the records of the long-inactive and long-living flows, the timeout mechanism is applied, in which the inactive timeout is five seconds and the flow-life timeout is two minutes. erefore, once a flow is timed out or a TCP flow is connection-closed, the corresponding flow record is issued and collected.
ere are various flow features proposed in the existing works, such as the statistical features [5,18,21], the sequential features [22,23], and the neural features that are automatically learned by deep learning models [24,25].
ough not accessing the conversation contents, the flow features are able to capture the conversation patterns from the aspects of the number, size, time interval, and direction of the packets within flows. In this work, we choose CIC-FlowMeter [18,19] to extract flow features since it is a flexible open-source tool and it can provide abundant and discriminative features for the tasks like anomaly detection and attack classification. Concretely, it extracts over 80 statistical flow features. ese features can be split into three groups, i.e., the forward, backward, and bidirectional features. In each directional group, several statistics of the packets are calculated as features such as the minimum, maximum, mean, and standard deviation of the sizes, the interarrival times, the active times, and so on. More details about CICFlowMeter can be found in [18,19]. It should be noted that not only the above flow features but also other flow features that represent the conversations can be used in XNBAD because how to choose the proper flow features is really dependent on the exact network environment.
Even though the flow features are able to provide discriminative information for anomaly detection, they cannot fully express the semantics of network behaviors of the flows because they only describe a series of actions between the subjects (i.e., source hosts) who initiate the actions and the objects (i.e., destination hosts) who react but neglect the states of them. erefore, the extra information related to the source and destination hosts is needed to compensate the shortage of the flow features on representing network behaviors.  Figure 1: Framework of XNBAD. It detects anomalies in three stages. In the first stage, flow feature vectors are extracted and flow records are collected from network traffic. Inside the second stage, a host interaction graph G and its derived graph G d are constructed based on the flow records within the current window first, and then base host feature vectors X are generated from the graphs, and finally the enhanced host feature vectors Z are generated by a GNN. In the last stage, for each record r, its behavior representation v r is generated by integrating its flow and host feature vectors and then the anomaly score A(r) is calculated by an anomaly detector.

Host Interaction Feature Generation.
A host's interaction with the other hosts in network reflects its network state in some aspect and is as important as the conversation between the hosts for representing network behaviors in anomaly detection. For example, if a host suddenly sends short messages to many other hosts, then it could be scanning the network so the behaviors of the related flows should be regarded as anomalous. Another example is that if an unauthorized host, which has never interacted with the database server before, downloads files from the server now, then this download behavior probably indicates a data exfiltration by a compromised host and it should be regarded as anomalous. erefore, XNBAD extracts features of the source and destination hosts from the interaction context of the entire network and then integrates them with the flow features to represent the network behaviors.
XNBAD captures a series of snapshots on the changing interaction context. For each snapshot, it builds a HIG and then generates host feature vectors to catch the timely interaction states of hosts. Since effective host features are hard to design and extract directly, XNBAD extracts several types of base features first and then applies a multi-layer graph neural network (GNN) for feature enhancement. e general process on a snapshot is shown in Algorithm 2. A snapshot corresponds to a two-minute tumbling window. XNBAD first constructs the HIG G and the derived HIG G d based on a window of flow records in Q. en, it produces several types of base feature vectors based on the graphs for all the hosts in G. Finally, it feeds the concatenated base host feature vectors forward to the GNN layer by layer to generate the final host feature vectors. e above three steps of this stage are detailed as follows.

Graph Construction.
To preserve the interaction information as much as possible, XNBAD constructs a HIG by directly joining the flow records that share the same hosts (IP addresses). A flow's initial direction is the direction of its first packet, i.e., from the source host to the destination host. Since a pair of hosts can product multiple flows in different initial directions, the HIG is viewed as a directed edge-attributed multi-graph where parallel edges could exist between a pair of nodes and edges have their own attributes. Formally, given a list of flow records collected in the t-th , the corresponding HIG can be represented as where V (t) is the node set and E (t) is the attributed edge list. We create V (t) by collecting all the distinct IP addresses occurring in R (t) as We also give each IP a node index by the indexing mapping ϕ (t) : V (t) ↦ 1, 2, . . . , |V (t) | . en, for each record r (t) i , we create its corresponding directed edge from node s (t) where the elements after the semicolon are viewed as the edge attributes. Furthermore, we derive a weighted directed simple graph G (t) d from the multi-graph G (t) as where d is the derived edge set, and W (t) d is the weighted adjacency matrix that represents the interaction strengths between nodes. We generate E (t) d by removing the redundant parallel edges from G (t) which ensures G (t) d is a simple graph. We measure the interaction strength between

Global:
Source of incoming packets Dict buffering living flows B Queue collecting issued flows Q (1) while packet p is coming from : initialize a flow record r′ with the same source and destination addresses as r (10) B[k]←r′ (11) else: # update flow featuresincrementally (12) update ALGORITHM 1: Flow collection and feature extraction.
Security and Communication Networks 7 the i-th and j-th nodes by the number of flow records between them as For convenience and clarity, we omit the time notation (t) sometimes hereafter since we mainly discuss the processes in a single time window.

Base Feature Extraction.
In this step, XNBAD extracts series of base features for the hosts in the current window.
ese features could be obtained from different aspects such as the domain knowledge and expert experiences about this network, the knowledge and experiences transferred from other fields, and so on. us, this step can be summarized as a bunch of different types of functions f base � (f b 1 , f b 2 , . . . , f b m ) that generate the base host interaction feature vectors from the current host interaction graph G � (V, E) (or its derived graph) as During this step, different types of base feature matrices can be generated independently, and then they are column-concatenated as the final base feature matrix.
Specifically, we extract three major types of base interaction features in this work. ey are the domain-based features directly extracted from G, the traditional graphbased features extracted from the derived graph G d , and the static embedding-based features generated and induced by using a node embedding learning method on the derived graph. It also should be noted that the base host interaction features are not limited to these three types and any other useful feature reflecting the host interaction states can be utilized in XNBAD.
(1) Domain-Based Features. Domain knowledge and experiences from experts like network administrators, security analysts, and so on are very helpful to depict the host interaction states. e following describes series of domainbased features extracted in this work and they cover several aspects of host interaction: host classes, involved services, and used ports.
Host Class. e hosts are classified into four classes U, S, O, and X. U, S, and O stand for the user hosts, servers, and the other hosts in the internal network, respectively, while X stands for the external hosts. en a 4-dimensional onehot feature vector is generated for each host according to its host class.
Interactive Distributions on Different Host Classes and Directions. is series of features reflects the state of a host interacting with different host classes. Figure 2 gives an example. For each target node that we extract features for, we first count the numbers of its forward and backward edges connected with the four different host classes. ese counting features can be organized by a matrix M D·H ∈ N 2×4 where the first row's four elements record the numbers of the edges going to the hosts of the class U, S, O, and X, respectively, while the second row records those coming from the four classes. us, the first row of M D·H is viewed as the interactive host class distribution on the forward direction while the second row is that on the backward direction. en, the marginal interactive host class distribution is calculated by applying the elementwise sum on the two row.

Input:
A window of flow records R in the collecting queue Q; A bunch of different types of base host feature extracting functions for host feature enhancement Output: Enhanced host feature vectors Z corresponding to R; #construct graphs (Section3.2.1) (9) return Z ALGORITHM 2: Host interaction feature generation. 8 Security and Communication Networks Similarly, the interactive direction distributions on each host class and the corresponding marginal distribution can be obtained along with different columns. Besides, for each distribution, the entropy is calculated to measure its diversity. Formally, given a distribution of m values Note that the above distributions and entropies are based on the edges (flows). Furthermore, we also generate the similar distributions and entropies based on the bytes and packets for the target node.

Interactive Distributions on Different Service Classes and
Directions. is series of features reflects the state of a host accessing and offering different services. Similar to the above distributions, for each target node, we count the numbers of its forward and backward edges with different services. We roughly classify the flows into 11 service classes such as HTTP, DNS, Mail, and so on, according to their protocols and ports. Similarly, these features are organized as a matrix M D·S ∈ N 2×11 and then the distributions and entropies can be obtained.
Interactive Distributions on Different Port Classes and Directions. is series of features reflects the port usage state of a host. At first, the ports are simply classified into two classes, privileged ports ( < 1024) and unprivileged ports ( ≥ 1024), and then for each host, its flows are classified into a 2-by-2 matrix M D·P based on the initial directions and port classes, which is similar to the above procedures. However, instead of counting the flow numbers directly, we record a port distribution which contains the distinct port number with their counts in each matrix element. en, we calculate its sum, its entropy, and the number of unique ports for each port distribution. erefore, the resulting features can be organized as a 3-order tensor M D·P ∈ R 2×2×3 where the first order is indexed by the direction, the second is indexed by the port class, and the third is indexed by the result type (the sum, the entropy, or the unique number).
(2) Graph-Based Features. In this work, we extract several node features based on traditional graph analysis since they have been shown effective for node anomaly detection in many applications [30][31][32][33].
ese features are extracted from the derived graph G d , as shown in Figure 3, and they are described as follows.
Degree Centralities. Centralities are a kind of global features measuring the importance of a node in a graph. Here, we compute the in-degree centrality, out-degree centrality, and degree centrality.
Egonet Features. An egonet is an induced subgraph of neighbors centered at a given node within a specific radius, and egonet features describe the local structure of a node. Here, several 1-hop egonet features are extracted such as the number of nodes, the number of edges, the number of the 2step-away node pairs, the sum of the weights, the egonet density, and so on.
(3) Embedding-Based Features. is type of features includes the node embeddings learned from the historical normal traffic (i.e., the training set) and the features induced from the node embeddings under the current context.
Node Embeddings. e low-dimensional node embeddings learned from a graph are useful for various graph tasks such as node classification and link prediction [34]. Here, we employ the famous embedding model Node2Vec [35] to learn the embeddings for hosts. e learned embeddings reflect the historical interaction relationships between hosts. Generally, if two hosts are used to interacting frequently, they have similar embeddings. Since Node2Vec is a static learning method which cannot learn the embeddings of nodes outside its training graph, we make a closed-world workaround (detailed in Section 3.4) where only the users and servers in the internal network have their own unique embeddings while the other internal hosts and the external hosts share two different embeddings respectively, as illustrated in Figure 4.
Induced Features from Node Embeddings. For each coming window, several features are induced from the learned embeddings on the derived graph G d to reflect the current relationships between a target host and the hosts it is interacting with. Figure 4 gives an example. is is inspired by the works [ 36,37] in which the embedding similarities are used to detect lateral movements. For a target node, we first fetch its 1-hop egonet and the embeddings of its egonet nodes and then we calculate the four features like [36]. ey are the minimum and mean of the similarities between the target node and its egonet nodes, the mean of the similarities for all node pair in the egonet, and the minimum of the similarities between the egonet nodes and their embedding centroid. It is assumed that if a host interacts with unfamiliar hosts, such as lateral movements occurring, these features could be influenced. Once a target host's three types of features are obtained, we concatenate them together as its base interaction feature vector. Generally, the above base features have limited representation ability. ey only capture low-order interaction patterns since they mainly focus on the interactions within the target node's 1-hop neighborhood. On the other hand, directly extracting features that capture high-order interaction patterns is difficult and expensive, especially by feature engineering, since it usually encounters intractable problems such as data sparsity and curse of dimensionality.

GNN-Based Feature Enhancement.
In order to effectively capture high-order interaction patterns, XNBAD further utilizes a GNN to enhance the host features. GNNs are a kind of powerful neural networks for representation learning on graphs. ey are able to consider both of the structural information of graphs and the attributes of nodes to automatically learn the low-dimensional node representations, and recently they have achieved surprising performances in many graph-related tasks. Formally, this step can be summarized as a parameterized function g like Z � g Θ X, G d , (8) where Θ denotes all the learnable parameters of the GNN. It takes the base host feature vectors X and the derived graph G d � (V d , E d , W d ) as input and outputs the enhanced host feature vectors Z ∈ R |V|×D gnn . Similar to X, the enhanced vector of host v ∈ V is the ϕ(v)-th row of Z, i.e., z v � (Z[ϕ(v), : ]) ⊤ . In general, a GNN stacks L layers and outputs node representations by forwarding the input layer by layer recursively like where g ℓ Θ is the ℓ-th layer of the GNN, Z ℓ is the output of the ℓ-th layer, Z 0 � X, and Z � Z L . At each layer, the input representations of nodes and their first-order neighbors are integrated to generate their output representations by two main operations neighborhood aggregation and representation updating. By forwarding through the layers, the highorder host interaction information, including the base host feature vectors and the structural information within L-hop neighborhood, is embedded into the enhanced host feature vectors.
Specifically, considering that the host interaction context is changing over different time windows, we choose GraphSAGE [38] in this work because it is an inductive learning framework which is able to generalize across graphs. Our GraphSAGE has two layers. e aggregation operation at the ℓ-th layer can be formalized as (10) and (11).
It first draws a subset N(v, ℓ) of size S ℓ from the target node v's first-order neighbors N(v) by the sampling function S. S samples a node u ∈ N(v) by the proportion to the bidirectional interaction strength between u and v which can be obtained from the weight matrix W d of G d , i.e., en, it aggregates the sampled nodes' representations output at the (ℓ − 1)-th layer as (11). e update operation at the ℓ-th layer is performed as follows.
where η is the activation function and W ℓ and b ℓ are the learnable parameters of the ℓ-th layer. It first concatenates and feeds it into a fully connected layer as (12). en, it updates the representation of the target node with the normalized output of the fully connected layer as (13).

Network Behavior Anomaly Detection.
During this stage, XNBAD integrates the enhanced host interaction feature vectors output from GNN and the flow feature vectors to generate the network behavior representations and then estimates their anomaly degrees by an anomaly detector. In this work, we concatenate the host and flow feature vectors directly as the behavior feature vector and use a clusteringbased detector for detection, which is shown in Algorithm 3. us, we rewrite (1) as (14) for each flow record r � (s, d, μ, q) in the current window.
where z s and z d are the enhanced feature vectors of hosts s and d obtained from the GNN and q is the flow feature vector. e clustering-based detector profiles normal network behaviors by a set of K cluster centers C � c k K k�1 where each center c k could represent a kind of normal behavior patterns gathering in the feature space. e detector measures the anomaly degree of a network behavior by the distance between the behavior feature vector and the cluster centers as (15) and (16).
e intuition is that if a behavior is normal, it will lie close to one of the cluster centers, showing a short distance between it and its cluster; otherwise, it will stay away from all the cluster centers, having a relatively long distance.

Learning Procedures.
In this section, we describe how to learn the Node2Vec embeddings, the graph neural network, and the cluster centers mentioned above. Among them, the embeddings and the centers are generated by the off-the-shelf models while the graph neural network needs to be guided by a specific loss function. erefore, we introduce the training data organization for these three models and the loss function used for the graph neural network. ese models are trained on the historical normal traffic that may span for hours, days, or more. Given a training set of T windows of flow records R � [R (t) ] T t�1 , we first construct the corresponding HIGs G � [G (t) ] T t�1 and their derived simple graphs where and then train these models one by one as follows.
3.4.1. Node2Vec. We generate a closed-world simple graph which summarizes the historical host interaction for training Node2Vec. is can be implemented by constructing a big HIG and its derived simple graph from all the records in R and then merging the derived graph's nodes, edges, and weights in the closed-world setting. In the closed-world setting, the internal user and server nodes are kept still while the other internal nodes and the external nodes are merged into two global nodes, respectively, and then the edges are merged and the weights are summed up accordingly. en, we train Node2Vec on the training graph to obtain the node embeddings.

GraphSAGE.
e learning objective of the GraphSAGE in this work is to maximize the normal host interaction context likelihood estimated from the enhanced host interaction feature vectors. We estimate the likelihood of a pair of nodes having interaction (edges) based on the inner product of their enhanced feature vectors. In G (t) � (V (t) , E (t) ), for a target node v ∈ V (t) , the remaining nodes can be split into the positive set V (t) + (v) in which the nodes have edges with v (i.e., 1-hop neighbors of v ) and the negative set V (t) − (v) in which the nodes have no edge with v. en, the likelihood of node v interacting with the others under the context G (t) can be estimated as where z (t) v denotes the enhanced feature vector of node v output by the GraphSAGE and σ(·) is the sigmoid function which ensures the estimated probability p u,v ∈ [0, 1]. en, we can estimate the interaction context likelihood over the nodes and graphs of the training set like However, calculating the likelihood over nodes will require |V (t) | × (|V (t) | − 1)/2 times the inner product computations, which costs a huge computation if V (t) is large. To control the computation, we use the predefinedsize subsets instead of V (t) + (v) and V (t) − (v) themselves and use a subset of V (t) when calculating the likelihood over nodes. Consequently, the negative log likelihood over graphs is calculated as the total loss function of the GraphSAGE like (19). Note that 1 − σ(x) � σ(−x).
where V (t) We apply a gradient descent-based optimization method to minimize L(G) and update the learnable parameters Θ of the GraphSAGE in a mini-batch fashion.

K-Means.
We run K-means algorithm on a number of network behavior feature vectors to obtain the normal behavior cluster centers. To capture representative normal behavior patterns, we use the behavior vectors across different windows as the training set of K-means. However, enormous flow records can be collected in hours, and using all of them for training is unnecessary and impractical. erefore, we randomly sample a part of them from each window in proportion to train K-means. Suppose the total number of the sampled records is predefined as N; then, for the t-th window, its proportion is p (t) � |R (t) |/ T t�1 |R (t) |, so we draw a subset R (t) of p (t) N records from R (t) . is sampling procedure is prepared at the very beginning of training all the models, and V (t) in (19) is obtained by collecting the hosts in R (t) .

Time Complexity of Detection Phase.
is section analyzes the time complexity of XNBAD performing detection on a window of flow records, in which data scale and feature dimensionality are mainly considered. e time complexities of different stages of XNBAD are summarized in Figure 5 and analyzed as follows.

Input:
A window of flow records R in the collecting queue Q Enhanced host feature vectors Z corresponding to R A set of cluster centers C profiling normal behaviors Output: Anomaly scores A R for records in R (1) A R ← empty list (2) foreach r in R: #note that r � (s, d, μ, q)

Flow Collection and Feature Extraction.
e flow feature extraction is performed incrementally as shown in Algorithm 1. erefore, when each packet arrives, all the flow features of the corresponding flow record could be initiated or updated. Let |P| be the packet arrival rate, and the time complexity of collecting a window of flow records is O(|P|D flow ).

Host Interaction Feature Generation.
is stage contains three steps: graph construction, base feature extraction, and GNN-based feature enhancement. e time complexities of the first and last steps are relatively determined while the second is highly dependent on what base features are used in the exact environment. Let R be a window of flow records, and G � (V, E) and G d � (V, E d , W d ) be the HIG and the derived HIG constructed from R as shown in (3) and (6), respectively. (1) Graph Construction. Constructing G and G d needs one pass through R. Besides, auxiliary data structures such as adjacent tables and alias sampling tables [39] of nodes need to be created for the subsequent feature extraction and enhancement, and they can be obtained in O(|E d |) time. Totally, the time complexity in this step is O(|R| + |E d |). (2) Base Feature Extraction. e time complexity of this step is usually linear with the dimensionality of base features and may have a linear or higher-order relationship with the number of nodes or edges. In this work, it is concluded as Suppose the GNN has L layers, the sampling size is S ℓ , and the output dimensionality is D ℓ gnn at the ℓ-th layer. For each node, neighborhood sampling (equation (10)) can run in O(S ℓ ) by the alias method [39]; aggregating (equation (11)) needs O(S ℓ D ℓ−1 gnn ), full connection layer forward (equation (12)) needs O(D ℓ−1 gnn D ℓ gnn ), and L 2 -normalization (equation (13)) needs O(D ℓ gnn ). erefore, the time complexity of the ℓ-th GNN layer for a node is O  In practice, GNN outputs node representations for a batch of nodes once. erefore, the overall time complexity of the GNN on a window is where N ℓ out is the number of unique nodes output at the ℓ-th layer and β � V|/N L out is the number of batches the GNN runs on G d . Note that N ℓ−1 out ≤ min |V|, N ℓ out (1 + S ℓ ) due to the sampling operation. In the worst case, G d is large enough, so ∀ℓ < L, In order to keep clarity, we reuse S ℓ as S ℓ + 1 and set S ℓ and D ℓ gnn to S gnn and D gnn , respectively, for all layers. en, (20) can be simplified to

Network Behavior Anomaly Detection.
For each flow record r ∈ R, getting its network behavior feature vector v r , calculating the distances between v r and K cluster centers, and selecting the minimum distance, respectively, need where D bhv � 2D gnn + D flow is the dimensionality of v r . erefore, the overall time complexity of this stage is O(|R|KD bhv ).

Dataset
We conducted experiments on the ISCX-2012 dataset [40]. In this section, we first introduce the basic information and our refinement about this dataset before detailing the experiment procedures and results. e ISCX-2012 dataset was collected in a real live testbed simulating an intranet environment with 4 user LANs (denoted as l 1 -l 4 ) and 1 server LAN (l 5 ). e user LANs have mainly 21 active user hosts (u 1 -u 21 ) and the server LAN has 3 servers, i.e., Main Server (s 22 ), Secondary Server (s 23 ), and NAT Server (s 24 ). e LAN hosts can reach the external network through NAT Server. Besides, it has 3 attack hosts (a 1 -a 3 ) located in the external network. e traffic of the dataset is generated by profiles. To simulate realistic network activities, the dataset creators analyzed and summarized four weeks worth of network activities in their institution into the β-profiles to generate the background (normal) traffic for HTTP, SMTP, SSH, IMAP, POP3, and FTP protocols. At the same time, they conducted α-profiles to describe multi-stage attack scenarios and generate the malicious traffic. Concretely, four multi-stage attack scenarios were carried out in four different days and they are Infiltrating on Jun13, HTTP-DoS on Jun14, Botnet-DDoS on Jun15, and SSH-BruteForce on Jun17. Furthermore, these four scenarios are designed to have associations so that they are presented together as a more sophisticated attack scenario. Totally, the dataset contains about 91 GB raw network traffic captured for a duration of 7 days, from June 11 to June 17, which is stored in seven PCAP files by days and it provides flow-based ground truth XML files for attack days. All the network interactions (not only between the LAN hosts but also between the LAN hosts and the external hosts) are preserved in the traffic.
ere are several reasons for choosing the ISCX-2012 dataset. First, this work aims to detect malicious network behaviors in a relatively local environment such as a campus or enterprise intranet, so the network environment simulated in this dataset meets the requirement of this work. Second, this dataset contains complete attack scenarios and relatively abundant attack behaviors, which makes evaluations as comprehensive as possible. ird, it provides very detailed materials such as the metadata of the network and the step-by-step attack details in paper [40], the complete raw traffic, and the useable label files, so that the dataset users are able to deeply understand the activities in the network.
ough there are some newer datasets released like CIC-IDS-2017 and CIC-IDS-2018 [41], these datasets have not provided such detailed information. erefore, it takes more time to study these datasets and we will leave the evaluations on these datasets in future work.
We studied the ISCX-2012 dataset by carefully crosschecking among the paper, the traffic, and the labels, during which we found two main problems which could have affected the experiment results and analysis. First, the original label files only provide binary labels, i.e., Normal or Attack. It may be sufficient for measuring the performances of detectors but cannot help further analyze their advantages and disadvantages. Note that the attack scenarios are multi-stage, which means there are many different types of malicious behaviors in a scenario. e "Infiltrating" scenario, for example, has the behaviors of scanning, web-based attacking, backdoor reverse connecting, and so on. erefore, when given the binary label, one can only tell whether the corresponding flow is malicious but hardly distinguish what type of behavior it belongs to. Second, there are several mislabeled and missing records for attack flows in the original label files because the label generation is based on an automatic analysis tool. For example, all the backdoor reverse connecting records since Jun15 are mislabeled as normal while those before Jun15 are labeled correctly as attacks. In Jun15, about six thousands of DDoS attack records from the user host u 9 are mislabeled as normal and the phishing record is missing.
To report more accurate results and more detailed analysis, we made a correction, summarized, and added the behavior labels. We extracted the flow records from the traffic and then labeled them with their behavior types. e behavior distribution is shown in Table 2. In general, there are totally 14 different types of malicious network behaviors in the dataset. Four of them are high-volume malicious behaviors, i.e., scan, HTTP DoS, botnet DDoS, and SSH brute force while the rest are subtle malicious behaviors, only taking up a very limited proportion in the total malicious behaviors. We further specify the hosts involved with the behaviors, which can help us understand the behavior semantics and analyze detection results. For example, in Table 2, scan (u 5 , l 1,2 ) means the behaviors of the user host u 5 scanning the first and second LANs, and control (u 5 , u 12 ) means the behaviors of attackers controlling the user host u 12 through the user host u 5 .

Experiments
We conducted experiments on ISCX-2012 dataset to evaluate the detection performance of XNBAD. We compared it with several representative methods and further analyzed their abilities of discovering different types of malicious behaviors. For each method, we trained a model with the normal traffic on Jun11 and tested it on the four attack days, i.e., Jun13, Jun14, Jun15, and Jun17. Since the performance may vary a lot with models' hyperparameters, we first ran hyperparameter search and selected the best models for comparison. Meanwhile, we studied the influence of the hyperparameters to the detection performance of XNBAD. Besides, to study the effectiveness of the GNN-based feature enhancement in XNBAD, we conducted the ablation experiments that compared the models with and without the GNN. Finally, we analyzed the runtime and scalability of XNBAD. e experiments were conducted on Ubuntu 16.04 64-bit OS with an Intel Core i7-7700 3.60 GHz CPU and 32 GB RAM, and the models were implemented in Python using DPKT, Scikit-Learn, and PyTorch. In the following sections, we describe the metrics, data preprocessing, and models as well as their hyperparameters considered in the experiments.

Metrics.
In the experiments, we used the weighted area of under the curve to measure the detection performance and select the best models. e area of under the curve (AUC) is a threshold-free classification metric, widely used in unsupervised anomaly detection task, and it is more robust than the thresholdbased metric accuracy (ACC) [42]. However, we found that using the general AUC for model selection and comparison will bias the models only good at discovering high-volume attack behaviors. For example, on Jun15 as shown in Table 2, the model which distinguishes most of the DDoS attack behaviors from the normal will achieve a high general AUC value even if it fails to detect the other behaviors. Even if it can detect the others, the gain of the general AUC value is trivial. e reason is that when calculating the general AUC, all the attack samples in the test set are treated equally, weighted as 1.0 implicitly, which makes high-volume attacks count more. In fact, compared to such high-volume attacks, subtle attacks are more valuable and more difficult to discover. erefore, the performance of discovering subtle attacks is also needed to be reflected in the metric. e weighted AUC, which is calculated on explicitly weighted samples, is more suitable for the anomaly detection problems having imbalanced multiple types of anomalies. To eliminate the bias of the general AUC, we weighted all anomaly types equally on each attack day. Suppose there are totally n samples on an attack day and they are divided into m types, i.e., T 1 , T 2 , . . . , T m , where n � m i�1 |T i |; then, the weight of type T i is assigned to n/m for i � 1, 2, . . . , m and the weights of its samples are assigned to n/(m|T i |). For a detection method, we randomly sampled 100 thousands of flows on Jun11 for training and then calculated the weighted Table 2: Behavior distribution of ISCX-2012. e table header specifies the corresponding days and scenarios of the columns. Each cell of the table contains a behavior type and its corresponding number. To express the semantics of the behaviors, the source and destination hosts involved in a malicious behavior are also specified in the bracket where u * , s * , and a * , respectively, stand for the user, server, and attacker hosts while l * stands for the internal LAN. Especially, if multiple hosts or LANs are involved as one end of a behavior, they are split by commas in the subscript such as u 1,6,13,15,17 and l 1,2 . e behavior types are sorted in the descending order of the number.

Jun11
Jun13 AUC on each attack day and averaged the weighted AUCs over days to measure the overall detection performance. We repeated this procedure for 10 times and used the mean values for comparison. Besides, the general AUCs were also reported for reference.

Models and Hyperparameters.
We considered three representative baselines in the experiments. ey were the packet-based method Kitsune [4], the static embeddingbased method SkipGramIDS [12], and the flow-based baseline named CICFlowKMS that extracts the flow features by CICFlowMeter [18] and utilizes K-means for detection. e experiment settings of XNBAD and its baselines are, respectively, described as follows.

XNBAD.
It contains two learning modules here, the K-means clustering-based anomaly detection module and the two-layer graph neural network GraphSAGE. e cluster number K of K-means was selected from 25, 50, 75, 100 { }. e hyperparameters of layers of the GraphSAGE were kept the same. e neighborhood sampling size was selected from 5, 10, 15, 20, 25 { }, and the output host feature size was selected from 100, 150, 200, 250, 300 { }. ree activation functions, linear, ReLU [43], and sigmoid, were considered. To train the GraphSAGE, we set both the positive and negative sampling sizes to 10 and used the mini-batch of size 50 for loss calculation. en, we utilized the Adam optimizer [44] for training with the learning rate of 0.1 for one epoch. Besides, we applied L 2 -regularization with weight λ � 0.1 to prevent overfitting and applied global clipping over 5 for the possible gradient explosion.

Kitsune.
is packet-based method learns from continuous packets and calculates the anomaly score for each packet. Since all the other methods give detection results on flows, we aggregated the packet scores within a flow into the flow score, where Min, Max, and Mean aggregations were considered. For a reasonable comparison, we used first 5.4 million continuous packets on Jun11 for training (amount to the number of training flows of other methods): the first 3 million for the feature extraction learning and the remaining 2.4 million for the autoencoder training. e other hyperparameters of Kitsune were kept as default. Since the original version runs slow even with Cython accelerated, we reimplemented it from the official codes with a little modification for acceleration. e preliminary test showed no detection performance decline of our modified version.

SkipGramIDS.
It estimates the abnormal degree of the flows by the similarity of the embeddings of words IP source and (IP de stination , Port de stination , Protocol). As discussed in Section 2, it cannot run on the unseen IPs and the destination port numbers. erefore, we did a workaround (closed-world setting) as Ring et al. [14] suggested. Concretely, the internal IPs unseen in the training set, all the external IPs, and the destination port numbers greater than 1024 were mapped to three default values, respectively. In this way, all the words are included in the training set and they can be mapped to the corresponding learned embeddings. SkipGram [13] is used for embedding learning.

CICFlowKMS.
It builds the normal profile by applying the K-means clustering on flow feature space and calculates the distances from the cluster centers as anomaly scores, which is similar to XNBAD. e clustering number was also considered in 25, 50, 75, 100 { }.

Performance Comparison.
e detection performances of XNBAD and its three baselines are shown in Table 3. We mainly compared their detection performances by the weighted AUC since it is able to appropriately reflect the ability of detecting the various subtle malicious network behaviors as well as the high-volume ones. Among these four methods, Kitsune was the worst one, only reaching an overall weighted AUC of 0.6266. Its weighted AUCs on the four attack days were around 0.5535 ∼ 0.6802. Both of SkipGramIDS and CICFlowKMS performed better than Kitsune. e static embedding-based method SkipGramIDS reached the weighted AUCs over 0.8370 on most attack days except Jun17 (0.7625), resulting in an overall weighted AUC of 0.8333. CICFlowKMS achieved a relatively high overall performance (0.9153), becoming the best baseline, and its performance on all attack days was over 0.9400 except that on Jun15, which was 0.8083. In comparison, XNBAD had the best performances among the four methods on all attack days, with the highest overall weighted AUC of 0.9499. e relative weighted AUC improvements of XNBAD to the baselines on each day were calculated, respectively, reported in Table 3. Besides, T-tests were conducted to analyze the statistical significance of the improvements. XNBAD achieved the relative improvements to Kitsune and Skip-Gram, respectively, over 38.9% and 9.2% on all attack days at the significance level of 1%. Compared to CICFlowKMS, XNBAD achieved considerable improvements of 2.2% and 13.1% on Jun13 and Jun15 at the significance level of 1%, respectively, and it had relatively small improvements of 0.8% and 0.5% on Jun14 and 17, respectively. Although relatively small, the improvement on Jun14 was also statistically significant at level of 1%. However, the improvement on Jun17 was not significant since the p value obtained from T-test was 12.4%( > 5%). Nevertheless, XNBAD, respectively, achieved the relative improvements to Kitsune, SkipGramIDS, and CICFlowKMS by 51.6%, 14.0%, and 3.8% in terms of the overall weighted AUC at the significance level of 1%.
We also observed some interesting results based on the general AUCs shown in Table 4. As observed, the general AUCs are larger than the weighted AUCs in most of cases because the high-volume malicious behaviors are easier to be detected and they appear more important for this metric.
e general AUCs of Kitsune on all attack days were much lifted, especially achieving the highest (0.9918) on Jun15, resulting in an overall performance of 0.9249. It implies that Kitsune is more suitable to discover the network behaviors of the high-volume attacks. SkipGramIDS with the overall general AUC of 0.8489, however, turned to be the worst because it had only 0.5900 on Jun17 despite achieving over 0.9310 on the other days. Considering that it also had a relative low weighted AUC on Jun17, we infer that it failed to discover most of the malicious behaviors on Jun17. Our proposed XNBAD achieving the highest overall general AUC of 0.9557 was still the best among the four, which means that it kept its effectiveness on distinguishing the behaviors of the high-volume attacks. e relative general AUC improvements of XNBAD with T-test were also reported in Table 4. Similarly, almost all the improvements were statistically significant at 1% level except for some cases. When comparing to CIC-FlowKMS, the improvements on Jun13 and Jun14 were not significant, which indicates that XNBAD and CIC-FlowKMS had comparable performances on detecting the high-volume attacks on those days. On Jun15, although XNBAD was inferior to CICFlowKMS with the improvement of −0.5%, the p value obtained from T-test was 4.9% which is very close to the significance at 5%. It indicates that their performances on detecting DDoS attacks were similar. e improvement to SkipGramIDS on Jun15 was 4.0%, but T-test output a p value of 1.9%, which suggests that this improvement was at the significance level of 5%. Besides, the improvement to CICFlowKMS on Jun17 was significant at 1%, which indicates that XNBAD still had an advantage over CICFlowKMS in detecting SSH brute force attacks. Eventually, in terms of the overall general AUC, XNBAD still significantly outperformed all the baselines at the 1% level.

Performance Analysis.
To gain further insights of the detection performances of the four methods on various malicious behaviors, we deeply studied their detected malicious behaviors at the false positive rate of 0.1 using the refined behavior labels obtained during our dataset study. e average true positive numbers and rates over the ten repetitions are shown in Table 5 and they enlighten us how these methods resulted in the performances shown in Tables 3 and 4. Our analysis is detailed as follows.

Kitsune.
It is observed that Kitsune indeed preferred to find the behaviors of the high-volume attacks such as scanning, DoS, and DDoS, but it could barely reveal the subtle attacks. For example, it detected average 3.8 out of 144 of the backdoor connection behaviors initiated from u 5 to a 1 and it failed to report any of the web attacks. Since Kitsune uses an ensemble of autoencoders which has a quite powerful learning ability, its poor performance is mainly due to its packet header-based features which could not provide enough discriminative information for detection.

SkipGramIDS.
Obviously, it performed better than Kitsune on detecting the subtle malicious behaviors such as the backdoor connection and botnet C2, which means the learned embeddings are relatively informative. However, it detected much fewer high-volume attack behaviors than the other three methods, especially on the SSH brute force behaviors. is is because it only considers the co-occurrence of (IP source , IP de stination , Port de stination , Protocol) of flows but ignores the discriminative flow statistical features, and it suffers from the generalization problem in dynamic network environments. Since CICFlowKMS did not perform well on the SSH brute force behaviors either, we believe the main reason for its unsuccessful detection was lack of generalization rather than lack of discriminative flow features. As mentioned in Section 2, there are two reasons for this shortage. First, too many flows trying SSH login made the current state of s 22 (MainServer) anomalous but the learned embedding could not reflect the anomalous state as XNBAD did. On the contrary, they only represented the outdated state of s 22 and its corresponding connection type (s 22 , 22, tcp). Second, SkipGramIDS could not directly provide the embedding of a 1 because a 1 is unseen in the training set. It had to use the corresponding default embedding for a 1 , which was also outdated. erefore, the co-occurrence of (a 1 , s 22 , 22, tcp) was improperly estimated as normal. is result also implies the importance of considering timely host interaction states in anomaly detection.      Table 5, it achieved true positive rates over 94% on 16 items, especially 100% on 7 items, and in the remaining, it reached around 86% on 3 items and around 73% on 2 items. is very satisfactory result was mainly attributed to XNBAD's effective data representation where flow features and highorder host interaction features are integrated.
Compared to the best baseline CICFlowKMS, XNBAD had higher true positive rates on more than half of the malicious behavior types, which was attributed to taking the timely host interaction states under the context into consideration. On the rest of the malicious behaviors types, XNBAD also had comparable detection results to CIC-FlowKMS. Concretely, XNBAD detected 100% of the backdoor connections on Jun13 and 94.8% of the botnet C2 behaviors on Jun15 while CICFlowKMS detected no more than 50% of them. is explains the significant weighted AUC improvements on these two days in Table 3. Besides, it is worth noting that these two kinds of malicious behaviors usually take place in the early stage of the entire attack chain. erefore, discovering these kinds of behaviors effectively is meaningful for cutting off the attack chain and protecting the network security. On Jun14, CICFlowKMS had already performed well so there was not much room for improvement. On Jun17, XNBAD's superiority was shown on detecting the SSH brute force behaviors (72.4% vs. 26.3%), which was significantly reflected on the general AUC instead of the weighted AUC since the latter eliminates the bias for high-volume attacks.
Moreover, we noticed that all the methods performed poorly on the HTTPGET download behaviors because at most only 2.6 of 14 on this type were detected (by XNBAD). It is quite challenging to discover this type of malicious behaviors if no content information is provided since they behave as normal daily HTTP downloading. In contrast, Slowloris download behaviors were much easier to detect for CICFlowKMS and XNBAD, which was due to the anomalous conversation patterns caught in the flow features.
In a brief summary of the above two sections, we conclude that XNBAD is promising since it outperformed the representative competitors at 1% significance level. It was able to perform well on various malicious network behaviors contained in the dataset since it utilized both of the discriminative flow features and the timely interaction contextaware host features. Meanwhile, it had a limited detection ability on the malicious behaviors like HTTPGET download which deliver malicious payloads but appear normal in both of the conversation patterns and interaction states.

Hyperparameter Influence.
In the experiments, the best model of XNBAD was obtained when the neighborhood sampling size was 20, the output host feature size was 250, the linear activation was used for the GraphSAGE, and the cluster number was set to 75 on K-means. e detection performances with the hyperparameters varying around the best are shown in Figure 6. Figure 6(a) illustrates the trend of the detection performance going with the cluster number. As observed, the weighted AUC grew from the minimum (below 0.94) to the maximum (near 0.95) when the cluster number increased from 25 to 75 and then it dropped a little when the cluster number got to 100, which indicates that the cluster number was a relatively important hyperparameter to the detection performance of XNBAD. e clusters represent the normal network behavior patterns. erefore, the cluster number controls the level of detail in modeling the behavior patterns. Too few clusters (25 here), an insufficient description, lead to underfitting while too many lead to overfitting. Figure 6(b) shows only a slight oscillation of the detection performance with the neighborhood sampling size. It means that the detection performance was not very sensitive to the neighborhood sampling size. e influence of the output size to the detection performance was more significant than that of the cluster number as shown in Figure 6(c). e weighted AUCs on the two sides (100 and 300) dropped largely from the maximum, which was partly due to the underfitting and overfitting of the GraphSAGE. In addition, we think it was the negative effect induced by the dimensionality difference between the output host features and the flow features. Since the anomaly score is distance based, the dimensions of the host and flow features decide their contributions to the anomaly score. For instance, the flow feature size is 194 here and if the output host feature size is set to 100, then the flow features will provide 94 more dimensions in calculating the anomaly score than the source or destination host features. Too large dimensionality difference leads to an improper contribution to the anomaly score and results in a poor performance. Figure 6(d) presents the detection performance on the three different activation functions of the GraphSAGE in XNBAD. e activation function was also one of the hyperparameters that affect the detection performance. Surprisingly, using the linear activation had the best performance. Nevertheless, it is worthy to note that using the linear activation in the GraphSAGE does not mean that the output features are linear to the input features because the last operation of the layers of the GraphSAGE is L 2 -normalization which introduces the non-linear characteristic into the output. When using the ReLU activation, the weighted AUC was under 0.92 and the error bar was from 0.91 to nearly 0.93, which indicates that the model with the ReLU activation performed poorly and unstably. Using the non-linear activation sigmoid resulted in a better performance with the weighted AUC over 0.93.
Together, the results in this section indicate that the detection performance of XNBAD was mainly affected by the output host feature size, the activation function, and the cluster number but not sensitive to the neighborhood sampling size.

Enhancement Effectiveness.
To understand the effectiveness of the GNN-based feature enhancement, we conducted ablation experiments. In the experiments, we compared XNBAD with its non-GNN variant in which the GNN was removed and then the base host features were directly concatenated with the flow features to represent the network behaviors. e results are shown in Figure 7.
As observed, XNBAD outperformed its non-GNN variant by a large margin on all attack days, resulting in a relative improvement of 7.3% in terms of the overall weighted AUC. It means that the GNN is indispensable to the great detection performance XNBAD achieved. ere are two benefits from applying the GNN. First, the GNN reduced the high-dimensional base host features to the lowdimensional features, which overcame the aforementioned negative effects introduced by the dimensionality difference between the host features and the flow features. As observed in Figure 6(c), it did not achieve a good performance when the dimensionality of the host features was too large (300) or too small (100), let alone the non-GNN variant using the base host features of size 400. erefore, with a proper output size, the GNN made the integration of the flow features and the output host features in harmony. Second, the GNN captured the high-order structural information of the interaction graphs, which made the output host features discriminative beyond the base host features. To confirm this benefit alone, we additionally evaluated the performance of another variant that uses the GNN with the output size of 400 (the base feature size) and all the other hyperparameters unchanged. It turned out that even without dimensionality reduction, the GNN variant having the overall weighted AUC of 0.9069 still beat the non-GNN variant, which verifies the second benefit of the GNN.
In summary, the GNN-based feature enhancement applied in XNBAD was effective and it greatly improved the detection performance. e GNN served not only as a highorder feature extractor but also as a dimensionality reducer during the enhancement.

Runtime Analysis.
To evaluate the runtime and scalability of XNBAD, we ran XNBAD sequentially over the four test days of ISCX-2012 dataset, where Algorithm 1 was run first to collect all the flow records and then Algorithms 2 and 3 were run on each window of records. We recorded the durations of each processes and then studied their accumulated runtime over all windows and their distributions of the runtime in a single window.
As shown in Table 6, the overall runtime of XNBAD on the four test days of ISCX-2012 dataset was about 1.3 hours, where the first stage, i.e., flow collection and feature extraction, took up most of the time (about 1.2 hours) while the remaining two stages together ran for about 9 minutes only.
is is because the first stage needs to go through numerous packets for flow feature extraction, which leads to the heavy computation overhead, while the latter two stages only need to deal with flows, nodes, and edges that are much less than   packets. Nevertheless, compared to the four-day duration, the overall runtime of XNBAD was much less, and thus it was acceptable. Since each window has 2 minutes and each day has 720 windows, on average, XNBAD took about 0.19 seconds (≪2 minutes) to output the detection results after a window of data was ready. It indicates that XNBAD is able to finish its pipeline on one window immediately before the data of next window arrive.
In the part of the window-level processes, the base host feature extraction and feature normalization were the two most time-consuming ones, with about 4.8 minutes and 2.7 minutes, respectively. More specifically, computing the interactive distribution features and the embedding-induced features f 3 and normalizing flow features, respectively, took about 2.5, 1.5, and 2.0 minutes. ey took most of the time. On the other hand, both of the feature enhancement and anomaly detection only used about half a minute overall.
Furthermore, we investigated the relationships between the runtime and data scale in a single window. e results of the main and time-consuming processes are illustrated in Figure 8. As shown, most of the processes' runtime was linear to the data scale (the number of flows/nodes/edges) while the runtime of computing f 3 was quadratic to the number of edges, which consists with the time complexity analysis in Section 3.5. e total runtime on a single window was also quadratic to the number of edges, which was mainly due to computing f 3 . Nevertheless, the maximum total runtime was only about 2.7 seconds when there were more than 1000 edges, which suggests that XNBAD was still efficient on the large data scale. However, according to the tendency, we can foresee that with the data scale continuously growing up to some extreme extent, the total runtime could exceed the window duration, which will make windows of data overstock. To mitigate this problem, we recommend to use simple base host features instead of the complicated ones as much as possible and let the GNN work since the GNN-based enhancement is effective and much faster. Besides, since windows are independent, we can utilize multi-processing to handle different windows in parallel. Moreover, several engineering optimizations can be applied on this situation, but it is beyond the scope of this paper.

Conclusions
In this paper, we proposed a novel unsupervised network behavior anomaly detection framework XNBAD which improves detection by describing network behaviors under the dynamic host interaction context. To comprehensively express the behavior semantics, XNBAD integrates the extracted flow features with the GNN-enhanced timely highorder host interaction features as network behavior representation. We carefully studied and refined the publicly available benchmark dataset ISCX-2012 and then conducted experiments on the refined dataset to evaluate the detection performance of XNBAD. e experimental results show that XNBAD is a promising framework. It was able to effectively detect not only the high-volume attack behaviors but also most of the subtle attack behaviors in the dataset, and it outperformed the other three representative baselines with at least 3.8% relative improvements at the 1% significance level on the overall weighted AUC. e results also demonstrate the effectiveness of the host feature enhancement using the GNN and the efficiency of XNBAD.
Future works might include developing unsupervised end-to-end XNBAD variants which learn on the dynamic interaction context via more advanced GNNs [45,46] and jointly train models to eliminate possible suboptimization and examining XNBAD on other environments like IoT and more advanced and complex datasets. Besides, since behavior detection cannot discover all types of malicious behaviors (as discussed in Section 6.2), making XNBAD collaborate with other detection mechanisms like content detection and host-based detection is also an interesting research direction.

Data Availability
e ISCX-2012 dataset used to support the results of this work is publicly available at https://www.unb.ca/cic/ datasets/ids.html. e refined behavior labels used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.