Anomaly Detection of System Call Sequence Based on Dynamic Features and Relaxed-SVM

*e system call sequences of processes are important for host-based anomaly detection. However, the detection accuracy can be seriously degenerated by the subsequences which simultaneously appeared in the call sequences of both normal and abnormal processes. Furthermore, the detection may be obstructed especially when the normal/abnormal distributions of subsequences are extremely imbalanced along with many ambiguous samples. In the paper, the system call sequences are divided into weighted subsequences with fixed-length. Secondly, a suffix tree of each system call sequence is constructed to automatically extract the variable-length subsequence from the longest repeated substring of the tree. *e frequencies of the fixed-and variable-length subsequences that appeared in each system call sequence constitute its feature vector. Finally, vectors are input into a cost-sensitive and relaxed support vector machine, in which the penalty-free slack of the relaxed SVM is split independently between the two classes with different weights. *e experimental results on two public datasets ADFA-LD and UNM showed that the AUC of the proposed method can reach 99%, while the false alarm rate is only 2.4%.


Introduction
System calls provide interfaces between system functions and the user applications. And the call sequences can reflect the targets of process actions. Accordingly, the system call sequences stored in various auditing and logging systems are important intrusion detection objects [1,2].
Usually, system call sequences are broken into subsequences and submitted to classification models [3]. Jewell and Beaver [4] proposed that the unique sequences of system calls are the basis for discriminating normal and abnormal behaviors of processes. Helman and Bhangoo [5] et al. defined the priority of system call sequence based on the probability of the occurrence of system calls to get typical features. Lee and Stolfo [6] performed machine learning tasks on operating system call sequences of normal and abnormal executions of the UNIX Sendmail program. Xie et al. [7][8][9] attempted to reduce the dimension of the frequency vectors of subsequences based on PCA to enhance the computational efficiency. Haider et al. [10] proposed to take both the rarest repeating subsequence and the most frequent repeating subsequence in the system call sequence as statistical features. Kan et al. [11] proposed a novel IoT network intrusion detection approach based on adaptive particle swarm optimization convolutional neural network (APSO-CNN). Bian et al. [12] extracted graph-based features from host authentication logs, which are then employed in the detection of APT targets in the network. Shin and Kim [13] preprocessed the collected data using n-gram to overcome the limitations of the sequence time delay embedding (STIDE) algorithm for host intrusion detection system (HIDS).
Recently inspired by linguistics mining in NLP (Nature language process), researchers started to analyze the semantic relations between subsequences. Forrest and Hofmeyr [1] defined the sequences as phrases composed of words (system calls), and then utilized artificial immune systems to classify the phrases. Creech and Hu [14] proposed to draw semantic features of the system call sequences (phrases) using context-free grammar and built an extreme learning machine for the classifications. Liao and Vemuri [15] calculated the TF-IDF scores of system call sequences and input them into the K-NN model for abnormal detection. Zhang and Shen [16] built an improved TF-IDF model of subsequences, which takes both the time information and the correlation between the processes into consideration. Marteau [17] proposed a similarity measurement method to evaluate the similarity between symbolic sequences. Ambusaidi et al. [18] proposed a supervised feature selection algorithm, which is able to handle both linearly and nonlinearly related data features. Shams et al. [19] designed a new context-aware feature extraction method for convolutional neural network (CNN)-based multiclass intrusion detections. Subba [20] combined TF-IDF vectorizer and singular value decomposition (SVD) to design a novel HIDS framework for anomalous system processes detection.
However, it is difficult to select appropriate subsequences of system calls to discover the real purpose of the calling actions. Laszka et al. [21] proved that the optimal length of subsequence is highly dependent on data and applications and needs to be carefully fine-tuned. For example, if the subsequence is too short then we may get an incomplete calling trace. On the other hand, if the subsequence is too long, then the malicious calls are mixed with many normal system calls, and the extracted features may be disturbed. Finally, there are many normal call sequences while the abnormal calls are extremely few, and many unusual API sequences, along with incomplete sequences may have a strong influence on the classification model. How to deal with the imbalance and noise data is also a big challenge.
In view of the above problems, not only the semantic information contained in short subsequence but also the representative features in variable-length subsequences are considered to generate combined features. Usually, system call sequences contain repeated subsequences, which are regarded as program-specific behavior patterns. And the length of repeated subsequences is different among call sequences. In order to generate variable-length repeating sub-sequences, a suffix tree is constructed for each system call sequence and the longest repeating substring is automatically extracted from the subtrees. Furthermore, to address the imbalance and noisy subsequences of normal and abnormal calls, the widely used relaxed support vector machine (RSVM) is improved by assigning different weights and free slack amounts to the positive and negative classes, which are scaled by the sizes of the two classes to reduce the influence of data imbalance and outliers.

Feature Extraction
In order to automatically extract the call subsequences and related features, which can reflect the real target of the system calls, we present a dynamic feature extraction method for system call sequences. As shown in Figure 1 the features are extracted from both fixed-length and variablelength subsequences. In the first step, the training sequences are split into subsequences by n-gram [22], and each subsequence is weighted by TF-IDF [23]. en, the first n subsequences with big weight values are selected to composite a fixed-length subsequence set. In the second step, suffix trees are constructed for training sequences. e longest repeating substrings in the suffix trees are selected as variable-length subsequences. In the third step, both the fixed-length and variable-length subsequences are combined to get a corpus set. In step 4, the occurrence frequencies of the corpus subsequences in each system call sequences are counted to constitute feature vectors. Furthermore, in step 5, an autoencoder [24] is also utilized to reduce the vectors' dimension before being submitted to the classifiers in step 6.

Fixed-Length Subsequence Based on Semantics.
e TF-IDF (term frequency-inverse document frequency) [23] is commonly utilized in text mining to evaluate the importance of every single word or phrase in a document. In this paper, the TF-IDF is employed to evaluate the system calls, which have been coded into word sequences. As described in [1], firstly, each system call is represented by a unique word (or number), accordingly, the call sequences become word sequences. As defined in (1), seq i ��→ is the j th element in the set.
Definition 1. Inverse ratio of sequence frequency. e inverse ratio of a sequence frequency is the IDF reverse file frequency defined in TF-IDF. As shown in equation (1), N is the number of training sequences, and n t i is the number of fixed-length subsequence t i that appeared in the training set.
Definition 2. Vocabulary frequency of single sequence. As shown in equation (2), fre ij represents the frequency of the fixed-length subsequence t i in a system call sequence seq i ��→ . And the vector Fre j represents the frequency of all the fixedlength subsequences t � t 1 , t 2 , . . . t m in the system call sequence seq i ��→ .
Definition 3. Process behavior weight. It is the combination of inverse ratio and vocabulary frequency of every single subsequence, as shown in (4), to evaluate the importance of a single vocabulary (subsequence) t i to one system call sequence seq j ���→ .
Definition 4. Fixed length corpus of system call sequences. e subsequences with the first three highest weights, calculated by (4), in a single process are included in the fixedlength sequence corpus. In equation (5), where t ji represents the i th fixed-length subsequence in the system call sequence seq j ���→ .

Variable-Length Subsequence.
Usually, the repeated subsequence in a system call sequence can reflect the behavior patterns of processes, which are much useful for abnormal detection. In the paper, these sequence fragments are defined as the representative features with variable lengths.

Search the Longest Repeating Substring.
After the establishment of a suffix tree, the longest repeating substring p k of the tree is selected to represent the behavior patterns of a system call sequence. As shown in Figure 4, for seq � "6, 4, 1, 4, 1, 4, 3" the deepest nonleaf node is node "4" and the longest repeating substring is p � "1, 4," which is incorporated into the variable-length sequence set corpus variable .

Segmenting Long System Call Sequence.
To alleviate the effect of long system call sequences on the suffix trees' generation efficiency, the long sequences are segmented into subsequences. As shown in Figure 5, for the system call sequence seq i � s 1 , s 2 , . . . , s 500 , . . . , s len with length len > 500. e sequence seq i is divided into subsequence seq i1 , seq i2 , . . . , seq ij . Let seq ij represents the j th subsequence of seq i . e suffix trees Tree i1 , Tree i2 , . . . , Tree ij are constructed for each subsequence seq i1 , seq i2 , . . . , seq ij .

Generating Corpus.
Both the fixed-length subsequence set corpus fixed and the variable-length subsequence set corpus variable constitute a combination set Corpus to represent the behavior patterns of the system calls.
2.2.5. Feature Extraction. Firstly, the frequency of the subsequences defined in (8) that appeared in each system call sequence is counted by AC automaton (Aho-Corasick automaton) [24]. Let fre t ji represents the occurrence frequency of fixed-length subsequence t i in a system call sequence seq j . fre t ji represents the frequency of a variablelength subsequence p i in seq j . e frequency vectors of all the call sequences constitute a feature matrix.

Security and Communication Networks
Vec � fre t 11 · · · fre t 1k fre p 11 · · · fre p 1m ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ fre t j1 · · · fre t jk fre p j1 · · · fre p jm ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ fre t n1 · · · fre t nk fre p n1 · · · fre p nm With the increasing of system call sequences, the number of subsequences in corpus also increases dramatically. In order to control the dimension of the feature vectors and fascinate mining of the potential features in the matrix Vec, in (8), the autoencoder [26] is utilized to reduce the dimension of Vec. Finally, Vec is submitted to weighted relaxed support vector machines described in the next section.

Weighted Relaxed Support Vector Machines
e widely used RSVM [27] is an extension of SVM-L2 with an additional penalty-free slack variable for each sample, which allows influential support vectors to be relaxed, such that a restricted amount of penalty-free slack is used to relax support vectors and push them towards their respective classes. In the paper, based on WRSVM [28], we modify  RSVM by assigning different weights and free slack amounts to the two classes, normal and abnormal call sequences, which are scaled by the positive and negative class sizes with different penalty control factors, C 1 and C 2 . e enhanced weighted relaxed support vector machines (EWR-SVM) model is given in equation (10), which is an extension of the well-known SVM formulation RSVM. However, the constructed model is different from RSVM in such a way that, it differentiates between positive and negative classes, and considers different weights inversely proportional to the class sizes.
min ω,b,ξ,υ where n + and n − are the sizes of the majority (normal) and minority (abnormal) class, I + and I − , respectively. Free slack is denoted by the variable υ i in the constraints for the i th sample. Due to imbalance, we provide separate amounts of total free slack for the normal and the abnormal classes in the constraints and υ i is parameterized by c, which is the free slack provided per sample.
en calculate the first order partial derivative of the Lagrangian function of equation (10) with respect to the related Lagrangian multipliers, we can get the Wolfe dual of equations (10) and (12) can be efficiently solved using the sequential minimal optimization (SMO) algorithm, and the dot products 〈x i · x j 〉 in equation (12) can be replaced by a kernel k(x i , x j ) for nonlinear classification.

Datasets Description.
In this section, a group of comparisons is carried out to test the performance of the proposed method based on two public system call sequence datasets ADFA-LD and UNM. e ADFA-LD [29] was released by the Australian defense force academy. It contains thousands of system call traces collected from modern Linux local servers. e UNM [30] was released by the University of New Mexico, which contains system call sequences of normal applications in the Linux system and sequences from attacking progress. e detailed descriptions of the two datasets are shown in Tables 1-3. And we can see that the UNM is an obviously imbalanced dataset.
Firstly, the sequence features are extracted by counting the number of the appearance of the fixed-length and variable-length subsequences in each system call sequence, and the abstracted features are verified by mathematical tests including information gain and Mann-Whitney U [31] test in the attachment. And then the features are input into classification models including the proposed EWR-SVM, naive Bayes, logistic regression, random forest, and gradient descent, and the performance is also compared with some traditional methods [8,10,32]. e results are evaluated by the index of AUC, F1-Score, false alarm rate, and ROC, which is also known as the receptivity curve. e ROC curve takes both the false positive rate and the true positive rate into consideration. It represents a curve drawn by the subjects under specific stimulus conditions due to different results obtained in different conditions. And AUC is the area under the ROC curve.

Experimental Result.
e experimental data set ADFA consists of 833 normal system call sequences and 719 attack system call sequences. e experimental data set UNM consists of 6 normal sample files and 16 attack sample files. e system call sequences' features are extracted from the data set ADFA-LD and UNM datasets. Fixed length features are extracted from system call sequences by a sliding window with length 2. Variable-length features are extracted from system call sequences by a suffix tree. e anomaly detection results based on the classifiers, including EWR-SVM, naive Bayes, logistic regression, random forest, and gradient descent tree are shown in Table 4 and Figures 6 and 7.
Obviously, the performance of EWR-SVM is better than the others on the two datasets. e results demonstrated that the features extracted by our method contain useful information to reflect the behavior patterns behind the call sequences. And the features are useful for model recognition of the abnormal program behaviors. In order to further demonstrate the effectiveness of our method, a feature validation is carried out on the datasets ADFA-LD and UNM.

Feature Validation Verification Experiment.
e occurrence frequencies of fixed-and variable-length subsequence in ADFA-LD and UNM are shown in Figures 8-15. From the figures, we can see that the distributions of the word (call) frequency are different between the normal and attacking traces (system call sequences). e information gain and Mann-Whitney U test is carried out to demonstrate the effectiveness of the extracted features. e information gain of both the selected fixed-length sequences and the left ones are shown in Table 5. e table showed that the information gain of the selected fixed-length sequence feature is obviously higher than that of the others. e results demonstrated that fixed-length sequences with the high TF-IDF weight are useful, which can be utilized to extract the behavior features of system call sequences.
According to the statistical results, we found that the frequency of the sub-sequences of normal sequences and attacking sequences does not follow the normal distribution in two datasets, so a nonparametric method is employed to conduct Mann-Whitney U test [31]. e following hypotheses are proposed: H 0 : there is no significant difference between the subsequence frequency of the normal sequence and the sub-sequence frequency of the abnormal sequence H 1 : opposite hypothesis of H 0 e results of the Mann-Whitney U test is shown in Table 6. According to the law of statistics, we know that when P − value > 0.05, the original hypothesis H 0 is true, while P − value < 0.05, the original hypothesis H 0 is rejected, and the hypothesis H 1 is true. From Table 6, we can see that the appearing frequency of fixed-length subsequence of normal sequences in ADFA-LD and UNM datasets is significantly different from that of the abnormal sequences. Similarly, the frequency of variable-length subsequences of normal sequences in ADFA and UNM datasets is also obviously different from that of the abnormal ones.

Comparison with Traditional Methods.
In this section, we compared the EWR-SVM with other abnormal traces detection methods [8,10,32], and the results are shown in Table 7. From the table, we can see that our method is obviously superior to others. In [8], system call sequences are classified according to the appeared abnormal frequencies in the tested traces. In their method, the optimal window width     is set by empirical test. However, the optimal length of the subsequence varies among different datasets, resulting in unstable performance. Haider [10] proposed to utilize the rarest repeating subsequence, the most frequently repeated subsequence, along with the maximum and minimum system calls in a single sequence to extract the trace features. However, the correlative contextual semantic information in the sequence is ignored. Anandapriya and Lakshmanan [32] brought the conception of contextual semantic information into trace detection with fixed-length windows. e results in Table 7 showed that the AUC of the proposed EWR-SVM is much larger than the others, while its false-positive rate is lower than the comparison targets. atis because our method considers not only semantic information of the sequences but also the behavior patterns of processes behind the calls. Furthermore, we proposed to assign different weights and free slack amounts to the two classes (normal and attacking traces), which are scaled by the normal and  abnormal class sizes to release the affection of imbalanced data distribution between the two classes. And the slack factors in EWR-SVM make the outliers have less influence on the optimal hyperplane and ensure a large margin between the two classes.

Conclusions
In the paper, both fixed-length and variable-length subsequence of system calls are taken into consideration for the host-based intrusion detection. e semantic weight is incorporated into the selection process of fixed-length subsequences. On the other hand, the variable-length subsequence is automatically selected from the suffix tree of each call sequence. In order to deal with the imbalance data distribution and samples outliers of the call sequences, a cost-sensitive relaxed support vector machine EWR-SVM is proposed, in which the restricted penalty-free slack is split independently between the two classes in proportion to the number of samples in each class with different weights. Both

Conflicts of Interest
e authors declare that there are no conflicts of interest.