Representativeness-Based Instance Selection for Intrusion Detection

,


Introduction
Along with the continuous development of network technology and 5G, smart systems are becoming more and more common in all fields of human life, such as finance, agriculture, and education. However, smart systems have become the target of many new attacks, which not only cause significant financial damage and personal information leakage but also hinder the large-scale deployment of smart systems in practice. As intrusion detection technology can effectively protect smart systems and detect attacks, the development of intrusion detection technology has attracted the attention of countries all over the world [1,2]. From the perspective of classification, the main goal of building an intrusion detection system (IDS) is to train a classifier that can distinguish between normal and intrusive data from the original network data set. e IDS based on machine learning has become an important part of IDS [3], which directly uses a large amount of network data to detect attacks. ese network data can result in wasting time and storage space for IDS. Moreover, the redundant data and noise in these data can affect the performance of IDS. But, instance selection is used for IDS to select important data from the original data to achieve two goals. One is to reduce the number of instances required by IDS in the training phase, thereby saving time and reducing the amount of calculation for training the classifier; the other is that through effective instances, the performance of the trained classifier can be effectively improved [4][5][6].
In recent years, many instance selection techniques have been proposed to improve the performance of IDS [7][8][9][10][11][12][13][14][15]. However, in terms of the factors and application areas of instance selection, there are mainly four problems in existing instance selection algorithms.
Firstly, only the influence of a portion of instances is taken into account for selecting the instance [7][8][9]. For instance, the instance selection algorithm based on partition and cluster center (PCIS) [7] selects representative instance by considering K nearest neighbor instances of the same class; e binary nearest neighbor tree algorithm (BNNT) [8] and the constraint nearest neighbor-based instance reduction algorithm (CNNIR) [9] select representative instance by K nearest neighbor instances of the selected instance. As these instance selection algorithms only consider the influence of a portion of instances and ignore the influence of remaining instances, some important instances are not selected.
Secondly, the influence of instances of different classes is regarded as an adverse factor selecting the instance. For the instances of the same class, the instance selection algorithm based on a ranking procedure (ISAR) [10] and the rankingbased instance selection algorithm (RIS) [11] only select instances representing the same class and remove instances representing different classes. Some selected instances are not representative because the influence of instances of different classes is considered as an adverse factor.
irdly, some instance selection algorithms use sampling to select the instance. Instance selection algorithm based on hierarchical data topology [12] uses hierarchical sampling to deal with large-scale problems of data sets. is algorithm combines the random subset selection (RSS) with the topology-based selection (TBS) to select important instances, which is a subset of original instances. Since sampling is used to select instance, some important instances are still removed, which contain the information of or original instances.
Finally, in the field of intrusion detection, only a few algorithms are used to deal with imbalanced data, and most of the instance selection algorithms are used to deal with the problem of balanced data [13][14][15]. Data imbalance is known as instance imbalance. For the binary classification problem, under normal circumstances, the proportion of positive and negative instances should be relatively close, and many existing classification models are based on this assumption. However, in some specific scenarios, the proportion of positive and negative instances may vary greatly, which reduces the accuracy of minority class, which has smaller instances. erefore, instance selection algorithms, which deal with imbalanced data, need to be strengthened.
Given the above four problems, the factors considered for instance selection include: (1) the influence of all instances of the same class on the selected instance; (2) the influence of instances of different classes on the selected instance; (3) the influence of different classes of instances as an advantageous factor; and (4) the instance selection algorithm should be applied to the balanced and unbalanced domains for intrusion detection. As the existing instance selection algorithm does not take into account the above four factors, some selected instances are redundant and some important instances are removed, increasing storage space and reducing efficiency. erefore, for the first three factors, we propose a new concept of representativeness of instance. is concept is used to express the importance of the instance. Considering the fourth factor, we propose two representativeness-based instance selections, which are named RBIS and RBIS-IM. RBIS algorithm is used to handle balanced data and select the same proportion of instances from each class. And RBIS-IM algorithm is used to deal with imbalanced data and select important majority instances according to the number of instances of the minority class. Finally, the experimental results verify the effectiveness of proposed algorithms. Two algorithms can reduce the size of the training set while maintaining or even increasing accuracy (ACC) and balanced accuracy (BA). e main contributions of this paper are as follows: (1) A new concept of instance representativeness is proposed to represent the importance of an instance. In terms of instance representativeness, we consider not only the representativeness of the instance within its class but also the representativeness of the instance within different classes. e two representativenesses are advantageous factors; (2) To deal with balanced data problem, the RBIS algorithm, which is based on instance representativeness, is designed to select the same proportion of normal instances and attack instances to improve intrusion detection efficiency. Compared with other algorithms on the benchmark data sets of intrusion detection, RBIS algorithm can achieve a better balance between accuracy and reduction rate. (3) To handle imbalanced data problem, the RBIS-IM algorithm, which is based on instance representativeness, is designed to select the same number of normal instances and attack instances. Compared with other algorithms on the benchmark data sets of intrusion detection, RBIS-IM algorithm can achieve a better balance between balanced accuracy and reduction rate. e paper is structured as follows. In Section 2, we introduce the basic concepts of instance selection technique. Section 3 reports a new concept of instance representativeness and two representativeness-based instance selection algorithms that are used with regard to balanced and imbalanced problems, respectively. Experimental results with two representative-based instance selection algorithms are shown in Section 4. Finally, conclusions with a discussion on future work are presented in Section 5.

Instance Selection Technique
In this section, the basic concepts of the instance selection technique are introduced. e instance selection is to select important instances and eliminate redundant instances from the original data. ese selected instances can contain the total effective information of the original data. Suppose X represents the original data; S represents the selected instances; so S is the subset of X, i.e., S ⊂ X and |S| ≪ |X| .Using the instance subset S for IDS can improve detection efficiency and reduce storage requirements. According to the distribution of instances and selection strategies, instances with different locations play different roles in the classification process. In general, these algorithms are divided into three categories: condensation, edition, and hybrid. e condensation algorithm considers that instances close to the boundary play an important role in the classification process, just like SVM. It preserves the boundary instances by deleting the interior instances of each class [16][17][18]. In the field of intrusion detection, nature-inspired instance selection technique (NIIS) [19] and instance selection technique based on cuckoo search and bat algorithm (CSBAIS) [20] are proposed to improve the training speed and accuracy of the support vector machine (SVM). e NIIS algorithm applies the lower polling algorithm and social spider algorithm to select instances near the boundary. CSBAIS algorithm uses cuckoo search and bat algorithm to select instances near the boundary. But these algorithms remove some important internal instances too. e edition algorithm is the opposite of the condensation algorithm. It tends to smooth the class boundary by deleting the boundary instances [21][22][23]. e instance selection algorithm based on K-means and K-nearest neighbor (KMKNNIS) [24] is proposed to select important internal instances. ose instances near the boundary are removed. e penalty-reward-based instance selection method [25] is to select instances by removing noise and boundary instances. ese algorithms can ignore some critical boundary instances.
Finally, the hybrid algorithm combines the condensation algorithm with the edition algorithm to obtain a smaller subset and an acceptable accuracy in the testing set [9,[26][27][28]. PCIS [7] algorithm applies the partition and cluster center to select the instance. First, the algorithm only considers the influence of k instances of the same class on the selected instances and does not consider the influence of all instances of the same class. Second, the algorithm only uses the class center instances of different classes and does not use the information of all instances of different classes. ird, the instance information of different classes is regarded as adverse information. ISAR [10] and RIS [11] algorithms select important instances by sorting the instances. In the process of sorting instances, although the influence of all instances of different classes is considered, it is regarded as adverse information. BNNT algorithm uses the binary nearest neighbor tree to select the instance [8]. e algorithm only considers the k nearest neighbor instances of the selected instance and does not consider the influence of remaining instances. Moreover, the algorithm needs to delete internal instances to select instances. e CNNIR algorithm uses the constraint nearest neighbor to select the instance [9]. e algorithm does not consider the influence of remaining instances.
To sum up, there are mainly four factors in the instance selection process: (1) the influence of all instances of the same class on the selected instance; (2) the influence of instances of different classes on the selected instance; (3) the influence of different classes of instances as an advantageous factor; and (4) the instance selection algorithm should be applied to the balanced and imbalanced domains for intrusion detection. Since the above four factors are not taken into account in the existing instance selection algorithm, some selected instances are redundant and some important instances are removed, increasing storage space and reducing efficiency. erefore, we propose two algorithms to select important instances without deleting internal instances, which can handle balanced and imbalanced data problems. Meanwhile, the proposed algorithms consider not only the influence of all instances of the same class on the selected instances but also the influence of instances of different classes and take the influence of instances of different classes as an advantageous factor.

Proposed Algorithms
In this section, we introduce the proposed representativeness-based instance selection algorithms. In the first subsection, we introduce a new instance representativeness. In the next two subsections, two representativeness-based algorithms are introduced, which are used to deal with balanced and imbalanced data problems.

Proposed Instance Representativeness.
e key factor of instance selection is to decide which instance is representative, which makes the selected instance subset representativeness of the original data. Selecting representative instances, we should consider not only the representativeness of the selected instance category but also the representativeness of different categories. In other words, the instance selected has the information of its category and different categories. And the influence of instances of different categories is seen as an advantageous factor.
Suppose that X is a training instance set containing normal and attack categories, X � (x 1 , c 1 ), . . . , (x n , c 2 ) . X has n instances; x i is a d-dimensional instance; c expresses the classes of instances and c � c 1 , c 2 ; c 1 is the class of normal instances X n and c 2 is the class of attack instances X a ; X is composed of X n and X a . e representation of any instance x i in the training set X is as follows: (1) e first half of formula (1) represents the representativeness of instance x i in its category; the second half shows the representativeness of instance x i in different categories; c r , c p ⊂ c 1 , c 2 and r ≠ p; c r represents the category of instance x i ; c p is a different category from instance x i .
To realize Q(x i , c r ) or Q(x i , c p ) in formula (1), the Euclidean distance d(x i , x j ) can be used to represent the relation of two instances. e representativeness between an instance and a class is inversely proportional to the sum of its Euclidean distances of the instance and remaining instances of the same class. And the representativeness of instances of different categories is considered.
Security and Communication Networks us, formula (1) is transformed into the following form: where n i is the number of instances in the same category as x i ; n j is the number of instances in a different category from x i . e expression c i shows the category of instance x i ; the expression c i � c j shows that instances x i and x j are the same category; the expression c i ≠ c j shows that instances x i and x j are different categories; if x i and x j are the same class, i ≠ j.
Calculating the representativeness of instance R(x i , c), three factors are considered: (1) the influence of all instances of the same class on the selected instance; (2) the influence of instances of different classes on the selected instance; and (3) the influence of different classes of instances as an advantageous factor. e proposed representativeness of instance reflects the importance of instance. In Section 4.3, compared with other algorithms on the benchmark data sets of intrusion detection, experimental results verify the effectiveness of the representativeness of instance R(x i , c).

Representativeness-Based Instance Selection for Balanced
Data. To handle balanced data problem, a representativeness-based instance selection algorithm is proposed to select representative instances, which is called RBIS, to improve accuracy (ACC) and reduce reduction rate (RR) for IDS.
rough the RBIS algorithm, the same proportion of instances for each class is selected. Algorithm 1 shows the pseudo-code of the RBIS algorithm.
In Algorithm 1, original instances X are composed of normal instances X n and attack instances X a .S is the set of selected instances from original instances X;S n is the set of selected normal instances from X; S a is the set of selected attack instances from X; the parameter t is the ratio of selected instances by cross-validation or validation set. Firstly, in lines 3-5 of Algorithm 1, the representativeness of each instance is calculated. According to normal instances X n and attack instances X a , S n and S a are initialized. Secondly, according to representativeness R(x i , c), representativeness R(x i , c) and training set X are sorted in descending order (lines 6 and 7). Meanwhile, S n and S a are sorted in descending order. irdly, from line 8 to line 11, according to the cross-validation or validation data, 1-NN is used as the classifier. e parameter t with the best accuracy is selected and the range of parameter t is [0, 1]. In Section 4.3, the selection process of parametert is shown by Figures 1  and 2. According to parameter t, the first |S n | * t instances and the first |S a | * t instances are selected in S n and S a , respectively. Finally, according to S n and S a , S is determined. Figure 3 with two dimensions is used to demonstrate the instance selection process of the RBIS algorithm. Figure 3(a) shows two types of original data, which are normal and attack instances. e circle is "Class One," which represents the normal instance; the square is "Class Two," which represents the attack instance. And there are 10 normal instances and 10 attack instances. According to their representativeness, the instances of each class are ranked in Figure 3(b). e numbers around the graph indicate the degree of representation of the instance. e smaller the number, the more representative the instance is. For example, in normal instances, the Number "1" is the most representative and the Number "10" is the least representative. In Figure 3(c), according to the parameter t, the same proportion of instances are selected in each class. When the parameter t is 0.6, the first six instances of each class are selected.
e RBIS algorithm is based on the representativeness R(x i , c) of instance. e selected instances of RBIS algorithm contain the information of original data. e efficiency of the RBIS algorithm is related to the accuracy (ACC) and reduction rate (RR). Compared with other algorithms on the benchmark data sets of intrusion detection, experimental results, which are shown in Section 4.3, prove that the RBIS algorithm is effective and achieves a better balance between accuracy and reduction rate. As the same proportion of instances for each class is selected, the RBIS algorithm can handle the balanced data problem.
According to Algorithm 1 and formula (2), the time complexity of the proposed algorithm is mainly related to the calculation of instance distance between the same and different classes. erefore, the time complexity of the al-

Representativeness-Based Instance Selection for Imbalanced Data.
To solve the imbalanced data problem, a representativeness-based instance selection algorithm is proposed, which is called RBIS-IM. rough the RBIS-IM algorithm, the same number of instances for each class is selected to improve balanced accuracy (BA) and reduce reduction rate (RR) for IDS.
Algorithm 2 shows the pseudo-code of the RBIS-IM algorithm. Like Algorithm 1, Algorithm 2 is based on the representativeness of instance. Original instances X are composed of normal instances X n and attack instances X a . X n and X a are called the majority class and the minority class, respectively. e difference in the number between X n and X a is huge. S is the set of selected instances from original instances X; S n is the set of selected normal instances from X; S a is the set of selected attack instances from X; the parameter t is the ratio of selected instances by cross-validation or validation set.
In the process of instance selection, the number of selected instances of the majority class not only depends on the number of instances of the minority class but also is the same as that selected of the minority class. Firstly, in lines 3-5 of Algorithm 2, the representativeness of each instance is calculated. According to X n and X a , S n and S a are initialized. Secondly, according to representativeness, representativeness R(x i , c) and training set X are sorted in descending order (lines 6 and 7). Meanwhile, S n and S a are also sorted in descending order. irdly, from line 8 to line 11, according to the cross-validation or validation data, 1-NN is used as the classifier; the parameter t with the best balanced accuracy (BA) is selected and the range of parameter t is [0, 1]. In Section 4.3, the selection process of parameter t is shown by Figures 4-6. According to the selected parameter t, select the first |S a | * t instances and the first |S a | * t instances from in S n and S a , respectively. Finally, according to S n and S a , S is determined. Figure 7 with two dimensions is used to explain the instance selection process of the RBIS-IM algorithm. Figure 7(a) shows two types of original data where the circle Input: X: Training data set; t: the Ratio of selected instance by cross-validation or validation set; X n : the Set of normal instances; X a : the Set of attack instances. Output: S � S n ∪ S a ; S: Set of selected instances from X; S n : Set of selected normal instances from X n ; S a : Set of selected attack instances from X a (1) Normalize X (2) Initialize S,S n , and S a , according to X , X n , and X a (3) For each x i in X (4) calculate R(x i , c) by formula (2) (5) End for (6) [R(x i , c), I] ⟵ sortdesc R(x i , c) (7) X ⟵ sortIdx(X, I) (8) Obtain S n and S a ; in other words, according to R(x i , c), S n and S a are sorted in descending order (9) Select the best t that reaches the best accuracy using 1-NN classifier through cross validation or validation set (10) Obtain S n ⟵ S n * t and S a ⟵ S a * t, which select the first |S n | * t instances in S n and the first |S a | * t instances in S a (11) Obtain S ⟵ S n ∪ S a    Input: X: Training data set; t: the Ratio of selected instance by cross-validation or validation set; X n : the Set of normal instances called the majority class; X a : the Set of attack instances called the minority class. Output: S � S n ∪ S a ; S: Set of selected instances from X; S n : Set of selected normal instances from X n ; S a : Set of selected attack instances from X a (1) Normalize X (2) Initialize S,S n , and S a , according to X, X n , and X a (3) For each x i in X (4) Calculate R(x i , c) by formula (2) (5) End for (6) [R(x i , c), I] ⟵ sortdesc R(x i , c) (7) X ⟵ sortIdx(X, I) (8) Obtain S n and S a ; In other words, according to R(x i , c), S n and S a are sorted in descending order. (9) Select the best t that reaches the best balanced accuracy using 1-NN classifier through cross-validation or validation set (10) Obtain S a ⟵ S a * t and S n ⟵ S a * t, which select the first |S a | * t instances in S n and the first |S a | * t instances in S a (11) Obtain S ⟵ S n ∪ S a  smaller the number, the more representative the instance is. In Figure 7(c), when the parameter t is 1, the first four instances of the minority class are selected. Since the number of selected instances of the majority class depends on the number of instances of the minority class and is the same as that selected of the minority class, the first four instances of the majority class are also selected. Similarly, since the RBIS-IM algorithm is based on the representativeness of instance R(x i , c), the selected instances can contain all information of original data. And the effectiveness of RBIS-IM algorithm is evaluated by balanced accuracy (BA) and reduction rate (RR). In Section 4.3, compared with other algorithms on the benchmark data sets of intrusion detection, experimental results show that the RBIS-IM algorithm is effective and can achieve a better balance between BA and RR. Since the same number of instances for each class is selected to improve intrusion detection efficiency, RBIS-IM algorithm can deal with the imbalanced data problem. As the time complexity of the RBIS-IM algorithm is the same as the RBIS algorithm, the time complexity of this algorithm is O(N 2 ).
e difference between RBIS-IM and RBIS algorithms is mainly embodied in three aspects. Firstly, the problems solved by the two algorithms are different. e RBIS-IM algorithm is to solve imbalanced data problem, which refers to the huge difference in the number of normal instances and attack instances; the RBIS algorithm is to deal with balanced data problem, which means that the number of normal instances and attack instances is very close or equal. Secondly, the methods of selected instances of two algorithms are different. In the RBIS-IM algorithm, the selection of instances of majority class is determined by selected instances of minority class. e number of selected instances of two classes is the same. In the RBIS algorithm, the number of instances of each class is close. In the RBIS algorithm, the same proportion of instances are selected for each class. erefore, the number of selected normal and attack instances is very close.
irdly, the evaluation criteria of the two algorithms are different, which are shown in Section 4.2. RBIS is evaluated by ACC and RR while RBIS-IM is related to BA and RR.

Experiments
In this section, experiments are designed to prove the effectiveness of the proposed algorithms. e section is divided into three subsections. In the first subsection, two experimental data sets are shown. In the second subsection, the evaluation criteria are introduced. In the last subsection, the RBIS and RBIS-IM algorithms are validated on balanced and imbalanced data sets.  Security and Communication Networks 7

Experimental Data
(KDD) Cup 1999 data set and DDoS 2016 data set. Although the KDD 99 data set has some disadvantages, it is still widely used as a benchmark for IDS evaluation [29][30][31]. In the KDD 99 data set, the 10% KDD training data and the KDD correct data are used as training data and testing data, respectively. e distribution of these data is shown in Table 1. In the KDD Cup 99 data set, the label of data includes the normal class and attack classes, which are divided into four groups: the remote-to-login (R2L), the denial-of-service (DoS), the user-to-root (U2R), and the Probe.
In the KDD Cup 99 data set, every network connection represents a data record that consists of 41 features and a label specifying the status of this record. Each record contains 41 features: 3 nonnumeric features, and 38 numeric features. During data preprocessing, these nonnumeric features, which are the protocol type, service, and flag, must be transformed into numeric data. e protocol type has three kinds of types: tcp, udp, and icmp. According to the different types, the "protocol type" feature is transformed into three features. As the "service" feature has 70 different types and would heavily increase the dimensionality, this single feature is not used in our experiments. e nonnumeric feature conversion is shown in Table 2.
e DDoS 2016 data set was published in 2016, which was created using the network simulator NS2 [32,33]. ere are 2.1 million data records in the data set. Each record contains 28 features: 5 nonnumeric features, and 23 numeric features. ese nonnumeric features need to be converted to numerical ones. e data set contains normal data and four types of DDoS attacks, which are UDP flood, smurf, HTTP flood, and SIDDOS. In this section, the data set, which uses normal data and UDP flood, is used to evaluate the performance of the proposed algorithms.
According to balanced and imbalanced domains, the Knowledge Discovery and Data Mining (KDD) Cup 1999 and DDoS 2016 are divided into the balanced data set and the imbalanced data set. e description of data sets is shown in Tables 3 and 4.

Evaluation Criteria.
To evaluate the effectiveness and performance of the proposed algorithms, the confusion matrix is used. e confusion matrix is shown in Table 5. According to the confusion matrix, four performance metrics are applied: the detection rate (DR, also known as the true positive rate), true negative rate (TNR, also known as specificity or selectivity), balanced accuracy (BA), and accuracy (ACC). Meanwhile, the reduction rate (RR) is also applied.
In balanced data, ACC and RR are used to evaluate the performance of the proposed RBIS algorithm. To treat the minority and majority instances equally, BA is selected as the evaluation criterion of the RBIS-IM algorithm in the imbalanced problem. e DR is the proportion of attack instances that are correctly predicted as attacks in the test data set; it is an important metric reflecting the attack detection model's ability to identify attack instances and is described as e TNR is the proportion of normal instances that are correctly predicted as normal in the test data set. And, it is an important metric reflecting the detection model's ability to identify normal instances and can be written as      Security and Communication Networks e BA is the average of DR and TNR; it can be a leading metric for imbalanced data sets; it can serve as an overall performance metric for a model.
e ACC is the ratio of the number of instances correctly predicted in the test data set to the total number of instances. And, it can reflect the ability of the detection model to distinguish between normal and attack instances and is defined as e RR is the ratio of the number of selected instances in the training data set to the total number of instances; it can show the ability of the instance selection model to select optimal instances and can be written as

Experimental Results and Analysis.
In this section, we use the instance subset selected by the proposed instance selection algorithms to verify the effectiveness of instance representation and the algorithms. e experiment is conducted in balanced and imbalanced data sets. All the experimental results are obtained by calculating the average value of 100 experiments. e RBIS and RBIS-IM algorithms have a parameter t that is used to determine the number of selected instance subsets. In the training phase, the parameter t is determined by grid search on cross validation or verification set. In the RBIS algorithm, the parameter is selected by the best ACC. In the RBIS-IM algorithm, the selected parameter is related to the best BA. Figures 1 and 2 show the relation of ACC and parameter t on the balanced data sets. Moreover, Figures 1 and 2 Figure 1(b) is based on Figure 1(a). Similarly, Figure 2(b) is based on Figure 2(a). From Figure 1(a), the best ACC is achieved when the parameter t takes 0.1 in the interval [0.1, 1]. erefore, the range of parameter t in Figure 1(b) is in the interval [0, 0.1]. rough experiments, the range of parameter t in Figure 1(b) is in the interval [0.001, 0.01]. In Figure 1(b), according to the best ACC, the parameter t is 0.3%.
Like Figure 1, Figure 2(a) illustrates that the best ACC is obtained when the parameter t takes 0.1 in the interval [0. 1,1]. erefore, the range of parameter t in Figure 2 rough experiments, the range of parameter t in Figure 2(b) is in the interval [0.0721, 0.0730]. In Figure 2(b), according to the best ACC, the parameter t is 7.25%.  Figures 4(a) show the change of BA when the parameter t is in a large interval [0.1, 1]. Figures 4(b) indicate the change of BA when the parameter t is between in a small interval [0.71, 0.80]. Figure 4(b) is based on Figure 4(a). From Figure 4(a), the best BA is obtained when the parameter t takes 0.8 in the interval [0. 1,1]. erefore, the range of parameter t in Figure 4 rough experiments, the range of parameter t in Figure 4(b) is in the interval [0.71, 0.80]. In Figure 4(b), according to the best BA, the parameter t is 0.76. From Figures 5 and 6, it is obvious that the parameter t is set to 1 under the condition that BA obtains the best on two data sets. Moreover, relevant experiments are conducted in the interval [0.9, 1].
e experimental results show that BA obtains the best when parameter t is 1. Table 6 shows that on the balanced data set, the three common classifiers, which are 1-NN, SVM, and Adaboost, use the entire training set and instance subset selected to obtain ACC, RR, and average accuracy, respectively. On the DoS data set of KDD cup 99, the accuracy of the three classifiers is greatly improved by using the instance subset selected by the RBIS algorithm. On the DDoS 2016 data set, the three classifiers also achieve good accuracy by using the instance subset. e accuracy of SVM and Adaboost using the instance subset are slightly lower than those of the whole training set, but the RBIS algorithm only uses 7.25% of instances to get good accuracy (i.e. 94.682% or 94.668%).
is shows that the RBIS algorithm can reduce RR while maintaining accuracy. On the two balanced data sets, the accuracy of 1-NN using the instance subset is higher than that by the whole training set. is is because the instance subset is selected by the proposed instance selection algorithm and 1-NN. In addition to good ACC, the RR by the three classifiers and instance subsets are very small, which are 0.3% and 7.25%, respectively. is can prove that the RBIS algorithm can achieve a better balance between ACC and RR. On the other hand, from the perspective of average ACC, it is obvious that the average ACC by the instance subset is much higher than that by the whole training set on the DoS data set. Meanwhile, on the DDoS 2016 data set, the average ACC by the instance subset is only slightly higher than that obtained by the whole training set. is indicates that the RBIS algorithm can select optimal instances to improve ACC and reduce RR for IDS.
In Table 6, the experimental results demonstrate that the proposed RBIS algorithm is effective and can deal with balanced data problem. e RBIS algorithm is effective because it is based on the new instance representativeness, which is shown in Section 3.1.
rough instance representativeness, the selected instances possess the information of the entire instances and are useful to improve ACC and reduce RR for IDS.

Security and Communication Networks
As shown in Table 7, on the imbalanced data sets, three common classifiers, which are 1-NN, SVM, and Adaboost, can obtain BA, RR, and average BA using the whole training set and the instance subset. ree imbalanced data sets are from the KDD Cup 99. On the Probe data set, using the instance subset, the three classifiers get good accuracy. Compared with the whole training set, the BA by the 1-NN classifier using instance subset is slightly lower, while BAby SVM and Adaboost are better. On U2R and R2L data sets, compared to using the whole training set BA of three common classifiers using instance subset is better. e experimental results prove that the RBIS-IM algorithm can achieve a better balance between BA and RR.
Besides, from the perspective of average BA, on the Probe data set, the average BA using the instance subset is slightly higher than that using the whole training set. On the U2R and R2L data sets, compared with the average BA using the whole training set, the average BA using the instance subset is greatly improved. erefore, the experimental results on imbalanced data sets indicate that the RBIS-IM algorithm is effective and can obtain good RR while improving BA. is is because the RBIS-IM algorithm is also based on the new instance representativeness, which is shown in Section 3.1. rough instance representativeness, the optimal instances are selected to improve BA and reduce RR for IDS. And the experimental results display that the RBIS-IM algorithm can handle imbalanced data problem.    Tables 8 and 9 display the ACC and RR with the 6 instance selection algorithms on the balanced data sets. e proposed RBIS algorithm is compared with 5 algorithms: edited nearest neighbor (ENN) [22], ISAR [10], BNNT [8], CNNIR [9], and RIS 1 [11]. For ISAR and RIS 1, their instance selection algorithms are only used. On two balanced data sets, compared with the other 5 algorithms, the proposed RBIS algorithm achieves the best experimental results on ACC in Table 8. And the RBIS algorithm achieves the second RR on two balanced data sets in Table 9. In terms of average performance, it is obvious the RBIS algorithm achieves the best experimental results on ACC and RR.
is indicates that the RBIS algorithm can achieve a better balance between ACC and RR. And, it can solve balanced data problem. Similarly, it proves that the RBIS algorithm is effective. In other words, the selected instances are optimal and contain the information of the whole instances.
is is because four factors in the instance selection process are considered, which are shown in Section 3.1. Table 10 shows the BA of 6 instance selection algorithms on the imbalanced data set. On the Probe data set, the BA of ENN, ISAR, RIS 1, and RBIS-IM algorithms are very close, and the biggest gap between them is less than 1%. is displays the RBIS-IM algorithm has the ability to distinguish between normal and attack instances. On the U2R and R2L data sets, the BA of the RBIS-IM algorithm is the best. Compared with other algorithms, the minimum gap is at least 10%. From the average BA, the average BA of the ENN, ISAR, and RIS 1 algorithms are very close, while the BA of the RBIS-IM algorithm is the best in Table 10.
e experimental results prove that representative instances selected by RBIS-IM algorithm contain the information of the whole instances and the RBIS-IM algorithm can select representative instances to increase the BA for IDS. Moreover, the experimental results demonstrate that RBIS-IM algorithm can deal with imbalanced data problem. Table 11 presents the RR of 6 instance selection algorithms on the imbalanced data set. On the Probe data set, the RR obtained by ISAR, CNNIR, and RIS 1 algorithms are very close. But, compared with ENN, other algorithms have a big gap with it. On the U2R data set, except for the ENN algorithm, the RR of other algorithms are very close and less than 1%. On R2L data, there is a small difference between the RR of the three algorithms, which are ISAR, CNNIR, and RIS 1 algorithms. From the average RR, the RR of the BNNT algorithm is the best. But, it is obvious that ENN gets poor RR (i.e. 99.879%). Since ENN is based on the nearest neighbor, ENN only removes instances near to the boundary and deletes limited instances of majority class. Moreover, ENN cannot deal with imbalanced data problem. e proposed RBIS-IM algorithm has good RR (i.e. 13.059%). is displays that the RBIS-IM algorithm can select small and representative instances to reduce RR. And the experimental results show that the RBIS-IM algorithm can deal with imbalanced data problem. e time complexity of 6 instance selection algorithms is present in Table 12. N represents the number of original instances. According to Table 12, the time complexity of the 6 algorithms is divided into two types. One is O(NlogN), which are ENN, BNNT, and CNNIR algorithms. e other is O(N 2 ), which are ISAR, RIS 1, RBIS, and RBIS-IM algorithms. Figure 8 shows the relation of average ACC and average RR of 7 algorithms on the balanced data set and is based on Tables 6, 8, and 9. e 1-NN algorithm uses the whole training instances and the other 6 algorithms use the instance subset through their instance selection algorithms. On the balanced data set, the RBIS algorithm achieves the best in ACC and RR. Figure 8 suggests that the RBIS algorithm can select optimal instances to improve ACC and reduce RR for IDS. ese optimal instances have the information for the entire instances. Figure 9, which is based on Tables 7, 10, and 11, shows the relation of average BA and average RR of 7 algorithms on the imbalanced data set. It is obvious that RBIS-IM is the best on average BA. And Figure 9 suggests that the RBIS-IM algorithm can select optimal instances to increase BA and reduce RR for IDS. Although the average RR of the RBIS-IM algorithm is not the minimum, RBIS-IM algorithm can    achieve a good balance between average BA and average RR. Moreover, it is found that the RBIS-IM algorithm can handle imbalanced data problem.

Conclusions
In this paper, after analyzing the instance selection algorithm and its defects in intrusion detection, we propose a new representativeness of instance to determine the importance of an instance. Calculating the representativeness of instance, we consider not only the representativeness of instance in its category but also the representativeness of instances in different categories. ese two representativenesses are equally important. Moreover, the influence of instances of different classes on selected instance is regarded as an advantage factor. To deal with balanced and imbalanced data problems, we propose the RBIS and RBIS-IM algorithms, respectively. In the process of instance selection, the proposed algorithms need not delete internal instances and noise instances. Compared with other algorithms on the benchmark data sets of intrusion detection, experimental results show that the two algorithms are effective. RBIS algorithm can achieve a better balance between accuracy (ACC) and reduction rate (RR). Similarly, the RBIS-IM algorithm can achieve a better balance between balanced accuracy (BA) and reduction rate (RR). Furthermore, it is also verified that the proposed representativeness of instance is correct and effective.
In future work, we intend to study how to automatically obtain the appropriate parameter t of the proposed approaches, which will reduce the training time of the algorithms. Moreover, obtaining the parameter t automatically can improve and enhance the effectiveness and applicability of the algorithms.

Data Availability
In this paper, two data sets are used for intrusion detection. ey are public, which are the Knowledge Discovery and Data Mining (KDD) Cup 1999 data set and DDoS 2016 data set. e corresponding URLs are, respectively, http://kdd.ics.uci. edu/databases/kddcup99/kddcup99.html and https://www.re searchgate.net/publication/292967044_Dataset_Detecting_D istributed_Denial_of_Service_Attacks_Using_Data_Mining_ Techniques.