In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs) and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 and
The class imbalance problem is found in various scientific and social arenas, such as fraud/intrusion detection, spam detection, risk management, technical diagnostics/monitoring, financial engineering, and medical diagnostics [
There are essentially two methods to address the class imbalance problem: sampling methods and cost-sensitive learning methods [
Most data mining techniques are not designed to cope with large numbers of features, and such is the case with feature selection. Currently, the class imbalance problem is severe when data dimensionality is high. Of the many methods that exploit feature selection, the most common are those that address only relevant features, and these methods are also the most efficient and effective, which is widely known as the curse of dimensionality [
In this paper, we investigate cost-sensitive feature selection issues in an imbalanced scenario. Specifically, before briefly introducing cost-sensitive learning and its application to feature selection, we illustrate the imbalanced problem, which is the most relevant topic of study in the current research. Then, we propose a new method for feature selection whose goal is to develop an efficient approach in the field of network security, an arena in which large numbers of imbalanced datasets are typical. Thus, rather than improving on previous methods, our purpose is to match the performance of previous cost-sensitive feature selection approaches using a method that addresses very large datasets with imbalance problems.
Different costs are associated with different misclassification errors in real world applications [
Cost-sensitive learning has two major types of costs: misclassification and test costs [
Test costs typically refer to money, time, computing, or other resources that are expended to obtain data items related to an object [
Some studies focus on misclassification costs but fail to consider the cost of the test [
In general, classification time increases with the number of features based on the computational complexity of the classification algorithm. However, it is possible to alleviate the curse of dimensionality by reducing the number of features, although this may weaken discriminating power.
A classifier can be understood as a specific function that maps a feature vector onto a class label [
Several works have addressed cost-sensitive feature selection in recent years. For example, Bosin et al. [
Mejía-Lavalle [
Wang et al. [
In study by Lee et al. [
Chang et al. [
Zhao et al. [
Liu et al. [
However, few studies have focused on the class imbalance problem in view of cost-sensitive feature selection. To the best of our knowledge, no study addresses cost-sensitive feature selection in the security network field because of the significant domain differences and dependences.
Here, we present the common notations and an intrusion detection event that was taken from the KDD CUP’99 dataset [
Let the original feature set be
Assume that, in the instance space, we have a set of samples
The cost-sensitive feature selection problem is also called the feature selection with minimal average total cost problem [
Let MC be the misclassification cost matrix and let TC be the test cost matrix. The average total cost should be the following:
In the real world, there are many types of costs associated with a potential instance, such as the cost of additional tests, the cost associated with expert analysis, and intervention costs [
Without loss of generality, let
The cost-sensitive classification problem can be constructed as a decision theory problem using Bayesian Decision Theory [
Although decision tree building does not need to be cost-sensitive to measure feature selection, an algorithm requires the cost-sensitive property to rank or weight features based on their importance [
Borrowing ideas from credit card and cellular phone fraud detection in related fields, the Lee research group identifies three major cost factors related to intrusion detection: damage cost (DCost), response cost (RCost), and operational cost (OpCost) [
Cost metrics of intrusion categories.
Main category (by results) | Description | DCost | RCost |
---|---|---|---|
U2R | Illegal root access is obtained | DCost = 100 | RCost = 60 |
R2L | Illegal user access is obtained from outside | DCost = 50 | RCost = 40 |
DOS | Denial-of-Service of target is accomplished | DCost = 30 | RCost = 15 |
PROBE | Information about the target is gathered | DCost = 2 | RCost = 7 |
Normal | Normal events | DCost = 0 | RCost = 0 |
The expected misclassification cost for a new example drawn at random from
Based on study by Lee et al. [
Operation cost metrics.
Feature level | Lev. 1 features | Lev. 2 features | Lev. 3 features | Lev. 4 features |
---|---|---|---|---|
Relative magnitudes | 1 | 5 | 10 | 100 |
We assume that both misclassification and test costs are given in the same cost scale. Therefore, summing together the two costs to obtain the average total cost is feasible.
Unlike the traditional feature selection algorithm, whose purpose is to improve classification accuracy or reduce measurement error, this paper attempts to minimize total costs and make trade-offs between costs and classification accuracy. The final objective of the feature selection problem is to select a feature subset with minimum size, total average costs, and classification accuracy.
Let feature subset
Genetic algorithm (GA) is a popular parallelized method because of its powerful quality of global search, which is widely used in feature selection [
Most studies introduce a logistics map [
By comparing ten one-dimensional chaotic maps in terms of the convergence rate, algorithm speed, and accuracy, Tavazoei and Haeri found that no single map has the best global optimization ability [
As the Tent map requires the maximum computational time, which seriously affects the algorithm speed, we improved it by deploying the random equation based on [
In this section, we propose a cost-sensitive feature selection model that uses a new cost-sensitive fitness function and chaos GA to solve the class imbalance problem. The algorithm follows the filter approach, which is not associated with a particular classifier. Finding a minimal optimal cost feature subset is NP-hard, particularly in those situations with an imbalanced dataset. However, it is important to combine the feature selection procedure with the learning model [
Our algorithm consists of four main steps (see Figure
Flowchart for the CSFSG algorithm.
Convert the discrete numeric attribute and normalization values to the range
Generate the initial population using the Tent map and encode. Provide the probability of crossover
Calculate the fitness value of each individual and select the optimum population based on the cost-sensitive fitness function.
Apply GA to the population, search the candidate feature subset by using the chaos optimize algorithm, and update the current population.
We conducted a series of experiments to compare the overall performance of our approach with some existing algorithms. In this section, we introduce the implementation setting and the evaluation measurements used in our experiments. Then, we describe the relative comparison experiments. Finally, we discuss the results.
Our experiments were implemented in the Weka framework with Java, which is available at
The raw dataset comprises approximately 4 GB and 5 million instances, which are divided into two classes—labeled and unlabeled—of three types: continuous, discrete, and string. In this study, we use the ten percent version, which consists of 494,021 connections and 24 types of attacks with 5 classes (Normal, DOS, U2R, R2L, and PROBE). The four main types of attack are DOS (Denial of Service Attack, e.g., land attack), U2R (User to Root Attack, e.g., rootkit attack), R2L (Remote to Local Attack, e.g., guess password attack), and probing (information about the target is gathered, e.g., nmap attack). The detailed characteristics of these datasets are shown in Table
Class distribution in the raw and experimental datasets of the KDD Cup’99 dataset.
Type | DOS | U2R | R2L | PROBE | Normal |
---|---|---|---|---|---|
Number | 391458 | 52 | 1126 | 4107 | 97278 |
Percentage | 79.23% | 0.01% | 0.23% | 0.83% | 19.69% |
Our feature selection, which is a filter approach, does not depend on classifiers. We choose two types of popular classifiers for the classification assignment:
For the sake of comparison, we designed two group experiments (feature selection and classification), one that uses our cost-sensitive feature selection method and one that does not. In the feature selection stage, the average total cost is applied (the sum of the average test and misclassification costs) to validate the effectiveness of our proposed method. In the classification stage, weighted accuracy is applied, which substitutes accuracy and is more suitable for imbalanced datasets. Moreover, we use tenfold cross-validation to evaluate the performance of the classifier as well as comparing the execution time of each classifier with and without feature selection.
The algorithm we proposed in this paper focused on the cost-sensitive fitness function but not parameter optimization of the GA. Thus the parameter of GA itself is not discussed in this paper. Based on the majority of configurations of the GAs, the parameters were configured to be nearly the same as those in Weiss et al. [
Bolón-Canedo et al. had studied and evaluated the behavior of the methods under the influence of parameter
Confusion matrix.
Predicted class | ||||||
---|---|---|---|---|---|---|
|
|
|
|
| ||
Actual class |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We chose precision, recall, and
For the sake of comparison, we designed two stages of experiments. In the first feature selection stage, we compared the effect of our cost-sensitive feature selection method with traditional ones on an imbalanced dataset. In the second classification stage, we compared the precision of two types of classifiers (KNN, C4.5) using CSFSG via 10-fold cross-validation in the imbalanced environment.
Because there were multiple datasets with deficient class numbers and evaluation metrics, we designed many combinations of comparison experiments for our proposed feature selection method using two feature selection algorithms: correlation-based feature selection (CFS) [
Table
Comparing the feature number and execution time (in seconds).
Feature selection algorithms | Feature number | Execution time (with KNN) | Execution time (with C4.5) |
---|---|---|---|
— | 41 | 0.16 | 17.12 |
CFS | 11 | 0.04 | 2.78 |
CASH | 23 | 0.21 | 33.14 |
CSFSG | 17 | 0.05 | 7.19 |
As seen in Figures
Precision values of KNN with feature selection.
Precision values of C4.5 with feature selection.
In the classification stage, 10-fold cross-validation is applied on the datasets. The results of classification with feature selection are shown in Table
Classification evaluation results for the
Algorithm |
| |||||
---|---|---|---|---|---|---|
Feature selection | Classifier | Normal | DOS | PROBE | R2L | U2R |
— |
|
0.996 | 0.998 | 0.990 | 0.936 | 0.876 |
CFS |
|
0.997 | 0.998 | 0.990 | 0.938 | 0.899 |
CASH |
|
|
|
|
0.936 | 0.866 |
CSFSG |
|
0.997 | 0.997 | 0.989 |
|
|
— | C4.5 | 0.997 | 0.998 | 0.988 | 0.963 | 0.966 |
CFS | C4.5 | 0.997 | 0.998 | 0.988 | 0.967 | 0.966 |
CASH | C4.5 | 0.997 | 0.989 |
|
0.942 | 0.963 |
CSFSG | C4.5 |
|
|
0.953 |
|
|
|
||||||
Algorithm | Recall | |||||
Feature selection | Classifier | Normal | DOS | PROBE | R2L | U2R |
|
||||||
— |
|
0.996 | 0.998 | 0.986 | 0.779 | 0.6 |
CFS |
|
0.997 | 0.998 | 0.78 | 0.724 | 0.8 |
CASH |
|
|
0.998 |
|
0.75 | 0.4 |
CSFSG |
|
0.997 |
|
0.964 |
|
|
— | C4.5 | 0.997 | 0.998 | 0.971 | 0.735 | 0.4 |
CFS | C4.5 | 0.997 | 0.998 | 0.788 | 0.75 | 0.2 |
CASH | C4.5 | 0.997 | 0.989 | 0.967 |
|
0.4 |
CSFSG | C4.5 |
|
|
|
0.738 |
|
|
||||||
Algorithm | ROC area | |||||
Feature selection | Classifier | Normal | DOS | PROBE | R2L | U2R |
|
||||||
— |
|
0.996 | 0.998 | 0.990 | 0.983 | 0.899 |
CFS |
|
0.999 | 0.998 | 0.996 | 0.977 | 0.99 |
CASH |
|
|
0.998 | 0.988 | 0.938 |
|
CSFSG |
|
0.998 |
|
|
|
0.988 |
— | C4.5 | 0.999 | 0.998 | 0.992 | 0.937 | 0.579 |
CFS | C4.5 | 0.997 | 0.998 | 0.963 | 0.886 | 0.733 |
CASH | C4.5 | 0.997 | 0.989 |
|
0.887 | 0.721 |
CSFSG | C4.5 |
|
|
0.996 |
|
|
Although the result in the first two columns (normal and DOS) does not show obvious differences, the result derived from the cost-sensitive feature selection in the last three columns presents higher values of
In addition, we can obtain better classification performance with CSFSG than with CASH in the time given, which is one of the most important factors in network security.
In conclusion, our cost-sensitive feature selection method does not save a considerable amount of time; however, it does facilitate the detection of the minority and CSFSG class, particularly under a class imbalance environment.
In response to the rapid growth of big data, this study presents a novel cost-sensitive feature selection method using a chaotic genetic search for imbalanced datasets. We introduce cost-sensitive learning into the feature selection method, considering both the misclassification cost and test cost with respect to the field of network security.
It can be seen from the experimental results that cost-sensitive feature selection using chaotic genetic search efficiently reduces complexity in the feature selection stage. Meanwhile, it can improve classification accuracy and decrease classification time.
Several future works will address problems with large numbers of features. Furthermore, future research will focus on the application of the proposed method to other fields, such as medicine or biology.
The authors declare that they have no competing interests.
This study is supported by the project supported by the National Science Foundation for Young Scientists of China (Grant no. 61401300), the Outstanding Graduate Student Innovation Projects of Shanxi Province (no. 20123030), and the Scientific Research Project of Shanxi Provincial Health Department (no. 201301006).