Imbalanced Data Set CSVM Classification Method Based on Cluster Boundary Sampling

This paper creatively proposes a cluster boundary sampling method based on density clustering to solve the problem of resampling in IDS classification and verify its effectiveness experimentally. We use the clustering density threshold and the boundary density threshold to determine the cluster boundaries, in order to guide the process of resampling more scientifically and accurately. Then, we adopt the penalty factor to regulate the data imbalance effect on SVM classification algorithm. The achievements and scientific significance of this paper do not propose the best classifier or solution of imbalanced data set and just verify the validity and stability of proposed IDS resampling method. Experiments show that our method acquires obvious promotion effect in various imbalanced data sets.


Introduction
Imbalanced data set (IDS) is the real observed data form which is prevalent in computer, economics, biology, medicine, and many other natural science fields.That is, there may be magnitude difference in the number of samples between classes.It reflects the nature of the objective things, but, in fact, people tend to only care about the occurrence of small categories.For example, in the detection of financial fraud, the vast majorities are legitimate users, but people do hope through the data to predict those potential illegal users [1,2]; in the company's bankruptcy risk prediction, the bankrupt company is a small minority, but the enterprise managers are concerned about whether the current business situation has the bankruptcy possibility [3]; in petroleum exploration, there may not be many petroleum areas, but this is exactly what exploration personnel should focus on [4]; in medical diagnosis, the fact is that healthy people must be the majority, but people care about whether they could predict the occurrence of the disease through the current data [5]; in the industrial field, the fault detection is typical of the IDS problem.Most of the equipment must be kept in normal operation in working hours, but it means a lot if a small amount of abnormal equipment or parts can be detected in advance [6]; in the field of biology, the prediction of DNA sequence and protein type also faces the problem of IDS [7,8].Therefore, IDS classification not only is one of the most difficult challenges in classification technology research at the technical level, but also has very important practical significance at application level.In recent years it has attracted a great deal of attention from the majority of researchers in various fields.
A large number of studies have shown that satisfactory classification results cannot be achieved if some standard classification models are directly adopted to solve the IDS classification problem [9].Almost all of the methods were very low in classification accuracy of rare categories and could not improve the recognition level of the rare classes as a whole to an extent of actual acceptance.Researchers are facing a huge challenge; thus the related researching needs to go deeper.Theoretically, we can adopt two strategies to solve the problem of IDS classification.One is the resampling, which can be divided into the down-and upsampling.That is, we can appropriately screen the information content from samples of large class or improve the error costs from samples of small class [10].The other strategy is to explore more 2 Mathematical Problems in Engineering suitable classification models and mend the IDS classification algorithm based on IDS data characteristics to improve their classification ability.
In this paper, we propose to combine the data resampling and algorithm enhancing strategies for an integrated solution to the problem of IDS classification.We creatively propose the method of cluster boundary sampling and resampling of the IDS, which not only effectively balances the data skew state but also greatly reduces the number of support vectors.It betters the classification results and improves the classification speed significantly.This sampling method overcomes the traditional sampling methods' shortness including lack of theoretical basis, strong randomness, interference of human subjectivity, and serious information loss.At the same time, it is a good solution to the aliasing phenomenon in data, which can greatly improve the generalization performance of the following SVM classifier.In order to adapt to the imbalance state of sample, we have also improved the SVM classification model and use the grid optimization method of cross validation for the training data to determine the punishment factor of the SVM and gamma value of kernel function, which seeks a reasonable theoretical basis for the determination of the penalty factor for SVM.

The IDS Classification Algorithm Based on CSVM
At present, the SVM classification algorithm has been able to solve most of the problems with the characteristics of the relatively small data volume, more complete marks, and relatively homogeneous distribution [11].But when facing the IDS, its performance has dropped significantly.By investigating the reason, we could find that the main reason should be the imbalanced distribution of training data, which makes the positive and negative samples ratio of support vector also obviously imbalanced.Negative samples information occupies a dominant position and thus submerges the positive samples information.Eventually the decision function makes the classification results redundantly lean to negative samples, since only the training samples near the interface could be used as support vector in classification and the samples far from the interface are not likely to be affected.In theory, the SVM should be affected by the IDS classification model with minimal impact compared with other models.For the cases with the smaller imbalance rate, we can get good classification results even though little work has been done to improve the classifying model itself and the SVM learning mechanism could give us a lot of space to improve the classification model.Therefore, this paper still chooses the SVM as the basic algorithm model of IDS classification.

The Analysis of the IDS's Influence on SVM Classification.
In order to make the analysis more visual and intuitive, we analyze the impact of data imbalance and sampling on SVM classification by linear separable data sets.As can be seen from Figure 1(a) of the training process, due to the imbalance of the positive and negative samples, the actual classification hyperplane basically keeps consistent with the ideal hyperplane in the direction, while it is far from the negative samples and close to the positive ones, which is the result of data submerging phenomenon.As shown in Figure 1(b), in the test, this classification hyperplane has a strong tendency towards some negative samples, which makes some positive samples wrongly classified as negative samples.
We randomly select the same number of samples from the positive and negative samples and make the data reach an equilibrium state.Figure 2(a) is the training results after resampling.Although the classification hyperplane we get from learning basically keeps an ideal distance with the positive and negative samples, it gets very big deviation with the direction produced by the ideal super plane, which results from information loss after sampling.As shown in Figure 2(b), such classification hyperplane may also lead to an incorrect classification in the test.
Therefore, how to reduce the information loss in a maximum degree while lowering the deflection rate is an important issue to concern if we want to use the resampling method to solve the IDS.

SVM Penalty Factor Determination Based on Grid Optimization.
Through the previous analysis we can draw such a conclusion.For the IDS, if we directly adopt the traditional SVM model for classification, the actual classification hyperplane basically keeps consistent with the ideal hyperplane in the direction, but we get the deviation in the distance, which is far away from the negative samples and closer to positive samples.This situation makes people naturally think of whether there is a way to pull the actual classification hyperplane near the ideal hyperplane.The most commonly accepted and widely used method is to adjust the penalty factor.Adjusting the penalty factor  is generally considered to be an effective way to improve the IDS classification effect [12].
Give a kernel function  and a set of labeled samples  train = {  ,   }  =1 .SVM finds a best   for each   to make classification interval  smallest between the classification hyperplane and the sample nearest to it.When a new test sample arrives, it can be predicted by the following formula: In the formula  is the threshold.First-order soft margin SVM will minimize the initial Lagrange function: In the formula,   ≥ 0,   ≥ 0. Penalty factor  represents a compromise between the empirical error and the classification interval.To satisfy KKT conditions, the value of   should be met: The penalty factor produces effect mainly through the restriction of   in formula (1).The value of   decides how big the vector   role is in determining the classification plane.Generally speaking, the bigger its value is the bigger effect it has in classification prediction.According to the optimum conditions (KKT), the value of   and its relationship with the penalty factor  are as follows: Adjusting the penalty factor can improve the classification effect of SVM on IDS, but this method also has some drawbacks.First, the penalty factor is often determined and adjusted in different application data by virtue of the experience from the researchers and there is no theoretical basis and support.Second, some research shows that the effect is very limited if we improve the IDS classification by adjusting the penalty factor.It is not true that the bigger the penalty factor the better the effect.When the penalty factor exceeds the critical value and if the value becomes bigger, it plays an opposite role [13].Finally, some studies also put forward a viewpoint that artificially changing the penalty factor is contrary to some basic principles of the kernel method and is lacking in theoretical support.
In this paper, we use the grid optimization method of cross validation on training data, to more reasonably determine penalty factor  of VSM and the parameter gamma value  of the RBF kernel function.We cannot solve all the problems of the penalty factor.We just want to seek some reasonable theoretical basis and feasible technical means for the determination of the SVM's penalty factor in the condition of IDS, which could get rid of the serious human experience reliance in the actual situation.We divide positive and negative samples of the training data into 10 parts, respectively, 9 of which are used as the hypothetical training data each time and one is used as hypothetical test data for training and testing.We estimate the scope of  and  and use (, ) to set grid space.We also set the space and seek the optimization in the grid until obtaining the relatively ideal classification results and the correspondent penalty factor and the value of  to train the VSM classification model.

Imbalanced Data Cluster Boundary Sampling Method Based on Density Clustering
For the imbalanced data, resampling is one of the important solutions to the problem of IDS classification, of which the clustering sampling method is widely accepted in recent years and has achieved good results [14].According to the previous analysis, we get the basic criterion of imbalanced data sampling to minimize the loss of information while trying to reduce the imbalance ratio.It is a paradox to reduce the number of negative samples but also retain the amount of information as far as possible.At present, using clustering technique as the main method of IDS sampling is one of the main means of downsampling strategy [15].The vast majority of the idea is to delete the sample containing little information content after the sample clustering so as to achieve the purpose of reducing the imbalance ratio and retaining most information.We have studied and analyzed the sampling method in the process of studying the IDS problem and think that the overall removal of the cluster sample is not necessarily the optimal strategy.We believe that the information content carried by the sample is not evenly distributed, and there should be core information which could represent the core information state of the data and is the key to affecting the classification.Thus we put forward a hypothesis.In the cluster boundary samples, both the majority of the similarity information in the cluster and the difference information among the clusters should be reserved.We assume that the special samples (i.e., cluster boundary samples) which include both the similarity information in the cluster and the difference information among the clusters carry the core information quantity we are looking for.Through this hypothesis, this paper creatively proposes a cluster boundary sampling method based on density clustering to solve the problem of resampling in IDS classification.

Clustering Method Based on Density.
As one of the widely utilized clustering algorithms, density based clustering can get rid of the restriction by data attributes, dimension, arrangement order, and spatial distribution shape and automatically identify the number of clusters with a strong ability to resist interference [16].The clustering algorithm based on density is widely used and considers the cluster as the high density object region which is separated by the lower one in the data space, which could find the clusters of any shape and can identify the noise data.
For the clustering method based on density, the main idea is to select an object as a kernel object and query the core object's neighborhood.Once the density of the adjacent area exceeds a certain threshold, any object except the previous core would be selected as new core object to continue clustering in the neighborhood.Eventually the relatively high density region is divided into clusters from relatively low density regions and forms clusters.
Assume that a data object is described by  attributes (also called metrics or variables), and several data objects with  attributes constitute the dimensional data space.In  dimension space, data objects are called  dimensional data points, and  dimensional data points  can be expressed as  = ( 1 , . . .,   ), in which   represents the value of  attribute and  represents the dimension of the space (dimensionality).Consider a collection of  dimensional data points (also known as  dimensional data sets). could be expressed as  = ( 1 , . . .,   ), in which   = (s 1 , . . ., s  ) and   represents  attribute value of  data point.According to the similarity between the data points, we divide  dimensional data sets  into { 1 ,  2 , . . .,   }.This process is called clustering analysis, in which  ≤ ,   ̸ =  ( = 1, 2, . . ., ), and ⋃   = .Here   is generally called clusters.
Data matrix, also called object-variable structure: use  variables (also called metrics or attributes) to express  objects.For example, we could use age, height, sex, weight, and other attributes to express people.This data structure is the form of relational tables or  ×  ( objects ×  variables) matrix, as shown in the following: Dissimilarity matrix, also called object-object structure: store the proximity of  objects of all pairs.It is often shown with  ×  matrix, in which ⟨ 1 (),  2 (), . . .,   ()⟩ is the measurement difference or dissimilarity between the objects  and .In general, (, ) is a nonnegative number.The more similar or closer the object  is to , the closer its value is to 0; the more different the two objects are, the bigger their value is.As shown in formula (6), (, ) = (, ) and (, ) = 0: Similarity is usually defined by the distance between the data points.The shorter the distance is, the greater the similarity will be; the smaller the distance is, the smaller the similarity will be.In an ideal situation, the distance   between data points V  and V  must meet the following conditions: (1)   ≥ 0 (nonnegativeness); (2)   = 0 if V  = V  ; (3)   =   (symmetry); The value of   should meet the above conditions and range within 0 ∼ ∞.The smaller the value of   is, the greater the similarity between V  and V  is.On the contrary, the bigger the value of   is, the smaller the similarity is.

Cluster Boundary Determination Method Based on Neighborhood.
The data elements obtained in the same cluster by density clustering are relatively densely distributed in the vector space and contain contents of high similarity.We believe that the data elements of the cluster boundary can effectively represent the characteristics of the data objects in the whole cluster.For the elements in the data space, it can correspond to the points in  dimension space.To be more precise, the arbitrary data elements  could be expressed as the vector form with the following feature, and the standard Euclidean distance is taken as the distance between two vectors: where   () represents the  attribute of the instance .Then the Euclidean distance between two instances   and   is In the data set , the neighborhood of an instance  can be defined as The method is based on the definition of the neighborhood to determine the boundary points of the cluster.In the same cluster, for an element, the more this kind of element is contained in the neighborhood, the closer this element is to the center of the cluster; and also the less this kind of element is contained in the neighborhood, the farther this element is to the center of the cluster.We could use |EPS()| to represent the number of data element  in the neighborhood.In order to find the boundary of the cluster more accurately, we chose two groups of density threshold.One group is called clustering density threshold, which is based on the characteristics and the average distance of the whole data set.It is used to divide the whole data into several clusters.The other is called the boundary density threshold, which is estimated by the scale of each cluster.It is used to find the boundary data objects.We use the clustering density thresholds EPS 1 and MINP 1 of the first group to find similar data elements in the data set and then divide these data elements into several clusters .For each cluster   , we use the boundary density thresholds EPS   and MINP   of the second group to find the cluster boundary ring.The determination of the boundary density thresholds is based on the scale of cluster   .In this paper, if we use  to represent the whole training data set,   to represent  cluster divided in , and   to represent the boundary ring of cluster   , then we have Details of the implementation of the algorithm are as follows: (1) Traverse the data elements in  and calculate the distance between the elements in .
(3) Use the density threshold of first group to cluster .
(4) Mark the elements which belong to cluster   or noise  noise in .
(5) Calculate the number    of data elements in one cluster   .
(6) Estimate the density threshold MINP   of cluster   according to    .
(7) Calculate the number of each of the data elements which belong to the same cluster in one certain neighborhood.
(8) Extract the boundary elements   from cluster   according to the density threshold MINP   of the second group.
(9) Repeat the fourth step until all the clusters where the nonnoise elements are have been traversed.(10) Get all   .
The positive and negative samples in IDS are distributed in an imbalanced manner.The data of high imbalance ratio has a centralized distribution.The gap between the numbers of positive and negative samples is always huge.So when extracting the cluster boundary ring of the imbalanced data, we need to ensure that the information of positive samples in the minority is as complete as possible, while the information of negative samples in the majority is as representative as possible.Here we retain all the information of positive samples, only to cluster the negative samples information and extract the cluster boundary.Finally, boundary samples of the entire positive and all negative samples clusters are used as the learning data of the SVM classification.

Experimental Verification and Analysis
The IDS has two internal factors: the imbalance ratio and the lack of information.Imbalance ratio refers to the ratio of large and small categories, which represents the degree of data imbalance.The lack of information is the data content in a sample of small class, which represents the information content of a small class in the data set.In order to verify the performance of the method in this paper, we select 4 groups of the open data set on UCI public data platform as the experimental data (Table 1 lists the basic information of the 4 data sets), which represents the four possible situations that the imbalanced data has.Using these data sets can reflect the characteristics of IDS from all aspects, which could verify the validity and feasibility of this experimental method.

Evaluation Metrics of Imbalanced Data Set Classification.
The IDS classification is a typical two-classification problem.When doing the classification work, we usually define the smaller categories with fewer data samples as positive samples and the larger categories with more data samples as negative samples.We can describe the results of classification as four situations: the rightly classified positive samples TP, the wrongly classified positive samples FP, the wrongly classified negative samples FN, and the rightly classified negative samples TN.The total number of positive samples  = TP + FN, while the total number of negative samples  = TN + FP.Based on these values, we can get the traditional classification evaluation standards, accuracy, precision, recall, and the calculation formula of 's value, as follows: We can see that when it comes to the problem of IDS classification, since these commonly used evaluation indicators cannot strongly change the classification distribution, the formula no longer has good performance and even fails (such as for a data set with the proportion of positive and negative samples as 1 : 100, as long as we see the test data as negative samples, the accuracy rate is above 99%).Some studies have also used the accuracy and recall rate as the main evaluation metrics.But the blind pursuit of classification accuracy of positive samples will lead to bad classification results on negative samples, which is not what we would like to see.Thus these traditional evaluation metrics cannot be scientific, reasonable, and accurate to show the performance of the IDS classification effect.
In recent years, the latest research shows that using ROC curve and AUC to evaluate the effect of IDS classification has obvious advantages since the ROC curve and AUC are not affected by the imbalanced distribution of data types.This means that when the number of positive and negative samples of the test data is changed, ROC curve and AUC will not change accordingly, which could evaluate the classification effect in a more scientific and intuitive way [17].
ROC curve is a two-dimensional curve, in which horizontal coordinate represents FPR (false positive rate) and longitudinal coordinate represents TPR (true positive rate).The more the test data we get the more smooth the ROC curve will be.In the ROC curve, the curve  is better than  if  is always above , which means, for all the possible wrong classification costs and class distributions, the expected cost of the classifier corresponding to  is always lower than that of .
Although the ROC curve can be intuitive to show whether the classification results are good or bad, in practical application, it is still hoped to use a method of numerical description to evaluate the classification results.As shown in Figure 3, if the two ROC curves  and  intersect, we can only find that  is better than  when  is less than 0.23, and FPR is better than  when FPR is bigger than 0.23.If we only use the ROC curve to measure, it is difficult to explain whether  or  has better classification effect, not to mention explaining how big the difference is between  and .To solve this problem, we can calculate the area (the value of AUC) under the ROC curve, which would be more intuitive and clear to present the good and bad sides of the classification effects:   The experimental results of Figure 4 have clearly shown the advantages of this method compared to the traditional SVM classification algorithm.In order to further compare on the quantifiable level, this paper calculates the values of AUC in the four data sets with the four methods, as shown in Table 2.
According to the comparison of the experimental results, we can get some very meaningful observations.(1) For the imbalanced data with high imbalanced ratio, if we directly use (2) According to the SVM classification of imbalanced data, reasonable adjustment in the value of penalty factor can effectively improve the classification effect, and the higher the data imbalance ratio is, the more obvious the effect is.This conclusion is easy to get by comparing the classification experimental results in the four data sets with methods 1 and 2, which further proves that the penalty factor is a practical method in reality to improve the classification effect.(3) Traditional SBC is not stable.The traditional SBC assumes that the elements in each cluster are highly similar to each other, carrying the same amount of information.But it is proved that the SBC method clustering cannot guarantee the good results every time.Sometimes, the classification effect is reduced due to the serious loss of information.This fully shows that the amount of information carried by the cluster is unequal and some elements should carry more core information which is good for classification.(4) Clustering boundary sampling method has good stability, which is an effective way to solve the imbalanced data classification.By comparing the experimental results in the four data sets with methods 2, 3, and 4, we can see that the method of cluster boundary sampling has a stable effect on improving the classification effect of IDS in different situations.The improving effect is more obvious in the case of serious imbalanced data, which proves that this method is a feasible technique in practical application.( 5) Samples on the clustering boundary should carry more information.
Through the comparison of methods 3 and 4, there is only a slight difference in the number of removal samples.That is to say, the two methods almost have the same ability to reduce the imbalance ratio, but it is clear that method 4 is more stable and effective.Thus we get a hypothetical conclusion of great significance.That is, the information content carried by samples is imbalanced.There should be more some kind of core information which is more helpful to classification in the information content; and the clustering boundary samples should be the special samples which carry this kind of core information.We will further study and explore the theoretical basis of this hypothesis.

Conclusion
In order to address the IDS classification problem, this paper proposes a CSVM classification method based on cluster boundary sampling, which provides an integrated solution to the problem of IDS classification.There are two main contributions of this method.One is the use of grid optimization method of cross validation for the training data to determine the penalty factor of SVM and kernel gamma value.This method does not solve all the drawbacks of the adjustment of the penalty factor, but at least it seeks a more reasonable theoretical basis for the determination of SVM penalty factor.At the technical level, it also provides a practical means of realization.The other is to propose a cluster boundary sampling method based on density clustering.We resample the IDS, which not only effectively balance the data skew state, but also greatly reduce the number of support vectors.It betters the classification effect while significantly improving the classification speed.This sampling method overcomes the shortcomings of the traditional sampling method including the lack of theoretical basis, strong randomness, human subjective interference, and serious information loss.At the same time, it is a good solution to the aliasing phenomenon in data, which can improve the generalization of the following SVM classifier.On a representative public data set, the proposed method is proved to be very stable in improving the classification of IDS.The improvement performance is especially more obvious in the case of high imbalance ratio.
In the future work, the research on the classification of IDS should be further explored in the following aspects.First, resampling is still a major method to solve the problem of imbalance in the world and we need to explore more effective ways to reduce the deviation of data as much as possible while minimizing the information loss.Second, we need to further study the information content and explore whether hypothetical core information content exists.At the same time, a further study of the theoretical basis for the validity of cluster boundary sampling is also necessary.In the end, we need to study the special kernel function to adapt to the classification of imbalanced data so that the classification algorithm can accommodate to the data imbalance in theory, which could completely solve all the problems of regulating the penalty factor.

Figure 1 :Figure 2 :
Figure 1: The phenomena of data flow.

Figure 4 :
Figure 4: ROC curve of four UCI data sets.

Table 1 :
The basic information of four UCI data sets.
) 4.2.Comparative Experiment and Analysis.In this paper, four methods are used to carry out experiments on four different data sets to verify the role of the proposed two strategies in classification.In the experiment, 50% of each data set is used for training, and the remaining 50% is used for testing.This paper also uses the method of rotation test ensures that

Table 2 :
The AUC value of three methods in different data sets.Shuttle and Abalone with high imbalance ratio, which are almost invalid, while in data sets Yeast and Churn with low imbalance ratio, its performance is accepted.This experimental result validates the existing conclusion.In case of serious IDS, using the SVM method directly has no good effect, but it has good adaptability if the data is not seriously imbalanced.At the same time, it is proved that the data set selected in this paper is representative and scientific.