A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM

Class imbalance ubiquitously exists in real life, which has attracted much interest from various domains. Direct learning from imbalanced dataset may pose unsatisfying results overfocusing on the accuracy of identification and deriving a suboptimal model. Various methodologies have been developed in tackling this problem including sampling, cost-sensitive, and other hybrid ones. However, the samples near the decision boundary which contain more discriminative information should be valued and the skew of the boundary would be corrected by constructing synthetic samples. Inspired by the truth and sense of geometry, we designed a new synthetic minority oversampling technique to incorporate the borderline information. What is more, ensemble model always tends to capture more complicated and robust decision boundary in practice. Taking these factors into considerations, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with imbalanced data learning (IDL) problems. Experiments on open access datasets showed significant superior performance using our model and a persuasive and intuitive explanation behind the method was illustrated. As far as we know, this is the first model combining ensemble of SVMs with borderline information for solving such condition.


Introduction
In the domain of machine learning, data is essential for the model's training. However, the distribution of two classes behaves in extremely imbalanced way and the circumstance is quite ubiquitous in real life [1]. In this paper, we concentrate on the binary classification problems when there is extremely imbalanced distribution in two classes; that is, one class for training severely outnumbers the other.
With the neglect of the class imbalanced distribution, traditional algorithms for binary classification tend to perform badly on the dataset [2,3], leading to the unsatisfied suboptimal result [4][5][6][7] that the majority class can be well identified while the minority is reverse. One of the reasons accounting for this is that the imbalanced distribution as prior information in many instances has a strong impact on final discrimination [8]. Now let us consider a special scenario in which the majority class amounts to the percentage of 99. In such case, an ordinary classifier that assigns any example to the majority class label would still achieve the accuracy of 99% [9]. However, due to the low recall ratio for the minority, such extreme result is not what we have desired. The phenomena are quite crucial and nontrivial in several circumstances, such as identification of network intrusion [10,11], medical diagnosis of type 2 diabetes [12], oil spills detection from satellite radar images [13], finding of financial fraud transactions [14], and bioinformatics [15]. Another fact which cannot be neglected is that in most of the binary classification problems, the minority class is what we really care about rather than the majority [3], especially when the cost is expensive for the failure in recognizing the minority ones.
Numbers of algorithms have been designed for relieving the consequences of the imbalanced data. From the perspective of the strategy, these methods can be categorized as three mainstream types. In the algorithm level, the adjustment of the weights of errors in the loss function, also called costsensitive learning, is a direct way to reduce the impact of imbalance. Cost matrix which measures different penalties for misclassification is critical for the improvement on the performance. Another way to tune the penalties was rooted 2 Computational Intelligence and Neuroscience in adaptive boosting [16] and some algorithms, for example, Ada-cost [17] and cost-sensitive boosting [18], were implemented for the learning task. In the sampling level, the easiest way is to randomly sample the training data from the whole training dataset in the way that different classes of data are sampled in appropriate ratios to balance the proportions in classes. Random sampling tended to be overfitting if the sampling ratios were not properly modulated [3]. Repeated sampling was easier to implement but hard to adjust efficiently, so Provost [9] thought undersampling was proper for the larger training dataset while the synthetic samples were constructed for the less sample cases. Estabrooks et al. [2] combined different repeated sampling methods to offer a better scheme of modulation. Another popular method was the synthetic minority oversampling technique (SMOTE) proposed by Chawla et al. [19] and the core idea was to construct the synthetic minority samples through the interpolation between minority training data and its -nearest neighborhoods. Han et al. [3] paid more attention to the samples near the decision borderline and combined the SMOTE to acquire the Borderline-SMOTE. Apart from the above sampling methods, clustering-based resampling algorithm [20] and SMOTE-Boost algorithm [21] were also designed for the imbalanced cases. Furthermore, fewer researchers focused on how the sampling methodologies impact on the learning performance but the reason why sampling methods boost the linear discriminative performance was given by Xue and Hall [6].
In addition to the algorithm and sampling based methods, other researchers proposed several practical and popular methods with excellent performance from the hybrid view. With the help of SMOTE on the boundary samples and the adjustment of kernel matrix, Wu and Chang [22] integrated the prior information of distribution of imbalance with SVM to obtain the kernel boundary alignment algorithm. Chawla et al. [4] noticed the great influence of feature selection and Maldonado et al. [23] did some research on various ways of feature selection and put forward backward elimination feature selection process for SVM's dealing with the IDL problem. Two-stage cost-sensitive learning was employed in the NASA imbalanced data and the researcher designed costsensitive rules for both feature selection and the classification stage [24]. Rather than former over-or undersampling methods, Sun et al. [25] utilized random partition and clustering tricks to obtain some balanced datasets for training various classifiers and combined them according to some rules. Bhowan et al. made use of genetic programming to construct types of fitness functions and exploit the multiobjective optimization for combining the classifiers [5]. Easy ensemble and balance cascade [26] were two superior algorithms, which made use of ensemble models learned in an undersampling way.
Quite different from the above-mentioned methods, we would propose a novel ensemble algorithm, in which an efficient sampling method was developed for IDL problems. Necessarily, similar works are summarized as follows. In the preprocessing process, Batuwita and Palade [27] screened some informative examples closer to the class boundary captured by SVM and downscaled the resampling of other samples to reduce the time complexity of training SVM with performance maintained. For the binary imbalanced data, Wu and Chang implemented boundary alignments using kernel tricks to relieve the offset of decision boundary [28]. From the perspective of ensemble, there exist several elaborate reviews about ensemble models for the IDL [8,29]. Specifically, López et al. [8] studied how six significant problems relating to the data intrinsic characteristics affected the performance of ensemble models for the IDL. Several superior ensemble models are based on boosting, such as EUS-Boost [30], evolutionary undersampling boosting model in dealing with breast cancer malignancy classification [31]. Shi et al. made use of bagging technique on SVM to cope with P300 detection problem [32]. Our proposed framework also made use of those samples near the decision boundary detected by the SVM but in a more flexible way of sampling. We applied bagging technique to meta-SVM trained with data obtained from the sampling process. The final results show our model's effectiveness in dealing with the IDL problem.

Materials and Methods
Before our framework was introduced, some basic knowledge about models or techniques would be in a brief summary. In our framework, SVM works as metaclassifiers for ensemble and bootstrapping aggregation is a type of sampling technique to obtain various training dataset. Besides, we would illustrate SMOTE in a comprehensive way and induce our adaptive SMOTE technique. Flow chart about our framework is given in Figure 1.

Support Vector
Machine: A Review. Support vector machine as one of the popular classifiers in binary classification has shown its state of art performance in engineering applications.
Given the labeled training set = {(x i , ) | = 1 or −1, = 1 ⋅ ⋅ ⋅ }, a naïve idea to learn a classifier is to characterize a hyperplane in the feature or transformed feature space of the input x that can separate two classes of training data as much as possible. Based on the statistical learning theory [33], SVM is recognized as the robust adaptation for the perceptron learning model [34].
Trick of feature transformation and soft-margin relaxation make SVM powerful for the detection of complex decision boundary and control the overfitting with the allowance for some samples' violating the support hyperplanes.
Here, we give an expression of SVM in the form of quadratic programming problem as follows: min , , where is the corresponding matrix ) . ( -dimension unit vector with th coordinate equals 1. The slacked vector of variables = ( 1 , 2 , . . . , ) measures the extent to samples' violating the support hyperplanes. The primal problem can be converted to the dual one by solving the Karush-Kuhn-Tucker optimal functions derived from the Lagrange equation [35].
Finally, the discriminative function is obtained in the form where * is the Lagrange multiplier corresponding to the sample satisfying Karush-Kuhn-Tucker (KKT) optimal conditions. Specifically, when the inner product of the transformed feature vectors can be applied with the kernel methods, the process of computation is efficient: Furthermore, only a small proportion of the training data corresponding to positive Lagrange multipliers called support vectors are useful for the final decision, so the representation of the classifier is rather sparse.

Ensemble Methods: A Review.
Apparently one classifier may be severely affected when the training dataset cannot well characterize the actual underlying distribution or the presumed model is biased. The strategy of models' ensemble can avoid the one-sidedness originated from the training dataset and hypothesis, receiving a better capability of generalization. In another aspect, weaker classifiers are easier to 4 Computational Intelligence and Neuroscience (1) Input the whole training dataset with | | = (2) For from 1 to : (3) Sample from by Bootstrapping trick to obtain with | | = ∼ (4) Derive the model ( ) by fitting (5) Ensemble of the models { ( ) | = 1, 2, . . . , } and obtain the final model ( ). In binary classification cases, ( ) = sign(∑ =1,2,..., Algorithm 1: Bagging algorithm.
obtain using simple criteria like stump and a strong classifier can be derived by combining multiple weaker classifiers [36]. In our proposed framework for the IDL, bagging technique was employed for developing various models. Bootstrapping aggregation abbreviated as bagging constructs totally different classifiers based on the bootstrapping method. Bootstrapping technique samples each training example in the same probability with replacements.
The general bagging algorithm can be described as shown in Algorithm 1.
One of the most famous models motivated from bagging is random forest in which not only is the training data sampled in bootstrapping way, but the features for training are selected in random as well [37]. Table 1 provides a detailed process of sampling methods.

Extrapolation Borderline-SMOTE.
For IDL, SMOTE [19] is a typical oversampling method with universal applications and the concrete process for generating the synthetic samples can be described as shown in Algorithm 2.
Generating some synthetic samples for the minority in an interpolation way is demonstrated to be effective for relieving the extent of imbalance and lifting performance. However, it seems that samples near the decision border overweigh the remaining ones in decision-making. Borderline-SMOTE [3] operates on samples near the decision border using SMOTE technique. Figure 2 displays the interpolation method to generate synthetic samples.
However, the interpolation between samples used in SMOTE or Borderline-SMOTE restricts the ability of exploring towards the actual boundary. As we would make use of ensemble SVMs, samples near decision boundary can be roughly characterized from support hyperplane learned by the first SVM. Taking this into consideration, a novel synthetic minority oversampling method is proposed as shown in Algorithm 3 and Figure 3 describes our ideology.
Here 1/‖ ‖ 2 is the distance from support hyperplane to decision hyperplane corresponding to the first SVM learned from the imbalance training dataset.
It is obvious that the synthetic minor sample tends to correct the skew finely and the extrapolation works to detect the decision boundary when ∼ belongs to the inner side of support hyperplane just as Figure 3 indicates.

Bagging of Extrapolation Borderline-SMOTE SVM.
Ensemble methods can effectively enhance model's capability of generalization. Here, a novel ensemble method for solving IDL problems is proposed called Bagging of Extrapolation Borderline-SMOTE SVMs (BEBS).
Computational Intelligence and Neuroscience 5 (1) Input the sample ∈ min and its -nearest neighbors represented as nn( ) (2) Select a random number generated from a uniform distribution [0, 1] (1) Input a sample near the decision border ∈ min where min is the set of the minority and its -nearest neighbors represented as nn( ) (2) Select a random number generated from a uniform distribution [0, 1] (3) Output a new synthetic sample for the minority as Algorithm 3: Extrapolation Borderline-SMOTE algorithm.
For SVM, it is noted that support vectors with positive Lagrange multipliers decide the final discriminative boundary. So we employ Extrapolation Borderline-SMOTE to the support vectors belonging to the minority for relieving the imbalance level.
The whole ideas about BEBS can be elucidated as follows. The original support vectors containing borderline information are roughly identified through the base SVM which is learned from the imbalance dataset . During the initialization process, a proper kernel and hyperparameter are chosen through cross-validation in which G-means is chosen as the optimal metrics. Then the original support vectors belonging to the minority are marked as SV 0 = { (0) SV } and a novel dataset ∼ = \ SV 0 for further bootstrapping is constructed by removing SV 0 . Bootstrapping is performed on ∼ in -turns and each sampling result ∼ is in the scale of | ∼ |. Furthermore, aggregating datasets of ∼ and SV 0 are operated by Extrapolation Borderline-SMOTE. After that, the merged datasets with new synthetic samples are used for meta-SVM's training and the original data which are not sampled work as validation sets for tuning parameters. Finally, SVMs are aggregated in the same weight to form the ensemble classifier (see Algorithm 4).
Specifically, default parameters in our model were initially set as = 0.5, = 100 and the following experiments shared the same parameters.

The Intuition behind BEBS.
The core idea of BEBS is to aggregate various SVMs which revise the initial decision boundary by constructing synthetic minority samples towards the correct direction. These synthetic samples are presumed to well characterize the actual decision boundary. The SVMs' variance originates from two sources. One is the random selection from SV 0 with the sampling ratio % and the other originates from training sets' difference due to bootstrapping manipulation. Besides, training data not sampled in a trek of bootstrapping is exploited as the validation set for exploring a better hyperparameter; just see Table 1. All

Experimental Settings and Metrics.
Datasets for experiments are chosen from UCI machine learning repository [38] and most of them are quite imbalanced. Here we only cope with binary classification problems, so one class is labeled as the minority while the rest merge as the majority in multiclass cases which is similar to other researchers' preprocess [25,39,40]. Table 2 shows the detailed information about the dataset including sample capacity, the number of attributes, the numbers of the minority samples and majority samples, and the imbalance ratio. The imbalance ratio is defined as the result of the cardinality of the majority dividing the cardinality of the minority, which may severely influence the performance of classifiers. Traditional ensemble methods like AdaBoostM1 and random forest are chosen for comparison as well. Further illustration that should be noticed is that both AdaBoostM1 and random forest can be considered as techniques relieving the imbalance due to the weight-adjustment mechanism by error in AdaBoostM1 and out-of-bag performance monitored in 6 Computational Intelligence and Neuroscience   random forest. We also validated imbalance effect on original SVM. Some state-of-the-art and commonly used algorithms, including random undersampling, random oversampling, SMOTE, and SMOTE-ENN [41] were performed on the above-mentioned dataset as well, all of which would demonstrate the effectiveness of the novel proposed algorithm. Besides, random undersampling, random oversampling, SMOTE, and SMOTE-ENN were combined with SVM for further classification.
In problems of binary classification, confusion matrix offers an intuitive measure for evaluating classifier's performance. As illustrated in Table 3, FN is the number of samples identified as negative ones by mistake and the rest can be understood similarly.
The accuracy of the classifier is defined as For the problem of IDL, the accuracy is not persuasive for evaluation as depicted before. One of the most frequently used evaluation criteria for IDL is G-means which penalizes the biased model strictly. G-means is an index averaging geometrically the recall ratios of two classes.
It is obvious that only when both of the recall ratios stay at higher level can the G-means receives better value. So the G-means can be considered to be the trade-off between the accuracy and the recall ratio.
Harmonic average is applied in the index and the parameter controls the extent for penalization. Here is selected as 1.
-score shows similar performance and shares consistency with G-means in our experimental findings, but it averages the precision and recall ratio of one class in essence.
Besides, the precision of the minority in one classifier also plays a crucial role in IDL and most of cases show its significance just as the introduction has described. So the precision was taken into consideration during evaluating process. The precision for the positive is denoted as  To obtain a robust result for evaluation, we picked up risk minimization as the criteria in which the minimum metrics of binary class were defined as the corresponding result. Taking precision for an instance, though precisions of both classes can be computed during testing process, the smaller one was selected, just as follows: Precision fl min Precision (class ) .

Results Analysis
3.2.1. Performance Analysis. We, respectively, averaged the results of G-means, 1-score, and precision in 10 independent turns. Table 4 was the final results on various dataset and the top 3 ones in each line were labeled with bold. A direct conclusion drawn from the table was BEBS, random forest, and AdaBoostM1 located in dominating board most of time and behaving stably in three metrics. Some reason accounting for this was that the requirements of careful adaptation about parameters for all the other sampling algorithms seemed crucial. However, the original SVM received worse results on dataset of Fertility, Pima, Segmentation 1, Segmentation 3, and 1-score in Parkinson was rather low. Such phenomenon verified the explanation about skew of SVM in imbalanced case. It was evident that random oversampling, random undersampling, and SMOTE-ENN were sensitive to the datasets because all of them needed parameters of manual setting according to specific case rather than adapting automatically. SMOTE outperformed these three methods but was less efficient than our proposed BEBS. Obviously, BEBS which performed well in three metrics stably benefited from both the intuitive extrapolation-SMOTE method involving boundary information and randomness from bootstrapping technique. To offer a more direct cognition, we ranked the performance of methods on testing sets in decreasing order from the perspective of G-means, 1-score, and precision. The average ranks of all algorithms on 6 datasets were shown in Figure 4. Also, taking generalization and performance into consideration, random forest and AdaBoostM1 were still worthwhile to make a trial with no additional information. Specifically, with the help of SPSS [42], we carried out Student's paired -test, in which confidence interval of difference was set as 95%, to check the significance of the 10 independent results in comparison. 10 independent results were compared in the form of BEBS versus some other algorithm. Since seven models were chosen for comparison, seven statistical testing results were obtained on each dataset. We looped such process in metrics G-means, -scores, and precision on each dataset. Furthermore, the seven pairs of the testing have three possible results, respectively, significantly weaker than BEBS and tie, significantly stronger than BEBS. Precise explanation about the result of tie is when the average of 10 independent results in some metrics on the dataset using model is higher or lower than the BEBS but is not significant from the analysis of pair -test, we directly attribute the reason behind difference to the randomness rather than the mechanism of models and label the paired comparison as tie. The label win means our BEBS's average of results not only outperforms the comparison model but also passes the hypothesis test. The same is the loss. Combined with Table 4, the results of significance testing were finally mapped into the 3-element tuple in the form of win\tie\loss. Then we counted the frequencies for win, tie, and loss in 7-paired comparison. So the computational results were arranged in the Table 5.
From Table 5, some obvious conclusions were drawn as follows. From the perspective of G-means, about 76.2% comparison results shown BEBS significantly outperformed other models which was computed as the total number of paired comparisons 42 divided by the number of win counts on the whole dataset 32. The ratio of no loss to others inscores occupied approximately 83.3% and at the same time 64.3% proportion of the total number of paired comparisons indicated superior results using BEBS compared with others significantly. For the precision, only 4.8% of the total counts were significantly poorer than some other models though the proportion of the tie counts maintained about 38.1%. In all, BEBS produced better results after a series of experiments and statistical testing process. The next part would do some research on the stability of the BEBS and some sensitivity analysis experiments were carried out.

Sensitivity Analysis.
It is noticed that our proposed algorithm BEBS contains two crucial hyperparameters to tune, that is, the number of metaclassifiers and the oversampling ratio for Extrapolation Borderline-SMOTE %. Regardless of variations on dataset in the former experiments, the hyperparameters were consistently set as the fixed values = 100 and sampling ratio = 0.5. The performance ought to be influenced when such parameters are violated. To investigate the robustness of the BEBS, we performed BEBS on the prepared dataset given a tunable range of hyperparameters. As suggested before, G-means is capable of well characterizing the fair results by imposing the penalization on the imbalance consequence. Here sensitivity analysis towards two hyperparameters was carried out and G-means was the objective we concentrated on.
With the sampling ratio fixed as 0.5, we ranged the number of metaclassifiers in the interval [70, 130] at the step length 10 and averaged the 10 independent results corresponding to the fixed parameters. As Figure 5 illustrated, the six polylines run steadily as increased and the maximum   Figure 6. The points on polylines were acquired by averaging G-means values from 10 independent results given a set of parameters as well. An interesting fact lied in the fact that tendencies on Fertility, Glass 7, and 10 Computational Intelligence and Neuroscience Segmentation 3 were significant and performance was steadily enhanced when increasing the sampling ratio. The phenomenon may be attributed to the imbalance ratio of dataset. The imbalance ratios on these were not less than 6 from statistical information in Table 2. More synthetic minority samples tended to make contributions towards detecting the actual boundary. So a conclusion can be drawn that when the imbalance ratio retains a rather higher level, the sampling ratio should also adapt to relieve the overfitting circumstance. Results on Parkinson and Pima indicated declines when sampling ratio is higher than some thresholds, so higher sampling ratio on not extremely imbalanced dataset may do damage to the final performance. In total, BEBS seems sensitive to the Resampling Ratio and the imbalance ratio should be involved in a fine choice for the parameter.

Conclusions
In this paper, a novel ensemble method called BEBS was proposed for dealing with the IDL in binary case. The BEBS was framed by employing an adaptive sampling method Extrapolation Borderline-SMOTE and bootstrapping aggregation to the former imbalanced dataset. Such variant of SMOTE takes advantage of boundary information derived from the initial SVM and bagging's mechanism contributes to the relief of overfitting and promotes the capability of model's generalization. The decision boundary's skew towards the minority when using SVM can be revised with the help of synthetic samples. In our experiments, the results on each dataset run for ten times independently to ensure the effectiveness of hypothesis test and further statistical records show BEBS can significantly outperform some representative IDL algorithms in most of time. The sensitivity analysis illustrates the relation between scale of ensemble, sampling ratio, and performance, suggesting BEBS would be extensively enhanced after a proper adaptation according to imbalance ratio of dataset. Future research will summarize general relations between algorithms performance and other attributes like attributes' number and samples' cardinality. Multiclass imbalance cases [43] are also considered in the later mining tasks.