A Novel Selective Ensemble Algorithm for Imbalanced Data Classification Based on Exploratory Undersampling

Learning with imbalanced data is one of the emergent challenging tasks in machine learning. Recently, ensemble learning has arisen as an effective solution to class imbalance problems. The combination of bagging and boosting with data preprocessing resampling, namely, the simplest and accurate exploratory undersampling, has become the most popular method for imbalanced data classification. In this paper, we propose a novel selective ensemble construction method based on exploratory undersampling, RotEasy, with the advantage of improving storage requirement and computational efficiency by ensemble pruning technology. Our methodology aims to enhance the diversity between individual classifiers through feature extraction and diversity regularized ensemble pruning. We made a comprehensive comparison between our method and some state-of-the-art imbalanced learning methods. Experimental results on 20 real-world imbalanced data sets show that RotEasy possesses a significant increase in performance, contrasted by a nonparametric statistical test and various evaluation criteria.


Introduction
Recently, classification with imbalanced data sets has emerged as one of the most challenging tasks in data mining community.Class imbalance occurs when examples of one class are severely outnumbered by those of other classes.When data are imbalanced, traditional data mining algorithms tend to favor the overrepresented (majority or negative) class, resulting in unacceptably low recognition rates with respect to the underrepresented (minority or positive) class.However, the underrepresented minority class usually represents the positive concept with great interest than the majority class.The classification accuracy of the minority class is more preferred than the majority class.For instance, the recognition goal of medical diagnosis is to provide a higher identification accuracy for rare diseases.Similar to most of the existing imbalanced learning methods in the literature, we also focus on two-class imbalanced classification problems in our current study.
Class imbalance problems have appeared in many realworld applications, such as fraud detection [1], anomaly detection [2], medical diagnosis [3], DNA sequences analysis [4], etc.On account of the prevalence of potential applications, a large amount of techniques have been developed to deal with class imbalance problems.Interested readers can refer to some review papers [5][6][7].These proposals can be divided into three categories, depending on the way they work.
(i) External approaches at data level: this type of methods consists of resampling data in order to decrease the effect of imbalanced class distribution.These approaches can be broadly categorized into two groups: undersampling the majority class and oversampling the minority class [8,9].They have the advantage of being independent from the classifier used, so they are considered as resampling preprocessing techniques.(ii) Internal approaches at algorithmic level: these approaches try to adapt the decision threshold to impose a bias on the minority class or by adjusting misclassification costs for each class in the learning process [10][11][12].These approaches are more dependent on the problem and the classifier used.
Mathematical Problems in Engineering (iii) Combined approaches that are based on data preprocessing and ensemble learning, most commonly used boosting and bagging: they usually include data preprocessing techniques before ensemble learning.
The third group has arisen as popular methods for solving imbalanced data classification, mainly due to their ability to significantly improve the performance of a single classifier.In general, there are three kinds of ensemble patterns that are integrated with data preprocessing techniques: boosting-based ensembles, bagging-based ensembles, and hybrid ensembles.In the first boosting-based category, these methods alter and bias the weight distribution towards the minority class to train the next classifier, including SMOTE-Boost [13], RUSBoost [14], and RAMOBoost [15].In the second bagging-based category, the main difference lies in the way how to take into account each class of instances when they are randomly drawn in each bootstrap sampling.There are several different proposals, such as UnderBagging [16] and SMOTEBagging [17].
The main characteristic of the third category is that they carry out hierarchical ensemble learning, combining both bagging and boosting with resampling preprocessing technique.The simplest method in this group is exploratory undersampling, which was proposed by Liu et al. [18], also known as EasyEnsemble.It uses bagging as the main ensemble learning framework, and each bag member is actually an AdaBoost ensemble classifier.Hence, it combines the merits of boosting and bagging and strengthens the diversity of ensemble classifiers.The empirical study confirms that EasyEnsemble is highly effective in dealing with imbalanced data classification tasks.
It is widely recognized that diversity among individual classifiers is pivotal to the success of ensemble learning system.Rodriguez et al. [19] proposed a novel forward extension of bagging, rotation forest, which promotes diversity within the ensemble through feature extraction based on PCA.Moreover, many ensemble pruning techniques have been developed to select more diverse subensemble classifiers.For example, Li et al. [20] proposed a novel diversity regularized ensemble pruning method, namely DREP method, and greatly improved the generalization capability of ensemble classifiers.
Motivated by the above analysis, we will propose a novel ensemble construction technique, RotEasy, in order to enhance the diversity between component classifiers.The main idea of RotEasy is to inherit the advantages of EasyEnsemble and rotation forest by integrating them.We conducted a comprehensive suite of experiments on 20 real-world imbalanced data sets.They provide a complete perspective on the performance of the proposed algorithm.Experimental results indicate that our approach outperforms the compared state-of-the-art imbalanced learning methods significantly.
The remainder of this paper is organized as follows.Section 2 presents some related learning algorithms with the aim to facilitate discussions.In Section 3, we describe in detail the proposed methodology and its rationale.Section 4 introduces the experimental framework, including experimental data sets, the compared methods, and the used performance evaluation criteria.In Section 5, we show and discuss the experimental results.Finally, conclusions and some future work are outlined in Section 6.

Related Work and Motivation
In order to facilitate our later discussions, we will give a brief introduction to exploratory undersampling, rotation forest, and DREP ensemble pruning method.

Exploratory Undersampling.
Undersampling is an efficient method for handling class imbalance problems, which uses only a subset of the majority class.Since many majority examples are ignored, the training set becomes more balanced and the training process becomes faster.However, some potentially useful information contained in these ignored majority examples is neglected.Liu et al. [18] proposed exploratory undersampling to further exploit these ignored examples while keeping the fast training speed, also known as EasyEnsemble.
Given a minority set P and a majority set N, EasyEnsemble independently samples several subsets N 1 , N 2 , . . ., N  from N, where |N  | < |N|.For each majority subset N  combined with the minority set P, AdaBoost [22] is used to train the base classifier   .All generated base classifiers are fused by weighted voting for the final decision.The pseudocode for EasyEnsemble is shown in Algorithm 1.
EasyEnsemble generates  balanced subproblems, in which the th subproblem is to learn an Adaboost ensemble   .So it looks like an "ensemble of ensembles." It is well known that boosting mainly reduces bias, while bagging mainly reduces variance.It is evident that EasyEnsemble has benefited from good qualities of boosting and a bagginglike strategy with balanced class distribution.
Experimental results in [18] show that EasyEnsemble has higher AUC, F-measure, and G-mean values than many existing imbalanced learning methods.Moreover, EasyEnsemble has approximately the same training time as that of undersampling, which is significantly faster than other algorithms.trained with the transformed data set.Different splits of the feature set will lead to different rotations.Thus diverse classifiers are obtained.On the other hand, the information about the scatter of the data is completely preserved in the new space of extracted features.In this way, accurate and more diverse classifiers are built.

Rotation
In the study of Rodriguez et al. [19], through the analysis tool of kappa-error diagram, they showed that rotation forest has similar diversity-accuracy pattern as bagging, but is slightly more diverse than bagging.Hence, rotation forest promotes diversity within the ensemble through feature extraction.The pseudocode of rotation forest is listed in Algorithm 2.

DREP Ensemble Pruning.
With the goal of improving storage requirement and computational efficiency, ensemble pruning deals with the problem of reducing ensemble sizes.Furthermore, theoretical and empirical studies have shown that ensemble pruning can also improve the generalization performance of the complete ensemble.
Guided by theoretical analysis on the effect of diversity on the generalization performance, Li et al. [20] proposed (1) initialize  * ← ⌀.
(2) ℎ() ← the classifier in  with the lowest error on .
(4) repeat (5)   Diversity Regularized Ensemble Pruning (DREP) method, which is a greedy forward ensemble pruning method with explicit diversity regularization.The pseudocode of DREP method is presented in Algorithm 3.
In Algorithm 3, the diversity is measured based on pairwise difference and is defined as follows: Starting with the classifier with the lowest error on validation set , DREP method iteratively selects the best classifier based on both empirical error and diversity.Concretely, at each step it first sorts the candidate classifiers in the ascending order of their differences with current subensemble, and then from the front part of the sorted list it selects the classifier which can most reduce the empirical error on the validate data set.These two criteria are balanced by the parameter , that is, the fraction of classifiers that are considered when minimizing empirical error.Obviously, a large value  means that more emphasis is put on the empirical error, while a small  pays more attention on the diversity.Thus it can be expected that the obtained ensemble will have both large diversity and small empirical error.Experimental results show that, with the help of diversity regularization, DREP is able to achieve significantly better generalization performance with smaller ensemble size than other compared ensemble pruning methods.

RotEasy: A New Selective Ensemble Algorithm Based on EasyEnsemble and Rotation Forest
Based on the above analysis, we propose a novel selective ensemble construction technique RotEasy, integrating feature extraction, and ensemble pruning with EasyEnsemble to further improve the ensemble diversity.
The main steps of RotEasy can be summarized as follows: firstly, a subset N  of size |P| from the majority class is undersampled.Secondly, we construct an innerlayer ensemble   through integrating rotation forest and AdaBoost.Lastly, DREP method is used to prune the learned ensemble with the aim to enhance the ensemble diversity.The pseudocode of RotEasy method is listed in Algorithm 4.
It should be pointed out that some parameters in RotEasy need to be specified in advance.With respect to the values of  and   , we set them in the same manner as that of EasyEnsemble.As for the validation set , we randomly split the training set into two parts with approximately the same size, one part is used to train ensemble members, and the other one is used to prune ensemble classifiers.The best value for the parameter  can be found by a line-search strategy over {0.2, 0.25, . . ., 1}.In fact, the performance of RotEasy is very robust to the variation of  values, and this will be confirmed in the later experimental analysis.

Experimental Framework
In this section, we present the experimental framework to examine the performance of our proposed RotEasy method and compare it with some state-of-the-art imbalanced learning methods.

Experimental Data Sets.
To evaluate the effectiveness of the proposed method, extensive experiments were carried out on 20 public imbalanced data sets from the UCI repository.In order to ensure a thorough performance assessment, the chosen data sets vary in sample size, class distribution, and imbalance ratio.
Table 1 summarizes the properties of data sets: the number of examples, the number of attributes, sample size of minority and majority class, and the imbalance ratio, that is, sample size of the majority class divided by that of the minority class.These data sets are sorted by imbalance ratio in the ascending order.For several multiclass data sets, they were modified into two-class cases by keeping one class as the positive class and joining the remainder into the negative class.
In our experiments, we use classification and regression tree (CART) as the base classifier in all compared methods, because it is sensitive to the changes of training samples, and can still be very accurate.We set the total amount of base classifiers in the ensemble to be  = 100.These benchmark methods and their parameters are described as follows.
(1) CART.It is implemented by the "classregtree" function with default parameter values in MATLAB software.
(2) RUSBoost (ab.RUSB).A majority subset N  is sampled (without replacement) from N, Then, AdaBoost is used to train an ensemble classifier using P and N  .
(7) RAMOBoost (ab.RAMO).According to the suggestion of [15], the number of nearest neighbors in adjusting the sampling probability of the minority is set to be  1 = 5, the number of nearest neighbors used to generate the synthetic data instances is set to be  2 = 5, and the scaling coefficient  is set to be 0.3.
(8) Rotation forest (ab.RotF).The feature set is randomly split into  subsets and PCA is applied to each bootstrapped subset.The number of features in each subset is set to be  = 5.
(9) EasyEnsemble (ab.Easy).It is firstly randomly undersampling (without replacement) the majority class in each outer-layer iteration.Then, AdaBoost is used to (11) Our proposed method (ab.RotEasy).The number of undersampled subsets is  = 10, and the number of inner ensemble iterations is   = 10.Then DREP method is applied on the validation subset  to prune the above ensemble.
RotE-un and RotEasy, we randomly split the training data set into two parts: 1/2 as training set, 1/2 as validation set.The parameter  is selected in {0.2, 0.25, . . ., 1} with an interval of 0.05.

Evaluation Measures.
The evaluation criterion plays a crucial role in both the guidance of classifier modeling and the assessment of classification performance.Traditionally, total accuracy is the most commonly used empirical metric.However, accuracy is no longer a proper measure in the class imbalance problem, since the positive class makes little contribution to the overall accuracy.
For the two-class problem we consider here, the confusion matrix records the results of correctly and incorrectly classified examples of each class.It is shown in Table 2.
Specially, we obtain the following performance evaluation metrics from the confusion matrix: AUC: the area under the receiver operating characteristic (ROC).AUC provides a single measure of the classification performance for evaluating which model is better on average.

Experimental Results and Analysis
This section shows the experimental results and their associated statistical analysis for the comparison with standard imbalanced learning algorithms.All the reported results are obtained by ten trials of stratified 10-fold cross-validation.That is, the total data is split into 10 folds, with each fold containing 10% of data patterns for prediction.For each fold, each algorithm is trained with the examples of the remaining folds, and the prediction accuracy rate tested on the current    fold is considered to be the performance result.For each data set, we compute the mean value of 100 prediction accuracy as the final prediction result.Firstly, we investigated the sensitivity of proposed RotEasy algorithm with respect to the variation of hyperparameter .

Sensitivity of the Hyperparameter 𝜌.
In the DREP ensemble pruning method, there is a trade-off parameter  between ensemble diversity and empirical error.We should first examine the influence of the parameter  on the algorithm performance.To do so, we considered various values of  in {0.2, 0.25, . . ., 1} with increment 0.05 in this study.
Figure 1 shows the curves of performance results as a function of parameter  on several training data sets, based on AUC, G-mean, and F-measure evaluation metrics, respectively.
As seen in Figure 1, the performance of RotEasy varies by a small margin along with the change of parameter .Thus, the proposed RotEasy algorithm is insensitive to the variation of parameter .Hence, it is proper that we fix the value of parameter  to be 0.5 in the subsequent experiments.

Performance Comparison.
In this section, we will compare our proposal RotEasy against the previously presented state-of-the-art methods.Before going through further analysis, we first show the AUC, G-mean,and F-measure values of all the methods on each data set in Tables 3, 4, and 5 respectively.We also draw the box plots of test results for all methods on the "Scrapie" data set in Figure 2. In this Moreover, we also investigate the computational efficiency of newly proposed RotEasy algorithm, through computing the running time of all algorithms and pruned ensemble size of RotEasy algorithm on all data sets.These results are listed in Table 6.From the last column of Table 6, we can see that the size of pruned ensemble drops from 100 to around 30.Then, it will greatly improve computational efficiency of RotEasy algorithm in prediction stage, particularly when we encounter the large-scale classification problems.
Hence, the average running time of RotEasy is the shortest in all methods, comparable to that of EasyEnsemble and UnderBagging.The RAMOBoost algorithm has the longest running time.Other compared algorithms can be ranked in the order from fast to slow as RUSB, AdaC, CART, RotF, SMOB, SMBag.

Statistical Tests.
In order to show whether the newly proposed method offers a significant improvement over other methods for some given problems, we have to give the comparison a statistical support.A popular way to compare the overall performances is to count the number of problems on which an algorithm is the winner.Some authors use these counts in inferential statistics with a form of twotailed binomial test, also known as the sign test.
Here, we employed the sign test utilized by Webb [24] to compare the relative performance of all considered algorithms.In the following description, row indicates the mean performance of the algorithm with which a row is labeled, while col indicates that of the algorithm with which a column is labeled.The first row represents the mean performance across all data sets.Rows labeled as ṙ represent the geometric mean of the performance ratio col/row.Rows labeled as  represent the win-tie-loss statistic, where the three values refer to the numbers of data sets for which col > row, col = row, and col < row, respectively.Rows labeled as  represent the test  values of a two-tailed sign test based on the win-tie-loss record.If the value of  is smaller than the given significance level, the difference between the two considered algorithms is significant and otherwise it is not significant.
Tables 7, 8, and 9 show all the pairwise comparisons of considered algorithms based on AUC, G-mean, and F-measure metrics, respectively.The results show that RotEasy obtains the best performance among the compared algorithms.RotEasy not only achieves the highest mean performance, but also always gains the largest win records in the light of the last columns in Tables 7-9.
In terms of three used evaluation measures, the top three best algorithms are ranked in the same order of RotEasy, unpruned RotEasy, and EasyEnsemble.Other compared algorithms are approximately ranked in the order from better to worse as SMOB, RUSB, UNBag, RAMO, RotF, AdaC,  random undersampling; (2) to generalize this technique into multiclass imbalanced learning problems, while only binary class imbalanced classification were considered in current experiment [26][27][28]; (3) to extend our study into semisupervised learning of imbalanced classification problems [29,30].
True positive rate: the percentage of positive instances correctly classified, TP rate = TP/(TP + FN), also known as Recall; True negative rate: the percentage of negative instances correctly classified, TN rate = TN/(FP+TN); False positive rate: the percentage of negative instances misclassified, FP rate = FP/(FP + TN); False negative Rate: the percentage of positive instances misclassified, FN rate = FN/(FN + TP); F-measure: the harmonic mean of  and ,  = TP/(TP + FP), F-measure= (2 × Precision × Recall)/(Precision + Recall); G-mean: the geometric mean of TP rate and TN rate , mean = √TP rate ⋅ TN rate ;

Figure 2 :
Figure 2: The box plots of AUC, G-mean, and F-measure results for all the algorithms on the "Scrapie" data set.

Output: For a given 𝑥, calculate its class label assigned by the ensemble classifier ℎ * : ℎ * (𝑥) = arg max
Input: A minority training set P and a majority training set N, |P| ≪ |N|.T: the number of subsets undersampling from N,   : the number of iterations in Adaboost learning.Learn an ensemble classifier   using P and N  .  is an Adaboost ensemble with   number of weak classifiers ℎ , , corresponding weights  , and threshold   : Randomly split the feature set  into  subsets , ,  = 1, 2, . . ., .(b) For  = 1 to  do Let  , be the data set  for the features in  , .

Table 1 :
Description of the experimental data sets.Imbalance ratio is the value of  maj / min .

Table 3 :
Performance results of all methods based on AUC evaluation metric.The values with boldface mean the best result.

Table 4 :
Performance results of all methods based on -mean evaluation metric.The values with boldface mean the best result.

Table 5 :
Performance results of all methods based on -measure evaluation metric.The values with boldface mean the best result.
train inner-layer ensemble classifier.The number of sample subsets is set to be  = 10, and the number of AdaBoost iterations is set to be   = 10.(10)Unpruned RotEasy (ab.RotE-un).The number of undersampled subsets is  = 10;   = 10 inner-layer ensemble is constructed through integrating rotation forest and AdaBoost.

Table 7 :
Pairwise comparisons of all algorithms based on the AUC criterion.

Table 8 :
Pairwise comparisons of all algorithms based on the -mean criterion.

Table 9 :
Pairwise comparisons of all algorithms based on the -measure criterion.