Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0model generalization including data preprocessing and basemodel training.The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSGmethod, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.


Introduction
Classification learning becomes complicated if the class distribution of the data is imbalanced.The class imbalance problem occurs when the number of representative instances is much less than that of other instances.Recently, the classification problem of imbalanced data appears frequently and has been widely concerned [1][2][3].
Usually, imbalanced data sets are grouped into binary class and the majority class and minority class are, respectively, denoted as negative class and positive class.Traditional techniques are divided into four categories.REsample technique is to increase the minority class of instances (oversampling) [4] or decrease the majority class of instances (undersampling) [5,6].Existing algorithms are improved by increasing the weight of positive instances [7].Classifier ensemble method is widely adopted to deal with the imbalance problem in the last decade [8].In cost-sensitive algorithms, data characters are incorporated with misclassification costs in the classification phase [9,10].In general, costsensitive and algorithms levels are more associated with the imbalance problem, whereas data level and ensemble learning can be used and independent of the single classifier.
Ensemble methods involving training base classifiers, integrating their results, and generating a single final class label can increase the accuracy of classifiers.Bagging algorithm [11] and Ada AdaBoost boost algorithm [12,13] are the most common ensemble classification algorithms.Ensemble algorithms combined with the other three techniques are widely applied in the classification of imbalanced data sets.Cost-sensitive learning targets the imbalanced learning problem by using different cost matrices that can be considered as a numerical representation of the penalty of misclassifying examples from one class to another.Data characters are incorporated with misclassification costs in the classification phase.Cost-sensitive learning is closely related to learning from imbalanced data.In order to solve the imbalance problem, ensemble algorithms were combined with data preprocessing, cost-sensitive method, and relevant algorithms in previous studies.Four ensemble methods for imbalance data sets are commonly used: boosting-, bagging-, cost-sensitive-boosting and hybrid ensemble methods.
In boosting-based ensembles, the data preprocessing technique is embedded into boosting algorithms.In each iteration, the data distribution is changed by altering the weight to train the next classifier towards the positive class.These algorithms mainly include SMOTEBoost [14], RUSBoost [15], MSMOTEBoost [16], and DataBoost-IM [17] algorithms.In bagging-based ensembles, the algorithms combine bagging with data preprocessing techniques.The algorithms are usually simpler than their integration in boosting because of simplicity and good generalization ability of bagging.The family includes but not limited to OverBagging, Under-OverBagging [18], UnderBagging [19], and IIVotes [20].In cost-sensitive-boosting ensembles, the general learning framework of AdaBoost is maintained, but a misclassification cost adjustment function is introduced into the weight updating formula.These ensembles are usually different in the modification way of the weight update rule.AdaCost [21], CSB1, CSB2 [22], RareBoost [23], AdaC1, AdaC2, and AdaC3 [24] are the most representative approaches.Unlike previous algorithms, hybrid ensemble methods adopt the double ensemble learning methods.For example, EasyEnsemble and BalanceCascade use bagging as the main ensemble learning method, but each base classifier is trained by AdaBoost.Therefore, the final classifier is an ensemble of ensembles.
Stacking algorithm is another ensemble method and has the same base classifiers to the Bagging and AdaBoost algorithms, but the structure is different.Stacking algorithm has a two-level structure consisting of Level 0 classifiers and Level 1 classifiers.Stacking algorithm involves two steps.The first step is to collect the output of each model into a new set of data.For each instance in the original training set, the data set represents every model's prediction of that instance's class and the models are base classifiers.In the second step, based on the new data and true labels of each instance in the original training set, a learning algorithm is employed to train the second-layer model.In Wolpert's terminology, the first step is referred to as the Level 0 layer and the second-stage learning algorithm is referred to as the Level 1 layer [25].
Stacking ensemble is a general method, in which a highlevel model is combined with the lower-level models.Stacking ensemble can achieve the higher predictive accuracy.Chen et al. adopted ant colony optimization to configure stacking ensembles for data mining [26].Kadkhodaei and Moghadam proposed an entropy-based approach to search for the best combination of the base classifiers in ensemble classifiers based on stack generalization [27].Czarnowski and Jędrzejowicz focused on the machine classification with data reduction based on stacked generalization [28].Most of the previous studies were focused on the way to use or generate stacking algorithm.However, the stacking ensemble does not consider data distribution and is suitable for the common data sets other than imbalance data.
In order to solve the imbalance problem, the paper introduces cost-sensitive learning into stacking ensemble and adds a misclassification cost adjustment function into the weight of instance and classifier.In this way, misclassification costs may be considered in the data set as a form of data space weighting to select the best distribution for training.On the other hand, in the combine stage, metatechniques can be integrated with cost-sensitive classifiers to replace standard cost-minimizing techniques.The weights for the misclassification of positive instances are higher and the weights for misclassification of negative instance are relatively lower.The method provides an option for imbalanced learning domains.
In this paper, RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method is proposed in order to solve the imbalance problem.In the method, preprocessed imbalance data are used to train the Level 0 layer model.Unlike common ensemble algorithm for imbalanced data, the proposed method utilized cost-sensitive algorithm as the Level 1 (meta)layer.Stacking methods combined with imbalance data approaches including cost-sensitive learning had been reported.Kotsiantis proposed a stacking variant methodology with cost-sensitive models as base learners [29].In the method, the model tree was replaced by MLR in metalayer to determine the class with the highest probability associated with the true class.Lo et al. proposed a costsensitive stacking method for audio tag annotation and retrieval [30].In these methods, cost-sensitive learners are adopted in the base-layer model and the metalayer model is trained by other learning algorithms such as SVM and decision tree.In this paper, Level 0 model generalizer involves resampling the data and training the base classifier.The costsensitive algorithm is used to train the Level 1 metalayer classifier.The two layers adopt the imbalanced data algorithms and take the full advantages of mature methods.Level 1 layer model has a bias towards the performance of the minority class.Therefore, the method proposed in the study is more efficient than other methods in which cost-sensitive algorithms are only used in the Level 0 layer.
The method was compared with common classification methods, including other ensemble algorithms.Additionally, the evaluation metrics of the algorithm were analyzed based on the results of statistical tests.Statistical tests of evaluation metrics demonstrated that the proposed new approach could effectively solve the imbalanced problem.
The paper is structured as follows.Related ensemble approaches and cost-sensitive algorithms are introduced in Section 2. Section 3 introduces the details of proposed RECSG approach including Level 0 and Level 1 model generalizers.In Section 4, experiments and corresponding results and analysis are presented.Statistical tests of evaluation metrics of algorithm performance are analyzed and discussed.Finally, Section 5 discusses the advantages and disadvantages of the proposed method.

Background
2.1.Performance Evaluation in Imbalanced Domains.The evaluation metric is a vital factor for the classifier model and performance assessment.In Table 1, the confusion matrix demonstrates the results of incorrectly and correctly classified instances of each class in the two classes of problems.
Accuracy is the most popular evaluation metric.However, it cannot effectively measure the correct rates of all the classes, so it is not an appropriate metric for imbalance data sets.For this reason, in addition to accuracy, more suitable metrics should be considered in the imbalance problem.Other metrics have been proposed to measure the classification performance independently.Based on the confusion matrix (Table 1), these metrics are defined as where  is a coefficient to adjust the relative importance of precision versus recall (usually  = 1).
The used combined evaluation metrics of these measures include receiver operating characteristic (ROC) graphic [31], the area under the ROC curve (AUC) [2], average geometric mean of sensitivity and specificity (GeoMean) [32] (see (4)), and average adjusted geometric mean (AGeoMean) [33] (see (5)), where   refers to the proportion of the majority samples.These metrics are defined as (5)

Stacking.
Bagging and AdaBoost are the most common ensemble learning algorithms.In bagging method, different base classifier models generate different classification results and the final decision is decided by majority voting.In AdaBoost algorithm, a series of base weak classifiers are trained with the whole training set and the final decision is generated by a weighted majority voting scheme.In each round of training iteration, different weights are attributed to each instance.In the two algorithms, the base classifiers are the same.
Stacking ensemble is another ensemble algorithm, in which the prediction result of base classifiers is used as the attribute to train the combination function in metalayer classifier [25].The algorithm has a two-level structure consisting of Level 0 classifiers and Level 1 classifiers.It was proposed by Wolpert and used by Breiman [34] and Leblanc and Tibshirani [35].
Set a data set: where   is a vector representing the attribute values of the instance  and   is the class value.All  instances are randomly split into  equivalent parts and -fold crossvalidation is used to train the model.The prediction results of every time testing set gives a vector which includes  instances: Set   is the difference base learning algorithm model obtained with the data set ,  = 1, . . ., , and   is the part of Level 0 models.Level 0 layer consists of  base classifiers, which are employed to estimate the class probability of each instance.
The above intermediate data are considered as the training data of the Level 1 layer model.The input data are treated as features and the real value of instance is treated as the output space.The next step is to train the data with some fused leaning algorithms.The process is called Level 1 generalizer and Level 1 model is denoted by M, which can be regarded as the function of ( (1) ,  (2) , . . .,  () , . . .,  () ).
Then, M model is used to combine base classifiers and predict the final result of   .
In this paper, for imbalance data, we propose a stacked generalization based on cost-sensitive classification.In Level 0 model generalizer layer, we firstly resample the imbalance data.Resampling approaches include oversampling and

Stacked Generalization for the Imbalance Problem
The proposed RECSG architecture includes two layers.The first layer (Level 0) consisting of classifiers ensemble is called the base layer; the second layer (Level 1) combined with base classifiers is called the metalayer.The flowchart of the architecture is shown in Figure 1.

Level 0 Model Generalizers.
For the imbalance problem, Level 0 model generalizer step of RECSG includes preprocessing data and training the base classifier.Firstly, the oversampling (SMOTE) method has been used in data preprocessing of base classifier.Secondly, the base classifier model is trained with the new data set.In this level, we employed three base algorithms: Naïve Bayes (NB) [36], decision tree C4.5 [37], and -nearest neighbors (-NN) [38].
(1) Naïve Bayes (NB).Given instance , set ( | ) is the posterior probability of class , then where NB uses an Laplacian estimate for estimating the conditional probabilities.(2) Decision Tree (C4.5).Decision tree classification is a method usually used in data mining [39].A decision tree is a tree, where each input feature labels a nonleaf node, one class, or probability over the class labels each leaf node of the tree.By recursive partitioning, a tree can be "learned" until no prediction value is added or the subset has the same value.
When the parameter of training data selects information entropy, the decision tree is named C4.5.
(3) -NN.k-NN is a nonparametric algorithm used for regression or classification.An instance is classified by a majority vote of its -nearest neighbors.If  = 1, the instance is simply classified in accordance with the class label of the nearest neighbor.
All the above algorithms are simple and have low complexity, so they are applicable weak base classifiers.

Level 1 Model
Generalizers.Prediction results of several base classifiers in Level 0 layer are used as input space and true value class of instance is used as output space.Based on Level 0 layer, Level 1 layer model is trained by other learning algorithms.For imbalanced data, the cost-sensitive algorithm is used to train the Level 1 metalayer classifier in the paper.

Cost-Sensitive
Classifier.Aiming at imbalanced data learning problem, in cost-sensitive classifier, different cost matrices are used as the penalty of misclassified instance [40].For example, a cost matrix  has the following structure in a binary classification scenario in Table 2.
In the cost matrix, the row indicated alternative predicted classer, whereas the column indicates actual classes.The cost of false negative is notated as  01 and the cost of false positive is notated as  10 .Conceptually, the cost of correctly classified instances should always be less than the cost of incorrectly classified instances.In the imbalance problem,  10 is always greater than  01 .For the German credit data set previously reported as the part of the Stalog project [41], the cost matrix is provided in Table 3.
The cost of false good prediction is greater than the cost of false bad prediction in the view of economical reason.So, in Level 1 classifier of stacking for the imbalance problem, the cost-sensitive algorithm is adopted.

Logistic Regression.
Ting and Witten illustrated that MLR (Multiresponse Linear Regression) algorithm had an advantage over other Level 1 generalizers in stacking [42].In this paper, the logistic regression classifier is used as the metalayer algorithm.Based on the metalayer, the cost of misclassification is considered in the cost-sensitive algorithm.In logistic regression classifier, the prediction result of Level 0 layer is used as the attribute of Level 1 metalayer and the real value of instance is used as output space.The linear regression for class  is simply obtained as Details of the implementation of RECSG are presented below.Then, the metalayer model M is constructed based on the data  meta , which is firstly predicted with Level 0 layer (line (3) in Pseudocode 1).Finally, Lever 1 layer classification (cost-sensitive and logistic regression) is used to predict the ultimate value of the tested samples.

Empirical Investigation
The experiment aims to verify whether the RECSG approach can improve the classification performance for the imbalance problem.In this paper, the RECSG approach was compared with other algorithms involving various combination ensemble techniques and imbalanced approaches.For each method, the same training set, testing set, and validation set were used.

Experimental Settings.
Experiments were implemented with 17 data sets from the UCI Machine Learning Repository [44].These data sets cover various fields and are based on IR measure values (from 0.54 to 0.014), unique data set names, a varying number of samples (from 173 to 2338),   and variations in the amount of class overlap (see the KEEL repository [45]).Multiclass data sets were modified to obtain two-class imbalanced problems so that the union of one or more classes became the positive class and the union of one or more of the remaining classes was labeled as the negative class.A brief description of the used date set is presented in Table 5.It includes the total number of instances (#Sam.), the total number of each class instances (#Min., #Maj.), the imbalance ratio (IR = the ratio of the number of minority class instance to majority class instance), and the number of features (#Fea.).
Our system was compared with other ensemble algorithms including AdaBoost with Naïve Bayes, AdaBoost with cost-sensitive, bagging with Naïve Bayes, bagging with costsensitive, and stacking cost-sensitive with NB, k-NN, C4.5, and logistic regression.All the experiments were performed by 10-fold cross-validation.6, 7, and 8, respectively, show the results of the three metrics (AUC, GeoMean, and AGeoMean) for the algorithms in Table 4 obtained with the data set in Table 5.The best results are emphasized in boldface on each data set in these tables.

Experimental Results. Tables
The results showed that the performance of the proposed RECSG method was the best for 12 of 17 data sets in terms of GeoMean and AGeoMean and for 10 of 17 data sets in terms of AUC.Some methods are better than others in some evaluation metrics, but not in all metric and most data sets.
In Table 6 (AUC), the performance of the RECSG method is better than that of the 3 single-base classification methods for 14 of 17 data sets and better than that of other 5 ensemble algorithms for 13 of 17 data sets.In Table 7 (GeoMean), the RECSG method outperforms all the 3 single-base classification methods in 15 of 17 data sets and outperforms other 5 ensemble algorithms in 15 of 17 data sets.In Table 8 (AGeoMean), the RECSG method outperforms all the 3 single-base classification methods in 14 of 17 data sets and outperforms other 5 ensemble algorithms in 16 of 17 data sets.For the 17 data sets, the evaluation metrics of GeoMean and AGeoMean are better than AUC.[46] is adopted to compare different algorithms.In the paper, we use two types of comparisons: pairwise comparison (between a pair of algorithms) and multiple comparisons (among a group of algorithms).

Pair of Algorithms.
We performed statistic -tests to explore whether RECSG approach is the significantly better than other algorithms in the three metrics (AUC, GeoMean, and AGeoMean).Table 9 shows the results of RECSG approach compared with the other methods in terms of AUC, GeoMean, and AGeoMean.The values in the square brackets indicate the number of the metrics with statistically significant difference in the -test performed with the confidence level  = 0.05 in the three evaluation metrics.As for the evaluation metric of AUC, the RECSG outperformed NB for 11 of 17 data sets and 5 data sets showed the statistically significant difference with the confidence level of  = 0.05.

Multiple Comparisons.
We performed Holm post hoc test [47] and selected multiple groups to test the three metrics (AUC, GeoMean, and AGeoMean).The post hoc procedures can determine whether a comparison hypothesis should be rejected at a specified confidence level .Statistical experiment was performed in the platform on the website http://tec.citius.usc.es/stac/.Table 10 provides the results that RECSG approach is significantly different with the confidence level of  = 0.05 in terms of AUC, GeoMean, and AGeoMean. 0 indicates that there is no significant difference.According  to the evaluation metric of GeoMean, the RECSG is significantly better than other 8 methods with the confidence level of  = 0.05.

Discussion.
According to the experiments performed with 17 different imbalance data sets and the comparison with other 9 classification methods, the proposed RECSG method showed the higher performance than other methods in terms of AUC, GeoMean, and AGeoMean, as illustrated in Tables 7, 8, and 9.The RECSG method showed the better performance for 10 of 17 cases in terms of AUC and for 12 of 17 cases in terms of GeoMean and AGeoMean.The method outperformed other 5 ensemble algorithms for 16 of 17 data sets in terms of GeoMean and AGeoMean and for 13 of 17 data sets in terms of AUC.GeoMean and AGeoMean are better than AUC in evaluation metrics in 17 data sets.The means of RECSG method in terms of GeoMean, AGeoMean, and AUC are all higher than other methods.The Holm post hoc test shows that RECSG method significantly outperforms the rest with the confidence level  = 0.05 in terms of AGeoMean, for 8 of 9 methods in terms of AGeoMean, and for 3 of 9 methods in terms of AUC.Experimental results and statistical tests show that the RECSG approach has improved the classification performance of imbalanced data sets.The reasons can be explained as follows.
Firstly, stacking algorithm uses M to combine the result of base classifier, whereas bagging employs majority vote.Bagging is only a simple decision combination method which requires neither cross-validation nor Level 1 learning.In this paper, stacked generalization M adopts logistic regression, thus providing the simplest linear combination of pooling the Level 0 models' confidence.
Secondly, cost-sensitive algorithm can affect imbalance data sets in two aspects.Firstly, cost-sensitive modifications can be applied in the probabilistic estimation.Moreover, costsensitive factors change the weight of instance, and the weight of the minority class is higher than that of the majority class.Therefore, the method directly changes the number of common instances without discarding or duplicating any of the rare instances.
Thirdly, RECSG method introduces cost-sensitive learning into stacking ensemble.The cost-sensitive function in M replaces the error-minimizing function, thus allowing Level 1 learning to be prone to focus on the minority class.
The results in Tables 7 and 8 demonstrate that the RECSG method has the higher performance when the evaluation performance metric of base classifier is weaker (such as Vehicle1, Vehicle0, and car-vgood).The reason is that the alternative methods of NB, C4.5, and -NN have shortcomings.The independence assumption hampers the performance of NB in some data sets.In C4.5 tree construction, the selection of attributes affects the model performances.Error ratio of -NN is relatively high when data sets are in imbalance.Therefore, logistic regression adopted in M can improve the performance when base classifier is weaker.
The performance of the RECSG method is generally better when the IR is low (such as Glass2, car-good, flare-F, car-vgood, and abalone-17 vs 7-8-9-10).The performance of RECSG method is probably related to the setting of the cost matrix.Different data sets should use different cost matrices.For the purpose of simplicity in the paper, we have adopted the same cost matrices, which may be more suitable to low IR values.

Conclusions
In this paper, in order to solve the class imbalance problem, we proposed the RECSG method based on 2-layer learning models.Experimental results and statistical tests showed that the RECSG approach improved the classification performance.The proposed RECSG approach may have the relatively high computational complexity in the training stage because the approach involves 2-layer classifier models

.
Pseudocode 1 presents the proposed RECSG approach.Input parameters are two data sets: training set and testing set.Output predicts class labels of the test samples.The first step in the process is data preprocessing, resample new instances, and train model   ( = 1, . . ., ), where  is the number of base classifiers and   (  ) is the generated function of   based on model   (lines (1)-(2) in Pseudocode 1).

Table 6 :Table 7 :Table 8 :
Results of other methods in terms of AUC.Results of other methods in terms of GeoMean.Results of other methods in terms of AGeoMean.

Table 1 :
Confusion matrix for performance evaluation.

Table 3 :
Cost matrix of credit data sets.

Table 5 :
Summary of imbalanced data sets.

Table 9 :
Comparison between RECSG and other algorithms in terms of wins/losses of the results.

Table 10 :
Holm post hoc test to show differences of Stacking-Cost-log (AUC) and other algorithms.