An Empirical Study on the Performance of Cost-Sensitive Boosting Algorithms with Different Levels of Class Imbalance

Cost-sensitive boosting algorithms have proven successful for solving the difficult class imbalance problems.However, the influence of misclassification costs and imbalance level on the algorithm performance is still not clear. The present paper aims to conduct an empirical comparison of six representative cost-sensitive boosting algorithms, including AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3. These algorithms are thoroughly evaluated by a comprehensive suite of experiments, in which nearly fifty thousands classification models are trained on 17 real-world imbalanced data sets. Experimental results show that AdaC serial algorithms generally outperform AdaCost and CSB when dealing with different imbalance level data sets. Furthermore, the optimality of AdaC2 algorithm stands out around the misclassification costs setting: C N = 0.7, C P = 1, especially for dealing with strongly imbalanced data sets. In the case of data sets with a low-level imbalance, there is no significant difference between the AdaC serial algorithms. In addition, the results indicate that AdaC1 is comparatively insensitive to themisclassification costs, which is consistent with the finding of the preceding research work.


Introduction
Classification is an important task of knowledge discovery and data mining.A large number of classification algorithms have been well developed, such as decision tree, neural network, the Bayesian network, and support vector machine.These algorithms always assume a relatively balanced class distribution.However, class imbalance problems are frequently encountered in many real-world applications including medical diagnosis [1], fraud detection [2], fault diagnosis [3], text categorization [4], and DNA sequences analysis [5].The class imbalance problem has emerged as an intractable issue due to the difficulty caused by the imbalanced class distribution.
The imbalanced class distribution is characterized as having many more instances of some classes than others.Particularly for the two-class task that we consider in this paper, it occurs when samples of the majority class representing the negative concept outnumber samples of the minority class representing the positive concept.It has been reported that conventional classifiers exhibit serious performance degradation for class imbalance problems, since they show a strong bias towards the majority class.However, the correct classification of the minority class is more preferred than the majority class.For example, the recognition goal is to provide a higher identification rate of rare diseases in medical diagnosis.
In view of the importance of this issue, a great deal of research work has been carried out in recent years [6][7][8].The main research can be categorized into three groups.The first group focuses on the approaches for handling the imbalance both at the data and algorithm levels.The second group explores proper evaluation metrics for imbalanced learning algorithms [9,10].The third one is to study the nature of the class imbalance problem; that is, what data characteristics aggravate the problem, and whether there are other factors that lead to performance reduction of classifiers [11,12].
Data level techniques add a preprocessing step to rebalance the class distribution by resampling the data space, including oversampling positive instances and undersampling negative instances [13,14].There are also some methods that involve a combination of the two sampling methods [15,16].When discussing what is the best data level solution for this issue, Van Hulse et al. [17] suggested that the utility of each particular resampling strategy depends on various factors, including the imbalance ratio, the characteristics of data, and the nature of classifier.Recently, García et al. [18] significantly extended previous works by deeply investigating the influences of the imbalance ratio and the classifier on the effectiveness of the most popular resampling strategies.Their experimental results showed that oversampling consistently outperforms undersampling for strongly imbalanced data sets, whereas there are no significant differences for data sets with a low-level imbalance.
At the algorithm level, the objective is to adapt existing learning algorithms to bias towards the minority class.These methods require special knowledge of both the corresponding classifier and the application domain [19,20].In addition, cost-sensitive learning algorithms fall between data and algorithm level approaches.They incorporate both data level transformations (by adding costs to instances) and algorithm level modifications (by modifying the learning process to accept costs).Interested readers can refer to the relevant literature [21][22][23][24][25][26].
In recent years, ensemble-based learning algorithms have arisen as a group of popular methods for solving class imbalance problems.The modification of the ensemble learning algorithm includes data level approaches to preprocess the data before learning each classifier [27,28].Besides, some proposals consider the embedding of the cost-sensitive framework in the ensemble learning process, which is also known as cost-sensitive boosting [29][30][31][32][33][34].For this kind of algorithms, the proper misclassification costs are essential for their good performance.When handling imbalanced classification problems, the imbalance level of the data set will undoubtedly impact the optimal misclassification costs of these algorithms.To the best of our knowledge, it is still not clear how the misclassification costs and imbalance level affect the performance of these cost-sensitive boosting algorithms so far.
Motivated by the previous analysis, we made a thorough empirical study to investigate the effect of both imbalance level and misclassification costs on the performance of some popular cost-sensitive boosting algorithms, including AdaCost [29], CSB1, CSB2 [30], and AdaC serial algorithms (AdaC1, AdaC2, and AdaC3) [31].To this end, we carry out a comprehensive suite of experiments by employing 17 realworld data sets, four performance metrics and fifty thousands training models, providing a complete perspective on the performance evaluation.The comparison results are tested for statistical significance via the paired -test and visualized by the multidimensional scaling analysis.
The rest of this paper is organized as follows.Section 2 reviews several cost-sensitive boosting algorithms based on AdaBoost.In Section 3, we describe the experimental framework, including experimental data sets, cost setups, performance measures, and experimental approaches.In Section 4, we discuss experimental results to obtain some valuable findings.Finally, conclusions and some future work are outlined in Section 5.

AdaBoost and Its Cost-Sensitive Modifications
Ensemble methods have emerged as meta-techniques for improving the generalization performance of existing learning algorithms.The basic idea is to construct several classifiers from the original data and then combine them to obtain a new classifier that outperforms each one of them.Boosting and bagging are the most widely used ensemble learning algorithms, which have led to significant improvements in many real-world applications.

AdaBoost Algorithm.
As the first applicable approach, AdaBoost [35] has been the most representative algorithm in the family of boosting.In particular, AdaBoost has been appointed as one of the top ten data mining algorithms [36].
AdaBoost uses the whole data set to train base classifiers serially and gives each sample a weight reflecting its importance.At the end of each iteration, the weight vector is adjusted so that the weights of misclassified instances are increased and those of correctly classified ones are decreased.Furthermore, another weight vector is assigned to individual classifiers depending on their accuracy.When a test instance is submitted, each classifier gives a weighted vote, and the final predicted class label is selected by majority.The pseudocode for AdaBoost algorithm is shown in Algorithm 1.
The sample weighting strategy of AdaBoost is equivalent to resampling the data space combining both undersampling and over-sampling.When dealing with imbalanced data sets, AdaBoost tends to improve the identification accuracy of the positive class since it focuses on misclassified samples.Hence, it makes the AdaBoost an attractive algorithm for class imbalance problems.

Cost-Sensitive Boosting Algorithms. However, since
AdaBoost is an accuracy-oriented algorithm, so the learning strategy may bias towards the negative class as it contributes more to the overall accuracy.Moreover, reported works show that the improved identification performance on the positive class is not always satisfactory.Hence, it requires AdaBoost algorithm to adapt its boosting strategy towards the cost-sensitive learning framework.
Cost-sensitive learning assigns different costs to different types of misclassification.Let (, ) denote the cost of predicting an example from class  as class .For the twoclass case, the cost of misclassifying a positive instance is denoted by   , and the contrary one is denoted by   .The recognition importance of positive instances is higher than that of negative instances.Hence, the cost of misclassifying the positive class is greater than that of the negative class; that is,   >   .The cost-sensitive learning adds the cost matrix into the model building process and generates a model that minimizes the total misclassification cost.
In present paper, we focus on several representative algorithms in the family of cost-sensitive boosting, including AdaCost [29], CSB1, CSB2 [30], and AdaC serial algorithms (AdaC1, AdaC2, and AdaC3) [31].They differ in the way of how to introduce cost items into the weighted distribution   in the AdaBoost framework.
where   is the normalization constant so that  +1 will be a distribution, that is, (iv) Output: The final hypothesis: Algorithm 1: AdaBoost algorithm.

AdaCost.
In this algorithm, the weight update rule increases the weights of misclassified samples more aggressively but decreases the weights of correctly classified samples more conservatively.This is accomplished by introducing the cost adjustment function  into the weight update formula as follows: where sgn(ℎ  (  ),   ) denotes "+" when ℎ  (  ) equals   ; that is,   is correctly classified, "−" otherwise.Fan et al. [29] provided the recommended setting: where   is the cost of misclassifying the th example and  + ( − ) denotes the output of correctly (incorrectly) classified samples, respectively.Since   >   , then we have  + + <  − + and  + − >  − − .Hence, false negative receives greater weight increase than false positive, and true positive loses less weight than true negative.
The weight updating parameter   is computed as where

CSB1 and CSB2. CSB1 modifies the weight update formula of AdaBoost to
And CSB2 changes it to The difference between CSB1 and CSB2 mainly lies in weight parameter   : CSB1 does not use any   factor, and CSB2 updates   in the same way as that of AdaBoost.Even though the weight update formula of CSB2 is similar to that of AdaC2, CSB2 does not take cost items into consideration in the update rule of parameter   in the learning process.

AdaC1
. This algorithm is one of the three cost-sensitive modifications of AdaBoost proposed by Sun et al. [31].These three algorithms derive different weighted distribution update formulas depending on where they introduce cost items.In AdaC1, cost items are embedded inside the exponent part of weight update formula: where   ⊂ [0, +∞] is an associated cost item with the th sample.The weight parameter   can be induced in the similar way as AdaBoost: Note that AdaCost is a variation of AdaC1 by introducing the cost adjustment function instead of cost items inside the exponent part.All the AdaC serial algorithms can be reduced to AdaBoost when all the cost items are equally set to 1, but AdaCost cannot be reduced to AdaBoost.
Mathematical Problems in Engineering 2.2.4.AdaC2.Unlike AdaC1, AdaC2 embeds cost items in a different way that is outside the exponent part: Accordingly, the computation of the parameter   is changed to 2.2.5.AdaC3.This modification combines the idea of AdaC1 and AdaC2 simultaneously.Namely, the weight update formula is modified by introducing cost items both inside and outside the exponent part: Thereby, the weight parameter   is computed as follows: ) . (12)

Experimental Framework
In this section, we present the experimental framework used to carry out the empirical study to evaluate the above costsensitive boosting algorithms.The aim of this study is to investigate how the algorithm performance is affected when different cost settings and different imbalance levels are considered in the experiment.

Experimental Data Sets.
In the experiment, we employed 17 public imbalanced data sets from the UCI machine learning repository, which have also been used [18].The chosen data sets vary in sample size, class distribution, and imbalance ratio in order to ensure a thorough performance assessment.We discussed only the binary classification problems in this paper.Some data sets with multiclass labels were transformed into two-class ones, by keeping one class as the positive class and joining the remaining classes into the negative class.Table 1 summarizes the properties of the used data sets for each data set, the total number of samples, the indicia, and the sample size of the minority and majority classs.The last column is the imbalance ratio, which is defined as sample size of the majority class divided by that of the minority class.This table is ordered according to the imbalance ratio in the descending order.
The large imbalance ratio means the high-level imbalance, and the small one denotes the low-level imbalance.Experimental data sets were divided into two groups according to the imbalance ratio.The first group is deemed as strongly imbalanced, including LetterA, Cbands, Pendigits, Satimage, Optidigts, Mfeat kar, Mfeat zer, Segment, and Scrapie, whose imbalance ratios are larger than 4. The second group consists of the low-level imbalanced data sets: Vehicle, Haberman, Yeast, Breast, Phoneme, German, Pima, and Spambase.

Cost Setups.
In the experiment, we will study the influence of different cost settings on the performance of these cost-sensitive boosting algorithms.In these algorithms, misclassification costs are used to characterize the recognition importance of different samples.The proper misclassification costs are often unavailable and can be ascertained using the empirical method.
In our experiments, the misclassification costs for samples in the same category are set with the same value:   denotes misclassification cost of the positive class, and   denotes that of the negative class.The ratio between   and   represents the deviation of the learning importance between two classes.The larger the ratio, the more weighted the sample size of the positive class is boosted to strengthen learning.A range of ratio values are tested to search for the most effective cost setting, and we use the cost setup:   = 1, and   varies from 0.1 to 1 with step 0.1.
When   =   = 1, AdaC1, AdaC2, and AdaC3 are all reduced to the original AdaBoost algorithm.For CSB1 and CSB2, the cost setup is suggested by Ting [30]: if a sample is correctly classified,   =   = 1; otherwise,   >   ≥ 1.According to the proposal of [31], we fix the cost factor for false negatives   as 1 and the cost setting   for true positives, true negatives, and false positives.

Performance Measures.
The evaluation metric plays a crucial role in both the guidance of the classifier modeling and the assessment of the classification performance.Traditionally, the total accuracy is the most commonly used performance metric.However, accuracy is no longer a proper measure in imbalanced domains, since the positive class makes little contribution to the overall accuracy.For example, a classifier that predicts all samples to be negative in a data set with an imbalance ratio value of 9 may lead to erroneous conclusions although it achieves a high accuracy of 90%.
In the confusion matrix, all samples can be categorized into four groups, as described by Table 2.
The accuracy evaluates the effectiveness of a classifier by the percentage of correct predictions: When dealing with the class imbalance problem, there are other appropriate metrics instead of accuracy.In particular, we can obtain four metrics from the confusion matrix to measure the classification performance of positive and negative classes independently.
True Positive Rate.The percentage of positive instances correctly classified, also known as sensitivity or recall in the information retrieval domain, is as follows: True Negative Rate.The percentage of negative instances correctly classified, also known as specificity, is as follows False Positive Rate.The percentage of negative instances misclassified is as follows: False Negative Rate.The percentage of positive instances misclassified, is as follows: On the other hand, in the case that high classification performance on the positive class is demanded, the precision metric is often adopted: When good quality performance for both classes is required, none of these metrics alone is adequate by itself.Hence, some more complex evaluation measures have been devised.
(1) -Measure.If only the performance of the positive class is considered, TP rate and precision are important metrics.measure integrates these two metrics as follows: Evidently, -measure represents the harmonic mean of precision and recall, and it tends to be closer to the smaller of these two measures.Hence, a higher -measure value ensures that both recall and precision are reasonably high.
(2) -Mean.When the performance of both classes is concerned, both TP rate and TN rate are expected to be high meanwhile.The -mean metric is defined as -mean represents the geometric mean of TP rate and TN rate , and so, it measures the balanced performance of a learning algorithm between two classes.

Experimental Approaches.
For each data set, we performed 5 independent runs of a stratified 10-fold crossvalidation to partition the whole data and obtained 50dimensional score vector for each algorithm.Moreover, we had 60 models for each data set, which came from 6 costsensitive boosting algorithms (AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3) and 10 cost settings.For comparison purposes, we also included the best results of these models, and thus got a total number of 61 models in the experiment.Similar to the statistical analysis method used in [18], we adopt paired -test to determine whether one algorithm is significantly better than another one and then use the multidimensional scaling to visually compare the performance of different cost-sensitive algorithms.
(1) Paired -Test.Given two paired sets   and   of  measured values, the paired -test determines whether they differ from each other significantly under the assumptions that the paired differences are independent and identically normally distributed.In light of the central limit theorem, the sampling distribution of any statistic will be approximately normally distributed, if the sample size  is large enough.As a rough rule of thumb, a sample size of 30 is large enough.In this paper, we conduct the statistical comparison between each pair of 50-dimensional score vectors, which can be regarded as approximately normally distributed, and so, it is reasonable to employ parametric paired -test for statistical comparisons.Based on pairwise comparisons for these algorithms, we computed the index of performance as the difference between wins and losses, where wins (losses) denotes the total times that an algorithm has been significantly better (worse) than others with a significance level  = 0.05.
(2) Multidimensional Scaling.The second complementary analysis tool is multidimensional scaling (MDS) [37,38], which aims at giving a visual comparison for classifier performance with respect to multiple metrics.We built a 61 ×  table for each performance metric, where  denotes the number of data sets used in the experiment.Each element (, ) represents the average score of the model  on the data set , which was calculated by 5 runs of 10-fold cross-validation.Then, we computed Euclidean distances between each pair of rows in the table and performed multidimensional scaling on the distance matrix in order to obtain a projection on 2-dimension space.We can determine the effectiveness of various algorithms through the dispersal trend of their performance scores towards the optimal point in the MDS space.

Experimental Results
Aiming to study the influence of the imbalance ratio and misclassification costs on the performance of different costsensitive boosting algorithms, we performed both statistical test and MDS analysis for data sets with high-level and lowlevel imbalances separately.

Results on Severely Imbalanced Data Sets.
First of all, we perform the significance test of different cost-sensitive boosting algorithms to show whether there exist significant differences among them.To this end, we use the paired -test on the combinations of different algorithms and different cost settings.
For each combination, we achieve some 50-dimensional vectors which come from the results of 5 runs 10-fold cross-validation on all the training data sets.Then, we conduct the paired -test between each pair of these score vectors.The wins of all the data sets are added into the final wins, and the final losses is gained likewise.The index of performance is calculated as the difference between wins and losses.Tables 3 and 4 provide the indices of performance for severely imbalanced data sets using -measure and -mean metrics, respectively.Note that the best result of each cost setting is signed with framed boxes.
From the results of severely imbalanced data sets, we can observe that in most cases, AdaCost is always significantly the firstly worst, CSB1 and CSB2 are secondly worst with negative index values absolutely.Among AdaC serial algorithms, AdaC2 is more preferred than others, especially when using the -mean metric.From the point of view of -measure, AdaC2 is slightly better than the others except for the first three cost settings.AdaC serial algorithms have similar performances when   = 1, since they are all reduced to AdaBoost.
Owing to the obvious advantage of AdaC serial algorithms over the other two groups, we will only include AdaC serial algorithms in the MDS analysis.For each data set, we have 30 models which come from 3 algorithms (AdaC1, AdaC2, and AdaC3) and 10 cost settings.For each model, we obtain the average score as the mean of 50-dimensional score vectors.We build a 31 × 9 table whose element (, ) represents the average score of the model  on the data set .Then, we perform multidimensional scaling on the distance matrix to obtain the projection on 2-dimension space.
Figure 1 illustrates the MDS plots of AdaC serial algorithms on severely imbalanced data sets.The number associated with each point denotes some cost setting.For example, the number 2 with the blue circle denotes the model of AdaC2 at the cost setting   = 0.2,   = 1.We can comprehend the meaning of other points, and so forth.
As we might imagine, when TP rate is considered, points of AdaC2 and AdaC3 become more and more close to the optimal point as   decreases.Moreover, points of AdaC2 are always closer to the optimal point than those of AdaC3, and points of AdaC1 are much farther than others.Conversely, the TN rate apparently presents the opposite behavior.On the other hand, AdaC1 is relatively insensitive to misclassification costs, which is consistent with results of Sun et al. [31].
Focusing on the -mean metric in Figure 1, we can see that AdaC2 is much better than AdaC3 and AdaC1, which is consistent with the result of Table 4.However, there are little differences between AdaC serial algorithms for the measure metric, and the most proper cost of the negative class lies in [0.7, 0.9].These findings are further confirmed by Table 5.
As a further confirmation of the previous findings, Table 5 reports Euclidean distances to the optimal point in the MDS space on the severely imbalanced data sets.As expected, the closest points with framed boxes generally appear in AdaC2 algorithm for both -measure and -mean metrics.We also find that AdaC1 is relatively insensitive to the costs settings.Hence, the average results of AdaC1 tend to be lower than the others.However, it is not implied that AdaC1 is superior to the other two algorithms.Experimental results show that AdaC serial algorithms are much more better than the other two groups, and AdaC2 is more qualified for handling severely imbalanced classification, especially when the cost of the negative class lies in [0.7, 0.9].

Results on Data
Sets with a Low-Level Imbalance.Similarly, we perform the significance test of different costsensitive boosting algorithms for the data sets with a low-level imbalance.
Tables 6 and 7 provide the indices of performance for the -measure and -mean metrics, respectively.As can be seen, both AdaCost and CSB serial algorithms are inferior to AdaC serial algorithms, which is very similar to the case of the strongly imbalanced data sets.However, when comparing AdaC serial algorithms, it seems that the differences between them are marginal in the sense that they generally achieve similar indices of performance, especially unlike the outstanding behavior of AdaC2 for severely imbalanced data sets.
Figure 2 illustrates the MDS plots of AdaC serial algorithms on the slightly imbalanced data sets over four evaluation measures.The results for the TP rate and TN rate are very similar to the case of the strongly imbalanced data sets: AdaC2 and AdaC3 become closer to the optimal TP rate as   decreases, whereas they are nearer to the optimal TN rate as   increases.AdaC1 is relatively insensitive to misclassification costs similar to the case of strongly imbalanced data sets.
When analyzing the results of -mean and -measure, we can observe that the trend of AdaC2 along with the change of   is very similar to that of AdaC3.Moreover, most of the best performances are achieved by AdaC2 algorithm around   = 0.7.These findings are also clear in Table 8.
From the previous analysis on the data sets with a lowlevel imbalance, we can conclude that AdaC serial algorithms consistently outperform the other two groups, but it is difficult to advise the best strategy among AdaC1, AdaC2, and AdaC3.It means that the effectiveness of a particular costsensitive boosting algorithm depends on the class imbalance    as well as on other factors, such as the data characteristics and the algorithm itself.

Conclusions and Future Work
In this paper, we presented a thorough empirical study on the performance of the most popular cost-sensitive boosting algorithms when dealing with different levels of class imbalance.We used 17 real-world imbalanced data sets (9 severely imbalanced and 8 slightly imbalanced), 6 cost-sensitive boosting algorithms (AdaC1, AdaC2, AdaC3, AdaCost, CSB1, and CSB2), and 10 cost settings in the experiment.Besides, the performance of algorithms has been evaluated by means of four different evaluation metrics, that is, TP rate , TN rate , measure, and -mean.
Experimental results show that AdaC serial algorithms (AdaC1, AdaC2, and AdaC3) consistently outperform the other two groups (AdaCost and CSB), both for strongly and slightly imbalanced data sets.Moreover, AdaCost has been demonstrated to be worse than CSB algorithms.When comparing AdaC serial algorithms, AdaC2 is observed to perform better than the two others for severely imbalanced data sets, especially when using the -mean metric.In the case of data sets with low-level imbalance, however the difference between AdaC serial algorithms is negligible.It is necessary to make a further data complexity analysis to choose a suitable algorithm for a particular imbalanced data set.
On the other hand, we have given some guidance in choosing the proper misclassification costs for these costsensitive boosting algorithms.Summarizing the experimental results, it demonstrated that the most proper cost setting is located in the neighbourhood of the point   = 0.7,   = 1.
Based on the present work, there are some interesting future research directions with regard to the class imbalance problem: (1) to utilize other parameter selection techniques for the confirmation of the proper misclassification costs; (2) to take other cost-sensitive learning algorithms into consideration within the present framework, such as the proposed algorithms of [22][23][24]39]; (3) to compare these cost-sensitive algorithms in terms of other performance metrics, such as AUC [9] and IBA [40].

Figure 1 :
Figure 1: MDS plots of severely imbalanced data sets over four performance metrics.

Figure 2 :
Figure 2: MDS plots for data sets with low-level imbalance over four performance metrics.

Table 1 :
Summary of characteristics for the used data sets.

Table 3 :
Index of performance using -measure for severely imbalanced data sets.

Table 4 :
Index of performance using -mean for severely imbalanced data sets.

Table 5 :
Euclidean distances to the optimal point in the MDS space for data sets with a high imbalance.

Table 6 :
Index of performance using -measure for data sets with a low-level imbalance.

Table 7 :
Index of performance using -mean for data sets with a low-level imbalance.

Table 8 :
Euclidean distances to the optimal point in the MDS space for data sets with a low imbalance.