To overcome the two-class imbalanced problem existing in the diagnosis of breast cancer, a hybrid of K-means and Boosted C5.0 (K-Boosted C5.0) is proposed which is based on undersampling. K-means is utilized to select the informative samples near the boundary. During the training phase, the K-means algorithm clusters the majority and minority instances and selects a similar number of instances from each cluster. Boosted C5.0 is then used as the classifier. As there is one different instance selection factor via clustering that encourages the diversity of the training subspace in K-Boosted C5.0, it would be a great advantage to get better performance. To test the performance of the new hybrid classifier, it is implemented on 12 small-scale and 2 large-scale datasets, which are the often used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in terms of Matthews’ correlation coefficient (MCC) and accuracy indices. It can be a good alternative to the well-known machine learning methods.
Breast cancer is one of the top ten causes of women death around the world [
There has been a great deal of research on medical breast cancer diagnosis in the literature, and many of them gained high classification accuracies. Li et al. [
Imbalance problem should be carefully addressed because traditional methods are designed to maximize the global accuracy, but exhibit poor generalization for the small class which is usually the most primary one. Thus, for the traditional algorithm, rare class is difficult to identify than the majority class. Hence, the breast cancer diagnosis problem should be classified from the perspective of class imbalance.
The popular mechanism to address the problem of class imbalance is the ensemble of classifiers with a data-based approach since the data-based method and classifier training task can be performed independently [
To overcome the limitation of undersampling, a K-means clustering-based undersampling method is employed to select the samples near the boundary since the border samples are the most informative ones and play an important role in the classification [
The major contributions of this paper are: (1) a strategy of using the k-means clustering technique for undersampling both majority and minority classes are presented, (2) an efficient classifier ensemble is considered. Boosting scheme is used to leverage the strength of base classifier, (3) an extensive experiment analysis is carried out on 12 real imbalanced data sets, showing that the proposed method K-Boosted C5.0 can outperform the results of the literature and state-of-the-art methods, RUSBoost and SMOTEBoost which are renowned methods in this area and hence proving the inherent advantage of the proposed approach.
The remainder of this paper is organized as follows. Section
In this section, we give a detailed description of the K-means clustering-based undersampling algorithm. The processes are shown in Figure
Block diagram for the proposed classification model.
Undersampling is a better choice than oversampling since the oversampling method increases the likelihood of overfitting; however, undersampling also suffers from the problem of underfitting, in other words, useful data might be eliminated. To overcome the limitation, a clustering-based undersampling method is proposed. As described in the aforementioned literature, interior prototypes can be discarded since they have little effect on classification accuracy, but the border prototypes are critical with emphasized impact for classification, that could be important for the induction process. Thus, in our proposed method, we use the clustering-based undersampling method to select the samples near the boundary region to rebalance the class distribution without significant loss of classification accuracy. The aim of clustering is to group objects into two clusters. Thus, we can select the optimal samples which lie to the cluster boundary, resulting in a balanced dataset with each cluster containing similar number of data. The idea behind this implementation of clustering-based undersampling is to eliminate the examples from both classes that are distant from the cluster border since these kinds of examples might be considered less relevant for learning. In this paper, only the K-means clustering algorithm is considered because it is simple and efficient [
This proposed clustering-based undersampling method has three stages: firstly, clustering the overall samples via K-means algorithm. Secondly, compute the distance from each point to the cluster centroid. Finally, the sample whose distance to the central point is greater than the cluster average distance is merged, resulting in a modified balanced data set. The remaining samples are used as the testing subset. We calculate the distance using Euclidean distance. The number of clusters was set to two for binary-class datasets. It should be pointed out that if the obtained subset is an imbalance dataset after performing instance selection algorithm, we should change the selection condition that the samples whose distance to the cluster central points are greater than half of the average distance of the clusters are selected and added to the training space. The details of the informative sample selection algorithm are described below. The pseudocode presented in Algorithm
Input: the data set
Output: the final training data with informative samples
Step 1: randomly select
Step 2: Euclidean metric is used for computing the distance between each point and the centroid in the same cluster, and each data point is assigned to its closest centroid. The distance between
Step 3: compute a new cluster centroid point for reducing the Euclidean distance.
Step 4: repeat steps 3 and 4 until cluster membership stabilizes.
Step 5: compute the average distance of each number
Step 6: create the final training data set
When the clustering-based undersampling method is employed, the redundant samples can be removed, the scale of samples has been greatly reduced, and the basic information of the original database can be retained. In this case, the time and space complexity of the algorithm can be reduced, and the classification accuracy can be improved.
C5.0 is an improved algorithm from C4.5 by Ross Quinlan, and the test attributes by information gain [
Boosting is the most commonly used technique in the imbalance framework for constructing ensembles. The boosting algorithm repeatedly calls weak learner, each time feeding it a different distribution over the training data. C5.0 can easily support for boosting, and the boosting technique can improve the performance. The class imbalance can be considered seriously in medical decision making and boosted C5.0, which has been widely considered, being now a well-know method on imbalance learning. Thus, in this paper, we come up with boosted C5.0 to make use of the advantage and avoid the shortcoming.
To clearly observe the impact of clustering-based undersampling on imbalance dataset and investigate how the performance measures behave along with the clustering-based undersampling degree in depth, we develop ensembles on imbalanced data sets with different degrees of class imbalance. In our experiments, the items to be investigated are as follows: (a) the ability of keeping majority classification accuracy and (b) the ability of improving the minority classification accuracy. The experiments are performed by using a laptop with Windows 10, 2.19 GHz Pentium CPU and 4 GB RAM, using Matlab version 2016a and R version 3.4.4. Additionally, boosted C5.0, naïve Bayes and SVM classifiers, “C50” and “e1071”, and “kernlab” packages have been used accordingly. For 10-CV algorithm, the “caret” package has been utilized. All packages with default setting were used.
In our experiments,
The work in this paper confers four experimental studies. In the first study,
Experimental datasets.
Datasets | No. of data samples | No. of features | Imbalance ratio |
---|---|---|---|
Small-scale datasets | |||
(1) Abalone | 731 | 8 | 16.4 |
(2) Bcwo | 683 | 9 | 1.8577 |
(3) Pima | 336 | 8 | 2.027 |
(4) Redwine1 | 837 | 11 | 3.21 |
(5) Redwine2 | 880 | 11 | 3.42 |
(6) Redwine3 | 734 | 11 | 12.85 |
(7) Redwine4 | 691 | 11 | 12.04 |
(8) Wbcd | 569 | 30 | 1.8 |
(9) Whitewine | 1043 | 11 | 5.4 |
(10) Yeast1 | 707 | 8 | 1.8975 |
(11) Yeast2 | 626 | 8 | 2.840 |
(12) Yeast3 | 892 | 8 | 1.08 |
|
|||
Large-scale dataset | |||
(1) Breast cancer | 102294 | 117 | 16319 |
(2) Protein homology prediction | 145751 | 74 | 11146 |
Overall accuracy becomes meaningless when the learning concern is how to find minority examples effectively [
Confusion matrix.
Predicted positive | Predicted negative | |
---|---|---|
Actual positive | True positive (TP) | False negative (FN) |
Actual negative | False positive (FP) | True negative (TN) |
In this study, the class of interest is known as the positive class, while all others are known as negative. Hence, the noncancer class is given “negative” and the cancer is given “positive.”
The data sources are taken from the breast cancer machine learning repository, which are Wbcd and Bcwo datasets. These are the complete and representative datasets. Thus, the testing results are reliable and valuable.
We compare the results with RUSBoost [
Table
Performance comparison based on Wbcd and Bcwo datasets.
Dataset | Method | Accuracy | Sensitivity | Specificity |
|
AUC | MCC |
---|---|---|---|---|---|---|---|
Wbcd |
|
|
0.9375 |
|
|
|
|
SMOTEBoost | 0.964 |
|
0.978 | 0.9619 | 0.963 | 0.924 | |
RUSBoost | 0.944 | 0.93 | 0.954 | 0.942 | 0.942 | 0.886 | |
SMOTE-Boosted C5.O | 0.925 | 0.939 | 0.911 | 0.9248 | 0.925 | 0.847 | |
|
|||||||
Bcwo |
|
|
|
|
|
|
|
SMOTEBoost | 0.92 | 0.98 | 0.89 | 0.934 | 0.933 | 0.839 | |
RUSBoost | 0.936 | 0.926 | 0.944 | 0.9350 | 0.934 | 0.8539 | |
SMOTE-Boosted C5.0 | 0.937 | 0.934 | 0.941 | 0.9375 | 0.937 | 0.8756 |
Performance comparison based on the Wbcd dataset.
ML method | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|
MCC |
---|---|---|---|---|---|
QKCLDA | 97.26 | — | — | — | |
K-SVM | 97.38 | — | — | — | |
PSO + Boosted c5.0 | 96.38 | 97.70 | 94.28 | — | |
Aisl | 98.00 | 95.9 | 98.7 | — | |
PSO-KDE | 98.45 | 100 | 97.99 | — | |
EC | 96.5 | — | |||
BBHA | 97.38 | 95.79 | 98.57 | — | |
EM-PCA-CART-fuzzy | |||||
Rule-based | 93.2 | — | — | — | |
FSMLP | 100 | 100 | 100 | 100 | |
|
|
|
|
|
|
Performance comparison based on the Bcwo dataset.
ML method | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|
MCC (%) |
---|---|---|---|---|---|
Aisl | 98.3 | 94.3 | 99.6 | 96.91 | |
PSO-KDE | 98.53 | 95.79 | 100 | — | |
|
|
|
|
|
|
For comparison purpose, the performance of K-Boosted C5.0 on Wbcd and Bcwo datasets is compared with that of other methods from the literature. Tables
From the results of Tables
For the Bcwo dataset, it is evident from Table
In order to illustrate the generalization performance of the K-Boosted C5.0 method, our experiments are tested on ten data sets which are shown in Table
Result comparison based on different datasets.
Dataset | Method | Accuracy | Sensitivity | Specificity |
|
AUC | MCC |
---|---|---|---|---|---|---|---|
Abalone | K-Boosted C5.0 | 0.960 | 0.2 | 0.992 | 0.445 | 0.596 |
|
SMOTEBoost | 0.822 | 0.628 | 0.834 | 0.724 | 0.730 | 0.264 | |
RUSBoost | 0.592 | 0.802 | 0.58 | 0.682 | 0.69 | 0.15 | |
SMOTE-BoostedC5.0 | 0.618 | 0.635 | 0.601 | 0.618 | 0.624 | 0.232 | |
|
|||||||
Pima | K-Boosted C5.0 | 0.766 | 0.640 | 0.820 | 0.725 | 0.730 |
|
SMOTEBoost | 0.75 | 0.646 | 0.8 | 0.719 | 0.723 | 0.432 | |
RUSBoost | 0.714 | 0.792 | 0.676 | 0.732 | 0.733 | 0.446 | |
SMOTE-BoostedC5.0 | 0.713 | 0.742 | 0.684 | 0.712 | 0.713 | 0.425 | |
|
|||||||
Redwine1 | K-Boosted C5.0 | 0.823 | 0.517 | 0.905 | 0.684 | 0.823 |
|
SMOTEBoost | 0.784 | 0.544 | 0.858 | 0.683 | 0.701 | 0.397 | |
RUSBoost | 0.702 | 0.656 | 0.718 | 0.686 | 0.69 | 0.34 | |
SMOTE-BoostedC5.0 | 0.688 | 0.715 | 0.661 | 0.687 | 0.688 | 0.377 | |
|
|||||||
Redwine2 | K-Boosted C5.0 | 0.902 | 0.681 | 0.969 | 0.812 | 0.825 |
|
SMOTEBoost | 0.826 | 0.85 | 0.82 | 0.835 | 0.835 | 0.59 | |
RUSBoost | 0.82 | 0.852 | 0.812 | 0.832 | 0.832 | 0.577 | |
SMOTE-BoostedC5.0 | 0.844 | 0.865 | 0.822 | 0.8432 | 0.843 | 0.691 | |
|
|||||||
Redwine3 | K-Boosted C5.0 | 0.834 | 0.410 | 0.866 | 0.596 | 0.637 |
|
SMOTEBoost | 0.89 | 0.12 | 0.948 | 0.337 | 0.545 | 0.054 | |
RUSBoost | 0.764 | 0.46 | 0.788 | 0.602 | 0.634 | 0.171 | |
SMOTE-BoostedC5.0 | 0.567 | 0.509 | 0.625 | 0.564 | 0.573 | 0.137 | |
|
|||||||
Redwine4 | K-Boosted C5.0 | 0.940 | 0.263 | 0.989 | 0.510 | 0.626 |
|
SMOTEBoost | 0.916 | 0.28 | 0.964 | 0.520 | 0.623 | 0.317 | |
RUSBoost | 0.678 | 0.64 | 0.682 | 0.661 | 0.660 | 0.152 | |
SMOTE-BoostedC5.0 | 0.618 | 0.635 | 0.601 | 0.618 | 0.624 | 0.232 | |
|
|||||||
Whitewine | K-Boosted C5.0 | 0.925 | 0.650 | 0.961 | 0.79 | 0.805 |
|
SMOTEBoost | 0.804 | 0.838 | 0.798 | 0.818 | 0.818 | 0.502 | |
RUSBoost | 0.794 | 0.85 | 0.784 | 0.816 | 0.817 | 0.49 | |
SMOTE-BoostedC5.0 | 0.796 | 0.801 | 0.792 | 0.796 | 0.797 | 0.593 | |
|
|||||||
Yeast1 | K-Boosted C5.0 | 0.952 | 0.957 | 0.949 | 0.953 | 0.957 |
|
SMOTEBoost | 0.762 | 0.722 | 0.788 | 0.754 | 0.754 | 0.497 | |
RUSBoost | 0.798 | 0.694 | 0.852 | 0.769 | 0.773 | 0.552 | |
SMOTE-BoostedC5.0 | 0.723 | 0.734 | 0.712 | 0.723 | 0.723 | 0.45 | |
|
|||||||
Yeast2 | K-Boosted C5.0 | 0.951 | 0.924 | 0.958 | 0.941 | 0.941 |
|
SMOTEBoost | 0.932 | 0.904 | 0.94 | 0.922 | 0.921 | 0.836 | |
RUSBoost | 0.93 | 0.938 | 0.926 | 0.928 | 0.931 | 0.824 | |
SMOTE-BoostedC5.0 | 0.9116 | 0.8934 | 0.9388 | 0.9158 | 0.916 | 0.821 | |
|
|||||||
Yeast3 | K-Boosted C5.0 | 0.646 | 0.575 | 0.706 | 0.637 | 0.641 |
|
SMOTEBoost | 0.618 | 0.618 | 0.612 | 0.615 | 0.616 | 0.232 | |
RUSBoost | 0.64 | 0.544 | 0.728 | 0.629 | 0.636 | 0.27 | |
SMOTE-BoostedC5.0 | 0.598 | 0.450 | 0.735 | 0.575 | 0.593 | 0.195 |
The accuracy, sensitivity, specificity,
MCC result comparison based on different datasets.
In the third experimental study, for the clear observation of the impact of clustering-based undersampling on imbalanced data sets, two different classifiers were constructed, namely, the support vector machine (SVM) and naive Bayes (NB). In addition, in order to evaluate the performance of the proposed ensemble approach, RUS is used as the baseline for performance comparisons. As indicated by the performance results in Figure
Classification of MCC of the different classifiers over the breast cancer and protein homology datasets.
In the fourth experimental study, the CPU time of the proposed method K-Boosted C5.0 was compared with the baseline algorithms, RUS and SMOTE, over ten small-scale datasets. In order to make the observation more convincing, the CPU time of the proposed method K-Boosted C5.0 was compared with RUS over two large-scale datasets. Figure
Computational efficiency of approaches on 12 datasets.
Computational efficiency of approaches on breast cancer and protein homology datasets.
In order to confirm whether or not the comparative methods are significant, the Friedman test with 95% confidence level [
Mean rank of the Friedman test over the four classification algorithms.
|
K-Boosted C5.0 | SMOTEBoost | RUSBoost | SMOTE-BoostedC5.0 |
---|---|---|---|---|
7.488 |
4 | 2.25 | 1.917 | 1.83 |
Figure
Results of the pairwise comparisons of methods using the Nemenyi post hoc test.
On the basis of our experimental analysis of the proposed method, following discussions are taken into consideration: MCC indicates that our proposed K-Boosted C5.0 approach is the best hybrid classifier for imbalanced datasets. In terms of accuracy, our proposed algorithm can maintain a good classification accuracy of overall class data except for the Redwine3 dataset. These improvements in classification accuracy are mainly due to the clustering-based undersampling method. For Redwine3, the best accuracy (89%) is obtained by the SMOTEBoost method which adds SMOTE into the boosting algorithms. In practice, SMOTE has good ability to balance dataset but how to choose sampling rate, which is crucial to its performance, and can be a time-consuming task. Thus, this fact restricts its use, and the performance is not stable. Experimental results show that the proposed K-Boosted C5.0 algorithm achieves relatively high, stable classification performance with less fixed parameters in most cases. So the proposed K-Boosted C5.0 is strongly desirable. In terms of A comparison between using K-Boosted C5.0 and SMOTE-Boosted C5.0 over small-scale datasets shows that the proposed clustering-based undersampling method is better than SMOTE. Boosted C5.0 was validated by comparison with SVM and naive Bayes. This result can lead us to conclude that combining the clustering-based undersampling method with Boosted C5.0 provides the highest rate of classification MCC. To show the adaptation and generation capability of our proposed K-Boosted C5.0, we compare the results obtained by K-boosted C5.0 with the baseline approaches, RUSBoost and SMOTEBoost, over all the datasets. According to these results, the K-Boosted C5.0 delivers the optimal tested performance with the least amount of time, which was observed to be significantly different from the other ( Notably, Lin et al. [
From the experimental result on overall datasets, we found that, it is worth noting that K-Boosted C5.0 obtains the highest classification MCC but suffers from parameter setting, which is crucial to its classification performance. From a large body of the literature in breast cancer diagnosis, most methods are designed to maximize the overall classification accuracy only; not much work has been conducted for solving a breast cancer prediction task as a class imbalance problem. Actually, the accuracy, specificity, and sensitivity indices of the literature methods show controversial results on the breast cancer diagnosis. In addition, the accuracy of these classifiers is higher, yet lack specificity since the accuracy is overwhelmed by the instances in the majority class, by ignoring the instance in the minority class. Such imbalanced class distribution significantly hinders predictive performance and causes learning bias towards the majority class and leads toward poor generalization. Clustering technique groups the dataset into two clusters, and we select the informative majority and minority class instance from each cluster. With the help of clustering-based undersampling, the original data set is balanced. Our proposed K-Boosted C5.0 has shown its promising predictive performance in breast cancer diagnosis, balancing and remaining high MCC, and accuracy.
In this paper, we propose a K-Boosted C5.0 algorithm based on undersampling to address the diagnosis of breast cancer and class imbalance problem. Our proposed method consists of two steps: firstly, K-means clustering is used to group the classes and find informative samples. We consider the instances which are close to the border of the cluster as the informative ones. We then set the distance parameter to make the majority and minority classes equal in number. Afterwards, Boosted C5.0 is performed for classification. Empirically, according to the experimental results, the K-Boosted C5.0 improves the performance significantly without increasing algorithm complexity. Furthermore, a clustering-based undersampling method actually provides a new way how to handle the class imbalance problem in an efficient manner.
A balanced, informative, and diverse training subset is obtained via k-means clustering in this work to encourage us to take this step further. In future, we would like to explore what are the effects of
The datasets in these experiments are taken from the public UCI machine learning repository.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China under grant no. 5186650 and Shaanxi Technology Committee Industrial Public Relation Projection (no. 2018GY-145).