^{1}

^{2}

^{1}

^{3}

^{1}

^{1}

^{4}

^{1}

^{2}

^{3}

^{4}

As one of the most prevalent cancers among women worldwide, breast cancer has attracted the most attention by researchers. It has been verified that an accurate and early detection of breast cancer can increase the chances for the patients to take the right treatment plan and survive for a long time. Nowadays, numerous classification methods have been utilized for breast cancer diagnosis. However, most of these classification models have concentrated on maximum the classification accuracy, failed to take into account the unequal misclassification costs for the breast cancer diagnosis. To the best of our knowledge, misclassifying the cancerous patient as non-cancerous has much higher cost compared to misclassifying the non-cancerous as cancerous. Consequently, in order to tackle this deficiency and further improve the classification accuracy of the breast cancer diagnosis, we propose an improved cost-sensitive support vector machine classifier (ICS-SVM) for the diagnosis of breast cancer. In the proposed approach, we take full account of unequal misclassification costs of breast cancer intelligent diagnosis and provide more reasonable results over previous works and conventional classification models. To evaluate the performance of the proposed approach, Wisconsin Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) breast cancer datasets obtained from the University of California at Irvine (UCI) machine learning repository have been studied. The experimental results demonstrate that the proposed hybrid algorithm outperforms all the existing methods. Promisingly, the proposed method can be regarded as a useful clinical tool for breast cancer diagnosis and could also be applied to other illness diagnosis.

Breast cancer is one of the most prevalent cancers among women all over the world [

In recent years, more and more machine learning techniques have been applied for medical diagnosis, which can provide useful knowledge from huge amount of medical data and thereby assist clinical physicians in making correct and effective decisions. To the best of our knowledge, breast cancer diagnosis has been attributed to classification problems. But traditional classification algorithms, such as the Naïve Bayesian [

To the best of our knowledge, feature selection is the process of selecting the best subset of the input feature to maximize the discrimination capability [

In this research, we proposed an improved breast cancer intelligent diagnosis approach, which utilizes IG performance for feature selection and CS-SVM performance for breast cancer classification. We expected that our proposed approach would have a competitive performance for breast cancer diagnosis compared to other classification algorithms. The remainder of this manuscript is organized as follows: Section

In the literature, various models have been designed to diagnosis breast cancer. In this section, we briefly review the previous studies for breast cancer intelligent diagnosis, such as artificial neural networks (ANNs) and decision tree analysis, which have been utilized for breast cancer diagnosis mostly due to their efficiency and high prediction accuracy. In 2002, Hussein A. Abbass proposed an approach named Memetic Pareto Artificial Neural Network (MPANN) for breast cancer diagnosis, and the experimental results have shown that MPANN has better generalization and lower computational cost than other comparative methods [

In this study, we utilized IG algorithm to select a compact feature subset with maximal discriminating ability and applied improved CS-SVM algorithm perform for classification. To the best of our knowledge, when we apply SVM method for classification, whose critical point is how to choose the optimal input feature subset and the optimal parameter, it plays a crucial role in building a classification model with the high classification accuracy and stability [

In this work, we introduce IG algorithm to select a compact feature subset with the maximal discriminative capability [

There are

For any of feature

SVM was originally developed by Boser and Vapnik, and it has been deemed as an excellent classifier with high generalization ability and structure risk minimum (SRM) for a long time and utilized by many machine learning researchers [

Support vectors and decision boundaries of a linear SVM.

In our work, we take into consideration the unequal misclassification costs of breast cancer diagnosis and introduce different penalty factors, namely,

In formula (

In our study, we set benign tumor as negative samples and malignant tumor as positive samples and set the value of false positive much higher than that of false negative and then readjust the parameters of

In order to solve the problem of formula (

According to KKT condition, we can convert inequality constraints into equality constraints, and the problem of breast cancer diagnosis based on CS-SVM can be transformed into solving the following minimum objective function:

In formula (

In formula (

PSO is a stochastic population-based metaheuristic algorithm which is based on the simulation of the social behavior of organisms, such as birds flying within a flock [

Formula (

In formula (

In formula (

This study proposes a novel breast cancer intelligent diagnosis approach which employs IG algorithm for feature selection and extracting the top

Flowchart of our proposed algorithm.

The classification accuracy and misclassification cost are taken into account in designing the fitness. In this work, we take the average misclassification cost (

where

The cost matrix used by the classifiers.

True | Predicted | |
---|---|---|

Benign/majority class | Malignant/minority class | |

Benign/majority class | 0 | |

Malignant/minority class | | 0 |

In our study, in order to compare the misclassification costs for the different classification models conveniently, we set the value of the correct classification cost as

The framework of our proposed hybrid classification algorithm is presented in Figure

Delete the missing cases and normalize the input samples.

Calculate the value of IG for each feature, and rank the features according to their importance.

Randomly initialize input samples, and then divide the input samples into 70-30% training testing partitions.

During this process, we obtained the optimal input parameter pair by using the SAPSO algorithm, and the details of the SAPSO algorithm are presented in Figure

Predict the test sets utilizing the optimal CS-SVM model.

Average the reported results obtained by 5-fold cross validation.

Results’ analysis and discussion.

In order to examine the effectiveness and rationality of mutual information feature selection method and further to verify the performance of our proposed classification model, we conduct empirical analysis and the experiment was implemented on MATLAB 2016a platform, and the performance parameters of the executing host were Win 10, Inter (R) 1.80 GHz Core (TM) i5-8250U, X64, and 16 GB (RAM).

To evaluate the performance of the proposed methods, an experiment was conducted based on WBC and WDBC datasets from the UCI repository [

Details of the two datasets.

Data set | Number of attribute | Number of cases | Class distribution(B/M) | Missing value |
---|---|---|---|---|

WBC | 10 | 699 | 458/241 | 16 |

WDBC | 32 | 569 | 357/212 | 0 |

Summary of attributes for WBC dataset.

Attribute | Domain | Mean | Standard error |
---|---|---|---|

Clump Thickness | 1~10 | 4.44 | 2.82 |

Uniformity of Cell Size | 1~10 | 3.15 | 3.07 |

Uniformity of Cell Shape | 1~10 | 3.22 | 2.99 |

Marginal Adhesion | 1~10 | 2.83 | 2.86 |

Single Epithelial Cell Size | 1~10 | 3.23 | 2.22 |

Bare Nuclei | 1~10 | 3.54 | 3.64 |

Bland Chromatin | 1~10 | 3.45 | 2.45 |

Normal Nucleoli | 1~10 | 2.87 | 3.05 |

Mitoses | 1~10 | 1.60 | 1.73 |

Summary of attributes for WDBC dataset.

Attribute | Mean | Standard error | Maximum |
---|---|---|---|

Radius | 6.98~28.11 | 0.112~2.873 | 7.93~36.04 |

Texture | 9.71~39.28 | 0.36~4.89 | 12.02~49.54 |

Perimeter | 43.79~188.50 | 0.76~21.98 | 50.41~251.20 |

Area | 143.50~2501.00 | 6.80~542.20 | 185.20~4354.00 |

Smoothness | 0.053~0.163 | 0.002~0.031 | 0.071~0.223 |

Compactness | 0.019~0.345 | 0.002~0.135 | 0.027~1.058 |

Concavity | 0.000~0.427 | 0.000~0.396 | 0.000~1.252 |

Concavity points | 0.000~0.201 | 0.000~0.053 | 0.000~0.291 |

Symmetry | 0.106~0.304 | 0.008~0.079 | 0.157~0.664 |

Fractal dimensional | 0.050~0.097 | 0.001~0.030 | 0.055~0.208 |

To evaluate the performance of our proposed hybrid algorithm, the classification accuracy (ACC), misclassification cost (AMC), and G-mean are utilized as the evaluation approaches. ROC analysis is a widely utilized method for analyzing the performance of binary classifiers and the G-mean is the geometric mean of true positive rate (TPR) and true negative rate (TNR) which is proposed to evaluate the performance of the classifier on imbalanced data. The calculation formulas are presented as follows:

The evaluation methods are based on the confusion matrix, which is shown in Table

Confusion matrix.

Predicted positive | Predicted negative | |
---|---|---|

Actual positive | True positive(TP) | False negative(FN) |

Actual negative | False positive(FP) | True negative (TN) |

This section presents the detailed experimental procedure of our proposed approach. In the experiments, we utilized 3 × 5-fold cross validation method to obtain the final results. In each fold, training dataset is first fed to the IG feature selection algorithm, generating different feature subsets. In this work the results of IG for the two datasets are presented in Tables

The order of features based on IG for WBC dataset.

Rank | Feature name |
---|---|

| Mitoses |

| Clump Thickness |

| Marginal Adhesion |

| Normal Nucleoli |

| Single Epithelial Cell Size |

| Bland Chromatin |

| Bare Nuclei |

| Uniformity of Cell Shape |

| Uniformity of Cell Size |

The order of features based on IG for WDBC dataset.

Rank | Feature name | Rank | Feature name | Rank | Feature name |
---|---|---|---|---|---|

| Texture 04 | | symmetry 01 | | smoothness 01 |

| Radius 04 | | area 01 | | perimeter |

| Compactness 06 | | concave points | | area 03 |

| Concavity 01 | | compactness 05 | | fractal dimension 01 |

| radius 02 | | symmetry 03 | | perimeter |

| smoothness 02 | | compactness 04 | | compactness 03 |

| symmetry 02 | | concavity 02 | | concavity 03 |

| fractal dimension 02 | | texture 03 | | compactness 01 |

| radius 01 | | radius 03 | | texture 02 |

| texture 01 | | compactness 02 | | area 02 |

The main parameters of SAPSO algorithm.

Parameter | Value |
---|---|

Objective function | Fitness value obtained by formula ( |

Maximum number of evolutions | 300 |

maximum number of populations | 20 |

Acceleration factor of c_{1} | 1.9 |

Acceleration factor of c_{2} | 1.9 |

The maximum value of individual speed | 0.5 |

The minimum value of individual speed | -0.5 |

Initial temperature | 100 |

Classification results of our proposed method with WBC test set.

Classification results of our proposed method with WDBC test set.

ROC curve of our proposed intelligent classification method based on WBC dataset.

ROC curve of our proposed intelligent classification method based on WDBC dataset.

In order to verify the superiority of our proposed method, we conduct two test sequences. First, our approach was compared with some previous works proposed by other authors; the results are presented in Table

Comparison of our proposed approach with previous works.

Author | Year | Model | Dataset | ACC(%) | AMC | G-mean(%) | Sen(%) | Spec(%) |
---|---|---|---|---|---|---|---|---|

Karabatak[ | 2009 | Neural network classification with association rules for reducing the dimension. | WBC | 95.60 | — | — | — | — |

Zheng[ | 2014 | Support vector machine algorithms with K-means for feature extraction | WBC | 97.38 | — | — | — | — |

Nahato[ | 2015 | Rough set indiscernibility relation method and the backpropagation neural network | WBC | 98.61 | — | 98.60 | 98.76 | 98.57 |

Wang[ | 2018 | SVM-based ensemble learning algorithm | WBC | 97.10 | — | 97.17 | 97.11 | 97.23 |

WDBC | 97.68 | — | 97.09 | 94.75 | 99.49 | |||

Proposed | — | Cost-sensitive SVM with IG for feature selection | WBC | 98.74 | 0.064 | 98.13 | 97.88 | 98.38 |

WDBC | 98.83 | 0.129 | 97.35 | 99.01 | 95.71 |

Note: the symbol of “

Comparison of our proposed method with conventional classification models.

Method | Dataset | ACC(%) | AMC | G-mean(%) | Sen(%) | Spec(%) |
---|---|---|---|---|---|---|

SVM(RBF) | WBC | 96.58 | 0.132 | 96.67 | 96.35 | 97.05 |

WDBC | 95.91 | 0.251 | 95.15 | 97.37 | 92.98 | |

PSO-SVM(RBF) | WBC | 95.61 | 0.307 | 94.37 | 97.82 | 91.04 |

WDBC | 97.66 | 0.234 | 97.05 | 100 | 94.20 | |

BP neural network | WBC | 94.11 | 0.324 | 92.20 | 93.30 | 91.30 |

WDBC | 94.72 | 0.526 | 92.93 | 100 | 86.41 | |

LVQ neural network | WBC | 91.56 | 0.627 | 87.88 | 96.55 | 80.00 |

WDBC | 92.75 | 0.724 | 90.26 | 100 | 81.48 | |

3-NN | WBC | 91.10 | 0.485 | 90.90 | 92.50 | 89.30 |

WDBC | 92.60 | 0.602 | 91.42 | 97.50 | 85.71 | |

Decision Tree | WBC | 96.64 | 0.162 | 96.63 | 97.84 | 95.45 |

WDBC | 95.65 | 0.304 | 95.27 | 97.50 | 93.10 | |

Random Forest | WBC | 96.49 | 0.140 | 96.49 | 96.33 | 96.66 |

WDBC | 97.53 | 0.145 | 96.82 | 97.82 | 95.83 | |

Proposed | WBC | 98.04 | 0.064 | 98.13 | 97.88 | 98.38 |

WDBC | 98.83 | 0.129 | 97.35 | 99.01 | 95.71 |

Note: the symbol of “

Comparison of running time for different classification models.

The final results are obtained by 3 × 5-fold cross validation method, and the results are reported by the average results of the 5-fold cross validation. As can be observed form the results listed, the proposed classification model is the most suitable method for breast cancer diagnosis, which obtains promising results and achieves classification accuracies of 98.04% for the WBC dataset and 98.83% for the WDBC dataset. In order to assess the actual situation of breast cancer diagnosis, we introduce misclassification cost as the measurement to evaluate the unequal costs of different categories for breast cancer diagnosis. As can be observed from the results listed in Table

In this section, we will provide a discussion on the performance of our proposed approach. The proposed approach is a hybrid method based on the IG for feature selection and the ICS-SVM performance for classification. As mentioned in advance, algorithm to be utilized in the feature selection stage, the classifier to be utilized for classification, and the meta-heuristic algorithm to be employed for searching the optimal parameter pairs of the proposed classifier are essential factors in building our classifier.

In this regard, an extensive experimental analysis has been implemented on WBC and WDBC breast cancer datasets. In order to evaluate the performances of our proposed approach, we implemented two test sequences. The first is to compare our proposed method with some previous works, and the results are presented in Table

To highlight the importance of utilizing IG for feature selection in our proposed approach, we also compared our proposed method with other two alternative methods. One is utilizing the grid-search method to handle the SVM’s parameter pair, and the other is utilizing the PSO algorithm to optimize the SVM’s parameter pair. The results of these comparison methods are presented in Table

From the empirical results based on WBC and WDBC breast cancer datasets, we can deduce that our proposed approach is the most suitable method for breast cancer diagnosis, which can produce excellent performances and only requires a minimum computational cost for solving breast cancer classification problem. Promisingly, our proposed approach may be adapted to other diseases’ diagnosis.

In this study, an improved CS-SVM classifier is proposed for breast cancer diagnosis. The proposed approach not only takes full account of unequal misclassification cost of breast cancer diagnosis, but also employs IG for features selection and utilizes meta-heuristic method to optimize the classifier. In order to verify the performances of our proposed approach, the proposed improved classifier was evaluated by several experiments on WBC and WDBC datasets and the experimental results demonstrated that our proposed approach can yield promising results for breast cancer diagnosis in comparison to some previous works and conventional classification methods (i.e., SVM (RBF), PSO-SVM (RBF), BP neural network, LVQ neural network, 3-NN, decision tree, and random forest). The main objective of this work was to construct an effective classifier for breast cancer diagnosis and expect our research to be utilized in real clinical diagnostic system and thereby assist clinical physicians in making correct and effective decisions in the future.

All the datasets we utilized in this paper are all come from the UCI Machine Learning Repository.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work was supported by the National Natural Science Foundation of China (71571105).