Cervical Cancer Diagnosis Model Using Extreme Gradient Boosting and Bioinspired Firefly Optimization

Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem. Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann, Schiller, Cytology, and Biopsy), is used in the study. +e four targets are the widely used diagnosis test for cervical cancer. +e effectiveness of the proposed study is evaluated in terms of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better results with the reduced number of features. Experimental results reveal the significance of the proposed model and achieved the highest outcome for Hinselmann test when compared with other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced the outcomes. Additionally, the performance of the proposed models is noticeable in terms of accuracy when compared with other benchmark studies for cervical cancer diagnosis using reduced risk factors data set.


Introduction
Cervical cancer is one of the commonly occurring types of cancer in females and mostly develops during their midlives (35 years-44 years) [1]. is type of cancer can be fatal as it does not show clear symptoms in its early stages. Symptoms usually appear in late stages, where it could have spread to other organs like bones, liver, lymph nodes, and lungs. One of the early signs of cervical cancer is when the tube that carries urine from the kidney is blocked. Other late symptoms that can appear are vaginal bleeding, pelvic pain, weight loss, and leg pain [2]. e risk factors that lead to the development of cervical cancer are hormones containing medicines, birth control pills, smoking, and the number of pregnancies. However, it is believed that human papilloma virus (HPV) is the major factor in developing cervical cancer [2]. HPV is a common sexually transmitted infection; it is usually harmless, but sometimes it may lead to cancer [3]. HPV infection becomes at a higher risk of getting cervical cancer. Furthermore, the probability of getting cervical cancer increases if one possesses more than one risk factor. As the cancer does not show signs in its early stages, regular checkups are required especially for those who have the risk factors. In the developing countries, lack of medical equipment and the cost of conducting checkups could also be a burden. With the advent and advancement of machine learning, it has become possible to find robust solutions for early diagnosis of cancer cases using data-driven approaches.
Various studies have contributed to the field of cervical cancer diagnosis using several classification techniques by using different types of data such as clinical-based, image, and genetic-based data. In our study, we used clinical cervical risk factor data. Two similar studies were conducted by Wu and Zhou [4] and Abdoh et al. [5]; they performed the comparative analysis of two feature selection techniques, namely, recursive feature elimination (RFE) and Principal Component Analysis (PCA). e first study used Support Vector Machine (SVM), and the other study used Random Forest (RF). Both studies used the same number of features. Although the data suffered from imbalance, an oversampling was applied to the data in [4] and SMOTE was used in [5]. Both studies identified two risk factors to be removed such as time since the first and last diagnosis of STDs (sexually transmitted diseases), due to a lot of missing entries. Furthermore, the study [4] discovered that less computational cost was an advantage given by both SVM-PCA and SVM-RFE, whereas high computational cost is a limitation to the SVM model. Moreover, STDs, intrauterine device (IUD), hormonal contraceptives, and first sexual intercourse were identified as the highly relevant features [5]. Overall, the outcome of both the studies showed that using 30 features produced highest results. Furthermore, it was found that the SMOTE-RF model performed well for all targets.
Similarly, Lu et al. [6] and Karim and Neehal [7] used ensemble models to estimate the risk of cervical cancer. Both studies performed data cleaning mechanism to replace missing values. e former study used an ensemble classifier with voting strategy using a combination of a private and public data set. e private data set contains 472 records taken from Chinese hospital. e public data set was obtained from the UCI repository; 14 features were used. e private data set was collected using questionnaire. e results revealed that voting ensemble classifier produced better results when compared to Linear Regression, Decision Tree (DT), Multilayer Perceptron (MLP), SVM, and K-NN classifiers. On the other hand, Karim and Neehal study used DT, MLP, and SVM using sequential Minimal Optimization (SMO) and K nearest neighbor (KNN) techniques. Experiments showed that SMO has a better performance in terms of accuracy, precision, recall, and F-measure. Similarly, Ul-Islam et al. [8] used DT, RF, Logistic Model Tree, and ANN for cervical cancer detection. Apriori algorithm was used to identify features that strongly relate to cancer. e study found that age, number of sexual partners, hormonal contraceptives, number of pregnancies, and first sexual intercourse are significant risk factors. Results indicated that RF produced best outcome when compared to the other models.
Al-Wesabi et al. [9] conducted a comparison between different machine learning classifiers such as Gaussian Naïve Bayes (GNB), KNN, DT, LR, and SVM. e outcome of the classifiers was not satisfactory due to the data imbalance. To resolve this problem, undersampling, oversampling, and SMOTETomek were applied. Oversampling had the best result among all three methods. Moreover, a Sequential Feature Selector was applied with both forward and backward versions. Both the Sequential Forward Feature Selector (SFS) and Sequential Backward Feature Selector (SBS) enhanced the performance of the prediction with an accuracy of 95%. After selecting the common features between DT and KNN, the accuracy exceeded 97% for the DT. e results revealed that age, first sexual intercourse, number of pregnancies, smoking, hormonal contraceptives, and STDs: genital herpes were the main predictive features.
Similarly, several studies have been made using deep learning and transfer learning for cervical cancer diagnosis. Fernandes et al. [10] and Adem et al. [11] used deep learning and showed significant outcome in terms of diagnosis accuracy. e study [10] used a loss function that provides a supervised optimization of dimensionality reduction and classification models. e study indicated that it can be useful in examining records of patients if the Biopsy and perhaps other testing results are absent and are capable of classifying successfully whether they have cervical cancer or not. On the other hand, the researchers in [11] used a deep neural network model with softmax function to classify the data sets. e performance of the softmax function with stacked autoencoder was compared with the other machine learning methods (DT, KNN, SVM, Feed Forward NN, and Rotation Forest models). It was found that the softmax function with a stacked autoencoder model produced better outcome classification rate of 97.8%.
Similarly, Fernandes et al. [12] applied transfer learning with partial observability for cancer screenings. e limitation of the study was that several patients were resisting answering some questions for privacy concerns. Challenges were also faced in defining quality as there are multiple readings and it started relying on human preference. erefore, as an alternative of an ordinal scale, a simple binary scheme was used. Nevertheless, the model performance was considerable.
Conclusively, the finding made after the above-mentioned literature is that the data set found at UCI repository had several missing values; therefore, previous studies have removed at least 2 features. Missing values were due to patient's concerns regarding their privacy. After removing 2 features due to huge missing value, SVM-PCA seemed to provide satisfactory performance. However, SMO and SMOTE-RF were amongst the best performing models. Another approach to deal with the imbalance in UCI cervical risk factor data set was using oversampling. Deep learning proved to be effective, especially where the Biopsy and possibly other screening results are absent. Age, first sexual intercourse, number of pregnancies, smoking, hormonal contraceptives, IUD, STDs, STDs: genital warts, or HPV infections were identified as the top key features. e significant outcomes made by the machine learning classifiers motivate the need for further investigation and enhancement of the outcomes for the prediction of cervical cancer.
In this study, three ensemble-based classifiers extreme Gradient Boosting, Ada Boost, and RF are used to classify cervical cancer. Cervical Cancer Risk factor data set from UCI machine learning repository was collected at "Hospital Universitario de Caracas" in Caracas, Venezuela [13]. In addition to the importance of correctly classifying cancerous and noncancerous cases, it is also essential to identify key risk factors that contribute to developing cancer. Natureinspired Firefly feature selection and optimization algorithm 2 Scientific Programming was applied. Furthermore, the Synthetic Minority Oversampling Technique (SMOTE) is used to balance the classes of the data as it suffers greatly from imbalanced problem. e paper is organized as follows: Section 2 presents material and methods. Section 3 contains experimental setup and results. e comparison of the proposed model with the existing studies using the same dataset is discussed in Section 4. Finally, Section 5 contains the conclusion.

Dataset Description.
e cervical cancer risk factors data set used in the study was collected at "Hospital Universitario de Caracas" in Caracas, Venezuela and is available on the UCI Machine Learning repository [13]. It consists of 858 records, with some missing values, as several patients did not answer some of the questions due to privacy concerns. e data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used for cervical cancer. It contains different categories of feature set such as habits, demographic information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has other types of cancer or not. Sometimes, a patient may have more than one type of cancer. In the data set, some of the patients do not have cervical cancer, but they had the Dx: cancer value true. erefore, it is not used as a target variable. Table 1 presents a brief description of each feature with the type. Cervical cancer diagnosis usually requires several tests; this data contains the widely used diagnosis tests as the target. Hinselmann, Schiller, Cytology, and Biopsy are four widely used diagnosis tests for cervical cancer. Hinselmann or Colposcopy is a test that examines the inside of the vagina and cervix using a tool that magnifies the tissues to detect any anomalies [3]. Schiller is a test in which a chemical substance called iodine is applied to the cervix, where it stains healthy cells into brown color and leaves the abnormal cells uncolored, while cytology is a test that examines body cells from uterine cervix for any cancerous cells or other diseases. And Biopsy refers to the test where a small part of cervical tissue is examined under a microscope. Most Biopsy tests can make significant diagnosis.

Dataset Preprocessing.
e data set suffers from a huge number of missing values; 24 features out of the 32 contained missing values. Initially, the features with the huge percentage of missing values were removed. STDs: Time since first diagnosis and STDs: Time since last diagnosis features were removed since they have 787 missing values (see Table 2), which is more than half of the data. However, the data imputation was performed for the features with fewer numbers of missing values. e most frequent value technique was used to impute the remaining missing values. Additionally, the data set also suffers from huge class imbalance. e data set target labels were imbalanced with 35 for the Hinselmann, 74 for Schiller, 44 for Cytology, and 55 Biopsy out of the 858 records as shown in Figure 1. SMOTE was used to deal with class imbalance. SMOTE works by oversampling the minority class by generating new synthetic data for minority instances based on nearest neighbors using the Euclidean Distance between data points [14]. Figure 1 shows the number of records per class labels in the data set.

Firefly Feature Selection.
Dimensionality reduction is one of the effective ways to select the features that improve the performance of the supervised learning model. In the study, we adopted nature-inspired algorithm Firefly for selecting the features that better formulate the problem. Firefly was proposed by Yang [15] and was initially proposed for the optimization. Metaheuristic Firefly algorithm is inspired by fireflies' and flash lightening capability of a fly. It is a population-based optimization algorithm to find the optimal value or parameter for a target function. In this technique, each fly is pulled out by the glow intensity of the nearby flies. If the intensity of the gleam is extremely low at some point, then the attraction will be declining. Firefly used three rules; that is, (a) all the flies should be of the same gender; (b) the criteria of attractiveness depend upon the intensity of the glow; (c) target function will generate the gleam of the firefly. e flies with less glow will move towards the flies with brighter glow. e brightness can be adjusted using objective function. e same idea is implemented in the algorithm to search the optimal features that can better fit the training model. Firefly is more computationally economical and produced better outcome in feature selection when compared with other metaheuristic techniques like genetic algorithms and particle swarm optimization [16]. e time complexity of firefly is O(n 2 t) [17]. It uses the light intensity to select the features. Highly relevant features are represented as the features with high intensity light.
For feature selection, initially, some fireflies will be generated, and each fly will randomly assign the weights to all features. In our study, we generated 50 number of flies (n � 50). e dimension of the data set is 30. Furthermore, the lower bound was set to − 50, while the upper bound is equal to 50. e maximum generations were 500. Additionally, α (alpha) was initially set to 0.5 and in every subsequent iteration, we used the (1) and (2) to update α (alpha) value.
However, the gamma (c) was set to 1. e number of features selected using Firefly for Hinselmann was 15, for Schiller 13 features, for Cytology 11 features, and 11 features for Biopsy, respectively.

Ensemble-Based Classification Methods.
ree ensemble-based classification techniques such as Random Forest, Extreme Gradient Boosting, and Ada Boost were used to Scientific Programming 3 train the model. e description of these techniques is discussed in the section below.

Random
Forest. Random Forest (RF) was first proposed by Breiman in 2001 [18]. Random forest is an ensemble model that uses decision tree as individual model and bagging as ensemble method. It improves the performance of decision tree by adding many trees to reduce the overfitting in the decision tree. RF can be used for both classification and regression. RF generates a random forest that contains decision trees and gets a prediction from each one of them and then selects the best solution with the maximum votes [19].
When training a tree, it is important to measure how much each feature decreases the impurity, as the decrease in the impurity indicates the significance of the feature. e tree classification result depends on the impurity measure used. For classification, the measures for impurity are either Gini impurity or information gain and for regression, and the measure for impurity is variance. Training decision tree consists of iteratively splitting the data. Gini impurity decides the best split of the data using the formula.
where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the best split of the data depending on the gain of each feature. e formula that calculates the information gain is given in the following equation:    [20]. XGBoost can be used for classification, regression, and ranking problems. XG boosting is a type of gradient boosting. Gradient Boosting (GB) is a boosting ensemble technique that makes predicators sequentially instead of individually. GB is a method that produces a strong classifier by combining weak classifiers [21]. e goal of the GB is building an iterative model that optimizes a loss function. It pinpoints the failings of weak learners by using gradients in the loss function [21]: where e denotes the error term. e loss function measures how good is the model at fitting the underlying data. e loss function depends on the optimization goal, for regression is a measure of the error between the true and predicated values, whereas, for classification, it measures the how good is a model at classifying cases correctly [21]. is technique takes less time and less iterations, since predictors are learning from the past mistakes of the other predictors. e  GB works by teaching a model C to predict values of the form By minimizing a loss function, e.g., MSE: where Estimator h will be fitted to Y − F m (x), which is the difference between the true value and the predicated value, i.e., the residual. us, we attempt to adjust the errors of the previous model (F m ) [22].
XGBoost is better than Ada boost in terms of speed and performance. It is highly scalable and runs 10 times faster as compared to the other traditional single machine learning algorithms. XGBoost handles the sparse data and implements several optimization and regularization techniques. Moreover, it also uses the concept of parallel and distributed computing.

AdaBoost.
Adaptive Boosting (AdaBoost) is a metalearner originally proposed for the binary classification proposed by Freund and Schapire [23]. It is an ensemble technique to build a meta classifier by combining several weak classifiers using progressive learning.
AdaBoost uses the concept of boosting data sampling technique; adaptive sampling was used to assign high weights to the misclassified events. e misclassified samples will be selected in the next iteration to better train the model, and the final prediction was made using weighted voting. AdaBoost has reduced error rate, has a better effect on the prediction as compared to bagging [24], and uses decision tree stumps. Initially, all the samples in the data set have equal weights. Let x be the number of samples in the data set, and let y be the target. e target is a binary class represented by 0 and 1. e first decision tree stump will use some records from the data set, and predictions will be performed. After the initial prediction, the weights to the sample will be updated. More weights will be assigned to the data samples that were misclassified. e samples with the high weights will be selected in the next iteration. e process will be continued, unless the error rate is completely reduced, or a certain target level is achieved.
AdaBoost contains two main steps, combination and step forward using sequential iterative approach. All the instances in the training set have equal weights in the first iteration. However, in subsequent iterations, the weights are changed based on the error rates. e instances with error have increased weights. For the binary class classification problem containing T training samples is represented in the following equation: Let C be the linear combination of weak classifiers. e combination of the classifiers is represented as where N is the number of weak classifiers, w represents the weights, and C (x) represents weak classifiers. In every next iteration, the classifier is trained based on the performance of the classifier in previous iteration.
where C(x) t represents the classifier in t iteration. C(x) t− 1 is the performance of the classifier at t − 1 iteration. e weights can be calculated using the following equation: ϵ n represents the error rate of the weak classifier.

Optimization Strategy.
is section discusses optimization strategy to find the best hyperparameters combination that produces the highest targeted outcomes. Firefly optimization algorithm was used for parameter tunning. e details of Firefly are discussed in Section 2.3. Table 3 presents the hyperparameter values of Random Forest for all the four targets, For RF "gini" index criterion was used. Table 4 represents the parameters used for XGBoost. Gbtree booster was used with the random state of 42 and the learning rate of 0.05. Similarly, Table 5 presents the optimal feature vales for AdaBoost. Furthermore, Figures 2-4 represent the Grid Search optimization graph for Random Forest, Extreme Gradient Boosting, and AdaBoost classifier.

Experimental Setup and Results
e model was implemented in Python language 3.8.0 release using Jupyter Notebook environment. Ski-learn library was used for the classifiers along with other needed built-in tools, while separate library (xgboost 1.2.0) was used for XGBoost ensemble.
ere is K-fold cross validation with K � 10 for partitioning the data into training and testing. Five evaluation measures such as accuracy, sensitivity (recall), specificity (precision), positive predictive accuracy (PPA), and negative predictive accuracy (NPA) were used. Sensitivity and specificity are focused more during the study due to the application of the proposed model. Accuracy denotes the percentage of correctly classified cases, sensitivity measures the percentage of positives cases that were classified as positives, and specificity refers to the percentage of negative cases that were classified as negatives. Moreover, the criteria for the selection of the performance evaluation measures depend upon the measures used in the benchmark studies. Two sets of experiments were conducted for each target using selected features by using Firefly feature selection algorithm and 30 features for four targets. e SMOTE technique was applied to generate synthetic data. e results of model are presented in section below.  Table 7 presents the outcomes for the Schiller test. Like Hinselmann target, XGBoost with selected features outperformed that of Schiller, respectively. However, the outcomes achieved by the model for Schiller are lower when compared with Hinselmann target class. e performance of

Biopsy.
Similarly, performance was not drastically different, yet using all the features resulted in a higher accuracy than when using SMOTE with selected features for Biopsy as shown in Table 9. XGB obtained the highest accuracy of 97.1 with all features. However, for other measures, the performance of the XGBoost is better with the selected features. Similar performance was achieved for all measures when classified using RF for both feature sets 30 and selected, respectively. e number of selected features used for Biopsy target class was 11.
Overall, after comparing all the four-diagnostic tests, Hinselmann test achieved the better outcome and can be used for the diagnosis of cervical cancer as shown in Table 10. As per the outcome achieved in the proposed study, Hinselmann diagnosis test has better performance when compared from other cervical cancer diagnosis tests like Schiller, Biopsy, and Cytology, respectively. Similar findings have been made in Abdoh et al. [5] and Wu and Zhou [4] study.

Comparison with Existing Studies
e study used three ensemble techniques AdaBoost, extreme Gradient Boosting, and Random Forest. Furthermore, the proposed study is the pioneer in using bioinspired algorithm for feature selection and optimization for cervical cancer diagnosis. To explore the significance of our proposed study, the outcome of the study was compared with the benchmark studies. e criteria for the benchmark studies selection were based on data set used for the diagnosis of cervical cancer. Table 11 contains the comparison of the proposed technique with the benchmark studies in the literature. e best outcomes in the benchmark studies were achieved using 30 features. However, some of the outcomes in the previous studies were achieved with the reduced features. e number in the brackets next to some of the outcomes represents the number of features. erefore, based on Table 11, the proposed study outperforms the two studies in the benchmark interms of accuracy with reduced risk factors. However, the achieved sensitivity and NPA are less than those of Wu and Zhou [4] but higher than those of Abdoh et al. [5]. e number of features in Wu et al. study is 30, while the proposed study used reduced risk factors. e specificity and PPA of the proposed study are higher than those of the benchmark studies except for the Schiller diagnosis test.
In nutshell, the main contributions of the current study are applying bioinspired algorithm for feature selection and for model optimization for cervical cancer risk factors. e     proposed model enhanced the outcomes when compared with the previous studies related with cervical cancer risk factors data set. Despite the above-mentioned advantages, the study suffers from some limitations: the data set suffers from huge imbalance, and augmented data was generated using SMOTE. Moreover, the current study was based on open-source data set, and further testing is required to use other real and open-source data sets. To alleviate the above-mentioned limitations, there is a need for validating the model on real data set from the hospital.

Conclusion
is study presents an investigation of several ensemble techniques such as Random Forest, AdaBoost, and Extreme Gradient Boosting for diagnosing cervical cancer. e data set was obtained from the UCI machine learning repository containing 858 records, 32 features, and 4 target variables. e target variables are the diagnosis test used for cervical cancer. Experiments were conducted for each target class separately. Data preprocessing includes imputing missing values and class balancing using SMOTE. Moreover, bioinspired firefly algorithm was used to optimize the models, and to identify the key features. To compare the performance of the models, the experiments were conducted with 30 features and the selected features using SMOTED data. Extreme Gradient Boosting outperformed the other two models for all four target variables. For future work, the model will be validated on multiple data sets. Also, other models that can handle outliers and unbalanced data differently should be investigated.