Hybrid Inception v3 XGBoost Model for Acute Lymphoblastic Leukemia Classification

Acute lymphoblastic leukemia (ALL) is the most common type of pediatric malignancy which accounts for 25% of all pediatric cancers. It is a life-threatening disease which if left untreated can cause death within a few weeks. Many computerized methods have been proposed for the detection of ALL from microscopic cell images. In this paper, we propose a hybrid Inception v3 XGBoost model for the classification of acute lymphoblastic leukemia (ALL) from microscopic white blood cell images. In the proposed model, Inception v3 acts as the image feature extractor and the XGBoost model acts as the classification head. Experiments indicate that the proposed model performs better than the other methods identified in literature. The proposed hybrid model achieves a weighted F1 score of 0.986. Through experiments, we demonstrate that using an XGBoost classification head instead of a softmax classification head improves classification performance for this dataset for several different CNN backbones (feature extractors). We also visualize the attention map of the features extracted by Inception v3 to interpret the features learnt by the proposed model.


Introduction
Leukemia is a malignancy that originates in cells that would otherwise develop into different types of blood cells. Most often, leukemia starts in the form of white blood cells (WBCs), but some leukemias start in other blood cell types as well. Their primary classification of leukemia is based on whether the leukemia is acute (fast-growing) or chronic (slower-growing) and whether it starts in myeloid cells or lymphoid cells. Knowing the specific type of leukemia helps doctors better predict each person's prognosis and select the best treatment.
Acute lymphocytic leukemia (ALL) is also called acute lymphoblastic leukemia. "Acute" means that if left untreated, leukemia can progress rapidly and cause fatality within months. "Lymphocytic" means it develops from early (immature) forms of lymphocytes, a type of WBC.
ALL starts in the bone marrow (the soft inner part of certain bones, where new blood cells are made). Most often, the leukemia cells invade the blood fairly quickly. They can also sometimes spread to other parts of the body, including the lymph nodes, liver, spleen, central nervous system (brain and spinal cord), and testicles (in males). Some cancers can also start in these organs and then spread to the bone marrow, but these cancers are not leukemia.
Acute lymphoblastic leukemia (ALL) is the most common type of childhood cancer and accounts for approximately 25% of pediatric cancers [1]. Approximately 74% of people under the age of twenty who are diagnosed with leukemia are diagnosed with ALL. Most cases occur between the ages of 2 and 5. ALL accounts for less than 1% of all new cancer cases worldwide and also accounts for less than 1% of all cancer-related deaths.
The 5-year survival rate gives us the percent (out of 100) of children and teenagers who live at least 5 years after being diagnosed with cancer. The 5-year survival rate for children between age 0 and 14 is 91%. The 5-year survival rate for people between ages 15 and 19 is 75%. It is rare for ALL to recur after 5 years; hence, children diagnosed with ALL who remain free from the disease after 5 years are generally considered cured.
98% of the children with ALL go into remission, and 85% of those with first-time ALL are expected to have long-term complications. However, the chance of recovery for adults is not high, as the percent of adults cured with current treatment is 20%-40%.
ALL is a life-threatening disease that can rapidly spread through children's bodies if left untreated and can cause death within a few weeks. During the diagnosis of leukemia, a necessary step is for the physician to classify the white blood cells in the bone marrow. Not only is this step difficult and complex, but it also results in increased human error and procedure time. This process can be automated by developing computerized methods to automatically classify the white blood cells. Not only does this method decreases the diagnosis time and error, but it also is economical especially with the increasing trend in digitizing microscopic images.
However, this task is not trivial; there are several challenges associated with the classification of white blood cell (WBC) images, the main challenge being the morphological similarity between the normal and the immature leukemic blast cells. Another challenging aspect in distinguishing WBCs is that they are surrounded by other blood components like red blood cells and platelets.
There are several methods and algorithms used for medical imaging; however, convolutional neural networks (CNNs) have proven to be the best choice. Pretrained neural networks such as VGGNet, ResNet, and Inception have been successfully utilized in various medical imaging applications. Moreover, these CNNs mitigate the issue of lack of sufficient training data which is a common problem in medical datasets by utilizing transfer learning, where the CNNs are trained on massive generic datasets and then trained on a specific downstream class on smaller datasets.
Our main motivation in this study is to develop a robust and efficient model for the classification of ALL from microscopic images. Medical image datasets are small; hence, it is often not feasible to train a CNN from scratch; hence, we aim to leverage the transfer learning ability of pretrained CNN architectures to learn a classifier for the C-NMC 2019 dataset. To improve the performance of these CNNs, we explore the use of different classification heads instead of a conventional softmax classification head. We aim to experiment with several data preprocessing techniques to improve the generalizability and performance of the model. We also aim to investigate and justify our choice of model design through extensive experiments presented in Ablation Study.
To this end, we introduce a hybrid Inception v3 XGBoost model which uses XGBoost as a classification head on top of an Inception v3 model fine-tuned for classification on this dataset. We perform extensive experimentation with several pretrained CNNs and different augmentation techniques.
We also investigate the features learnt by the Inception v3 model visualizing the heat map of its feature maps using Grad-CAM. We have performed experiments that indicate the effectiveness of our model and justify the design; these experiments are presented in Ablation Study.
The major contributions of this proposed model are the following: (i) The proposed model gives a high weighted F1 score of 0.98 for the C-NMC 2019 dataset (ii) The proposed architecture involving the use of XGBoost classification head can be utilized with several CNN backbone feature extractors and results in increased performance (refer to Table 1) (iii) The model can be interpreted using attention maps of the feature maps extracted by the Inception v3 CNN The paper is divided into 8 sections. Recent literature pertaining to leukemia detection is reviewed in Section 2. Section 3 briefly describes the dataset used in this study. The proposed model and methodology are discussed in Section 4. The implementation details are provided in Section 5. Section 6 discusses the experimental results. Section 7 presents an ablation study for our hybrid model. Finally, we conclude the study and discuss the future directions in Section 8.
To reproduce our results, we present detailed implementation details in Implementation Details. Moreover, the full code for experiments conducted in this research is publicly available at https://github.com/ramaneswaran/ lymphoblastic-leukemia-detection.

Literature Review
There has been a lot of research into the classification of white blood cells. Early approaches to this problem involve using traditional image processing techniques and machine learning models for classification. Jagadev and Virani [2] present an approach to classify leukemia lymphocyte images using handcrafted image features and SVM classifier. Amin et al. [3] propose yet another method involving SVM classifiers to detect acute lymphoblastic leukemia (ALL) where the geometrical and statistical features of nuclei are used to train the classifier. Rodellar et al. [4] present an approach Yu et al. [29] 88.50% DTH 2017 Mourya et al. [30] 89.62% ISBI 2018 Kassani et al. [31] 96.17% ISBI 2019 Bodzas et al. [10] 100% Blood smear 2020 Kasani et al. [16] 96.58% ISBI 2020 Shafique and Tehsin [12] 99.50% ALL-IDB 2018 Proposed approach 98.50% ISBI 2021 2 Computational and Mathematical Methods in Medicine for morphological characterization and automatic cell image recognition using handcrafted quantitative features. Mahmood et al. [5] experiment with several models including random forest, gradient-boosted machine, and CART for the detection of pediatric ALL; from their experiments, they conclude that the best fitting model for the dataset used in the research was the CART model. In recent literature, deep learning-based methods have been utilized for ALL classification and have met with significant success. Pretrained CNNs, as well as custom CNNs, have been successfully trained and tested on several cell classification tasks.
Macawile et al. [6] propose a method for white blood cell (WBC) classification and counting using pretrained CNNs. They use modified AlexNet, GoogleNet, and ResNet-101 in tandem to obtain classification results. Hegde et al. [7] provide a comparison between traditional image processing approaches and deep learning methods in the task of classifying WBCs. Using neural network architecture gives a significant performance increase over traditional methods. Sharma et al. [8] present a custom CNN architecture for white blood cell classification; the proposed network consists of 2D convolutions and MaxPooling layers with Relu activations. This architecture achieves high accuracy scores for both binary classification and multiclass classification settings. Habibzadeh et al. [9] present a method for utilizing the ResNet and Inception network for WBC classification. The proposed method also utilizes several augmentation techniques in the preprocessing stage. WBC classification is done using hierarchy topological feature extraction by the CNNs.
In [10], Bodzas et al. propose an approach to automatically identify ALL from peripheral blood smear images using conventional image processing techniques and ML algorithms. The approach uses an extensive preprocessing and three-phase filtration algorithm. Sixteen handcrafted features were extracted from the image and were used as input to SVM and ANN classifiers. Muntasa and Yusuf [11] present a model that detects ALL using principal object characteristics of a color image. There are four main stages in the proposed approach; these are enhancement, segmentation, feature extraction, and accuracy measurement. The proposed method archived the maximum accuracy on the ALL-IDB dataset. Shafique and Tehsin [12] compare the different methods for the early detection of ALL. The various stages in the diagnosis procedure are comparatively analyzed in their study. They also discuss the advantages and disadvantages of each method. Shafique and Tehsin [13] present an approach that uses pretrained AlexNet which is fine-tuned for the task of classification of ALL into its 4 subtypes (L1, L2, L3, L3, and normal). The last 4 layers are replaced with new linear layers, and their weights are trained from scratch. The research also employs several data augmentation techniques to generalize the model performance. The model achieves high accuracy of 99.5% for detection of ALL and 96.06% for ALL subtype classification.
Bhuiyan et al. [14] propose a framework for identifying ALL from microscopic images of WBC. A total of four different statistical models are used for classification, and their performance is compared. From the experimental results, the authors conclude that the SVM model gave the best fit for their dataset. Acharya and Kumar [15] survey various methodologies in current literature that are used to segment WBCs and provide a novel method for segmenting the nucleus and the cytoplasm of the WBC. Subsequently, models are built to extract features and perform supervised classification of the microscopic images into the four subtypes of ALL. The model achieves an accuracy of 98.6% for the dataset used. Kasani et al. [16] propose to use a pretrained CNN model in an aggregated fashion to detect ALL from microscopic WBC images. The authors use several data augmentation techniques to avoid overfitting. The proposed network consists of a VGG19 and a NASNetLarge which are used together for classification. The final ensemble produced an overall accuracy of 96.58% which is higher than any of the individual networks.
An extensive survey on the current trends and approaches to the detection of leukemia from microscopic images is presented in [17][18][19].

Dataset
The dataset used in this research is called the ISBI C-NMC 2019 dataset [20]. The dataset consists of white blood cell images collected from 60 cancer subjects and 41 healthy subjects. The dataset was prepared at Laboratory Oncology, AIIMS, New Delhi. There are a total of 10661 cell images in the dataset. The train, validation, and test splits were 75%, 15%, and 15%, respectively. Figure 1 illustrates the microscopic white blood cell images from the C-NMC 2019 Challenge dataset. Figure 2 portrays the class distribution of the C-NMC 2019 dataset.
To remove the variations in illumination, a stain normalization process has been applied to the images. The normalization procedures applied to this dataset have been described in detail in [21][22][23][24][25].

Proposed Approach
In this section, we describe our proposed model. Figure 3 shows the architecture of the proposed hybrid Inception v3 XGBoost model. Figure 4 portrays the architecture of the Inception v3 model. The proposed model consists of two components, an image feature extractor and a classification head. Generally, the classification head in a pretrained CNN for image classification tasks is a softmax classifier. In the proposed model, however, we use the XGBoost classifier as a classification head. The input features used for this XGBoost classifier are provided by the fine-tuned Inception v3 model. Through experiments, we also show that this setup works for several other pretrained CNNs too.
The proposed model is trained in two stages. In the first stage of training, we fine-tune the Inception v3 model on the training data. Through experiments, we observe that using features from fine-tuned Inception v3 leads to better classification results by the XGBoost classifier as opposed to using a pretrained Inception v3 directly as a feature extractor (refer to Figure 5). We also used cutout [26] augmentation that acts as a regularizer by randomly masking out square regions of input during training (4) Normalization. The images are normalized with Ima-geNet mean and standard deviation. These values are precomputed standards derived from the ImageNet database 4.2. Image Feature Extraction. Literature review on recent works of medical imaging suggests that deep convolutional networks pretrained on large datasets such as ImageNet provide the best results for medical image classification tasks. This is due to the fact that medical image datasets are difficult to collect and are usually small in size. Hence, it becomes difficult to train CNNs from scratch which often results in overfitting. However, pretrained CNNs help in avoiding this problem as we can use transfer learning to fine-tune these CNNs on medical datasets. We experiment with several popular CNN architectures such as ResNet and DenseNet to select the model which performs the best. We fine-tune these CNNs for the task of classification and choose the model with the best weighted F1 score. Refer to Experimental Results and Discussion and Table 2 for more details.
We employ an Inception v3 [27] model that is initialized with ImageNet weights and fine-tuned on the train set to extract feature maps for images. After experimenting with several pretrained CNN models for this task, Inception v3 gave the best F1 score. Inception v3 is the 3rd version of CNN from the inception family of architecture that makes several improvements. These improvements include factor-ized convulsions that reduce the number of parameters without decreasing the network efficiency. It uses label smoothing to act as a regularizer. Additionally, it utilizes an auxiliary classifier to propagate label information lower down the network and further help in regularization.

Classification Head.
We employ an XGBoost [28] classifier to classify the cell images as leukemic blasts or normal. XGBoost is a machine learning algorithm used for both classification and regression modelling tasks. It is an ensemble of gradient-boosted decision trees. Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is a special case of boosting algorithms where errors are minimized by a gradient descent algorithm.

Stage 1 Training.
In the first stage of training, Inception v3 is trained on the training set. We employ the pretrained ImageNet weights for Inception v3. The last fully connected layer in Inception v3 is replaced with a 2-node softmax classifier. The parameters for this replaced layer were randomly initialized: The softmax function is used to convert logits of the classifier into a probability distribution. Each element of the output lies in the interval ½0, 1, and the output elements sum up to 1. The input image is assigned to the class with maximum probability. Equation (1)  From Figure 2, we can observe that the dataset has a class imbalance problem. To address this problem, we use a weighted cross-entropy loss function. This function is given by the formula where weight½class refers to the weight assigned to each class.
To minimize the effect of class imbalance, we assign larger weights for minority classes. The losses are averaged across observations for each minibatch. In this case, it is a weighted average given by During this stage of training, we used several augmentation techniques that were mentioned in Data Preprocessing. Using the image augmentation helps the model generalize better and improve performance. Figure 6 compares the validation loss during training of two different Inception v3 models, one which uses image augmentation on the input images and the other which does not use it. Using image augmentation improves the performance of the model. To extract the features using Inception v3, we remove the softmax classifier from the network and directly obtain the feature map from the penultimate layer. The feature maps obtained are of dimension 2048 × 1. We use the same training, validation, and test splits that were used in stage 1 training.

Implementation Details
All the networks were trained on the Tesla K80 GPU provided by Kaggle's Machine learning kernels. We used the PyTorch library to develop the deep learning models. The models were optimized using Adam optimizer. For the XGBoost classifier, we used the XGBoost library. We used a grid search strategy to tune the model to optimize the loss. The detailed hyperparameter configuration for the proposed model is given in Table 3.

Experimental Results and Discussion
In this section, we report the experimental results for our proposed model. The primary evaluation metric that we adopt is the weighted F1 score. We additionally report accuracy, precision, recall, and AUC score.
Once the model is trained, we select the best checkpoint to be used in model inference. The predicted classes are compared to the actual target classes to calculate the aforementioned metrics. We experimented with several CNN backbone feature extractors such as AlexNet and DenseNet during stage 1 of training. We experimented with these CNNs to identify which model can be used as the feature extractor for our hybrid model. Figure 7 compares the validation loss of the different CNN models during stage 1 of training. Among these, Inception, v3 was the best performing model with a weighted F1 score of 0.97. Table 2 displays the evaluation metrics of the various CNN models used during stage 1 of training.
During stage 2 of training, we extracted image features using the Inception v3 model trained in stage 1. These features were used in training an XGBoost classifier. Using an XGBoost classifier on top of this Inception v3 model gave the best result on the test set with a weighted F1 score of 0.98. Figure 8 displays the confusion matrix obtained for the proposed hybrid model. We observe that there are very few misclassified data. We observe that there is a better false positive rate when using an XGBoost classification head over a CNN; this is an essential factor when dealing with the medical diagnosis since it is better to screen a person as diseased and conduct further tests to exclude the disease than exclude a diseased person by falsely predicting a negative.
Sensitivity and specificity are two important metrics that are used to validate medical diagnosis models. Sensitivity reflects the probability that a diagnostic test will return positive for people who are diseased. Specificity on the other hand reflects the probability that a test will return negative for    Computational and Mathematical Methods in Medicine persons without the disease. Clinically, these metrics are important for confirming or excluding disease. We can interpret these metrics from the confusion matrix ( Figure 8). The sensitivity is 0.9884, and specificity is 0.9133. The TPR (true positive rate) and FPR (false positive rate) are important AUC/ROC (Area Under the Curve/Receiver Operating Characteristics) metrics that help to determine the amount of information learnt by the model and how well it is able to distinguish between the classes. In the ideal case, TPR = 1 and the FPR = 0. Refer to Figure 5 that depicts the ROC curve for the hybrid model on the test data. An AUC of near 1 indicates that a model has excellent separability. We can observe that the model achieves a high AUC of 0.9826. This shows that the proposed model has excellent separability and correctly classifies most of the samples in the test data with very few misclassifications. Also, the FPR is close to 0 and TPR close to 1 from which we can deduce that the model is performing well.
To benchmark and compare our hybrid model, we have selected the following models from recent studies on leukemia detection. These models are trained and validated on either ISBI C-NMC dataset or other similar datasets of microscopic WBC image for ALL classification. Moreover, these models use CNN for feature extraction or have some deep learning components in their model design. We have described these models in brief below.
Yu et al.: to prevent a model from fitting data noise, the authors have combined several CNNs and used their combined output to get classification results. The CNN architectures being used are ResNet50, Inception v3, VGG16, VGG19, and Xception.
Mourya et al.: this approach combines the optical density features and discrete cosine transform domain features extracted through CNN to build the classifier. They use bilinear pooling instead of average pooling after the last convolutional layer to help in fine-grained recognition. Kassani et al.: in this approach, the image is first enhanced using several preprocessing and augmentation techniques; then, features are extracted using a hybrid VGG16 and MobileNet model. The authors have developed an integrating strategy to overcome the shortcomings of the individual models. Finally, a multilayer perceptron is trained using these features. Bodzas et al.: in this approach, the image is segmented using a three-phase filtration; then, sixteen handcrafted features are extracted and used for classification by SVM and ANN classifiers.
Kasani et al.: the authors develop an aggregated deep learning model ALL detection. Several data augmentation techniques were applied to overcome dataset size issues, and transfer learning was utilized to accelerate learning. The authors have used the following CNN models: Inception v3, AlexNet, DenseNet201, VGGNet-16, VGGNet19, Xception, MobileNet, ShuffleNet, and two NASNet models.
Shafique and Tehsin: the authors have used a pretrained AlexNet model in their study. They have replaced the last layers with new linear layers and learnt the weights from scratch by fine-tuning on the ALL-IDB dataset. They have employed several data augmentation techniques to overcome overfitting.
We compare the models based on the accuracy obtained since this was a common metric we found in all these studies. Table 1 compares the proposed approach with its counterparts.
A common trend we noticed in these studies is that several CNN models are being aggregated and utilized to make a classification decision; we feel that this approach makes the model unnecessarily complex. Not only does this   Another limitation we noticed is that the studies do not attempt to interpret and justify classification decisions made by the models. Interpretability of models is of prime importance in building trust and towards the successful integration of these models in everyday medical use. Since we use a single CNN backbone for feature extraction, we can demystify the CNN by visualizing their activation maps of the features extracted (refer to Figure 9).

Ablation Study
In this section, we attempt to justify our design choices in developing the proposed hybrid model.
We investigate the effectiveness of using an XGBoost classification head with a fine-tuned CNN model. We experiment with different CNN backbones such as AlexNet and ResNet18 in our proposed model. The goal of this experiment is to demonstrate the effectiveness and generalizability of using the XGBoost classification head over the softmax classification head for this dataset. Table 4 shows the weighted F1 score of hybrid models using different CNN backbones. Table 4 shows that generally there is a significant increase in the performance of the model when used in this setting.
We check whether a pretrained CNN can be direct without fine-tuning on the train set. We conduct this experiment to check for the effectiveness and need for fine-tuning the feature extractor (stage 1 training). When we directly use the pretrained Inception v3 as a feature extractor, we notice that there is a significant drop in performance. We try to investigate the reason behind this by plotting a scatter plot of the features extracted from the Inception v3 (refer to Figure 9). We use t-sne to convert the high-dimensional feature maps to lower-dimensional embeddings. We observe that with fine-tuning, the Inception v3 learns better and more discriminative feature representations for the dataset which helps the XGBoost model in making better and more informed classifications, whereas the features from a pretrained off the shelf Inception v3 are not discriminative at all, which is clearly observed in Figure 9.
We also try to understand the inner workings of Inception v3 from stage 1. Being able to interpret the model can help in justifying the classification decision; this kind of interpretability will provide more confidence to medical practitioners and patients in the model prognostics. To do this, we would like to find out the parts the image Inception v3 pays attention to while making a classification decision. We visualize the feature maps to understand the active areas of the image. Figure 10 displays the heat map over the image; the highlighted areas in the image are those areas that contribute most to the classification decision. We observe that the cell nucleus is the region that contributes most to the classification decision. We also observe that the model does not pay much attention to the area surrounding the cells. This observation also justifies the choice to perform center cropping while preprocessing the data as that removes