Ensemble of EfficientNets for the Diagnosis of Tuberculosis

Tuberculosis (TB) remains a life-threatening disease and is one of the leading causes of mortality in developing regions due to poverty and inadequate medical resources. Tuberculosis is medicable, but it necessitates early diagnosis through reliable screening techniques. Chest X-ray is a recommended screening procedure for identifying pulmonary abnormalities. Still, this recommendation is not enough without experienced radiologists to interpret the screening results, which forms part of the problems in rural communities. Consequently, various computer-aided diagnostic systems have been developed for the automatic detection of tuberculosis. However, their sensitivity and accuracy are still significant challenges that require constant improvement due to the severity of the disease. Hence, this study explores the application of a leading state-of-the-art convolutional neural network (EfficientNets) model for the classification of tuberculosis. Precisely, five variants of EfficientNets were fine-tuned and implemented on two prominent and publicly available chest X-ray datasets (Montgomery and Shenzhen). The experiments performed show that EfficientNet-B4 achieved the best accuracy of 92.33% and 94.35% on both datasets. These results were then improved through Ensemble learning and reached 97.44%. The performance recorded in this study portrays the efficiency of fine-tuning EfficientNets on medical imaging classification through Ensemble.


Introduction
According to the World Health Organization (WHO), tuberculosis (TB) is one of the leading causes of death [1,2]. It is most prevalent in developing countries due to the high rate of economic distress [3]. TB can be treated if detected early to avert its further spread and mortality [4]. Early diagnosis of TB requires a reliable screening procedure and accurate interpretation of the screening outcome. To this end, chest X-ray (CXR) has been recommended by the WHO for screening pulmonary (lungs) abnormalities due to its high sensitivity, wide availability, and relatively less expensive. Despite this recommendation, adequate medical resources and skilled radiologists required to efficiently screen patients and interpret results are limited in high TB burden countries [5,6]. Several algorithms varying from hand-crafted to support vector machines (SVMs) and convolutional neural networks (CNNs) have been deployed to automate TB screening in tackling these limitations. However, our focus is on the CNN algorithms.
Amongst these algorithms, CNNs have demonstrated promising performance, as evident in the literature primarily in transferring models trained on large datasets (ImageNet [7] and CIFAR-10 [8]), to medical imaging with fewer datasets. One of the challenges in the medical field is the lack of reliable annotated datasets and class imbalance; hence, transferring the weights of deep models helps address these challenges since CNN algorithms often depend on large datasets to learn various discriminating features contributing to attaining good results [9].
To date, many CNN models have been developed to address the challenges of TB detection, some of which are discussed here, ranging from the ones trained from scratch to pretrained CNNs and Ensemble of them.
e CNN architecture proposed in [10] is trained from scratch to diagnose TB. e architecture comprises five convolutional layers, each preceded by the pooling layers, and then the final output layer to obtain 86.2% accuracy and AUC of 0.925%. e authors incorporate Grad-CAMS and Saliency map visualizers to validate the output of the model classification and identify areas where TB was visible on the CXR. e importance of a saliency map is further emphasized as a valuable tool to review and interpret the model decision.
A technique based on deep CNN models is presented in [11] for screening CXR to detect TB. Artificial CXR images were generated using deep convolutional generative adversarial networks (DC-GANs).
ese images and the CXR images from popular datasets were segmented into only the lung fields using UNET. e segmented CXR images are then used to train Vgg16 and InceptionV3 and then Ensemble both for classification. e Ensemble model achieved 97.1%, while the individual model obtained 93.8% and 96.3%, respectively. e authors affirm that the performance of the proposed model is impacted by the preprocessing technique, segmentation, augmentation, image denoising, and hyperparameter optimization.
In [12], VGG16, a variant of VGGNet [13], was employed as a feature extractor to extract discriminating features from CXR, and a logistic regression classifier was trained to predict the normal and infected images. Preprocessing techniques such as augmentation and contrast limited adaptive histogram equalization were applied to boost the quality of the CXR. e authors conclude that a large database of labeled CXR would assist in building a deeper and robust computer-aided detection system from scratch. e study shows the effectiveness of transferring the weights of a pretrained model to the medical imaging domain where there are limited annotated datasets.
GoogLeNet [14] was fine-tuned in [15] to classify different manifestations of TB as consolidation, pulmonary edema, cardiomegaly, pneumothorax, or pleural effusion. Image preprocessing was applied to normalize the CXR with shorter dimensions, while the CXR with larger dimensions was padded before training the model. e performance of the model was evaluated to determine its AUC score, sensitivity, and specificity. e model obtained 0.964 AUC and 91% for both specificity and sensitivity, which are further compared with the result of 2 board-certified radiologists.
e study presented in [16] used the pretrained CNN to confirm the presence of TB on CXR. In the study, the authors conducted two experiments by applying preprocessing such as data augmentation to CXR in the first experiment, while the second experiment was performed directly on the CXR. e model achieved an accuracy of 81.25% in the first instance and 80% in the second instance. is performance is low compared to what could be acceptable due to the sensitive nature of TB. However, it has shown data augmentation to increase data samples where necessary and proves that the more data samples you have, the more features can be learned and better the performance.
Ensemble classifier is presented in [17] for detecting TB abnormalities in the CXR. Firstly, three pretrained CNN models are trained individually with different hyperparameters to classify the CXR into normal and abnormal classes. e three models are then combined through Ensemble to improve the classification. e results are then compared, and the Ensemble model performed better than each model, achieving 86.42%.
Pretrained CNN models were customized and proposed in [18] to classify CXR abnormalities. e study applied data augmentation to generate more CXR. After the image preprocessing, CapsNet [19], VGG16, and AlexNet [20] were trained to classify the CXR into normal and abnormal classes. e classifiers were evaluated using different metrics, and the CapsNet performed best above the others concerning affine transformations. Lopes and Valiati [21] proposed three approaches (Bag of features, Simple feature extractor, and Ensemble) as feature extractors and compared their performance to detect TB.
e Ensemble in the study is made up of VggNet, ResNet [22], and GoogLeNet. e CXR was first downsampled to suit the different approaches excluding Bag of features; even though downsampling could impact the performance due to vital detail loss, experiments were then performed for all the proposed methods, and the Ensemble outperformed the other two approaches, achieving 84.6%. is performance could have been impacted due to downsampling, class imbalance, and limited datasets. e study in [23] proposed an Ensemble of pretrained CNN models toward learning modality-specific features from different CXR collections. e knowledge learned through modality specific is transferred and fine-tuned for TB detection tasks on the Shenzhen CXR dataset. e predictions of the best performing models are combined using different ensemble methods to demonstrate improved performance over any individual model in classifying the CXR as infected or healthy. e evaluation of the proposed model achieved 94.1% accuracy.
Another study that makes use of Ensemble models is proposed in [24]. is study extracted distinctive features from edge detected images and raw CXR images for TB detection. Image preprocessing was applied before developing different classifiers to represent Ensemble and applied to both image categories. e performance was evaluated, and the Ensemble model performed best at 89.77% accuracy with 90.91% sensitivity.
e study presented in [25] is a proposed method that fine-tuned five pretrained CNNs as feature extractors to diagnose TB automatically. e CXR was collected with the patient's demographic details such as weight, height, weight, age, and gender. e authors conducted two experiments that compare the performance of the CXR with demographic information and the CXR without demographic details. e first experiment shows higher performance in terms of AUC and sensitivity over the second experience. e result proves that incorporating demographic information can positively impact performance.
In [26], VggNet and AlexNet architectures were modified to screen TB on CXR and classify them into healthy and unhealthy classes. e architectures were independently trained on the Montgomery and Shenzhen datasets to evaluate the performance of each model. VggNet achieved 81.6% classification accuracy, a slight superiority over the AlexNet architecture at 80.4%.
Deep CNN models proposed in [27] were integrated with the handcraft technique through Ensemble learning as feature extractors from CXR. e features extracted were then used as inputs to train a classifier for detecting infected CXRs. e model was evaluated to compare the performance of both methods, and the Ensemble model performed better at 0.99 AUC.
In [28], an automatic deep learning algorithm for the diagnosis of active lung TB was presented to streamline the process of screening and detection. e algorithm was developed using the CNN comprising about 27 layers. e CXR images were resized; then, a geometric and photometric operation was applied to them before passing them on to the CNN architecture for training. e algorithm was evaluated on two public datasets and four custom datasets compared with certified radiologists and physicians' interpretations. e algorithm performance shows significant performance compared to the radiologist and physician's performance on all the datasets.
A hybrid model (MoblieNet-AEO) was proposed in [29] to classify CXR images as TB and Not-TB. e proposed model first employed MobileNet as feature extraction and then used the Artificial Ecosystem-Based Optimization (AEO) as the feature selector. e model was trained with the Shenzhen dataset and a private dataset to achieve 90.2% and 94.1% accuracy on both datasets.
Although several computer-aided detection models have been developed to diagnose TB, low accuracy and sensitivity are still major concerns that lead to misdiagnosis. Existing models are also computationally complex and expensive, which make them unaffordable in a low-budget economy. All these serve as motivation to propose an efficient system suitable and accessible for low-income economies. e contributions of this work are summarized as follows: (1) Implementation of state-of-the-art EfficientNets to develop an effective and inexpensive TB detection system. It is the first time the EfficientNet model is being ensembled to classify CXR images for tuberculosis diagnosis. (2) e proposed model achieved high sensitivity and accuracy.
(3) e model improved accuracy through Ensemble learning (Bagging). (4) Finally, two benchmark datasets (Shenzhen and Montgomery) were used to evaluate the performance of our model. Despite the challenges of limited annotated datasets and class imbalance in medical fields, our approach produces significant improvements in TB classification accuracy over the related methods.
e rest of the work is structured as follows: Section 2 presents detailed data and methodology explored in this study. e experimental results and discussion are provided in Section 3, while Section 4 concludes and gives insight into future direction.

Dataset.
In this study, all the experiments were performed with the two public CXR datasets provided by the Department of Health and Human Service, Montgomery Country, Maryland, USA, and Shenzhen No. 3 People's Hospital in China [30]. ese datasets were deidentified and exempted from IRB review. Both datasets described below can be accessed from the website https://lhncbc.nlm.nih. gov/publication/pub9931.
(1) e Montgomery County dataset: this dataset is a TB-specific frontal CXR dataset provided by the National Library of Medicine in conjunction with the Department of Health Services, Maryland, USA, for research intent. is dataset is composed of 58 abnormal samples and 80 normal samples. e naming of each sample ends with either 0 denoting normal sample or 1 denoting abnormal sample. All samples are 4020 × 4892 pixels and are available in portable network graphic (png) file format. is dataset is accompanied by clinical readings that detail each sample regarding age, sex, and manifestations.
(2) e Shenzhen dataset: this dataset is a TB-specific frontal CXR provided by Shenzhen No. 3 Hospital in Shenzhen, Guangdong, China, and is publicly available for research. e dataset is made up of 336 abnormal samples and 326 normal samples. e naming of each sample ends with either 0 denoting normal sample or 1 denoting abnormal sample. All samples are approximately 3000 × 3000 pixels saved in portable network graphic (png) file format. is dataset is accompanied by clinical readings that detail each sample concerning sex, age, and diagnosis. Some of the CXR in both datasets are shown in Figure 1.

Preprocessing.
Image preprocessing is crucial in a deeplearning task. Since images are provided in different sizes, the input images need to be resized to conform to the different CNN models while preserving the important features. Also, the CNN model generally requires a large dataset to properly learn discriminating features suitable for making predictions and obtaining a reasonable performance [9]. Medical imaging is usually limited; hence, data augmentation [31] is applied in this study to generate additional unique CXR images. e data augmentation approach contributes toward controlling overfitting. Overfitting is a phenomenon where the model performs well on the training set but poorly on the new (unseen) test set. Contrast limited adaptive histogram equalization (CLAHE) [32] is also applied on the CXR to improve the visibility quality of the images. e augmentation types applied to the CXR images are presented in Table 1, while Table 2 shows the parameter values used to perform the CLAHE operation.

Transfer Learning.
In this study, we adopt the concept of transfer learning. is phenomenon leverages the pretrained CNN model on large datasets (millions) to learn features of the target (diagnosing TB from the CXR in this case) with a limited dataset. In other words, we transfer the rich discriminative features learned by the deep CNN models on the ImageNet [7] dataset to our CXR dataset. e number of Computational Intelligence and Neuroscience deep CNN model parameters increases as the network gets deeper to achieve improved efficiency. Hence, it requires many datasets for training, thereby making it computationally complex. So, applying these models directly on small and new datasets results in feature extraction bias, overfitting, and poor generalization. We, therefore, modified the pretrained CNN and fine-tuned its structure to suit the CXR dataset. e concept of transfer learning is computationally inexpensive, achieves less training time, overcomes limitations of the dataset as in the medical domain, improves performance, and is faster than training a model from scratch [33]. e pretrained CNN model fine-tuned in this work is the EfficientNets [34], and the structure of the proposed method is represented in Figure 2. [34] is a lightweight model based on the AutoML framework [35] to develop a baseline EfficientNet-B0 network and uniformly scaled up the depth, width, and resolutions using a simplified and effective compound coefficient to improve EfficientNet models B1-B7. e models performed efficiently and attained superiority over the existing CNN models on the ImageNet datasets, as shown in Figures 3 and 4, respectively. EfficientNets are smaller with few parameters, faster, and generalize well to obtain higher accuracy on other datasets popular for the transfer learning task. e proposed study fine-tuned EfficientNet models B0-B4 on CXR to detect TB. Due to limited annotated datasets in medical imaging, the data augmentation technique is applied to generate additional unique CXR images to control overfitting. In transferring the pretrained EfficientNets to the CXR dataset, we fine-tuned the models by adding a global average pooling (GAP) to reduce the number of parameters and handle overfitting. Two dense layers follow the GAP with a ReLU activation function and a dropout rate of 0.4 before a final dense layer that serves as output with SoftMax activation function to determine the probabilities of the input CXR representing the normal and infected classes. e SoftMax activation function is given as

EfficientNet Architecture. EfficientNet
where σ is the SoftMax activation function, q represents the input vector to the output layer i depicted from the   Computational Intelligence and Neuroscience exponential element e q i , N is the number of classes, and e q y represents the output vector of the exponential function. It is understood that too many iterations could lead to model overfitting, while too few iterations can cause model underfitting; hence, we invoked an early stopping strategy. About 120 training iterations were configured and then set up the early stopping to terminate the training once the performance stops improving on the hold-out validation set. e early stopping, L2 regularization, data augmentation, and dropout were implemented to control overfitting.
e EfficientNet B0-B4 models were trained for 40 iterations (epochs) using a stochastic gradient descent (SGD) optimizer with a 0.0001 learning rate. e batch size for each iteration was 22, and momentum equals 0.9 and L2 regularizer. All these configurations help to control overfitting. At the same time, categorical cross-entropy is the loss function used to update weights at each iteration. Hyperparameters used were carefully evaluated and found to perform optimally. e optimizer employed is defined as where Δ α J is the gradient of the loss w.r.t α, n is the defined learning rate, α is the weight vector, while x and y are the respective training sample and label, respectively.   Computational Intelligence and Neuroscience

Ensemble Modeling.
Ensemble learning is the terminology for techniques that incorporate multiple models for decision making. e ultimate goal of Ensemble is such that by combining multiple models, the errors of a single model can be corrected (compensated) by other models, thereby making the overall result (prediction and classification) of the Ensemble better than any single participating model [36]. e concept of Ensemble is mainly considered the interpretation of machine learning for the intelligence of the majority. In recent times, Ensembles are regarded as state-of-theart methods for solving a more significant number of machines and deep learning challenges [37]. In other words, weighing and aggregating many individual viewpoints will be better than deciding on the judgment of a single individual.
(1) Strength of Ensemble learning: the efficiency of Ensemble models for obtaining improved performance could be due to the following [36,38]: Avoidance of overfitting: algorithms often find ways to perfectly predict the training set in a few datasets even though poor prediction is made on the test set. So when a different hypothesis is averaged, the risk of selecting the wrong hypothesis is minimized.
Computational expediency: individual models can get stuck in a local minimum, but through (Ensemble) combining models, the risk of local minimum is decreased as other models can get past a certain local minimum. Representation: Ensemble models help improve the scenario where the optimal hypothesis is outside the scope of any individual model. Class imbalance and dimensionality: they are major problems that affect the equality of results and are well handled by Ensemble model through resampling.
(2) Building an Ensemble model: building an Ensemble model culminates in methodology selection to train the individual participating models and select a suitable process for combining the individual predictions (output). e high diversity and predictive performance of a model are essential in selecting participating models [38]. Other considerations for model selections include input and output manipulations, partitioning, and hybridization. Furthermore, the methods of integrating all individual outputs into a single final result are crucial. ere are different Ensemble types based on the generation process of different models and a combination strategy. e most prominent Ensemble types are Bagging [39], Boosting [40], and Stacking [41]. In this study, the Bagging method of Ensemble is implemented. is method is also known as Bootstrap Aggregation.
(3) Bagging [39] is an effective method for creating Ensembles of independent models. Many individual models are trained independently with a sample of data derived from the original datasets as replacements. By independently, we mean the training and output of a particular model do not affect the others. To ensure diversity, each replica sample has the same number of samples as the original dataset. us, some data samples may appear for the replica, while others may not or appear more than once for training individual models. Once the classification output of the models is obtained, bagging then combines them through majority voting to obtain the final prediction. Algorithm 1 presents bagging pseudocode. e bagging method is employed in this study because of the following: It is suitable for a task with small training datasets. It can achieve low variance without impacting the bias. It handles classification tasks well It has the ability to obtain improved accuracy.

Results and Discussion
Various EfficientNet (B0-B4) variants were fine-tuned on both Shenzhen and Montgomery TB-specific CXR datasets to detect TB. Each dataset is split into a 75% training set and 25% test set, respectively. e experiments were entirely performed using the Keras deep learning framework running on the TensorFlow backend. Each model was trained in the cloud on the Tesla graphics processing unit (GPU) and made available through the Google Collaboratory framework by Google. e Collaboratory framework provides up to 12 GB random access memory (RAM) and about 360 GB GPU in the cloud for research purposes. e models were evaluated using the popular evaluation metrics (accuracy, sensitivity, specificity, and area under the curve). e way these evaluation metrics are determined is given as follows: where TP � true positive, FP � false positive, TN � true negative, FN � false negative, TPR � true-positive rate, and FPR � false-positive rate. e experimental results showing each dataset's accuracy and loss are presented in Figures 5 and 6. e various EfficientNet models converge at the 40th iteration (epoch); then, the accuracy is determined using the test set. e models performed well at extracting and learning discriminative features from the CXR. Precisely, EfficientNetB4 attains the best accuracy (92.33% and 94.35%) for both datasets over the other variants. e performance details of 6 Computational Intelligence and Neuroscience the experiments are illustrated in Table 3, depicting accuracy, sensitivity, specificity, and AUC. Despite achieving good performance, we further build an Ensemble model comprising the best performing individual models (B2, B3, and B4) from the initial experiments to improve accuracy. e performance of the three Effi-cientNets was averaged to build an Ensemble to classify the CXR.
e Ensemble model outperformed the individual variants of EfficientNets in diagnosing TB, achieving 95.82% for the Montgomery dataset and 97.44% for the Shenzhen dataset. e accuracy and loss plots for the Ensemble model are presented in Figure 7. Also, the model obtains the highest sensitivity for TB detection.
One of the main advantages of EfficientNets is that they are smaller with fewer parameters and faster and generalize well on transfer learning datasets [34]. EfficientNet-B0 had the least performance, as shown in Table 3, despite having the fewest parameters. is low performance could result from image downsampling to conform to the model's 224 × 224 input size. A detailed number of parameters and input size for each EfficientNet are provided in Table 4. We observed that performance improves as the model gets deeper. For the Montgomery dataset, EfficientNet-B0 started poorly. It only began to converge from the 11th iteration, although with little noise, until the 25th iteration, where it began to stabilize until the 40th iteration. Overfitting starts to occur after the 40th iteration; hence, early stoppage was invoked. On the contrary, the training progressed well for the Shenzhen dataset but only began to overfit after the 40th iteration. Generally, the models performed better on the   Figure 4: Performance of EfficientNets vs state-of-the-art models on ImageNet [34].   Computational Intelligence and Neuroscience   is attribute could be due to the increased dataset and balanced class ratio of normal to abnormal CXR. One of the challenges faced during the experiments was determining the best combinations of hyperparameters suitable for these models and overfitting due to limited data samples. Hence, we explored data augmentation and other regularization techniques (dropout) to combat overfitting as much as possible. e result of the proposed method compared with the related Ensemble study is presented in Table 5.

Conclusion
is study investigated and implemented EfficientNet models for automatic diagnosis of TB on the two most prominent and publicly available CXR datasets. Effi-cientNets that achieved state-of-the-art performance over other architectures to maximize accuracy and efficiency were explored and fine-tuned on CXR images. e fine-tuning technique is a valuable way to take advantage of rich generic features learned from large dataset sources such as ImageNet to compliment the lack of annotated datasets affecting medical domains. e experimental results show the effectiveness of the EfficientNets in extracting and learning distinctive features from the CXR images and then classifying them into a healthy or infected class. Out of the five EfficientNet variants explored in this study, the EfficientNet-B4 outperformed among the others, as evident in Table 3 and depicted in Figure 8. e achieved result is improved through Ensemble of the best three (B2, B3, and B4). It is worthy of note that exploring EfficientNets for the classification of TB images helps save time while preserving accuracy. However, one of the major limitations of the proposed approach is training the model on small datasets and training on images with low resolutions. ese limitations can easily result in significant overfitting. Hence, it is essential to explore effective image preprocessing techniques to overcome these challenges. For future study, we will consider Ensembling all variants of EfficientNets and compare with other state-of-the-art models for detecting foreign objects on CXR images that could affect performance or be misclassified as TB. Although the proposed methodology is specific to TB detection, it could be extended to detect other pulmonary abnormalities, given the right set of datasets.

Data Availability
Data used in this research are publicly available (Shenzhen and Montgomery TB-specific CXR datasets). e developed model can be shared upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.