Ensemble Framework of Deep CNNs for Diabetic Retinopathy Detection

Diabetic retinopathy (DR) is an eye disease that damages the blood vessels of the eye. DR causes blurred vision or it may lead to blindness if it is not detected in early stages. DR has five stages, i.e., 0 normal, 1 mild, 2 moderate, 3 severe, and 4 PDR. Conventionally, many hand-on projects of computer vision have been applied to detect DR but cannot code the intricate underlying features. Therefore, they result in poor classification of DR stages, particularly for early stages. In this research, two deep CNN models were proposed with an ensemble technique to detect all the stages of DR by using balanced and imbalanced datasets. The models were trained with Kaggle dataset on a high-end Graphical Processing data. Balanced dataset was used to train both models, and we test these models with balanced and imbalanced datasets. The result shows that the proposed models detect all the stages of DR unlike the current methods and perform better compared to state-of-the-art methods on the same Kaggle dataset.


Introduction
Diabetes mellitus, commonly known as diabetes, causes high blood sugar. Persistently high blood sugar level leads to various complications and general vascular deterioration of the heart, eyes, kidneys, and nerves [1]. Diabetic retinopathy (DR) is one of the leading diseases caused by diabetes [2]. It damages the blood vessels of the retina, for those who have diabetes type-I or type-II. DR is classified into two major classes: nonproliferative (NPDR) and proliferative (PDR) [3]. In NPDR, the changes are detected in the retina that needs to be monitored. e NPDR is subdivided into three stages according to the level of damage in the retina, namely, mild, moderate, and severe. e NPDR would turn to PDR with a high risk if timely not diagnosed. In PDR, fragile (breakable) new blood vessels form on the surface of the retina over time.
ese abnormal vessels can bleed or develop scar tissue, causing severe loss of sight (neovascularization, vitreous hemorrhages). e disease progresses from mild NPDR to PDR, as shown in Figure 1. e influence of DR can be alleviated if it can be detected and treated at an early stage. Globally, patients are expected to increase from 382 million to 592 million by 2025 with diabetes [4]. And, with DR, this is excepted to increase from 126.6 million to 191.0 million by 2030 [5]. e international diabetes federation estimates that global incidence of adult diabetes will be increased from 8.4% in 2017 to 9.9% by 2045 [6]. In the early stage, patients are asymptomatic but in advanced stage, it may lead to blurred vision, floaters, and visual acuity loss. Hence, it is difficult and utmost important to detect DR in early stages to avoid the worse effect on later stages. Figure 1 shows the images of all stages; it is clear that normal and mild stage visually look similar. So, it is difficult to detect mild stage. e color fundus images are used to diagnose DR. e manual analysis can only be done by experts, which is expensive in terms of time and cost. erefore, it is important to use computer vision techniques to automatically analyze fundus images. Many automatic techniques have been applied to detect DR. Computer visionbased method can be divided into two categories: hand-on engineering [7] and end-to-end learning [8,9]. e hand-on engineering methods were based on single selected features, such as blood vessels outline, exudes, hemorrhages, microaneurysms, and maculopathy of retinal fundus image or their combinations [7]. e end-to-end learning automatically learns features and hence performs better classification. Many hand-on engineering and end-to-end learning methods [8][9][10][11][12][13][14] detect DR using Kaggle dataset, but no approach could detect mild stage. To treat this fatal disease, early stage detection is important. is study focuses on detecting all the stages of DR (including mild stage) using end-to-end deep ensemble networks. e result shows that the proposed approach outperforms state-of-the-art methods. e remainder of this paper is organized as follows. Section 2 reviews recent literature related to DR detection. e proposed model is described in Section 3. e performance evaluation and results are presented in Section 4. Conclusions of the research and suggestions for future work are summarized in Section 5.

Literature Review
e classification of DR has been extensively studied in the literature. Several studies have proposed methods to detect DR stages and its severity [15][16][17][18][19][20][21]. DR can be detected in many ways such as single stage detection and binary classification. e problem in these methods is that we cannot detect the severity of the disease. So, the solution is multiclass classification. Pratt et al. [20] proposed a CNN-based method; however, Pratt et al.'s [20] architecture did not detect the mild stage and achieved 30% sensitivity and 95% specificity, which indicates that the architecture did not classify the affected stages properly. e major issue with their sensitivity results is that they used an imbalanced/skewed dataset. Another study [22] shows a better result when using a balanced dataset for training; however, a balanced dataset has not yet been used to detect DR when testing a model. In addition, Bravo et al. [23] used a balanced dataset to train a model and achieved 50.5% accuracy; however, the test dataset was not balanced. Further, Chandrakumar et al. [16] implemented a deep CNN deployed with a dropout layer to detect DR and achieved 94% accuracy on the DRIVE dataset. ey used spatial feature analysis for detection; however, the number of samples in the dataset was not sufficient. Furthermore, Takahashi et al. [21] proposed an AI disease-staging system that grades DR by using a retinal area.
at proposed system directly suggests treatments and determines prognoses. However, they used modified Davis staging, which is not commonly employed for grading DR. In that study, in the network misclassified some images, the false negative rate was lower than the false positive rate.
Moreover, deep learning classification algorithms have proven to be very effective if the model is trained in a supervised manner with sufficient data. Transfer learning with different CNN models has achieved good accuracy and DR classification results [24,25]. Kori et al. [26] used an ensemble technique to detect DR stages and DME, and, according to [27], ensemble models perform well. Similarly, Choi et al. [28] pretrained a model with transfer learning and used an ensemble voting technique that improved accuracy. Hagos et al. [29] used transfer learning to train an Inception-V3 model that classified all stages. ey achieved 90.9% accuracy on the test dataset. Similarly, Carson et al. [30] also used transfer learning to classify all DR stages. However, they all use transfer learning and an ensemble technique but did not use balance dataset to train and test a model. 2 Computational Intelligence and Neuroscience e literature shows that researchers have applied or proposed various methods for detecting and classifying DR stages. As mentioned above, there are many ways to detect DR, but multiclassification detects the severity of DR stages. In the multiclass classification, the stages are divided into 5 stages as discussed in the Introduction section. However, most of the literature is not able to properly classify all stages of DR, especially the initial stage. It is important to detect the early stage of DR for treatment, as at a later stage it is difficult to cure and can lead to blindness. To the best of our knowledge, no other work has identified the early stages using the Kaggle dataset we used for our research with a balanced dataset. Our models can detect the mild stage and perform better than the current state of the art. Moreover, in the literature, no one has shown the result of a balanced dataset. Imbalanced dataset can lead to bias in classification accuracy. If the samples in the classes are equally distributed in a balanced dataset, then the network can learn the possibilities correctly, but, in the case of an imbalanced dataset, the networks exceed the high sampling class.

Preprocessing.
e different preprocessing steps we perform on the input dataset before giving it to the model are shown in Figure 2.
We used the Kaggle dataset which contains 35126 color fundus images, each of size 3888 × 2952 pixels, shown in Figure 2(a). It contains an image of five classes according to DR severity. Table 1 shows the distribution of sample images in different classes of Kaggle dataset. e distribution of samples images is shown in the first row of Table 1, which is clearly imbalanced. Training a deep network with imbalanced dataset may lead to biasness of classification. In the first preprocessing step, we resized each image to 786 × 512, which maintains the original aspect ratio, shown in Figure 2(b). Moreover, we randomly cropped 512 × 512 patches to reduce training overhead, as depicted in Figure 2(c). Furthermore, to speed up training time and avoid feature biasness, each image was mean normalized as shown in Figure 2(d). In the end, the dataset is balanced with upsampling [22]. Upsampling was performed by rotating an image to 90 degrees, augmenting minority classes, and flipping image, as shown in Figure 2(e), which increase the size of dataset, balance the samples in each class, and avoid overfilling. e distribution of balanced samples images of the different classes is shown in second row of Table 1, in which the balanced dataset is divided into three sets: training (64%), testing (20%), and validation (16%), and the validation set is used during the training to check and reduce overfitting. Finally, we make batches from the training dataset, i.e., dataset_1, dataset_2, and dataset_3, which are used to train the models. Table 1 shows the total number of samples in each class. e original dataset contains 35126 samples with different numbers of samples in each class; i.e., this dataset is highly imbalanced. To balance this dataset, we performed the preprocessing steps mentioned above. e balanced dataset contains 129050 samples. is dataset is further divided into the training, test, and valid datasets, and the training dataset is further divided into three smaller datasets, dataset_1, dataset_2, and dataset_3, each of which has an equal number of samples, i.e., 27530.

Model 1.
e combination of several machine learning techniques into a single predictive model is called an ensemble method. It may include decreasing variance (bagging) and bias (boosting) or improving prediction (stacking) [31]. We employ a bagging ensemble technique, wherever the bagging represents bootstrap aggregation. e variance of an estimate can be reduced by taking an average of multiple estimates. Bootstrap sampling is used in Bagging method to obtain data subsets for training a base learner. e output of a base learner is aggregated by voting and averaging for classification. e proposed approach ensembles the result of three datasets trained with DenseNet-121. Algorithm 1 presents the proposed model in detail. Let H � {DenseNet-121} be a pretrained model. A model is fine-tuned with three fundus images datasets (X,Y), where X is the number of images, size of 512 × 512, and Y contains the corresponding labels; Y � {y/y ϵ {normal, mild, moderate, severe, PDR}}. ree bags of datasets are divided into mini batches, each of size . ., N/n; iteratively optimizing (fine-tuning) the CNN model h ϵ H reduces the empirical loss: where x is the input, y is the class, h(x, w) is a CNN model predicting class y for input x, and w and l are the categorical cross-entropy penalty functions. e stochastic gradient descent is used to update the learning parameters: where α is the learning rate and is set as 0.0001. γ is a Nesterov momentum which helps accelerate SGD in the relevant direction and dampens oscillations, set as 0.9. In the start w t , t � 0 is initialized to the learned weights of the model h ϵ H using transfer learning. e output layer of a model, h ϵ H, uses SoftMax as an activation function which generates the probabilities of how much the input belongs to the set of different classes {normal, mild, moderate, severe, PDR}. We use 50 epochs for training with early termination if the model starts overfitting.
In case of testing a model, an unseen example from the class label is used to predict the model efficiency. e results of all models were combined by averaging which provides a unified output. e ensemble approach leads to better performance by combining the strengths of individual models. e proposed bagging ensemble is shown in Figure 3. Let X test be a new test sample; then the ensemble output is given by Computational Intelligence and Neuroscience 3

Model 2.
is model is an improved version for the classification of DR stages in this study. Algorithm 2 presents the details of the proposed model. Let H � {DenseNet-121, ResNet50, Inception-V3} be pretrained models. e models are fine-tuned with three fundus images datasets (X,Y), under the same conditions as Model 1 (Section 3.2). e proposed bagging ensemble for Model 2 is illustrated in Figure 4.

Results
In this section, the results of the proposed models are discussed using imbalanced and balanced datasets. e proposed models were trained utilizing a high-end Graphics Processing Unit (NVIDIA GeForce GTX 1070 Laptop) with the CUDA Deep Neural Network library. In addition, the TensorFlow and Keras (http://keras.io/) were used (Keras as deep learning package and TensorFlow as machine learning back end).

Performance Parameters.
We used the following metrics to evaluate the performance of the proposed model. Here, the objective was to properly classify all DR stages specially the early stages of DR.
Accuracy. Accuracy can be calculated as positive and negative classes: Here, TP is true positive, TN is true negative, FP is false positive, and FN is false negative. Recall/Sensitivity. Recall (or sensitivity) is also known as the TP Rate (TPR).
Specificity. Specificity is also known as the TN Rate.
Precision. Precision is the rate of correctly predicted number of classes over the total number of classes predicted by the model.
Area under the Curve [32] and Receiver Operating Curve (AUC-ROC) [33] represent the degree or measures of separability of different classes. e higher the AUC score, the better the model and vice versa. To show the effect of the imbalanced dataset, we have used two datasets: (i) imbalanced dataset and (ii) balanced dataset. In the end, we also have shown the comparative results of the models. e distribution of test dataset samples is given in Table 2.

Model 1. Model 1 is similar to a bagging technique
where only a single base model is used with different bags of datasets. We consider the DenseNet-121 dataset as the base model, and we used three balanced datasets to train this model discussed in Section 3.1. e results were computed using a batch size of 5. e resulting confusion matrices are shown in Figure 5. Each class of a dataset is equally distributed in the balanced dataset; therefore, the classification result is also better than the result obtained using the imbalanced dataset. With the balanced dataset, Class 1 (mild) is predicted accurately compared to the imbalanced dataset where only eight images are predicted. We obtained higher accuracy with the imbalanced dataset due to the unequal distribution of samples. We can also see a difference in Class 4 (PDR) prediction. Here, the balanced dataset outperforms the imbalanced dataset. e overall accuracy achieved by this model was 78.13% and 60.80% on the imbalanced and balanced datasets, respectively. To obtain more accurate results, we also evaluated results using the ROC curve, shown in Figure 6, and class-wise results, given in Table 3. Table 3 lists class-wise results for both balanced and imbalanced datasets in terms of recall, precision, and specificity. As can be seen, the results are significantly better in the balance dataset compared to those in the imbalanced dataset. e values differ due to the different number of samples. In the recall column, all class values for the balanced dataset are better than those for the imbalanced dataset, particularly for Class 1 and Class 4. ese classes are predicted very well in Model 1. In the precision column, Class 1 and Class 4 are predicted more accurately in the balanced dataset than in the imbalanced dataset. In the specificity column, the values for the balanced dataset are less than the values for the imbalanced dataset. Balanced dataset results are better than imbalanced dataset results; however, the overall accuracy values differ. As accuracy results differ, we also find the ROC curve for the balanced and imbalanced dataset as shown in Figure 6. e ROC curve for the balanced dataset indicates the equal distribution of samples. e most accurate prediction was obtained for Class 0 because its area is 0.96, which is near 1. We can also see that Classes 1-4 curves are also near 1, which means that Model 1 classifies all images accurately. With the imbalanced dataset, the highest curve is achieved by Class 4 because it has the lowest number of samples, and the predicted images are also high, which is why the curve is near 1. Class 0 has more samples than all other classes, and the predicted value is also very high. However, as the number of samples increases, the curve decreases to 0, which is why the area of Class 0 is 0.82.

Model 2.
Model 2 is the same as Model 1; however, the training models differ. With Model 2, we trained three deep CNN models, i.e., DenseNet-121, ResNet50, and Inception-V3, with different bags of training datasets. Note that the same test dataset was used for all three models. e models are tested with balanced and imbalanced datasets, and the results are shown in Figure 7. e classes are distributed equally in the balanced dataset; therefore, the classification result is better than that of the imbalanced dataset. With the balanced dataset, Class 1 (mild) is predicted accurately, while in the imbalanced dataset only two images are predicted. In addition, in the balanced dataset, approximately 2000 images are predicted accurately in each class; 5000 images were predicted for Class 0. We obtain higher accuracy with the imbalanced dataset due to the unequal sample distribution. We can also see a difference in Class 1 (mild) and Class 4 (PDR) predictions. For these classes, the balanced dataset outperforms the imbalanced dataset. e overall accuracy achieved by this model was 80.36% and 60.89% on the imbalanced and balanced dataset, respectively. e ROC curves for the balanced and imbalanced datasets are shown in Figure 8. For more accurate results, we     Computational Intelligence and Neuroscience 7 also consider class-wise results, as shown in Table 4. Table 4 shows the class-wise results for both balanced and imbalanced datasets in terms of recall, precision, and specificity. e balanced dataset shows improved results compared to the imbalanced dataset, in terms of recall, precision, specificity, and accuracy. In the recall and precision columns, all values were improved for the balanced dataset, and specificity was also improved. Overall accuracy for the imbalanced dataset was higher than that for the balanced dataset due to the unequal number of samples in the imbalanced dataset.
e negative class (Class 0), which is predicted accurately, has the large number of samples; however, positive Classes1-4 are misclassified. We calculated ROC curves for both balanced and imbalanced datasets shown in Figure 8. e ROC curve for the balanced dataset shows the equality in sample distribution. e more accurately predicted class is 0 because its area is 0.97, which is close to 1. We can also see that Classes 1-4 curves are also near to 1, which means that Model 2 classified all images accurately.   Computational Intelligence and Neuroscience With the imbalanced dataset, the highest curve was achieved by Class 4 because it has lowest number of samples, and the predicted number of images was also high, which is why the curve is close to 1. ere were more samples in Class 0 than the other classes, and the predicted value is also very high. However, as the number of samples increases, the curve decreases to 0, which is why the area of Class 0 is 0.84. It is vital to detect all DR stages for early treatment of the disease. Class 1 (mild) is the first stage of the disease and detection at this stage is important to provide better treatment.

Model Comparison.
In this section, the results obtained in Model 1 and Model 2 are compared with each other and also to those of other models. Model 2 returned much better results than Model 1 because the stages are classified properly. Models 1 and 2 achieved 78.13% and 80.36% accuracy, respectively, as shown in Table 5. e models are also trained with different batch sizes, and accuracy, recall, and specificity results change as the batch size changes given in Table 5.
Various models are compared in Table 5. ese models were trained with the same architecture but with different batch sizes, except for the model proposed by Pratt et al. [20]. Models batch_size_2, batch_size_3, and batch_size_4 were trained with Model 1 architecture but with different batch sizes, i.e., batch sizes 2, 3, and 4, respectively. Model 1 used batch size 5. e results for Model 1 with batch size 5 are discussed in detail in Section 4. DenseNet-121, ResNet50, Inception-V3, Xception, and Dense169 were trained using the entire balanced dataset with batch size 8, and these models are tested on an imbalanced dataset. Pratt et al.'s [20] model is used for comparison with the proposed models because that study [18] also used the Kaggle dataset and investigated classification of DR stages. Model 2 was trained with batch size 5, and those results are also discussed in Section 4. Our proposed models, 1

Conclusions
Diabetes is one of the fast-growing diseases in the world and causes many diseases. Diabetic retinopathy (DR) is one of those diseases. DR has different stages from mild to severe and then PDR (Proliferative Diabetic Retinopathy). In the later stages of the disease, it may lead to symptoms such as floaters, blurred vision, and finally a vision loss. Manually diagnosing this disease is tedious and error prone. So, computer vision-based techniques applied to diagnose a disease in an automatic way are discussed in the literature. In this study, we presented two deep ensemble CNN Models 1 and 2 to classify the stages of DR using both balanced and imbalanced datasets. e proposed two models outperform other single deep learning architectures in terms of accuracy, such as DenseNet-121, ResNet50, Inception-V3, and Dense169, which indicates that ensemble technique can strengthen the capability of classifying model. Model 2 yields higher accuracy with 80.36% than Model 1 does with 78.13% on imbalanced dataset, which indicates that diversity of base classifiers used for ensemble framework is the key factor to high accuracy of ensemble classifying model. Moreover, the confusion matrices of Models 1 and 2 with the balanced and imbalanced datasets have shown equalization of training dataset which makes the classifying model more stable.
In the future, we intend to extend Kaggle dataset by adding fundus images of the same patient during a long period in collaboration with doctors and, oreover, training specific models for specific stages to increase the accuracy of early stages.
Data Availability e public dataset, Kaggle, is used and it can be accessed from https://www.kaggle.com/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.