Hybrid Model Structure for Diabetic Retinopathy Classification

Diabetic retinopathy (DR) is one of the most common complications of diabetes and the main cause of blindness. The progression of the disease can be prevented by early diagnosis of DR. Due to differences in the distribution of medical conditions and low labor efficiency, the best time for diagnosis and treatment was missed, which results in impaired vision. Using neural network models to classify and diagnose DR can improve efficiency and reduce costs. In this work, an improved loss function and three hybrid model structures Hybrid-a, Hybrid-f, and Hybrid-c were proposed to improve the performance of DR classification models. EfficientNetB4, EfficientNetB5, NASNetLarge, Xception, and InceptionResNetV2 CNNs were chosen as the basic models. These basic models were trained using enhance cross-entropy loss and cross-entropy loss, respectively. The output of the basic models was used to train the hybrid model structures. Experiments showed that enhance cross-entropy loss can effectively accelerate the training process of the basic models and improve the performance of the models under various evaluation metrics. The proposed hybrid model structures can also improve DR classification performance. Compared with the best-performing results in the basic models, the accuracy of DR classification was improved from 85.44% to 86.34%, the sensitivity was improved from 98.48% to 98.77%, the specificity was improved from 71.82% to 74.76%, the precision was improved from 90.27% to 91.37%, and the F1 score was improved from 93.62% to 93.9% by using hybrid model structures.


Introduction
Diabetic retinopathy (DR) is an ocular medical disease that damages the retina caused by diabetes. People with diabetes for a longer time are more likely to develop diabetic retinopathy. According to the severity, DR can be divided into the following five grades: no DR, mild, moderate, severe, and proliferative DR. Mild, moderate, and severe are classified as nonproliferative diabetic retinopathy (NPDR) stage. In the NPDR stage, the patients have no obvious symptoms. e way to detect NPDR is to examine the fundus by a trained ophthalmologist. As the condition worsens, DR will develop to Proliferative DR (PDR) stage. In the PDR stage, abnormal new blood vessels form at the back of the eye. ese fragile blood vessels can burst and bleed, which blur vision and eventually lead to blindness. So far, the most effective treatment period for DR is in the NPDR stage. erefore, regular screening of diabetic patients through fundus examination is the most effective method to detect early abnormal signs of DR. Early diagnosis and timely treatment are helpful to prevent DR in patients [1].
However, the screening of diabetic retinopathy needs professional clinical knowledge, experience, and diagnosis time of ophthalmologists. Ophthalmologists generally need to perform a direct examination of the patient's fundus and combine the fundus retinal images taken by special equipment to diagnose the severity of the patient's diabetic retinopathy. is process will take a lot of time. And the number of professional ophthalmologists is far from enough to meet the number of patients diagnosed. erefore, the automatic classification algorithm of diabetic retinopathy severity plays an important role in improving the efficiency of DR diagnosis. Fundus images, the main images to study DR, are a current research hotspot [2,3]. Some research [4][5][6][7] uses machine learning and algorithms for DR detection and classification. However, as deep learning has done well in many competitions, more and more research uses deep learning methods for DR detection and classification. is research mainly focused on the end-to-end DR severity classification of fundus images by using CNNs. In a study, Li et al. [8] presented a novel cross-disease attention network (CANet) to jointly grade DR and DME. ey proposed a disease-specific attention module and a diseasedependent attention module to extract useful features. eir network achieved AUC of 96.3% and accuracy of 92.6% for DR classification on the Messidor database. Shanthi and Sabeenian [9] proposed a modified AlexNet architecture [10] for classification of DR fundus images according to the severity of the disease with the application of suitable Pooling, Softmax, and Rectified Linear Activation Unit (ReLU) layers to obtain a high level of accuracy. And they validated the performance of the proposed algorithm using the Messidor database [11]. Finally, the proposed algorithm achieved a classification accuracy of 96.6% on the Messidor database. In a study, Hosseinzadeh et al. [12] [17] and evaluated the CNN's performance for 2 classes and 5 classes of DR classification. ey found out that the performance of the model was directly linked to the number of convolutional and pooling layers in the CNN. e best accuracy for 2 classes of DR classification was 80.40% achieved by VGG19. e main contributions of this work are as follows: an improved loss function, enhance cross-entropy (E-CE) loss function, is to improve the performance of basic DR classification models and three proposed hybrid model structures are to fuse multiple basic models for the better performance of DR classification. In this work, preprocessing on the fundus images was firstly performed. During the training process of the basic models, data enhancement methods were used to expand the number of samples and the diversity samples for the DR fundus dataset. And different basic models were trained with E-CE loss and cross-entropy (CE) loss, respectively. Results (see Table 1) showed that our proposed E-CE loss can shorten the convergence time of loss. Under various evaluation metrics, the basic models trained with E-CE loss performed better than the models trained with CE loss. en, the final output features of the better basic models in different ways were combined to train the hybrid model structures. Results showed that the performance of hybrid model structures is further improved compared to the basic models.

Materials and Methods
e proposed algorithm graph of this work is shown in Figure 1. e graph consists of three steps: fundus images preprocessing, basic CNN models prediction, hybrid model structures prediction, and DR grade output. First, the fundus images would be preprocessed. en, each basic CNN model predicted the preprocessed fundus images. And the outputs of each basic CNN model were input into the hybrid model structures. Finally, the hybrid model structures output five predicted values, corresponding to the probability of the five DR grades, and the DR grade with the largest probability was taken as the result of the fundus image.

Dataset.
e dataset for this work consists of three different datasets which come from the Kaggle diabetic retinopathy detection competition [18] provided by Eye-PACS, APTOS 2019 Blindness Detection organized by the 4 th Asia Pacific Tele-Ophthalmology Society [19], and DeepDR Diabetic Retinopathy Image Dataset provided by the IEEE International Symposium on Biomedical Imaging (ISBI) 2020 [20]. EyePACS dataset contains 35,126 training fundus images and 53,576 test fundus images. APTOS dataset contains 3,662 training fundus images and 1,928 test fundus images. DeepDR dataset contains 1,200 training fundus images, 400 validation fundus images, and 400 test fundus images. All fundus images from the three datasets had been rated for the severity of diabetic retinopathy on a scale of 0 to 4: 0 is no DR, 1 is mild DR, 2 is moderate DR, 3 is severe DR, and 4 is proliferative DR. Examples of different severity of DR fundus images are shown in Figure 2. Each fundus image from the three datasets has a high resolution. e dataset for this work contains 39,988 fundus images which come from the training fundus images with rate of the three datasets because only the training fundus images from the three datasets are rated. As shown in Table 2, the class distribution of the dataset is highly imbalanced, and most of the fundus images are no DR grade.

Data Processing.
ere are two steps for data processing. One is preprocessing for the fundus images before training basic models; the other is the fundus images enhancement in the training process. e first step for data processing is mainly to remove the black border of the fundus images because the black border will bring useless information and weaken the ability to extract features of the basic models and resize the images to a suitable size for inputs of models. e details are as follows: (1) Binary processing was performed on the fundus images to find the border between the black area and the fundus area and then cut the extra black border for each fundus image. e processes are shown in Figure 3.  (2) Because each fundus image has a higher resolution, which is not suitable for the input of the basic models, all images were resized to 380 × 380 pixels for EfficientNetB4, 380 × 380 pixels for Effi-cientNetB4, 331 × 331 pixels for NASNetLarge, and 299 × 299 pixels for EfficientNetB5, Xception, and Inception-ResNetV2.
In the training process, the following operations were performed on the fundus images: rotation, width shift, height shift, shear range, zoom, horizontal flip, and vertical flip.
en, RandAugment was used for the images. Ran-dAugment [21] is an improved data augmentation method proposed by Cubuk et al. On the ImageNet dataset, Cubuk et al. achieved 85.0% accuracy, 0.6% increase over the previous state-of-the-art, and 1.0% increase over baseline augmentation by using RandAugment.

Basic Model Structures.
In this work, hybrid model structures were proposed to improve the classification ability of the basic models. EfficientNetB4, EfficientNetB5, NAS-NetLarge, Xception, and InceptionResNetV2 CNNs were chosen as the basic models. And then three methods to implement the hybrid model structure were used. Finally, the experiments to verify the performance of the basic models and the basic models with the hybrid model structures were done. e results are shown in part 3. e structure of the basic models are as follows: (1) EfficientNet: EfficientNet [22] is a family of models designed by Tan et al. ey proposed a scaling method [23] that uniformly scales all dimensions of depth/width/resolution of CNNs using a simple yet highly effective compound coefficient. en, they used a neural architecture search to design a new baseline network and used the scaling method to scale it up to obtain EfficientNet, which achieve much better accuracy and efficiency than previous ConvNets. In this work, EfficientNetB4 and Effi-cientNetB5 were chosen as basic models. e input size of EfficientNetB5 was changed to 299 × 299 pixels and EfficientNetB4 kept the original input resolution. Both of them were added a dropout layer with 0.4 drop rate and a fully-connected layer with 5 units and the activation function of the fully-connected layer was softmax function.
(2) NASNetLarge: the NASNet architecture, introduced by Zoph et al. [24], is the best architecture found on CIFAR-10 by the neural architecture search (NAS) framework [25].  , is a convolutional neural network architecture based entirely on depthwise separable convolution layers inspired by Inception. e Xception architecture has 36 depthwise separable convolutional layers forming the feature extraction base of the network, which makes the architecture very easy to define and modify. e 36 convolutional layers are structured into 14 modules, all of which have linear residual connections around them, except for the first and last modules. In this work, the fully-connected layers and the logistic regression layer of the Xception architecture were replaced by a dropout layer and a fully-connected layer with 5 units by using softmax function.

Hybrid Model Structures.
ree methods were proposed to implement the hybrid model structure, called Hybrid-a, Hybrid-f, and Hybrid-c. e details are as follows: Hybrid-a: in Hybrid-a, the average value of each DR grade which the basic model outputs is calculated as the final output of the hybrid model structure. e formula is where N denotes the number of the basic models. y grade n denotes the DR grade output of the nth model, and Y grade denotes the DR grade of the final output of Hybrid-a. Hybrid-f: Hybrid-f is a model mainly composed of fully-connected layers in short. e output of each basic model, which is a 5 × 1 column vector, is stacked vertically, and finally forms a 25 × 1 column vector as the input of the Hybrid-f model structure. Figure 4 shows the structure of Hybrid-f. Hybrid-f consists of 2 fully-connected layers. e hidden layer has 2048 units and the output layer has 5 units with softmax activation function. Hybrid-c: Hybrid-c is mainly composed of 2D convolution layers. e 5 × 1 column vector output of each basic model is stacked horizontally and finally forms a 5 × 5 matrix as the input of the Hybrid-c model structure. e structure of Hybrid-c is shown in Figure 5, and the details of Hybrid-c are shown in Table 3. ree 2D convolution layers as the feature extraction layers make up the first half of the Hybrid-c structure, and then the Hybrid-f structure makes up the last part of Hybrid-c.

Loss Function. Different loss functions have different effects on the training process and results of network models.
In this work, an improved loss function, E-CE loss function, was proposed for the training process of the basic models. And comparison experiments with CE loss function were done. e formula of CE loss function is y n log y n + 1 − y n log 1 − y n , (2) where y denotes the true value, y denotes the predicted value, and N denotes the total number of DR grade. e E-CE loss function is based on CE loss function and shown as follows: where G y denotes the DR grade of truth and G y denotes the DR grade of prediction. DR grade is an integer in the range of 0 to 4. In the formula, a part of the loss is added to measure the impact of the misclassification of the basic models during the training process. e farther the output value of the model is from the true value during the model training process, the greater the excess loss will be. Experiments (see Part 3) showed that the E-CE loss function will accelerate the training of the basic models and improve the accuracy of the basic models.  e batch size of EfficientNetB4, EfficientNetB5, NASNetLarge, Xception, and Inception-ResNetV2 are 32, 40, 64, 64, and 32, respectively. CE loss function and E-CE loss function were used to train each basic model for the control experiment. e epochs of for training each model were 50. Also, the pretraining weights on the ImageNet dataset were used to accelerate the training process of each basic model. For training Hybrid-f and Hybrid-c model structure, the optimizer was Adam. e initial learning rate was 0.001. And the loss function was cross-entropy loss function. Training epochs were 100.

Performance Evaluation.
e performance of the basic models and the hybrid models are evaluated by 5 evaluation metrics which are accuracy, sensitivity, specificity, precision, and F1 score. e formulas are shown as follows, where TP denotes the number of positive samples actually identified as positive samples, TN denotes the number of negative samples correctly identified as the negative samples, FP denotes the number of negative samples falsely identified as the positive samples, and FN denotes the number of positive samples falsely identified as the negative samples: F1score � 2 · precision · sensitivity precision + sensitivity .

Results and Discussion.
e basic models were trained on 34,988 fundus images which were selected according to the DR grade ratio from the dataset consisting of EyePACS, APTOS, and DeepDR dataset. e remaining 5,000 images of the dataset were used as test images to evaluate the performance of the models.
In order to verify the performance of E-CE loss function, each basic model was trained with E-CE loss function and CE loss function, respectively. Figure 6 shows that the convergence speed of the basic models trained with E-CE loss function is faster than that trained with CE loss function. e accuracy of the basic models is also relatively improved faster. e accuracy, sensitivity, specificity, precision, and F1 score of the obtained results are shown in Table 1. It can be seen from Table 1 that our proposed E-CE loss function improved the performance of the basic models under partial classification metrics, especially the performance in terms of accuracy and sensitivity. e model trained with E-CE loss function has an average performance improvement of about 5% on accuracy and 3.5% on sensitivity. is may be because an extra part of E-CE loss relative to CE loss increases the influence of the basic models on the misclassification of DR grade during the training process, which will optimize the basic models towards the correct classification faster. e performance of our proposed hybrid model structures outperforms all the basic models in all classification metrics. Referring to Table 1, Hybrid-c has the highest accuracy which is 0.8634 and F1 score which is 0.939, Hybrid-a has the highest sensitivity which is 0.9877, and Hybrid-f has the highest specificity which is 0.7476 and precision which is 0.9137. As shown in Table 1 Flatten Dense_1 Dense_2 Figure 5: e structure of Hybrid-c. improve the classification performance in all aspects. e Hybrid-f and Hybrid-c with complex structures have better overall performance than Hybrid-a with simple structure. When the hybrid model structure is more complex, the difference between Hybrid-f and Hybrid-c is smaller. For the hybrid structure proposed in this work, although the higher complexity of the hybrid structure will not bring about a linear performance improvement, the hybrid structure will improve the performance of a single model performance in DR grade classification. e confusion matrix of Hybrid-c on the testing fundus images is shown in Table 4. From Table 4, Hybrid-c performs the best in DR grade 0 classification, with an accuracy of 0.9706. e performance of Hybrid-c on DR grade 2 classification is better, which achieves 0.7181 score of accuracy. And Hybrid-c has good performance in the classification of DR grade 4, which achieves 0.6698 score of accuracy. For DR grade 1 images, Hybrid-c prefers to misclassify them to DR grade 0. For DR grade 3 images, Hybrid-c prefers to misclassify them to DR Grade 2. e reason for this situation may be as follows: (1) e number of training fundus samples of DR grades 1 and 3 is relatively less compared to the number of DR grades 0 and 2, which causes the poor classification ability of the model for DR grades 1 and 3. (2) e hidden features of fundus images in DR grades 1 and 3 are closer to those of DR grades 0 and 2. We found out the images of DR grades 1 and 3 which were misclassified to DR grades 0 and 2.   fundus images, because of the camera, are dark, blurred, or highlights, which can also affect the judgment of experts.
In future work, we may improve the method of data enhancement to improve the impact of the imbalance of DR grade in the dataset and may extract the output of the intermediate layers of the basic convolution models as the input of the hybrid model structure to increase the richness of the input feature maps of the hybrid model.

Conclusions
In this work, we proposed an improved loss function, E-CE loss function, and proposed three hybrid model structures Hybrid-a, Hybrid-f, and Hybrid-c to improve the performance of a single model. e results show that the E-CE loss function can effectively accelerate the training process of a single basic model and can improve the performance of a single model compared with the CE loss function. e three different hybrid model structures can improve the performance of the basic models in all aspects. Although the increase in the complexity of the hybrid model does not bring a linear improvement in model performance, the more complex Hybrid-c and Hybrid-f perform better than the simple Hybrid-a in some evaluation metrics. Finally, the proposed algorithm achieved five classifications accuracy of 86.34%, sensitivity of 98.77%, specificity of 74.76%, precision of 91.37%, and F1 score of 93.9% in this work.

Conflicts of Interest
e authors declare that they have no conflicts of interest.