Classification of Diabetic Retinopathy Severity in Fundus Images Using the Vision Transformer and Residual Attention

Diabetic retinopathy (DR) is a common retinal vascular disease, which can cause severe visual impairment. It is of great clinical significance to use fundus images for intelligent diagnosis of DR. In this paper, an intelligent DR classification model of fundus images is proposed. This method can detect all the five stages of DR, including of no DR, mild, moderate, severe, and proliferative. This model is composed of two key modules. FEB, feature extraction block, is mainly used for feature extraction of fundus images, and GPB, grading prediction block, is used to classify the five stages of DR. The transformer in the FEB has more fine-grained attention that can pay more attention to retinal hemorrhage and exudate areas. The residual attention in the GPB can effectively capture different spatial regions occupied by different classes of objects. Comprehensive experiments on DDR datasets well demonstrate the superiority of our method, and compared with the benchmark method, our method has achieved competitive performance.


Introduction
Diabetic retinopathy (DR) is an ocular complication caused by diabetes. It is a leading cause of visual impairment and even blindness. It has become a major medical problem worldwide [1,2]. However, up to now, there is no efective treatment for this disease. Studies have shown that early diagnosis and timely treatment of diabetic retinopathy are helpful to prevent blindness. Tis goal can be achieved through regular screening programs [3]. As a result, many national health agencies are promoting DR screening, which is efective in reducing blindness due to DR [4]. Digital fundus images are the most widely used imaging mode for ophthalmologists to screen and identify the severity of DR, and it can show the severity of the disease. However, due to the lack of ophthalmologists, DR screening is a heavy burden for many underdeveloped countries. For this reason, automatic classifcation technology for DR severity has become a trend in diagnosis.
With the development and application of artifcial intelligence technology, deep learning [5] is playing a more and more important role in the feld of medical image analysis. In recent years, the convolutional neural network (CNN) has been successfully applied to medical image classifcation [6,7], medical image segmentation [8,9], medical image registration [10,11], medical image fusion [12,13], and medical image report generation [14,15] because it can learn highly complicated representations in a data-driven way. Although CNN shows great potential in medical image analysis, it also has some limitations. Local receptive felds such as convolution operations limit the capture of long-range pixel relationships. Inspired by the success of transformers in NLP, Alexey Dosovitskiy et al. [16] proposed the vision transformer (ViT), which takes image classifcation as a sequence prediction task for image patch sequences, to capture the long-range correlation of input image. In addition, recent research shows that compared with CNN, ViT is more in line with the prediction error of mankind [17,18].
Te biggest challenge of DR severity classifcation is that the classifcation accuracy of fundus disease images is more precise than other image categories. It shows that the differences of DR lesion points between main adjacent classes are very subtle, and it is difcult to distinguish. Although the attention module in ViT plays an important role in object classifcation, if the attention modules are simply superimposed, the performance of the model will decrease. In addition, ViT ignores the diferent spatial regions occupied by diferent kinds of objects. Motivated by the previously mentioned observations, we propose a deep network model to identify and classify DR, which consists of two key modules: FEB and GPB. Te FEB module extracts the features of the image by the ViT model. Te token in the ViT model has more fne-grained attention and pays more attention to the retinal hemorrhage area. Te GPB module efectively captures the diferent spatial regions occupied by objects from diferent classes and generates class-specifc features for each class by referring to a simple spatial attention score. By integrating the previous modules, our network can more accurately classify DR lesions of the diferent degrees.
To sum up, our contributions are as follows: (i) Extracting fundus image features via vision transformer's excellent modeling ability. (ii) Using the residual attention module to make use of the individual spatial attention of each object class so as to improve the accuracy of DR classifcation. (iii) Experiments on DDR datasets show that this method has achieved good results in DR classifcation tasks. Specifcally, our method achieves the best performance on grading 0, 2, 3, and 4. According to the international classifcation of DR [19], DR can be divided into fve stages. Tey are class 0 (no DR), class 1 (mild DR), class 2 (moderate DR), class 3 (severe DR), and class 4 (proliferative DR). Figures 1(a)-1(e) show the fve stages of DR, respectively [20]. As we all know, the image quality has a great infuence on deep learning models.

Related Works
However, in clinical practice, due to exposure and other reasons, low-quality images are inevitable. Terefore, as shown in Figure 1(f ), the DDR dataset [20] divided the fundus images that do not meet the quality standard into class 5 (ungradable).

Deep Learning in Medicine
Images. With the rapid development of artifcial intelligence (AI) technology, deep learning (DL) methods have been widely used in various tasks related to medical images and have achieved remarkable results. In the medical feld, the types of images to be processed usually include X-ray, ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI) [21,22]. Te processing tasks include image classifcation, object recognition, image segmentation, image reconstruction, and so on.
Medical image classifcation can assist doctors in diagnosing diseases. Esteva et al. [23] directly used 130,000 clinical image data to train the model based on the Inception v3 backbone network. Te results showed that it was better than human experts. At the same time, the experiment proves that ordinary CNN can also produce good prediction results on large-scale and high-quality annotation data sets. Yi et al. [24] proposed a novel graph regularized NMF algorithm called NMF-LCAG that handles the adaptive graph learning issue in NMF. Compared with other related algorithms, the accuracy of the NMF-LCAG algorithm can be improved by at least 1%∼3% in most cases. To achieve effcient and rapid diagnosis of patients with COVID-19, Li et al. [25] proposed a computer-aided diagnosis algorithm based on ensemble deep learning. Te experimental results show that the algorithm has good classifcation performance for COVID-19's disease patients, common pneumonia patients, and normal control groups and can signifcantly improve the performance of deep neural networks in multiclass prediction tasks.
It is of great signifcance in clinical treatment to accurately detect or identify lesions in medical images. Te object recognition task is divided into two stages and one stage. Te two-stage algorithm is represented by the R-CNN series [26][27][28], and the one-stage algorithm is most representative of the YOLO series [29][30][31][32]. Andrew Ng's group proposed the CheXnet algorithm [33], which is a 121-layer convolutional neural network. Tis algorithm can automatically detect pneumonia from chest X-rays. Te accuracy rate is even higher than that of radiologists. Aoki et al. [34] completed the detection and probability prediction of erosion and ulcer lesions in wireless capsule endoscopy based on CNN for the frst time. Te detection accuracy reached 88.2%. In addition, properly improving the sensitivity of the model in clinical application will help doctors to reduce the missed detection rate.
Te purpose of the medical image segmentation is to provide a reliable basis for clinical diagnosis and pathological research. Te fully convolutional network (FCN) [35] was frst used for segmented tasks. Although the fnal output layer has correct semantics, it is short of detailed information. U-Net [36] borrowed the idea of FCN, designed a more elegant image segmentation framework, and realized richer and more detailed segmentation results. In order to achieve precise segmentation of retinal blood vessels, Guo et al. [37] proposed a lightweight network named SA-Unet that achieves state-of-the-art performance on DRIVE and CHASE_DB1 datasets.
In addition, DL is also widely used in medical image reconstruction [38,39], medical image report generation [40], and other tasks. Deep learning provides important theoretical basis and technical support for intelligent medicine. However, there are still many problems with intelligent medical imaging. For example, the lack of highquality labelled training samples and the model obtained by deep learning is poorly interpretable.

Deep Learning in DR Classifcation.
Accurate classifcation of medical images is an important means to assist clinical care and treatment. In recent years, DL has made remarkable achievements in medical image analysis, making DR-assisted diagnosis more reliable and efcient.
Bravo and Arbelez Pablo [41] investigated the performance of diferent preprocessing methods, designed a classifcation model based on the VGG16 architecture, and achieved an average classifcation accuracy of 50.5% in DR classifcation. A multi-cell architecture [42], which can gradually increase the depth of the deep neural network and the input image resolution, improves classifcation accuracy while reducing training time. To fully utilize images in diferent stages of deep learning, they also propose a multitask learning strategy. To solve the problem of lack of data, a deep learning architecture was proposed in [43], and the MESSIDOR dataset was used to train and test their architecture. Methods by developing the convolution layer and maximum pool layer in the frst eight layers and the full connection layer in the last three layers, the AlexNet architecture is simply modifed. Te model is suitable for smaller datasets and provides acceptable accuracy. Golub et al. [44] put forward a method to identify and classify DR. Tis method could not only segment any retinal region of the fundus image but also evaluate the quality of the original image. In order to simulate the diagnosis process, a doublestream binocular network is proposed in [45] to capture subtle correlations between the left eye and the right eye, and its advantages over monocular methods are demonstrated on the EyePACS dataset. Zhang et al. [46] designed sourcefree transfer learning (SFTL) for DR detection, which utilizes unannotated retinal images and only employs a source model throughout the training process. On the EyePACS dataset, it achieved 91.2% accuracy, 0.951 sensitivity, and (c) moderate, which is labelled as class 2; (d) severe, which is labelled as class 3; (e) proliferative, which is labelled as class 4; (f ) ungradable, which is not up to quality standards to be used for model training and is labelled as class 5.
0.858 specifcity. [47] discusses existing DR detection and classifcation techniques, their advantages and disadvantages, and available DR datasets. Te research achievements and progress in the feld of DR detection are introduced in detail.
Although all these algorithms are devoted to extracting the features of lesions, there is still a problem of insufcient recognition performance of lesion, especially for small lesions. Tere are several reasons: (1) only high-resolution images can detect small pathological tissues, so the resolution of retinal images is very high. (2) Compared with other types of image, the classifcation accuracy of DR lesions is more accurate. Moreover, too small lesion points make the diferences of DR lesion points between adjacent classes very subtle, which make it difcult to distinguish. (3) Identifying a severe class with a large local receptive feld may lead to gradient disappearance or explosion problems. (4) Te calculation cost of DR images is high, and it is difcult to train the model.

Visual Transformer in Medicine Images.
Following the unprecedented success in natural language tasks, transformers [48] have also made great achievements in image recognition tasks recently. Te ViT model has become very popular in various computer vision tasks including image classifcation [16], image detection [49], image segmentation [50], and so on. In the feld of natural image recognition, ViT and its derived instances have achieved state-of-the-art performance on several benchmark datasets.
Recently, the ViT has been successfully applied in medical image classifcation. Yu et al. [51] proposed the MIL-ViT model, which was frst pretrained on a large dataset of fundus images, and then fne-tuned on the downstream task of retinal disease classifcation. MILbased headers are used in the MIL-ViT system and can be used with ViT in a plug-and-play way. Experiments on APTOS2019 and RFMiD2020 datasets show that the performance of MIL-ViT is better than that of the baselines based on CNN. Most data-driven methods regard DR classifcation and lesion detection as two independent tasks, which may not be optimal, because errors may be propagated from one stage to the next. To handle these two tasks together, it is proposed in [52] to the lesion aware transformer (LAT), which consists of an encoder-based pixel relationship and a decoder of the lesion-aware transformer. In particular, they take advantage of the transformer decoder to express lesion detection as a weakly supervised lesion localization problem. Te LAT model has achieved the performance of the state of the art on the Messidor-1, Messidor-2, and EyePACS datasets. Yang et al. [53] proposed a hybrid structure consisting of the convolutional layer and the transformer layer to classify fundus diseases on OIA datasets. Similarly, Wu et al. [54] and AlDahoul et al. [55] also verifed that the ViT model is more accurate than the CNN model in DR classifcation. As can be seen from the previous references, most methods directly use the original ViT model as a plug-and-play way to improve classifcation performance. Based on the previous observations, we think that using ViT as the backbone network to integrate domain-specifc contexts can improve DR classifcation performance.
Apart from medical image classifcation, ViT is widely used in medical image segmentation [56], medical image detection [57], medical image reconstruction [58], medical image synthesis [59], and medical image report generation [60] and other tasks. However, some studies [61] have shown that the transformer is highly dependent on massive data, and its performance can surpass CNN only after training on large datasets. Most medical images have small public datasets and few labels, which limit the application of transformers in this feld.

Methods
In this section, we frst briefy outline our proposed network and then explain the key network components of the network in detail. Finally, the loss function of the designed is given.

Overview.
Classifcation is usually based on diferences or distinctions between categories. Diabetic retinopathy develops from mild to severe, and there is correlation between adjacent classes. For example, the severe DR stage follows the moderate DR stage, and the moderate DR stage follows the mild DR stage. Keeping this in mind, we propose a classifcation model that distinguishes fve classes of DR through the ability of the ViT network to capture subtle changes and the discriminative ability of CSRA's diferentcategory features. Te model outputs fve probability scores (which sum to 1), corresponding to the fve classes of DR. Te DR classifcation network architecture we designed is shown in Figure 2. It is composed of two key modules, FEB and GPB. FEB is mainly used for image feature extraction, while GPB is mainly used for classifcation prediction.
Te proposed approach is presented as an algorithm in Algorithm 1.

Feature Extraction Block.
As mentioned previously, the FEB is mainly used for extracting features from images. Te standard transformer accepts 1-D token embedded sequences as input. To process a 2D image, we reshape the image x with the original shape [H × W × C] into a sequence of fattened 2D patches x P ∈ R N×(P 2 C) , where (H, W) represents the resolution of the original image, C represents the number of channels, (P, P) represents the resolution of each patch, and N � HW/P 2 represents the number of patches. Te transformer uses the same vector dimension D for all its layers, so we use a linear mapping layer to map the image patches to the D dimension. Similar to the [class] character in the BERT model, learnable embedding z 0 0 � x class is added before the block embedding sequence. Te output state embedded in the transformer encoder is treated as an image representation. Te processing of this process is shown in the following equation: 4 Computational Intelligence and Neuroscience where E is the weight vector of the linear mapping layer and E pos is the positional embedding, which is directly added to the image patch embedding. Te purpose of position embedding is to preserve the position information of diferent blocks, and the resulting sequence of embedded vectors is used as the input of the transformer encoder. Figure 3 shows the network structure of the transformer encoder [16]. It consists of a stack of six identical encoding layers, each of which has two encoding sublayers. Te frst encoding sublayer is a multihead attention layer, while the second encoding sublayer is a position-wise feed-forward network. Residual connection is used between the two encoding sublayers, and the output of each encoding sublayer is normalized by the Layer Norm. Terefore, each sublayer can be represented as LN(Sublayer(x)), where Sublayer(x) is a function of the sublayer itself. For the convenience of residual connection, the output dimensions of all sublayers in the model are d mode l � 512. In the transformer, the attention function maps the queries, keys, and value vectors into an output vector, which is packaged into matrices Q, K, and V, respectively. Attention is described in the following equation: Te multihead attention mechanism performs L different learnable linear mappings on the queries, keys, and value vectors and maps them into vectors of dimensions d k , d k , and d v , respectively. Te head and multihead are described as follows: where the parameter matrix

Grading Prediction Block.
Fundus image classifcation is a challenging computer vision task for practical applications. To capture the diferent spatial regions occupied by objects from diferent classes more efciently, we introduce a class-specifc residual attention algorithm [62] in GPB.
With spatial attention scoring, the class-specifc residual attention (CSRA) generates specifc features for each class and then uses average pooling on these features for feature fusion.  As shown in Figure 4, the feature matrix x ∈ R d×h×w of the input image is extracted by the FEB. Here, d, h, and W represent the dimension, height, and width of the feature matrix, respectively, and we assume that d is 2048, h is 7, and W is 7. Firstly, the feature matrix x is decoupled into a position feature matrix group x 1 , x 2 , . . . , x 49 (x j ∈ R 2048 ). Ten, a fully connected layer (1 × 1 convolution) is used as the classifer. Note that each class has its own specifc fully connected layer classifer, and the parameter of the classifer m i corresponding to the ith class is m i ∈ R 2048 .
Te CSRA score is defned in [62] by the following equation: where T (>0) is the temperature control factor and s i j is the probability that class i appears at the position j.
Te CSRA f i for the class i is given by the following equation: Here, a i is a class-specifc feature vector and a i � 49 k�1 s i k x k . λis a hyperparameter (setting λ � 0.3), g � 1/49 49 k�1 x k . According to [62], the dot product of the CSRA f i of the ith class and the classifer m i corresponding to this class obtain the fnal logical output, as shown in the following equation: Here, C is the number of classifcation categories.

Loss Function.
In this paper, the binary cross-entropy (BCE) loss function given in (8) is used to calculate the loss between prediction y and label y of ground truth. Te stochastic gradient descent (SGD) method is used to optimize the loss function.  [63]. It is a dataset consisting of typical DR and normal retinal structures and is divided into three parts, namely, segmentation, classifcation, and location. Among them, classifcation consists of 516 original color fundus images which are divided into the train set (413 images) and test set (103 images). In addition, this dataset provides information on the disease severity of DR and diabetic macular edema for each image. Tis makes it ideal for the development and evaluation of image analysis algorithms for the early detection of DR.

Implementation Details.
In order to prepare more trainable data, we do some operations on the original images. In this paper, the pretrained backbone model parameters are used, and the training is fne-tuned on the used datasets. Limited by the memory received, the large images are randomly resized 512 × 512 sizes. In addition, we apply random horizontal fips, vertical fips, and random rotation as forms of data augmentation to reduce overftting. Our framework is implemented by PyTorch 1.6 and runs on NVIDIA Quadro RTX 6000 GPU with 24 GB of memory. Table 1 highlights the hyperparameters used in training.  Computational Intelligence and Neuroscience precision, which means the proportion of samples that are correctly classifed as positive; (10) recall/sensitivity, which is the probability that the DR image of the lesion is not missed as negative; (11) specifcity, which is the probability that a DR image with normal specifcity will not be misjudged as positive; (12) accuracy represents the correct proportion of model classifcation; (13) F1-score, which is the harmonic mean between precision and recall. Tese initial metrics are added to a confusion matrix of multiclassifcation. Equations (14)-(17) make it possible to extend the defnitions of these performance indicators to N classes. Tis work uses the following indicators (macro-average): accuracy, sensitivity, specifcity, and F1-score to evaluate the DR classifcation process.

Evaluation
Accuracy � TP + TN TP where TP (true positive) represents the positive samples predicted by the model to be in a positive class, TN (true negative) represents the negative samples predicted by the model to be in the negative class, FP (false positive) represents the negative samples predicted to be positive by the model, and FN (false negative) indicates the positive sample predicted by the model as a negative class. In addition, the area under the curve (AUC) of the receiving operating characteristic (ROC) curve is employed, which is also recognized as a metric of fundus image grading in previous research. Te AUC refects the performance of the data predicted positively and also characterizes the effectiveness of the model. Te higher the AUC value, the better the efect of model classifcation.

Evaluation of the Model Performance.
In this work, the DR classifcation method in 5 severity categories from 0 to 4 is proposed. An additional category (class 5), similar to [20], is related to images presenting artifacts, which prevent clear evaluation of information generated. A second experiment was performed to exclude images with artifacts (class 5). To evaluate the performance of our model, we trained and tested it on DDR and IDRiD datasets. Figure 5 is the loss curve of our model training process on the DDR dataset. It can be seen that the loss decreases with the increase in training times. In the frst 6 epochs, the loss value of the test set and the train set decreases signifcantly. Te loss value of the test set does not decrease after the 7th epoch and the train set after the 9th epoch, and the model training tends to be saturated. From the change of the loss curves, it can be seen that there is little diference between the train loss and the test loss, which means that our model does not show overftting.
For the prediction classifcation model, we hope that the more accurate the prediction results of the model, the better. Tat is, the larger the value of TP and TN in the confusion matrix, the better and the smaller the value of corresponding FP and FN. However, confusing the matrix will calculate the number. With a large amount of data, it is difcult to directly judge whether the model is good or bad. Besides, accuracy is not a good indicator for unbalanced data sets. Terefore, on the basic statistical results of the confusion matrix, we introduced fve metrics: precision, recall, specifcity, F1-score, and accuracy.
Te six classifcation confusion matrices and the fve classifcation confusion matrices obtained by this method are shown in Tables 2 and 3, respectively. From the two tables, it can be seen that although there are misclassifcations between classes, most of them are classifed into adjacent classes. Most of the data fall on the diagonal line. In addition, most of the data fall on the diagonal line, which also proves that this algorithm is suitable for DR image classifcation.
Tables 4 and 5 present the metrics obtained from this work, separated by class and database. It was observed that in both results, in cases with no DR and proliferative stage classes, the DR model has a high classifcation index. It can be analyzed that the model can distinguish the categories with distinct characteristics. In the intermediate classes (from 1 to 3), the classifcation index of the DR model is not high. Tis is because there are no obvious diferences between the characteristics of these categories, and it is easy to be confused with nearby categories. Comparing the results in Tables 4 and 5, the results obtained show that including ungradable category (class 5) improves accuracy in all categories. Tis also refects that the image quality plays an important role in the classifcation of the model.
Te ROC curve and AUC value are used to evaluate the performance of our model. As shown in Figure 6, the AUC values of class 0, class 1, class 2, class 3, class 4, and class 5 were 0.9980, 0.6129, 0.9509, 0.9455, 0.9741, and 0.9293, respectively. Our model performed well enough in class 0, class 2, class 3, class 4, and class 5. However, the performance of class 1 was not satisfactory. In addition to the possible reasons we have analyzed previously, the serious imbalance of the amount of data is also a very important factor. After all, the total sample size in the DDR dataset is 13673, while the sample size of class 1 is only 630. To solve this problem, we tried to resample the data and retrain the model by binary classifcation. Te AUC of class 1 can reach 0.9430. In this way, if it is just a simple DR       (Table 6), our method ranks frst in class 1, class 2, class 3, class 4, class 5, and AA. Our model achieved an accuracy of 0.9635 on class 1. Te state-of-the-art performance is achieved among all models, which is almost 4 times the accuracy of the second-best DenseNet-121 model (0.2275). Tis shows that our model has made great improvement on the most difcult part to identify mild DR, and our model has also achieved the best performance in class 2, class 3, and class 4. Te performance has greatly improved. For the purposes of image quality control, our model has also been improved by 3.30% compared with the second-best VGG-16 model. For the AA metric, our model achieved a performance of 0.9154, which was 30.35% higher than the second-best DenseNet-121 model. However, our model does not perform as well as other models at level 0. Our model is not set manually, but the optimal threshold is obtained according to the Youden index. It can be speculated that the reason is that the model made concessions in order to reduce the rate of missed detection and improve overall performance. In conclusion, compared with other benchmark models, the model which is based on ViT and CSRA is highly competitive in DR severity classifcation.

Ablation Studies on the DDR Dataset.
In this paper, the infuence of each component in the network is studied by an ablation experiment. First, we replace the transformer with a diferent backbone to verify the infuence of FEB on our model. Ten, CSRA is replaced by MLP, and there was no change in the FEB module, to verify the infuence of GPB on our model. At last, on the basis of keeping the existing model unchanged, we set the diferent number of heads of CSRA in GPB to verify the infuence of CSRA parameters setting on the overall performance of the model. Tables 7 and 8 detail the sensitivity, specifcity, accuracy, and AUC values obtained from diferent experiments and can be compared.
(1) Analyze the efect of FEB: frst of all, the FEB part takes ResNet50 as the backbone to extract image features. Compared to this design, our model improves sensitivity by nearly 2%, specifcity by nearly 4%, accuracy by nearly 3%, and AUC by more than 6%. Ten, the FEA part uses ResNet101 as the backbone of extracting image features. Compared with this design, the sensitivity and specifcity of our model are only slightly improved, and the accuracy and average AUC values are increased by over 1%, respectively. (2) Analyze the efect of GPB: next, we keep the FEB module unchanged and replace it with MLP by CSRA. Compared with this design, the sensitivity and accuracy of our model are improved by more than 2%, the specifcity by more than 4%, and the AUC value by nearly 4%, respectively. As can be seen from Table 8, our model has achieved the best performance in all the evaluation indexes. (3) Analyze the efect of attention heads in GPB: according to the research results of [62], we set the head to 2 by default. In order to verify the infuence of attention heads on our model and keep the existing model unchanged, the number of CSRA heads is set to 1, 4, and 6, respectively. It can be seen from the experimental results in Table 8 that our parameter setting achieves the best performance (Head � 2).

Conclusions
According to the data of the International Diabetes Federation, diabetes is one of the fastest-growing global health emergencies in the 21st century. By 2030, it is estimated that 643 million people will have diabetes (accounting for about 11.3% of the global population) [1]. DR is one of the common chronic complications of diabetes. Due to the diferent stage of DR severity, it can be divided into fve stages from mild to severe. In this paper, we design a new network to classify the fundus images of DR diferent stages by using vision transformers and residual attention. Te model is trained and tested on two publicly available fundus image datasets (DDR dataset and IDRiD dataset). Te experimental results show that compared with the existing fve DR classifcation benchmark methods, the proposed model has better performance. However, limited by the number of labelled samples and the imbalance of data, there is still a lot of room for improvement in the identifcation and classifcation of mild DR, which leads to the defciency of our network. Terefore, in future work, we will continue to improve the network structure and further modify the learning strategy to achieve better classifcation performance of DR severity.

Data Availability
Te DDR and IDRiD datasets used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest. Te bold font represents the best performance on that class. Te italic font indicates the second-best performance on that class. Te bold font represents the best performance in each column. Te italic font represents the second-best performance in each column. Te bold font represents the best performance in each column. Te italic font represents the second-best performance in each column.