Diagnosis of Diabetic Retinopathy through Retinal Fundus Images and 3D Convolutional Neural Networks with Limited Number of Samples

School of Electronics and Information, Harbin Institute of Technology, Harbin, Heilongjiang Province, China Department of Electrical and Computer Engineering, COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Pakistan College of Internet of Things (IoT) Engineering, Hohai University (HHU), Changzhou Campus, Changzhou 213022, China Interdisciplinary Centre for Security, Reliability and Trust (SnT)/SigCom, University of Luxembourg, 1855 Luxembourg City, Luxembourg College of Electronics and Information Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China School of Pattern Recognition and Intelligent System, Shenzhen Institute of Advance Technology (Chinese Academy of Sciences), Shenzhen, China Department of Electrical Engineering, University of Science and Technology Bannu, Khyber Pakhtunkhwa, Pakistan Communication Research Laboratory, Department of Information and Communication Technology, Islamic University, Kushtia-7003, Bangladesh


Introduction
Diabetes weakens the blood sugar regulation process inside the human body. In the year 2017, approximately 451 million peoples were suffering from this disease. A higher level of blood sugar severely cripples human body organs leading to risky complications such as coronary episode, vision loss, cataracts, glaucoma, retinopathy, and dementia. A growing population of peoples, irrespective of age, suffering from diabetes have problems in vision termed as diabetic retinopathy (DR) [1][2][3][4].
In clinical settings, four stages are generally involved in the assessment of DR, which are mild, moderate, severe, and proliferative retinopathy, respectively. In the earliest stage, microaneurysms, which are balloon-like structures, are formed in the small veins of the retina which are obstructed in the moderate stage. In the final stages, visual deficiency can occur [2].
Problems associated with DR can be treated in initial stages. The human eye is made up of optic nerves/discs, and the images of the eye can be segmented to get a good classification accuracy of different stages involved in DR [2]. Fundus images display the existence of exudates, hemorrhages, and other eye deficits and are graded manually by a limited number of ophthalmologists whose numbers are shrinking every year [3]. Microaneurysms, hemorrhages, hard exudates, cotton wool spots, neovascularization, and macular edema are some of the characteristics of DR [5,6]. For the screening of DR, at the level of an image, "normal" category contains no lesion while "abnormal" category contains at least one lesion. A computer-aided diagnostic system can help health care specialists in alleviating variabilities. Deep learning (DL) is a popular method for analysing retinal fundus images [7,8]. It captures high-level features throughout the learning process effectively adapting to any type of noise [9] thus forming a natural solution for identifying retinal diseases.
Various algorithms have been proposed for the examination of scan reports to diagnose DR [10][11][12]. Usually, researchers have focused on automatic recognition of lesions associated with DR [13][14][15]. In [16], the authors deployed convolutional neural networks (CNNs) to obtain a precision of 75% on the validation dataset for classifying DR images in the presence of artificial data augmentation/enhancement techniques. Shanthi and Sabeenian in [17] classified the fundus images employing a modified AlexNet architecture validated using Messidor database. The authors in [18] used transfer learning architectures, such as AlexNet, VGGNet, GoogleNet, and ResNet to reach a recognition rate of 95.68% exploiting publicly available Kaggle platform. A full patch-based CNN architecture is designed in [19] using only 28 retinal images achieving a sensitivity of 0.940. Authors in [2] constructed a 3D capsule network and validated their model on the Messidor dataset to achieve an accuracy of 98.64% on the stage 3 fundus images. In [4], a deep and densely connected network was designed to classify Messidor-2 and EyePACS datasets to attain a precision of 95% on Messidor-2 and 88% on the EyePACS dataset. Researchers in [20] used CNNs to achieve a sensitivity of 90% on the EyePACS dataset and a sensitivity of 87% on Messidor-2 dataset. Sayres et al. [21] deployed DL models achieving a sensitivity of 91% on Messidor-2 and a sensitivity of 94.5% on the EyePACS datasets. The work in [22] employed transfer learning-based VGG-19 architecture to classify 9 retinal diseases and one normal retina class with limited number of samples to obtain an accuracy of 30.5% when considering all the ten categories with the deployment of translation, rotation, and brightness change augmentation methods. The researcher used a deep CNN architecture [23] on the EyePACS dataset achieving a sensitivity of 98% deploying rotation, shearing, flipping, zooming, cropping, Krizhevsky augmentation, and translation as augmentation methods. Shankar et al. [24] proposed a DL model deploying histogram-based segmentation and a synergic network achieving an accuracy of 99.28% on the classification task for the Messidor dataset. Beede et al. [25] conducted a study on the clinical characterization of eye screening workflows for the detection of diabetic eye disease. They discovered factors such as gradability, Internet speed and connectivity, nursing workflows, and patient experience to be responsible for the model's performance. Khare et al. [26] proposed a firefly algorithm for dimensionality reduction, principal component analysis for feature extraction, and a deep neural network (DNN) model for the classification of DR to achieving an accuracy of 97%, precision and recall of 96%, specificity of 95%, and a sensitivity of 92% for the binary classification task. Qureshi et al. [27] proposed a DL architecture based on active learning for multiclass (e.g., 5 classes) classification of DR images from EyePACS benchmark for achieving a sensitivity of 92.2%, specificity of 95.1%, F-measure of 93%, and an accuracy of 98% on a wide range of fundus images. Das et al. [28] utilized maximal principal curvature, adaptive histogram equalization, morphological opening, and a CNN-based classifier to achieve an accuracy of 98.7% on DIARETDB1 dataset. Li et al. [29] presented a deep ensemble algorithm for the detection of DR using retinal fundus images. They exploited Inception-v4 architecture on Messidor-2 dataset to achieve an area under the curve of 0.994, sensitivity of 0.930, and a specificity of 0.971. Limwattanayingyong et al. [30] compared the screening of DR in a longitudinal setting via DL and human grading. They achieved a prevalence rate of 5.1% using DL while they observe a reduction in prevalence rate on a two-year follow-up. Tsiknakis et al. [31] provided an overview of DR detection based on fundus images discussing several aspects such as datasets, preprocessing techniques, and DL models for the characterization of important lesions. Karakaya and Hacisoftaoglu [32] compared different smart phone-based solutions such as iExaminer, D-Eye, Peek Retina, and iNview finding that field of view is the most important parameter for the detection of DR where iNview provides the largest and iExaminer provides the smallest value for this field.
Few-shot learning has been a topic of considerable interest where very few examples are used to categorize classes especially those classes that are not presented during training [33][34][35]. Fine-tuning is a common approach for few-shot learning. These systems require complex inference mechanisms due to the processing of complex inductive bias. Meta learning, augmentation/generative met-hods, transfer learning, and semisupervised methods are some of the typical approaches for this type of learning.
Although the reported studies offer competitive solutions to the binary and multiclass classification of DR, most of them are geared towards utilizing information in the 2D domain. Higher dimensions, such as 3D, offers rich scale and geometry information, which are challenging solutions for computer vision algorithms [51][52][53]. There is a need for studies to utilize the information offered by higher dimensions for these tasks. To take advantage of these representation learning methods on a limited number of samples [54] in the presence of data augmentation, we have used a 3D-CNN architecture [55] in the spatial domain for one 2 Wireless Communications and Mobile Computing binary and one multiclass classification task on the DR datasets. We have employed random weak Gaussian blurring and random shift as data augmentation/enhancement techniques along with their combination to study the impact of these methods on both classification tasks. This work contributes to the existing literature on the classification of DR in the following ways. To the best of authors' knowledge, very few researches have been carried out in the literature to solve this problem in the 3D domain. This study is designed to achieve that which offers the advantage by considering both spatial and temporal dimensions simultaneously. Impact of different data augmentation schemes on the final classification performance is also worth investigating especially in the 3D domain. Few-shot learning is also a problem worth investigating as limited number of samples is a bottleneck in achieving good classification performances using DL methods.
The remaining sections of this paper are organized as follows. A brief description of the datasets is given in Section 2, while Section 3 provides the details of the methods used in this study. Section 4 presents the details of the conducted experiments. Section 5 provides results and a thorough discussion. Finally, conclusions are drawn in Section 6 from the work presented.

Dataset Description
We have used two datasets to carry out the experiments. The first one named TeleOphta [3] is a database of fundus images with exudates and microaneurysm lesions. Using this database, we have constructed 99 3D volumes of healthy subjects and 83 3D volumes of diseased class that show signs of exudates and microaneurysms, and these volumes are split at the subject level [56]. Random shifting and random weak Gaussian blurred augmentation techniques are deployed to enhance the dataset. Some samples of the images are shown in Figure 1. The volume size is 210 × 210 × 12.
The second dataset has Gaussian filtered retina scan images to detect DR with five categories, which are no, mild, moderate, severe, and proliferate. The size of the 3D volumes is 512 × 512 × 2. In this database, we have 262 3D volumes of each of these categories that are split at the subject level [56]. Some samples of the images that are present in the database are shown in Figures 2-4, respectively. We have implemented random shifting and random weak Gaussian blurred augmentation techniques to enhance the dataset. This dataset is taken from the Kaggle website. All the 3D volumes in both these databases are normalized to have intensity values in the range between 0 and 255.

Methodology
In this work, we have considered both binary and multiclass classification tasks using different DL-based 3D-CNN architectures. These architectures are presented visually in Figures 5 and 6, respectively. As given in Figure 5, there are small differences between the architectures deployed without augmentation, combined augmentation schemes, and with random weak Gaussian blurring and random shifted augmentation schemes. Feature maps in the convolutional layers are 10 for the combined augmentation scheme, 8 for no augmentation, and 9 for the classification tasks involving random weak Gaussian blurring and random shifted augmentation schemes. Rest of the architectures in Figure 5 are equivalent. An input layer accepts a volume of size 210 × 210 × 12 with rescale-zeroone normalization method that scales the values of the incoming input between 0 and 1 according to the minimum and maximum values per channel. After that, a block that is repeated 5 times named block-A consists of a 3D convolutional layer, batch normalization layer, rectified linear nit (ReLU) activation layer, and a max pooling 3D layer which is used for the extraction, normalization, and downsizing of feature maps. Subsequently, there is a block which is repeated a single time named block-B consisting of 3 fully connected (FC) layers with number of neurons equal to 300, 150, and 2, one dropout layer with probability 0.1, and, finally, a softmax and a classification layer to culminate the binary classification task. ReLU is equivalent to max ðx, 0Þ. Batch normalization [57] is another technique used for the improvement of training efficiency through a reduction in the statistical difference between the fundus volumes [58]. It contributes to a rapid convergence and a reduction in sensitivity during learning process [59]. Dropout [60] is effective in reducing the overfitting of models by omitting both hidden and visible units during the training process. It is a type of regularization method that prevents complex formation of adaptations on the training data. Weight and bias L2 factors are added to encourage smaller weights and biases by penalizing a network based on the size of weights and biases. Transformation of the input values of the softmax function can be interpreted as probabilities. Table 1 shows detailed 3D-CNN architecture hyperparameters for the binary classification task with 8 feature maps in the convolutional 3D layer.
As given in Figure 6, there are small architectural differences between the architectures deployed using no augmentation, combined augmentation schemes, and with random  accepts a volume of size 512 × 512 × 2 with zero centre normalization applied to the 3D volume. After that, there comes a block that is repeated 6 times named block-A consisting of a feature extracting convolutional layer, batch normalization layer, exponential linear unit (ELU) activation layer, and a max pooling 3D layer that is used for the extraction,   5 Wireless Communications and Mobile Computing normalization, and downsizing of feature maps. After that, there is another block that is repeated a single time named block-B consisting of 3 FC layers with 500, 300, and 5 neurons, one dropout layer with probability 0.1, and, finally, a softmax and a classification layer to culminate the multiclass (5 classes) classification task. ELUs [61] solve the vanishing gradient problem by having values in the negative region allowing them to push mean unit activations closer to zero but with lower computational complexity. Mathematically, Table 2 shows detailed 3D-CNN architecture hyperparameters for the multiclass classification task with 10 feature maps in the convolutional 3D layer.

Experiments
We have performed experiments in the spatial domain for both binary and multiclass classification tasks to differentiate between the different categories of DR deploying two data augmentation methods: random weak Gaussian blurring and random shifting. We set the σ value to 1.5 for the ran-dom weak Gaussian blurring and shift value to 1 or 2 pixels for the random shifting augmentations. We have also combined the training samples of both these augmentation methods. We have carried out the experiments related to the following tasks: (1) binary classification of healthy/diseased classes without augmentation, (2) binary classification of healthy/diseased classes with random weak Gaussian blurring augmentation, (3) binary classification of healthy/ diseased classes with random shifting augmentation, (4) binary classification of healthy/diseased classes with combined random shifting and random weak Gaussian blurring augmentation methods, (5) multiclass classification without augmentation, (6) multiclass classification with random weak Gaussian blurring augmentation, (7) multiclass classification with random shifting augmentation, and, finally,    Figure 6: Architectures for multiclass classification tasks without augmentation, with random shifted and random weak Gaussian blurred augmentation schemes, and by combined augmentation schemes. A tenfold cross-validation approach is used for hyperparameter selection.  For experiments on the binary classification tasks, we have used the following settings: minibatch size is set to 2, initial learning rate is set to 0.001, epochs are set to 50, learning rate schedule is set to piecewise, optimization algorithm is Adam [62], categorical cross-entropy is chosen as a loss function, total number of experiments equals 41, while time taken to perform these experiments is approximately 642 minutes or 10.7 hours.
For experiments on the multiclass classification tasks, we have considered the following settings: minibatch size is set to 2, initial learning rate is set to 0.001, epochs are set to 30, learning rate schedule is set to piecewise, optimization algorithm is Adam, loss function is categorical crossentropy, total number of experiments equals 41, while time taken to perform these experiments is approximately 8448 minutes or 140 hours.

Results and Discussion
For binary classification between healthy and diseased classes, the experimental results are presented in Tables 3 and  4, respectively. As a visual aid, Figure 7 presents the results given in Table 3 while Figure 8 presents the results given in Table 4. We have used accuracy, F 1 -score, Matthews   where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative samples, respectively. Ranking of the methods for the binary classification tasks based on individual and collective performance metrics is given in Table 4. As given in Tables 3 and 4, validation on test split is performed with the model trained using combined augmentation methods. The best performing model is the one that is trained using combined augmentation methods, then comes the model that is trained using random weak Gaussian blurred augmentation method, followed by the model trained using random shift augmentation scheme, and, finally, the model that does not use augmentation at all performed the worst. Here, combined augmentations mean combination of both random weak Gaussian blurred augmentation and random shifted augmentation schemes. Except for specificity metric, the rankings for all the methods remained the same which shows strong correlation between these performance metrics.
For the multiclass classification task, we have considered overall accuracy, relative classifier information (RCI), confusion entropy (CEN), index of balanced accuracy (IBA), geometric mean (GM), and MCC as performance metrics. Overall accuracy is the ratio of values that are correctly predicted to the sum of total values. Tables 5-8 lists the complete statistics of the performance metrics for the multiclass classification tasks. In these tables, class-wise statistics for CEN, IBA, GM and MCC performance metrics as well as overall accuracy and RCI values for each of the four tasks, i.e., without augmentation, with random weak Gaussian blurred augmentation, with random shifted augmentation, and with combined augmentation schemes, are presented. Here, combined augmentations mean combination of both random weak Gaussian blurred augmentation and random shifted augmentation schemes. Finally, the statistics of the task involving test subset validated on the model trained without augmentation are also presented in these tables. Table 9 lists the summary statistics of the performance metrics for the multiclass classification task, while Figure 9 visually presents the results given in Table 9. In this table, averages of CEN, IBA, GM, and MCC performance metrics are calculated by summing their class-wise values and dividing with 5. The RCI and overall accuracy values remain the same as in Table 5. Table 10 presents a system of ranking based on the statistics given in Table 9 for the multiclass classification task. As a visual aid, Figure 10 visually presents the results given in Table 10. In this table, ranking based on individual performance metrics as well as an overall ranking obtained by considering the individual performance-based metrics is presented. Overall accuracy, RCI, IBA, GM, and MCC based ranking is obtained by considering the fact that higher values      10 Wireless Communications and Mobile Computing of these metrics represent better classification while CENbased ranking is obtained by considering that lower values are desirable as they represent better classification.
In Table 10, it can be observed that the model that performs the best is the one that is trained without augmentation followed by the model that is trained with combined augmentations, followed by the models that are trained with random weak Gaussian blurred and random shifted augmentation methods. Training without augmentation has the best performance considering individual and overall metric-based rankings, while combined augmentations have second best overall performance. Random shifted and random weak Gaussian blurred augmentation methods have equal performances. We can observe strong correlation among these performance metrics as depicted in their rankings where without augmentation and with combined augmentations can be completely specified by only a single performance metric alone. However, there is a clear difference between performances of methods employing random shift augmentation and random weak Gaussian blurred augmentation methods for the multiclass classification tasks when they are observed from individual metric-based performances alone. GM, MCC, and overall accuracy of methods employing random shifted augmentation are the worst while RCI, CEN, and IBA of methods employing random weak Gaussian blurred augmentation are the worst which signifies that these methods have disparities leading to difference in the opinion of these performance metrics. These differences could also be due to the way CNNs generalize to image transformations at a small scale [63].
For the multiclass classification task, we have found that the instances of proliferate DR class have the highest diagnostic performance. The performance of DL architectures is better in the case of binary classification than in the case of multiclass classification tasks, and this result is quite natural. Furthermore, architectures that combine different augmentation methods tend to perform better than those that do not. Furthermore, we have found the performance   Table 9: Summary statistics of performance metrics for the multiclass classification tasks without augmentation, with random weak Gaussian blurred augmentation, with random shift augmentation, and with combined augmentations. Wireless Communications and Mobile Computing of architectures trained using random weak Gaussian blurring augmentation to be better than those that are trained using random shifted augmentation as the global sum of the feature maps will not be invariant to translation while performing the operation of convolution.
Architecture engineering has an impact on the performances of classification tasks. It can be observed that, for the binary classification tasks, large number of feature maps in the convolutional layer helps in getting better performances when compared with small number of feature maps in this layer. We can see that combined augmentation methods whose performances are better than other methods used large number of feature maps in the convolutional feature extracting layers. However, interesting observations can be seen for the multiclass classification tasks, where architectures with small number of feature maps help in getting the best performance overall. We can see architectures that did not employ any form of data augmentation performed better than those that employed data augmentation and these architectures employed less number of feature maps in the convolutional feature extracting layers. However, more feature maps in the convolutional layers help in getting better performances on the multiclass classification task as can be seen for the combined augmentations case that outperformed single augmentations for this task by using more complex architecture. In general, we can see the advantages brought by using deeper architectures in comparison with shallower ones when both these tasks (binary and multiclass classifications) are considered.
The suboptimal performance of DL architectures could be explained by the limited number of samples that we have used during training and validation processes [22]. Modern DL architectures require a lot of samples to train without experiencing overfitting issues. Another major limitation of our study is the lack of validation on a multicentre validation set which will prove beneficial in clinical practice. Finally, we   Task Ranking

12
Wireless Communications and Mobile Computing hope that this pilot study deploying 3D CNN architectures with data augmentation schemes can be supportive to eye care specialists on the deployment of DL methods in terms of their clinical use.

Conclusions
In this research, we have utilized different DL methods to study both binary and multiclass classification problems to differentiate between different stages of DR. We have deployed 10-fold cross-validation approach to select optimal set of hyperparameters for the binary and multiclass classification tasks. For the binary classification task, we have found the performance of architecture trained using combined augmentation methods to be the best while the performance of model trained without any augmentation is found to be the worst. In contrast, in the multiclass case, we have observed the overall performance of model trained without augmentation to be the best while the performance of models trained with a single augmentation method whether random weak Gaussian blurring augmentation or random shifted augmentation to be the worst.
In the future, we will work on other retinal diseases such as retinal detachment using fundus images deploying data augmentation methods such as elastic/plastic deformations as well as other DL-based architectures such as graph convolutional networks. Eye diseases such as age-related macular degeneration, media haze, drusen, myopia, branch retinal vein occlusion, tessellation, epiretinal membrane, laser scars, macular scar, central serous retinopathy, optic disc cupping, central retinal vein occlusion, tortuous vessels, asteroid hyalosis, optic disc pallor, optic disc edema, optociliary shunt, anterior ischemic optic neuropathy, parafoveal telangiectasia, retinal traction, retinitis, chorioretinitis, exudation, retinal pigment epithelium changes, macular hole, retinitis pigmentosa, and many other eye diseases [64] are affecting a large number of people worldwide, and their accurate and early detection using DL-based methods may allow for palliative care procedures employed by clinicians and medical practitioners.

Data Availability
The data employed to support the findings of this research is publicly available from the Kaggle platform and TeleOphta database.