Accuracy of Deep Learning Algorithms for the Diagnosis of Retinopathy of Prematurity by Fundus Images: A Systematic Review and Meta-Analysis

Background Retinopathy of prematurity (ROP) occurs in preterm infants and may contribute to blindness. Deep learning (DL) models have been used for ophthalmologic diagnoses. We performed a systematic review and meta-analysis of published evidence to summarize and evaluate the diagnostic accuracy of DL algorithms for ROP by fundus images. Methods We searched PubMed, EMBASE, Web of Science, and Institute of Electrical and Electronics Engineers Xplore Digital Library on June 13, 2021, for studies using a DL algorithm to distinguish individuals with ROP of different grades, which provided accuracy measurements. The pooled sensitivity and specificity values and the area under the curve (AUC) of summary receiver operating characteristics curves (SROC) summarized overall test performance. The performances in validation and test datasets were assessed together and separately. Subgroup analyses were conducted between the definition and grades of ROP. Threshold and nonthreshold effects were tested to assess biases and evaluate accuracy factors associated with DL models. Results Nine studies with fifteen classifiers were included in our meta-analysis. A total of 521,586 objects were applied to DL models. For combined validation and test datasets in each study, the pooled sensitivity and specificity were 0.953 (95% confidence intervals (CI): 0.946–0.959) and 0.975 (0.973–0.977), respectively, and the AUC was 0.984 (0.978–0.989). For the validation dataset and test dataset, the AUC was 0.977 (0.968–0.986) and 0.987 (0.982–0.992), respectively. In the subgroup analysis of ROP vs. normal and differentiation of two ROP grades, the AUC was 0.990 (0.944–0.994) and 0.982 (0.964–0.999), respectively. Conclusions Our study shows that DL models can play an essential role in detecting and grading ROP with high sensitivity, specificity, and repeatability. The application of a DL-based automated system may improve ROP screening and diagnosis in the future.


Introduction
Retinopathy of prematurity (ROP) occurs in preterm infants on supplemental oxygen, which helps to improve survival but may contribute to blindness [1]. e ROP grades are complicated and include aggressive ROP (AP-ROP), prethreshold ROP, or regression of ROP, as well as stages, zones, extent, and preplus/plus diseases of ROP [1,2]. ROP can be diagnosed by binocular direct or indirect ophthalmoscopy [3], and fundus photographs are taken by digital retinal cameras, such as RetCam. Due to difficulties associated with eye contact photography in newborns and limited technological expertise of ophthalmologists, as well as financial and ethical issues, ROP screening is not a common practice. However, early treatment can improve the structural and functional outcomes [4]. erefore, developing a screening method for early diagnosis of ROP is essential.
Artificial intelligence has the potential to revolutionize the current disease diagnosis pattern in ophthalmology and generate a significant clinical impact [5]. Deep learning (DL), a technology of machine learning (ML), was introduced to artificial neural networks in 2000 [6]. DL can automatically grade images and has been applied to ophthalmology for signal processing and diagnostic retinal imaging [5,7]. e deep convolutional neural network/ convolutional neural network (DCNN/CNN) is a DL technique that is widely used to interpret medical images through classifiers. It is a multilayer structure comprising a convolutional layer, a nonlinear processing unit, and a subsampling layer [8]. DL must be trained with high mathematical precision but can be implemented using a lower precision computer; thus, the automatic detection system can be applied in a general hospital.
In recent years, DL algorithms have been widely applied to ophthalmology for the diagnosis of various diseases, such as diabetic retinopathy (DR), age-related macular degeneration (AMD), and ROP [5,9]. However, due to the lack of a public database for ROP and hence the difficulty in obtaining a large clinical dataset, the development of DL algorithms for diagnosing ROP lags behind other retinal diseases. Some studies have used DL models to process retinal images by vessel segmentation or zone identification and have recommended the application of the feature-based images for further clinical diagnosis or DL model building [10][11][12]. Some studies followed this flow to build an entire DL model for diagnosing ROP. ey applied the model to extract features relevant to ROP, such as tortuosity and dilation of vessel, and applied these images to the classifiers of the DL model [13]. Other studies have suggested that using original retinal images to build a DL model without limited features may reserve more information [13,14]. Some accuracy measurements, such as accuracy, sensitivity, and specificity, were calculated to evaluate the performance of DL algorithms in detecting ROP using fundus images compared to the clinical methods. e validation dataset is used to tune the model's hyperparameters, whereas the test dataset provides an unbiased evaluation of a final model fit. Most studies have only verified the DL model through internal validation using a test dataset derived from the training dataset, rather than external validation that uses an independent test dataset. We typically diagnose diseases using a plethora of diagnostic methods, but DL limits evidence to images. e DL model is complex and a "black-box" that the mechanism is unknown. ese reasons make the diagnostic results unstable and unreliable, hindering their use in clinical practice. erefore, we conducted a meta-analysis to summarize and compare the published evidence to evaluate the accuracy of DL algorithms for the diagnosis of ROP by fundus images.

Systematic Review.
We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, which are based on Cochrane's Handbook for Systematic Reviews, to conduct and report on the current study [15,16]. For our study, we searched the following public databases: PubMed, EMBASE, Web of Science, and Institute of Electrical and Electronics Engineers Xplore Digital Library (IEEE). We systematically searched using the most appropriate free-text terms ("Retinopathy of Prematurity (MeSH)," OR "Prematurity Retinopathy", OR "Retrolental Fibroplasia") AND ("Deep learning (MeSH)," OR "Hierarchical learning," OR "Convolutional Neural Network," OR "Deep Neural Network") to find relevant articles published between January 1, 2000, and June 13, 2021. In addition, the relevant articles were tracked through automatic e-mail alerts from the databases during the preparation of our manuscript.

Inclusion and Exclusion
Criteria. Two authors (Zhang and Liu) independently screened the titles and abstracts for retrieved articles to be considered for the systematic review. We selected the articles from the initial screening and retained them for full-text review, excluding editorials, letters to the editor, reviews, commentary, policy, contribution, conference, book, and articles with traditional methods for detecting ROP. All the included studies had to fulfill the following criteria: (1) the studies were restricted to peer-reviewed articles published in English (conference abstracts or proceedings were excluded), (2) the studies provided information on the dataset, diagnosis, and grading criteria of ROP, and the number of research object (e.g., images, cases (eyes), or infants) in each group, (3) the studies described the DL algorithms for distinguishing ROP using a binary classifier and provided an evaluation, such as accuracy, sensitivity, and specificity, using the area under the receiver operating characteristics curve (AUC, AUROC), or precision and recall, using the area under the precision-recall curve (AUPR). All the studies meeting the inclusion criteria at this stage were additionally reviewed by the same two authors to ensure they were appropriate for the final analysis. Disagreements were resolved by discussion with another expert (Matsuo).

Data Extraction and Quality Assessment.
Two researchers (Zhang and Liu) individually retrieved all information from the selected articles and extracted the following items: author, publication year, dataset characteristics, definition and grade of ROP, DL model, training, validation and test set characteristics, and all accuracy values of validation and testing. True-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) values were calculated for a meta-analysis. If the TP/TN/FP/FN values were not quantitatively expressed or could not be calculated from the composition of dataset and accuracy measurements, the study was excluded.
Unlike ordinary diagnostic accuracy studies, there are no generally accepted and appropriate quality criteria for evaluating the accuracy of DL algorithms. We referred to the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [17] and PLASTER (a framework for evaluating DL performance) [18] to select some assessment factors, such as image selection and preprocessing, clear description of DL algorithms and algorithm evaluation, reference standards to label images for the classifier of the CNN, and flow and timing of ROP. Inconsistent results between authors in data extraction and quality assessment were resolved through discussion.

Statistical Analysis.
A meta-analysis was conducted using Meta-analysis of diagnostic and screening tests (Meta-DiSc, Version: 1.4, Universidad Complutense, Barcelona, Spain) [19]. An overall test performance was evaluated by separately combining TP/TN/FP/FN values of the validation and test datasets in each study. Separate subgroup analyses were performed: the validation and test dataset, separately and the definition of ROP and the grade of ROP, separately.
e DerSimonian-Laird random-effects model was applied to the pooled data. e descriptive forest plot for pooled sensitivity and specificity values, positive and negative likelihood ratio (PLR/NLR), diagnostic odds ratios (DOR), and the AUC of summary receiver operating characteristics curves (SROC) [20] were used to summarize overall test performance. Statistical significance was expressed with 95% confidence intervals (CIs). e AUC criteria are 0.90-1 (excellence), 0.80-0.90 (good), 0.70-0.80 (fair), 0.60-0.70 (poor), and 0.50-0.60 (failure). reshold effect and nonthreshold effect testing were used to examine heterogeneity, assess potential biases, and evaluate the accuracy factors. e threshold effect exists when different cutoffs or thresholds are used to define a positive (or negative) test result in different studies [19]. e nonthreshold effect may exist in case of chance and variations in the study population, index test, reference standard, study design, and conducted partial verification bias [21]. For DL models, the nonthreshold effect may be caused by image segmentation methods, feature extraction methods, and classifiers. A Spearman correlation coefficient (r) between the logit of true-positive rate (TPR) and logit of false-positive rate (FPR, 1-specificity) was calculated, and a strong positive correlation r with a p value less than 0.05 suggests the threshold effect [18]. If there was a threshold effect, the included studies might have used different thresholds to define positive and negative results; therefore, the sensitivity and specificity values were heterogeneous, and the pooled results should be referred to with caution [22]. e nonthreshold effect test, using the chi-square value of pooled accuracy estimate and Cochran-Q value of DOR, was implemented. If the heterogeneity was beyond the specifications, the test results would have a low p value [19,23]. Figure 1 shows details of the screening stage. Table 1 provides the nine studies with fifteen classifiers, which were included in our systematic review by meta-analysis.

Selected Studies and General Characteristics.
Publication years ranged from 2018 to 2021. All training datasets were hospital-based rather than built as a database after quality control. A total of 521,586 objects were applied to DL models. All the included studies, except one study, reported the type of digital camera used to obtain the retinal images [33]. Nevertheless, since all the data were collected after the improvement of the neonatal fundus image quality [14], we retained this study and ruled out the potential bias due to the image quality [34]. e diagnosis of ROP and its grade is credible due to the use of similar reference standard and the consistent label by professionals. All the included studies developed DL models to detect ROP from normal subjects, and five studies further distinguished two ROP grades [24-26, 28, 30]. Most studies evaluated the algorithm using k-fold cross-validation or by developing several modules and then selecting the best one to apply to the final DL model. Considering that not all the studies reported complete accurate measurements, we only applied the dataset with available TP/TN/FP/FN values to meta-analysis. e accuracy of the validation and testing datasets were 0.785-0.99 and 0.856-0.988, respectively.

Meta-Analysis.
In the threshold effect test for primary analyses, Spearman correlation coefficient (r) was −0.561 (p � 0.030), suggesting the absence of a threshold effect. In the subgroup analyses, none of the subgroups obtained any significant r to show evidence of the threshold effect (all p values > 0.05). Table 2 provides the results of primary and subgroup analyses. Figure 2 shows the performance of the DL models in detecting and grading ROP in the primary analyses. In the pooling of sensitivity, specificity, PLR, NLR, and DOR, the nonthreshold effect tests showed high heterogeneity from the nonthreshold effect across all studies and all subgroup analyses (all the chi-square and Cochran-Q with p values of <0.01).
We explored the heterogeneous source of included studies according to the primary and subgroup analyses, quality assessment, and testing results of threshold and nonthreshold effects.
ere was no evidence of the threshold effect, possibly because the "threshold" has different meanings in clinical diagnosis and DL models. For DL models, positive (or negative) is defined based on probability rather than a certain decision; therefore, different DL models may share the same threshold of probability. According to the results of the nonthreshold effect test and quality assessment, there may be a risk beyond bias for random reasons. Since all the images were from infants and were obtained using RetCam, the bias of patient selection could be avoided. Additionally, the images were labeled according to a standard reference or consistent opinion; however, they were obtained by different ophthalmologists and processed by various technologies, creating a risk of bias for object (images) selection. In addition, as the images were acquired from different quadrants of the fundus, there is a possibility of misclassification of the ROP grades. Considering that the flow and timing of disease were part of the classification criteria, the risk of bias among studies cannot be avoided. For DL models, different dataset composition, CNN structure, and hyperparameter setting could also cause heterogeneity.

Principal Findings.
is systematic review and metaanalysis evaluated the performance of DL algorithms in detecting and grading ROP using RetCam images compared to the clinical methods. e results showed that DL models have a promising performance in ROP screening, and their diagnosis has clinical relevance with the ophthalmologist's. e main results are that DL algorithms (DCNN/CNN) achieved high sensitivity and specificity in identifying ROP and distinguishing two grades of ROP; the PLR, NLR, and DOR values indicate good test performance. All the accuracy values based on AUC are over 0.97, which is classified as high when above 0.9 [35]. erefore, the included studies suggested that the DL-based automatic diagnosis system for ROP was effective. Comparing primary and subgroup analyses, the specificity, PLR, NLR, and DOR of distinguishing ROP grades are less satisfactory but matches expectations, possibly because ROP progression is a continuous spectrum and the definition of positive and negative is unclear. e primary and subgroup analyses had nonthreshold effects, creating considerable uncertainty around accuracy estimates in this meta-analysis; however, the AUC obtained from the SROC curve is quite robust to heterogeneity [36]. Some issues should be considered when using DL models for ROP diagnosis. (1) Early diagnosis and screening are essential for ROP, and most studies focus on higher sensitivity rather than specificity when selecting cut-offs to build DL models. However, in clinical practice, the low specificity may impede its adaption due to various considerations, including financial implications [37]. (2) When the membership of the test set is not balanced (e.g., the actual negative is far more than the actual positive), the specificity may not reflect the variety of true negative. erefore, it is better to use precision with the precision-recall (P-R) curve to evaluate accuracy. In our studies, only two studies reported the P-R curve [28,38]. (3) e performance of the DL model is affected by the image quality and disease manifestation [37,39]. erefore, the clinical performance of proposed DL models requires more external verification to avoid overestimation due to overfitting and bias [24,25,35]. (4) e ROP course is continuous, meaning that both binary and multiple classifications of images are crude in clinical practice. (5) e ROC curve can only evaluate binary classification, and although some studies developed DL models with multiclass or multinomial classification to diagnose ROP, they could not be included in the metaanalysis. (6) Contrary to the diagnosis of DR and AMD, ROP     imaging mostly requires pupil dilation to avoid quality issues due to nonmydriatic fundus photography [40]. is could explain why the accuracy of ROP diagnosis is better than that of DR (the pooled AUC was 0.97) [9]. (7) Additionally, contrary to DR, the gold standard, fluorescein fundus angiography, is rarely applied to ROP diagnosis in infants. Subsequently, the vessel segmentation techniques may play a more critical role in the automatic diagnosis system. However, studies that have independently developed vessel segmentation techniques to label features are limited [41]. (8) Most studies did not evaluate the diagnosis of late-stage ROP because their DL models are based on retinal blood vessel morphology, which is difficult to visualize in late ROP.

Opportunities and Challenges.
ere are immense opportunities for applying DL algorithms to develop the automatic ROP identification system based on the fundus images. Implementation of automated systems based on DL algorithms would improve the efficiency and coverage of ROP screening and subsequently promote early treatment to reduce retinal detachment and loss of vision. However, several challenges need to be addressed. For a given DL model, the specificity increases while sensitivity decreases; thus, further studies should improve the algorithms to overcome the difficulties of raising both indexes. e DL models trained by a given dataset are specific to that dataset, and generalization of the DL model is unreliable. e DL algorithm is considered a "black-box." Although some studies limited some features to make the process open, the inflexible learning method rather than experience hinders ophthalmologists from accepting the diagnosis by the DL model. e DL algorithm is isolated, whereas ophthalmologists have an integrated knowledge system; this may affect the patients' trust in the diagnosis. Most DL models are optimized for classification rather than diagnosis. Notably, ROP diagnosis comprises identifying, grading, defining affected zones, and identifying symptoms of preplus or plus disease, which may not be possible using DL models. Besides, the DL model could not make differential diagnosis of ROP from other retinal diseases, such as retinal vascular dysplasia. Additionally, the quality of images used to train the CNN model affects diagnostic accuracy. High-quality images increase DL power consumption; thus, it is necessary to maximize energy efficiency [18,34]. Notably, DL tends to overfit. Most training sets of DL models for DR can involve approximately 10,000 images, but this number of images is insufficient for ROP. Some studies expanded the dataset by image augmentation, but none of the CNN models for ROP was regularized to prevent overfitting [42]. Most fundus images are taken by ophthalmologists, who deliberately focus on some abnormal regions for precise diagnosis and grading of ROP. In addition, ophthalmologists rely on the patients' clinical history, such as oxygen supplementation, for accurate diagnosis. In contrast, the CNN model may not perform well for images with subtle findings that most ophthalmologists cannot identify. e ImageNet-trained CNNs are biased towards texture rather than shape, which is different from human observers [43]. In identifying DR or retinal hemorrhage [44], the focus is on changes to texture, but for ROP, diagnosis is based on alterations of the shape of vascular tissues.
is may explain why the diagnostic performance of pretrained ImageNet for ROP is less satisfactory.

Strengths and Limitations.
Our study is the most comprehensive systematic review and meta-analysis to evaluate the performance of the DL model to detect and grade ROP. However, our study has several limitations. First, we could not reduce the heterogeneity from the nonthreshold effect among the studies as this difference is inherent to the imaging mode and internal features of the DL model. However, it does not affect the value of this study in providing an overview of the diagnostic accuracy of DL models. Second, accuracy measurements for some studies or some subdatasets were unavailable to us. ird, we could only evaluate the performance of DL models using the accuracy measurements provided by individual studies rather than calculating the accuracies by directly applying the images to the DL models in practice. Fourth, due to the varied DL arithmetic logics, it was difficult to conduct a subgroup analysis based on the models to assess the bias. Fifth, we only systematically analyzed the DL models with binary classifiers. Since multiple classifiers yielded a probabilistic interpretation representing each classification, the distribution of these probability outputs could be illustrated in the violin plot but could not be pooled. Sixth, some studies neither validated algorithms on external data nor compared algorithm performance against other professionals; thus, the generalization of DL algorithms could not be assessed [37]. Seventh, the objects of DL models could be infants, cases (eyes), or images, and an infant or a case may contain several images. We could not estimate the effect because the classification based on several images might be more accurate, and a small sample size might affect the diagnostic accuracy [45].

Conclusions
Our study findings show that DL models can play an essential role in detecting and grading ROP with high sensitivity, specificity, and repeatability. e application of a DL-based automated system may change the approach to ROP screening and diagnosis in the future, which may improve healthcare. Earlier detection and timely treatment might halt disease progression at an earlier stage and prevent the onset of complications.
Data Availability e data supporting this meta-analysis are from previously reported studies and datasets, which have been cited. e processed data are available from the corresponding author upon request.