Development and Application of a Standardized Testset for an Artificial Intelligence Medical Device Intended for the Computer-Aided Diagnosis of Diabetic Retinopathy

Objective To explore a centralized approach to build test sets and assess the performance of an artificial intelligence medical device (AIMD) which is intended for computer-aided diagnosis of diabetic retinopathy (DR). Method A framework was proposed to conduct data collection, data curation, and annotation. Deidentified colour fundus photographs were collected from 11 partner hospitals with raw labels. Photographs with sensitive information or authenticity issues were excluded during vetting. A team of annotators was recruited through qualification examinations and trained. The annotation process included three steps: initial annotation, review, and arbitration. The annotated data then composed a standardized test set, which was further imported to algorithms under test (AUT) from different developers. The algorithm outputs were compared with the final annotation results (reference standard). Result The test set consists of 6327 digital colour fundus photographs. The final labels include 5 stages of DR and non-DR, as well as other ocular diseases and photographs with unacceptable quality. The Fleiss Kappa was 0.75 among the annotators. The Cohen's kappa between raw labels and final labels is 0.5. Using this test set, five AUTs were tested and compared quantitatively. The metrics include accuracy, sensitivity, and specificity. The AUTs showed inhomogeneous capabilities to classify different types of fundus photographs. Conclusions This article demonstrated a workflow to build standardized test sets and conduct algorithm testing of the AIMD for computer-aided diagnosis of diabetic retinopathy. It may provide a reference to develop technical standards that promote product verification and quality control, improving the comparability of products.

To support standard development, it would be helpful to explore the approach to build and apply standardized test sets. While the literature reports existing public datasets for medical AI [28,29], they are more appropriate for model training or competition [5,8] rather than testing. On the one hand, the design of public datasets usually occurs before the research and development of the AIMD, and they may not match the application scenario of the AIMD. On the other hand, test sets have special requirements. Tey should be independent from manufacturers or developers in order to verify the generalizability of AI. Te capacity and diversity of data samples should be similar to the intended patient population. Standard operation protocols should be followed during the lifecycle. A systematic annotation process is needed to provide the reference standard.
Tis article demonstrates a case study to build test sets for computer-assisted diagnosis of DR, which is a common application of the AIMD. It is reported that deep learning algorithms can diferentiate referrable DR patients from nonreferrable DR patients by reading colour fundus photographs [5,7,9,10,12]. Indeed, annual DR screening using digital photographs of the retina has long been recommended by several major governmental or professional organizations, including the UK National Health Service [10,30], the American Diabetes Association [31], and other international societies [32].
In this article, a standardized approach is proposed to compose test sets for DR. Te major procedure is described, including data collection, curation, and annotation. Te test set is applied in the testing of AUTs. Te advantages and practical issues of this approach are discussed, which may provide a reference for the development of technical standards.

Framework for Dataset Construction.
Te framework to build the test set is illustrated in Figure 1. It depicts a workfow, including design input, requirement specifcation, data collection, data curation, data annotation, and quality inspection. Risk management and personnel management are also considered and integrated into the workfow.

Design Input and Requirement Specifcation.
To initiate dataset construction, the design input is frstly clarifed. Te intended use of this test set is to verify algorithm performance on classifcation of diabetic retinopathy by comparing algorithm outputs with the reference standard. Te test set represents colored fundus photographs of diabetic patients from hospitals. Common image formats such as JPEG and BMP are accepted.
Requirement specifcation of this test set further describes dataset composition, classifcation, and data inclusion/exclusion criteria. Tis study uses colored photographs taken by fundus cameras that are ofcially approved to enter the market with a feld of view no less than 45°. Photographs taken under near-infrared illumination are not included. According to the common intended use of AIMD products and the clinical guidelines for DR [33,34], the images in the test set should include 7 categories (shown in Table 1 Notably, the above categorization method is a result of justifcation since many AI products in China were designed according to the Guidelines for Diabetic Retinopathy Diagnosis and Treatment in China [33], which has referenced a previous version of the guidelines published in 1985 and ICO guidelines for diabetic eye care. Te current guideline [33] divides DR based on severity into 6 stages as shown in Table 2. DR phases 0-III in Table 2 are equivalent to Classes 0-3 in Table 1. Since the treatment scheme of DR phases IV-VI is similar and the referral strategy is identical, the test set consolidates these stages into Class 4, which is compatible with ICO guidelines and practical in a clinical scenario.
Fundus diseases other than DR are classifed as Class 5, which include but are not limited to hypertensive retinopathy [35], age-related macular degeneration [36], suspect glaucoma [37,38], retinal vein occlusion [39], pathologic myopia [40], and optic nerve diseases [41]. Although these ocular diseases are not necessarily claimed by AIMD products, they may be imported into AIMDs in the real world. Terefore, they serve as negative controls in the test set.
Ungradable images are classifed as Class 6. Image quality is given special attention in the development of the test set. DR screening is often performed in out-patients, sometimes on patients with undilated pupils. Te colour retinal photographs are obtained using low levels of illumination. Also, human factors such as movement and positioning in addition to ocular factors such as cataracts and refections from retinal tissues can produce defects. Especially, without pupillary dilatation, artifacts are observed in 3-30% of retinal images to the extent that they impede annotation [42]. Terefore, in this test set, ungradable images are also included, with conditions ranging from over darkness/saturation, out of focus, wrong positioning, lens contamination, to anterior segment images.
If an image only has minor quality problems that do not disturb annotation, it will be annotated and assigned to category 0-5. Images with photocoagulation marks and   Other fundus diseases 6 Ungradable images other treatment marks are annotated according to their posttreatment features. Te comparison between pretreatment and posttreatment images is not within the scope of the test set.

Risk Management.
Data security, patient privacy, and data bias are the major risks considered in this study. To ensure data security, all activities are conducted on the local area network with controlled user access. Data are stored in servers independent from algorithms under testing. Data annotation tools are not allowed to export images. To protect patient privacy, only deidentifed images with ethical approval are accepted in this test set. To minimize data biases such as selection bias and coverage bias, the diversity of positive and negative samples is highlighted in the requirement specifcation.

Data Collection.
During data acquisition, deidentifed fundus photographs are collected retrospectively from partner hospitals with ethical approval from local institutional review boards. Te raw images are submitted in JPEG formats. No modifcation or processing, such as fltering, smoothing, clipping, and contrast enhancing, is allowed. Additional information on image sources, including data collection sites, manufacturers of fundus cameras, and models of fundus cameras, is recommended and submitted.

Data Curation.
Data curation is the process to ensure data safety and quality. First, the status of deidentifcation and ethical approval proof are manually confrmed. Second, data vetting is conducted to exclude problematic images, including unreadable fles, incomplete images, and images that compromise privacy information. After curation, the images are stored, indexed, and submitted to the image annotation process. Additional data preprocessing is not implemented in this study.

Resource Management.
Dataset construction relies on resource management, especially personnel management and tool management. Personnel management focuses on annotator recruitment, qualifcation, and management. Te annotation task needs both junior annotators and senior annotators. All junior annotator candidates are publicly recruited. Te basic qualifcation is a board-certifed ophthalmologist with at least 5 years of clinical experience. All candidates receive annotation instructions in advance to clarify the classifcation rule according to the literature on DR [33,34] and other fundus diseases [35][36][37][38][39][40][41]. After the training, the candidates attend an exam to classify 100 fundus photographs (18% nonreferrable DR, 45% referable DR, 32% other ocular diseases, and 6% ungradable images). Tose who achieve greater than 80% accuracy pass the exam. Tey are given an additional training session.
Senior annotators should have professional certifcation as image readers and receive special training to promote consistency. In this article, senior annotators all have NHS (UK National Health Service) certifcation.
Tool management focuses on software tools that facilitate data processing and annotation. In this study, a custombuilt annotation software is used. Te main functions include image preview, contrast adjustment, image magnifcation, flter selection, task assignment, and progress monitoring. Annotators can add, edit, and submit annotation results. Reviewers and arbitrators can visit their

Data Annotation.
Te reference standard is based on the combined decisions of junior annotators and arbitration experts. Te image annotation is conducted in a laboratory environment. Te annotation workfow is summarized in Figure 2. Te annotation process includes two rounds: Each batch of images is assigned to a team of 3 annotators. Te annotators independently annotate images in a blinded way. If their classifcation result on an image is fully in agreement, such images are categorized as the prequalifed pool. Images with discordant classifcations are categorized as the arbitration pool. 10% of the prequalifed pool is randomly sampled and submitted to the second round. Te annotations of the rest of the prequalifed pool are accepted conditionally. Te arbitration candidate group are also submitted to the second round.

Second Round (Review and Arbitration).
Tis step is carried out by a team of three senior annotators, one of whom acts as the team leader. Te team leader has served as the director of an image reading center in a top ophthalmological hospital. Tey review all images submitted to this round so as to resolve the fnal annotation in the arbitration pool and review the samples from the prequalifed pool. If sampled annotation results in the prequalifed pool cannot pass the review, more samples will be submitted to the arbitration pool. Feedback may be given to annotators in the frst round. Senior experts can justify the number of samples in the prequalifed pool for inspection. All images are stored, accessed, previewed, and manually classifed using a custom-built annotation software.

Quality Inspection.
After data annotation, quality inspection is conducted to examine the dataset's quality. Te annotation records, including initial annotation, review, and arbitration, are reviewed and compared on each image to avoid inconsistencies and mistakes. Images that pass quality inspection are enrolled in the test set. Te percentage of diabetic retinopathy subtypes is calculated. Usability and validity of each image are also examined manually.

Algorithm
Testing. Five algorithm models intended to classify fundus photographs are enrolled as AUTs. Tey are trained by diferent manufacturers or developers. Tey all claim to use deep learning, but details such as the neural network structure, weights, and training sets are beyond the scope of this article. Te test set is imported into each AUT. Te output of AUTs is compared with the fnal annotation results. Te overall accuracy, sensitivity, and specifcity used to diferentiate referable DR from nonreferrable images are reported. Te performance of AUTs is further compared across the 7 subtypes separately.

Diversity of the Test Set.
Te test set contains 6327 images from 11 hospitals in 10 provinces. Among them, 9 hospitals are tertiary hospitals and contribute 71.2% of the images, while the rest are secondary hospitals and contribute 28.8%. No primary hospitals or community clinics are involved. Since the images are deidentifed, the location of the hospital is used to indicate geological distribution of patients. Te provincial distribution of images is shown in Table 3, which demonstrates that representative provinces in Northeast China, North China, Central China, East China, Southeast China, and South China are involved.
Te images are acquired by more than 13 types of fundus cameras made by 9 manufacturers, all in compliance with an ISO standard on fundus cameras [43]. Te feld of view is 45°. Te optical resolution is between 80 and 120 pairs s/mm. All images are larger than 1000 pixel by 1000 pixel. Te difference in image size, detector, light source, and embedded software may add more diversity to image quality and features.
In this test set, all fundus photographs are rectangular images with a pure background (either dark or white pixels) enveloping the round-shaped images of interest. Te ratio between the pure background area and the whole area of each photograph is also considered an important source of image variation.

Performance of Annotators.
During the recruitment of annotators, 47 ophthalmologists registered and attended the exam to classify 120 fundus images, including 63 DR images. 15 candidates fnally passed and joined the annotation. Teir average professional experience is above ten years. Tey are from 15 diferent hospitals in 7 provinces. Teir accuracies in the exam range from 80% to 87%. Te interannotator agreement is evaluated by calculating Fleiss' kappa. Te result is 0.75, which is considered substantial given the fact that annotators come from diferent hospitals and regions. Te intraannotator agreement is evaluated by calculating intraclass correlation, which is >85% for all qualifed ophthalmologists. Additional training is given before the centralized annotation to reinforce the guidelines and minimize misunderstandings.

Annotation Results.
In the frst round, 15 annotators are evenly divided into 5 groups randomly. Individual workload is between 1000 and 1500 images. 3694 images yield concordant results, and 369 images are submitted to the second round as samples for inspection. 2356 images are graded with a majority opinion reached within each grading group and submitted to the second round for arbitration. 277 images yield totally diverse results within each group and are sent for arbitration too.
In the second round, the images are read by two NHS certifed retinal experts and a senior expert with an NHS certifcate independently in a blinded way. Ten, they discuss all results and reach consensus on the fnal annotation results. According to the fnal results, 55.41% of images are directlydetermined by the consensus within each group in the frst round. 16.02% of the images are graded according to the major opinion within each group in the frst round. 26.81% of the images are graded with a reference to the minor opinion in each group in the frst round. Only 1.76% of the images are graded only by the arbitrators.
Using the fnal annotation results as the reference standard, the accuracy of each annotator is calculated. Te average accuracy is 83%. Te minimum is 75%, while the maximum is 90%. 13 out of 15 annotators have accuracy higher than 80%. Te performance of the 15 annotators comports with their qualifcation exam results and is considered satisfactory in comparison with the commonly accepted diagnostic accuracy by single-feld fundus photography [42].
Te composition of the annotated images is described in Table 4. Te overall proportion of DR is 39.51%, comparable with the prevalence of DR in the Chinese DM population (24.7%-37.5%) [33]. Te prevalence of other fundus diseases is 41.08%. Tis test set balances the proportion between DR and other fundus diseases that may be assessed by future AIMD products.
Te classifcation of the current test set can be expressed in a simplifed manner. Class 0 and Class 1 in Table 1 are consolidated into nonreferrable DR. Class 2 to Class 4 in Table 1 are consolidated into referrable DR. Class 5 and Class 6 may remain independent or be consolidated into a certain type. In the following algorithm testing, they are considered nonreferrable.

Comparison with Raw Labels.
During data collection, partner hospitals submitted raw labels, which were annotated by local annotators without centralized examination or training. Te number of annotators deployed in each hospital varied from 1 to 3. Te requirement for annotator qualifcation was diferent among partner hospitals. Te minimum requirement was graduate student level, and the maximum requirement was associate professor level. Using the fnal annotation results as the reference standard, the overall accuracy of raw labels is 61.64%, and Cohen's Kappa is 0.5173, indicating the quality problems with raw labels.

Algorithm Testing Results.
Te overall accuracy, sensitivity, and specifcity to diferentiate referable DR from nonreferrable images are calculated and compared among the 5 AUTs. Table 5 shows the results of the 5 AUTs. Te accuracy ranges from 0.77 to 0.88. Te sensitivity ranges  Te capability of the algorithm to correctly classify images of a specifc class as referable or nonreferrable is also calculated. For class 2-class 4, it is represented as the number of true positives over the total number of samples in this category, which is equivalent to sensitivity. For other classes, the specifcity of each category is calculated instead. Table 6 compares the performance of 5 AUTs on each specifc class. It provides more details to demonstrate the variation in algorithm performance. For class 0, class 3, and class 4, the capability of all AUTs is above 95% on average. For class 1, the capability of AUT1 is signifcantly lower than the rest (on average above 90%). For class 2, the capability ranges from 0.64 to 0.75, indicating a common weakness among all 5 AUTs. For class 5, the capabilities of AUT1 and AUT3 signifcantly outweigh the rest of the AUTs. For class 6, AUT1 shows the top capability among the 5 AUTs. No AUTs in this experiment shows homogeneous capability to classify all 7 classes.

Discussion
Tis article demonstrates a centralized pathway to build test sets and conduct third party testing of AIMD products. Te test set is composed of 6327 images, which are annotated into 7 classes covering all stages of DR according to ICO guidelines, as well as "other fundus diseases" and "ungradable images." Te diversity of the test set considers data sources (11 hospitals from 10 provinces), fundus cameras (>13 models from 9 manufacturers), and image parameters (image sizes, detectors, and light sources).
Te pathway for test set construction in this article is diferent from that in algorithm challenges, where test sets and training sets are usually constructed under the same protocol or as subsets of a larger dataset. Tis pathway relies on independent data collection, curation, annotation, and storage, which decreases the possible similarity between this test set and training sets owned by developers of AUTs and promotes the verifcation of AI algorithm generalizability. It may be suitable for third party testing laboratories to conduct conformity assessment.
According to the literature [5,9,10,44], the pathway to form the reference standard in other studies is based on various combinations of annotators and reviewers. In this study, a combination of prequalifed annotators and arbitrators conducted data annotation. Under this scheme, the annotators' performance is estimated quantitatively (Fleiss Kappa � 0.75, individual accuracy >80%, and intra-class correlation >85%). During the annotation process, each image in the test set is reviewed by 3-6 experienced professionals, and 98.2% are determined by the major decision (3 votes out of 3 annotators or >4 votes out of 6). Only 1.76% are determined by the arbitration experts. Te results show that the annotation scheme helps enhance consensus among annotators.
On the other hand, the raw labels from partner hospitals show signifcantly lower accuracy and consistency compared to the fnal annotation results. According to information provided by partner hospitals, the raw labels are annotated by an inconstant number of annotators, ranging from 1 to 3, including graduate students, residents, and junior and senior ophthalmologists. It suggests the importance to organize annotation task systematically and the necessity to establish consistent annotation rules among diferent hospitals. Otherwise, the discrepancy in data annotation may impact dataset quality and further inhibit the quality of the AIMD.
Using the annotated test set, the performance of 5 AUTs is tested quantitatively as technical demonstration. It is straightforward to compare the overall accuracy, sensitivity, and specifcity in the scenario of DR classifcation. Algorithm performance can be further observed on subgroups of the test set. However, no AUT in this experiment shows homogeneous capability to classify diferent categories of images. While public stakeholders pay attention to algorithm fairness and generalizability, this study shows the necessity to reveal and understand how the AI algorithm performs diferently on subtypes of diabetic retinopathy images. It also indicates that algorithm performance may change with the proportion of these categories. A strategy to tune the composition of test sets in a fexible manner is needed to guide future testing.
Tis work explores practical approach and issue in advancing the standardized testing of the AIMD. But due to time and resource constraints, it has limitations in the following aspects:   First, the test set is based on retrospective data collection. Although data are randomly sampled by partner hospitals, control measures should be taken to limit bias. Continuous sampling of data within a period may help.
Second, the proportion of mild NPDR is much smaller than that of other DR subtypes. One possible reason is that without compulsory DR screening, patients with mild NPDR are unlikely to take fundus photographs, which results in the relative scarcity of mild NPDR photographs. Increment of mild NPDR not only decreases the sampling errors of SE and SP but also improves the balance between diferent stages of DR. In fact, from the annotator's perspective, it is important to diferentiate microaneurysm in mild NPDR from blot hemorrhages in moderate NPDR. Terefore, more cases of mile NPDR should be added to the current test set.
Tird, as a colour fundus photograph dataset, it is diffcult to use the test set alone to annotate important diseases among the 41.09% "other diseases" that may be assessed by AI in the near future. Colour fundus photographs are incapable of thickness measurement, which inhibits detection of certain diseases such as AMD and glaucoma. Images from additional imaging modalities such as OCT should be added to the test, but the cost will increase signifcantly.
Fourth, the diversity of this test set still needs improvement. Partner hospitals in this study are mostly tertiary hospitals, without community-level hospitals. As a result, most photographs are acquired by high-end fundus cameras. Handheld fundus cameras, which may be more popular in community-level clinics and rural areas, have minor contribution to data collection. More data should be added to compensate for this scenario and enrich data diversity.
To promote standardization of AIMD testing, reliability and comparability of test sets need to be addressed in the future research. Test sets built by diferent organizations may have diferent data sources, data inclusion/exclusion criteria, annotation resources, and procedures, which would cause inconsistent dataset quality. Transparent description of data sets should be normalized. Consensus standards on dataset construction and annotation are needed to guide the procedure. It would be necessary to conduct sample inspection and comparison among test sets, similar to profciency testing [45] by interlaboratory comparison.

Conclusions
Tis article proposes a practical approach to build test sets for third-party testing of the AIMD. It takes quality control measure during data collection, curation, and annotation. It demonstrates the beneft of centralized data annotation in comparison with individual annotators and spontaneous annotation from single hospitals. Te application of such a test set reveals algorithm performance and weakness in a comparative and straightforward manner, providing helpful information for regulation of such medical devices.

Data Availability
Te data supporting the fndings of the current study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest.