Biomedical Data Annotation: An OCT Imaging Case Study

In ophthalmology, optical coherence tomography (OCT) is a widely used imaging modality, allowing visualisation of the structures of the eye with objective and quantitative cross-sectional three-dimensional (3D) volumetric scans. Due to the quantity of data generated from OCT scans and the time taken for an ophthalmologist to inspect for various disease pathology features, automated image analysis in the form of deep neural networks has seen success for the classification and segmentation of OCT layers and quantification of features. However, existing high-performance deep learning approaches rely on huge training datasets with high-quality annotations, which are challenging to obtain in many clinical applications. The collection of annotations from less experienced clinicians has the potential to alleviate time constraints from more senior clinicians, allowing faster data collection of medical image annotations; however, with less experience, there is the possibility of reduced annotation quality. In this study, we evaluate the quality of diabetic macular edema (DME) intraretinal fluid (IRF) biomarker image annotations on OCT B-scans from five clinicians with a range of experience. We also assess the effectiveness of annotating across multiple sessions following a training session led by an expert clinician. Our investigation shows a notable variance in annotation performance, with a correlation that depends on the clinician's experience with OCT image interpretation of DME, and that having multiple annotation sessions has a limited effect on the annotation quality.


Introduction
OCT is an interferometric imaging technique which can provide cross-sectional views of the subsurface microstructure of biological tissue [1].Due to its noninvasive nature, simplicity of use, absence of ionising radiation, and high resolution, OCT has seen widespread use in diagnostic ophthalmology and has played a critical role in the diagnosis and monitoring of eye diseases such as age-related macular degeneration (AMD) and DME [2] as well as other retinal vascular diseases and macular disorders.Te role of OCT analysis is also expanding in the prediction and monitoring of systemic neurodegenerative diseases.Providing highresolution information on anatomic and structural changes within the eye, the technology can allow clinicians to detect early disease changes and monitor response to treatments, hence consequently improving patients' outcomes.With the development of spectral-domain OCT, clinicians are now able to capture denser 3D images, generating large amounts of patient-specifc data.While OCT imaging holds promising benefts, the challenge of interpreting an extensive array of biomarkers using both 2D and 3D perspectives, combined with the need for comparisons to prior scans, presents diffculties.Tis challenge is particularly pronounced within clinics already grappling with substantial workloads.Due to this, each patient has become a "big data" challenge [3].
Automated image segmentation algorithms have emerged to provide meaningful data from medical images.Tese algorithms use deep learning, a subfeld of machine learning that has proven successful in retinal structure segmentation in fundus [4] and OCT images [5,6].Te use of deep learning has stimulated a proliferation in automatic assessments of various segmented features of disease pathologies within an OCT volume, particularly IRF, subretinal fuid, pigment epithelial detachment, and subretinal hyper-refective material [7][8][9].
Supervised deep neural networks have achieved strong performance in medical image segmentation, in part due to high-quality annotated data [10].However, the collection of high-quality image data annotations has many hurdles, including time and monetary costs, as well as regulatory issues [11].To appropriately model the underlying data distribution of a given task, current supervised deep learning systems rely on labelled data.Te quality of the labelled data is one of the most important aspects in achieving the desired performance for computer vision algorithms driven by deep learning.Furthermore, due to obscure boundaries of morphological features pertaining to disease pathology on medical images, annotations can sufer from inter/intraobserver variance even among specialists.As deep neural networks are vulnerable to overftting, accepting annotations as perfectly correct and reliable may result in biased models with poor generalization [12].Tis further necessitates the development of more robust annotation quality analysis approaches for the creation of more reliable and applicable datasets.
In this study, we investigated the quality of IRF annotations from multiple clinician annotators on OCT images collected from patients with DME.Te main focus of this work is to explore how the quality (as evaluated through comparison of a given annotation against a ground truth) of OCT-based image annotations can vary based on clinician experience.As well as this, we explore the efect of having multiple annotation rounds on the resulting annotation quality.To evaluate the underlying variations in the annotations, image-based properties such as image noise and pixel intensity are also considered.
1.1.Related Work.Data quality and annotation techniques have been investigated in recent years in diferent felds of artifcial intelligence due to their importance.McCulloh et al. [13] described numerous measures for assessing interannotator agreement, rater consistency, and distraction rate, applied to natural language processing tasks.Tey demonstrated improvement in annotator utilisation, interannotator agreement, and the rate of annotation through the management of the annotation process by actively monitoring several quality and schedule metrics.
For automated analysis of OCT imaging, there has been limited work in the literature investigating how annotation quality can be evaluated.Kurmann et al. [14] implemented an image annotation quality control for OCT B scans in the form of majority voting and intergrader performance evaluation using Kappa metrics, which improved the performance of a machine learning model for the classifcation of AMD biomarkers.Apostolopoulos et al. [15] investigated the use of automatic image enhancement on OCT B scans prior to annotating, resulting in greater annotation quality between two graders for several biomarkers.
Most investigations into how the annotator's experience can afect annotation quality are related to crowd-sourcing, which describes the use of many nonexperts for large-scale labelling and the rapid generation of labelled images for algorithm development and evaluation.Irshad et al. [16] showed that crowd-sourced nonexpert-derived evaluation metrics perform at a level similar to both research fellow and automated method-derived metrics for the relatively simple task of nucleus detection and segmentation, with the research fellow annotations showing the strongest performance for detection and the crowd-sourced scores showing the strongest performance for segmentation.Tey concluded that crowd-sourced image annotation is a highly scalable and efective method of obtaining nucleus annotations for large-scale projects in computational pathology.
Others have also investigated the impact of annotated data quality on algorithm performance measurements.Using multiple publicly available datasets of stationary ground-based camera data of outdoor spaces and urban scenes, Joseph et al. [17] conducted several data annotation experiments to measure interannotator agreement and consistency, as well as how the selection of ground truth impacts the perceived algorithm performance.Tey showed that when not monitoring the quality of annotations during the annotation process, subsequent algorithm precision can have a signifcant loss and that knowledge of the labelling process, training data, and ground truth data used for algorithm evaluation are fundamental components to accurately assess the trustworthiness of an AI system.Walter and Sörgel [18] identifed common issues in crowd-sourced geographic data annotation, such as object misrepresentation.Tey demonstrated that by annotating the data numerous times and using a single shared dataset, some of these quality discrepancies can be removed.
Tough there have been investigations into quality control of annotation eforts for a range of imaging modalities, to the best of our knowledge, there has been little work to show to the extent which junior clinicians can be utilised for the development of biomedical imaging annotations, and how these annotations compare to that of more experienced clinicians, especially for the case of OCT images.

OCT Imaging Data.
Te images utilised for this study were captured using the SD-OCT Heidelberg Spectralis (Heidelberg, Germany) as part of routine clinical care, using standard imaging protocol at Sunderland Eye Infrmary, UK.Te dataset consists of 10 randomly selected individual OCT B scan slices from 10 patients with active DME who attended vascular clinics for treatment and follow-up.Te imaging protocol used had a scan width of 15 × 30 °with 19 scans (240 microns spacing) and an ART value of 9 (high speed).Te individual line scans had a slight variation in size and scaling but were typically 732 by 428 pixels with a scaling of 10.5 microns per pixel on the X (horizontal) axis and 3.8 microns per pixel on the Y (vertical) axis.

Annotation Protocol.
Te annotation process took place utilising annotation software developed by Glif [19].Te annotation tool is designed to work on any kind of imaging modality, with intuitive features of 3-step zooming, contrast A clinician ophthalmologist, Ex 1 , performed the annotation once using the platform.A second clinician ophthalmologist, Ex 2 , performed the annotations two times-once before each trainee training session (with a minimum of 1-month period between each round).Ex 2 also conducted a training session for the three clinicians considered juniors.Training focused on understanding the tool features and identifying and defning various OCT landmarks, and the features and characteristics of IRF cysts, as well as other common OCT image defects, such as vessels and exudates shadowing or dark retinal tissue that can mimic IRF cysts.Te clinicians were then presented with the 10 training OCT scans of various clarities and with a wide mix of OCT features.Te 3 junior clinicians were given the task of annotating all areas of IRF cysts, with no limit on size.Tey were instructed to shade the entire area of IRF cysts, rather than edges only.Following the frst round of annotations, the annotated images were collated and reviewed by Ex 2 .A training session was set to review the results of the 3 junior clinicians to identify any wrongly defned areas of IRF cysts and evaluate over or underestimation trends of annotation and identify correct edges of the diferent IRF cysts, especially in images with low-resolution or ill-defned edges with poor scan quality.Te clinicians were instructed to have a month cooling period before further attempts for the second round of annotation using the same platform and taking into consideration, the training provided and remarks made to improve the accuracy of defning cysts' edges and borders.None of the clinicians had access to other clinicians' annotations to prevent any infuence on results.Ground truth annotations were generated through a majority vote of clinicians Ex 1 and Ex 2 last round of annotations to create the most accurate representation of ground truth possible.
Despite only 10 OCT images being used for this study, 2399 IRF areas were annotated across all clinicians and annotation rounds.Furthermore, the 10 images used represent scans that cover a large range of variations in both image noise/clarity, as well as disease progression, therefore allowing a greater insight into quality variation for the clinicians under diferent circumstances.

Majority Voting.
We use majority voting to generate expert agreed ground-truth annotations, following the approach detailed by Kajino et al. [20].Figure 1 shows an example of a majority vote with the correlated OCT region.

Performance Metrics.
To assess annotator performance, annotations are evaluated against the majority vote.Quantitative metrics used include intersection over union (IoU), dice similarity coefcient (DSC), true-negative rate (TNR), true-positive rate (TPR), precision, and Cohen's kappa.Te metrics are calculated using the following equations: where TP, TN, FP, and FN refer to true-positive, truenegative, false-positive, and false-negative regions, respectively.TP represents the number of pixels which are part of the region that are labelled correctly by both the annotator and the ground truth.TN represents the number of pixels which are part of the background region and labelled correctly by both the annotator and the ground truth.FP represents the number of pixels labelled as part of the region by the annotator but labelled as a part of the background by the ground truth.Finally, FN represents the number of pixels labelled as part of the background by the annotator but labelled as part of the region in the ground truth.Cohen's kappa coefcient is used as a metric to analyse the reliability of the agreement among annotators, which has the beneft over the alternatives of accounting for the possibility of the agreement happening by chance.Te coefcient ranges between 0 when the agreement is not better than chance and 1 when there is perfect agreement [21].Kappa values are grouped as follows: κ ≤ 0 indicating no agreement, κ ≥ 0.01 and κ ≤ 0.20 as none to slight, κ ≥ 0.21 and κ ≤ 0.40 as fair, κ ≥ 0.41 and κ ≤ 0.60 as moderate, κ ≥ 0.61 and κ ≤ 0.80 as substantial, and κ ≥ 0.81 and κ ≤ 1.00 as almost perfect agreement.κ is calculated using the following equation: where p o represents the relative observed agreement among annotators and p e is the hypothetical probability of chance agreement, both calculated using the following equations: (3) Gwet's AC 1 is also calculated similarly as another interrater agreeability metric [22].Gwet's AC 1 tends to be Journal of Ophthalmology less infuenced by imbalanced data and prevalence than Cohen's kappa.Terefore, providing values for Gwet's AC 1 alongside Cohen's kappa is recommended due to strong evidence to support its benefts, as detailed by Wongpakaran et al. [23].Gwet's AC 1 has the same structure as Cohen's kappa described in equation ( 2); however, the main difference lies in the calculation of p e : where p 1 , p 2 , p 3 , . .., p n are the probabilities of each category occurring for both raters and n is the number of categories.Terefore, in the case of binary annotations, we defne p e as follows: (5)

Annotation Image Analysis.
To better understand the underlying variance in the annotations of IRF from the annotators, image properties of the annotations against the corresponding OCT images are evaluated.IRF appears as round, minimally refective spaces within OCT images [24]; therefore, we expect the annotations to be present in dark, rounded regions within the image.We evaluate the greyscale pixel intensity diference between the outer and inner boundaries of each annotation, with each referring to 1 pixel outside or inside the edge of the annotated region, respectively.As IRF regions are visibly darker than the surrounding tissue, this is used to gain insight into whether clinicians are over/undersegmenting.Where the IRF regions have normalised greyscale pixel intensity values closer to 0 (black) and surrounding tissue closer to 0.5 (grey), a larger diference in the pixel intensity between outer/inner annotation boundaries indicates a more accurate annotation.Figure 2 shows the annotation boundaries for some annotated areas.Due to OCT images sufering from noise defects, annotator performance for diferent noise levels is evaluated to determine the impact of noise on the annotations.As the speckle noise present in OCT follows a Poisson distribution [25], the shape parameter λ for each OCT image is estimated using the approach detailed by Laligant et al. [26].Each of the 10 OCT images used is then assigned to groupings of low noise (λ ≥ 0, λ ≤ 20), medium noise (λ ≥ 20, λ ≤ 40), and high noise (λ ≥ 40).Te subsequent distribution had 5 OCT images designated as low noise, 3 as medium noise, and 2 as high noise.Figure 3 shows OCT images from our dataset with a low and high noise estimate.
Finally, we investigate whether the quality of the annotation varies based on the location of the perceived IRF biomarker on the OCT image.Tree zones are established at locations moving outward in each direction from the centre of the OCT image, with a vertical crop at 0.5 mm, 0.5 mm-1.5 mm, and 1.5 mm-3 mm. Figure 4 illustrates these zones on an OCT image from our dataset.Recent research work has shown variable correlations between fuctuations in fuid volumes in diferent central zones of the OCT scans and visual outcomes after treatment, as well as diferences between 2D area measurements and estimates of fuid versus 3D volumetric measurements.It is, therefore, prudent to assess and compare the annotation accuracy for these diferent zones for future interpretation of results.

Clinician Grading of IRF.
Te mean κ value for the IRF annotation agreement across both round-1 and round-2 annotations was 0.68, with minimum and maximum scores of 0.56 and 0.94, respectively.Te mean κ value for round-2 was lower than that for round-1 (κ round−2 � 0.68 < κ round−1 � 0.71).Te highest agreement was observed for expert clinician's Ex 1 and Ex 2 on round-1 annotations (κ � 0.94), indicating almost perfect agreement, and the lowest agreement was seen for clinician's Ex 2 and Jr 2 on round-1 annotations (κ � 0.56), indicating moderate agreement.Figure 5 shows the relations of κ between each pair of graders for the annotations of round-1 and round-2.
In comparison, the mean AC 1 value for round-2 was also found to be lower than that for round-1 (AC 1 round−2 � 0.60 < AC 1 round−1 � 0.63).Te same maximum and minimum agreement values between pairs were found as with previously stated Cohen's kappa pairs.Figure 6 shows the relations of the AC 1 value between each pair of graders for the annotations of round-1 and round-2.Tables 1 and 2 describe the performance of each clinician grader against the majority vote, with metrics averaged across all annotations.Columns "# of annotations" within each table refer to the number of annotated fuid areas within the images.Figure 7 shows the distribution of IoU scores for each clinician's annotation rounds.Performance metrics had limited improvement in the second round of annotations for the 4 annotators with a second round, where 9 of 20 metric values across IoU, DSC, TNR, TPR, and precision increased, with 2 remaining the same and 9 decreasing.Figure 9 shows the variance of the IoU values across the three zones previously defned for each clinician's last round of annotations.A relationship between the location of the annotation and the performance of the annotator can be seen, particularly for juniors with a large decrease in mean IoU farther from the centre of the OCT image.Te mean IoU for expert and junior clinicians in zone 1 is 0.8857 and 0.6179, respectively, whilst in zone 3, expert and junior mean IoU reduces to 0.8305 and 0.2782, respectively.
Figure 10 depicts the mean normalised greyscale pixel intensity values for each clinician's last round of annotations for the 10 OCT images.A larger diference between the  Journal of Ophthalmology expert's outer and inner annotation boundaries can be seen with a mean diference of 0.1522, in comparison with the juniors which have a mean diference of 0.1216.Furthermore, the greyscale pixel intensity of the entire area of annotations shows that on average, junior annotations have higher greyscale intensity annotations than the experts.Both factors suggest that when annotating the IRF areas, junior annotations tend to be slightly oversegmented into the surrounding retinal tissue.

Discussion
Our experimental results show that the quality of annotations can have a notable range based on clinician experience.Diferences in the annotation quality between the clinicians used within this study could be attributed to human-centric limitations during the manual annotation process such as clinical experience, lack of annotator focus, and distractions.
Inter-rater agreeability analysis found that agreement as evaluated through Gwet's AC 1 is lower than that of Cohen's kappa.Tis could be attributed to the calculation of Gwet's AC 1 , which takes into account the prevalence of categories and adjusts for imbalanced data.As our data are imbalanced, with a majority of pixels in each annotation being background, this suggests the clinicians agree on the more prevalent category (background), but disagree more on the less prevalent category (IRF), which can be expected.Despite the slight disparity in the absolute values of the two metrics, they reveal comparable trends of agreeability between clinicians, where the pairs of clinicians with the highest and lowest agreement metrics remain consistent across both measures.
A unique feature detailed within our investigation was an image analysis-based method to evaluate the variance of the image properties of the annotations.Tis showed that expert clinicians have precise annotations which are more contained to the IRF areas, whilst less experienced clinicians have larger and less precise annotations which oversegment.Furthermore, we found the noise present in the OCT image not to have a signifcant efect on annotation quality; however, annotations farther away from the centre of the therefore, the results may be skewed to this small sample.As well as this, with only 2 experts to create the majority vote, when evaluating the performance of the data used to create the majority vote, there will be some bias.Tis could be prevented by having a minimum of 3 expert clinicians included and using a leave-out approach where performance metrics and intergrader agreeability are also evaluated for a majority vote in which their annotations are not included, as shown in the investigation by Kurmann et al. [14].Finally, it would be benefcial to include annotations of other biomarkers to determine how the quality of the annotations may vary depending on the biomarker, as only IRF annotations were considered in this study.

Conclusion
In this work, we investigated the variance in annotation quality on OCT image annotations from multiple annotators with diferent levels of experience.Based on intergrader agreeability metrics, we observed that expert clinicians have an almost perfect agreement with a moderate-substantial agreement with junior clinicians, whilst juniors have substantial agreement with each other.It was found that having junior clinicians annotate over multiple training rounds had a limited improvement in quality.Trough an image-based analysis, image noise had little efect on annotation quality; however, annotations farther from the centre of the image have substantially reduced quality for junior clinician annotations.Furthermore, it was found through analysis of annotation boundaries that expert clinician annotations were contained to the darker IRF areas, whereas juniors tend to oversegment into surrounding retinal tissue.
Tis study has shown the extent to which annotations from annotators with diferent levels of subject experience can vary.Tese fndings suggest that the use of less experienced clinicians for image annotation may not be a reliable solution for improving the efciency of data collection in ophthalmology.It may be necessary to invest in more comprehensive training programs or to prioritize the involvement of experienced clinicians in the annotation process to ensure the reliability and accuracy of automated image analysis using deep learning approaches that rely upon these annotations.

3. 2 .
Annotation Image Region Properties.Te variance of the IoU values for each clinician's last round of annotations grouped into diferent noise levels can be observed in Figure 8. Image noise did not have a signifcantly strong correlation to annotator performance, with mean IoU for expert and junior clinicians in low noise images of 0.8648 and 0.4304, respectively, whilst in high noise expert and junior, mean IoU is 0.8479 and 0.4687, respectively.

Figure 2 :
Figure 2: Annotated areas with highlighted outer boundary (red) and inner boundary (blue).

2
Journal of Ophthalmology adjustment of the image or the annotation, and precision annotation size to subpixel level.10 OCT images were graded using the tool by 5 clinicians with diferent levels of experience in OCT interpretations.Te clinicians whose annotations are used in this study are anonymised for confdentiality reasons into abbreviations of expert "Ex i " and junior "Jr i ."Te clinician experience is as follows: Ex 1 > 15 years and Ex 2 > 15 years.Both experts are senior ophthalmology clinicians and retinal specialists.All juniors are ophthalmic trainees at diferent levels of their specialist training: Jr 1 7 years, Jr 2 2 years, and Jr 3 0 years.