Noninferiority and Equivalence Evaluation of Clinical Performance among Computed Radiography, Film, and Digitized Film for Telemammography Services

Objective. The aim of this study was to evaluate and compare the clinical performance of different alternatives to implement low-cost screening telemammography. We compared computed radiography, film printed images, and digitized films produced with a specialized film digitizer and a digital camera. Material and Methods. The ethics committee of our institution approved this study. We assessed the equivalence of the clinical performance of observers for cancer detection. The factorial design included 70 screening patients, four technological alternatives, and cases interpreted by seven radiologists, for a total of 1,960 observations. The variables evaluated were the positive predictive value (PPV), accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curves (AUC). Result. The mean values for the observed variables were as follows: accuracy ranged from 0.77 to 0.82, the PPV ranged from 0.67 to 0.68, sensitivity ranged from 0.64 to 0.74, specificity ranged from 0.87 to 0.90, and the AUC ranged from 0.87 to 0.90. At a difference of 0.1 to claim equivalence, all alternatives were equivalent for all variables. Conclusion. Our findings suggest that telemammography screening programs may be provided to underserved populations at a low cost, using a film digitizer or a digital camera.


Introduction
Screening mammography programs, especially programs that use modern digital techniques such as computed radiography (CR) or full-field digital mammography (FFDM), have reduced the mortality rate associated with breast cancer [1,2]. However, screening programs alone are inconclusive as they yield many false positives, and the definitive diagnosis of breast cancer is verified by biopsy and a histopathological examination of palpable lesions [3]. Therefore, the positive predictive value (PPV) of specific mammographic findings has been evaluated in several studies [4][5][6] and recently by Venkatesan et al. [7]. Nevertheless, the evaluation of sensitivity is also very important in the evaluation of mortality associated with false negatives.
Telemedicine may help to provide widespread screening mammography services in underserved areas, and approaches such as CR or FFDM are useful in the implementation of telemammography. However, these technologies are still unaffordable in vulnerable areas of our country, such as jungles that have a low population density; therefore, low-cost solutions are required for effective telemammography. In our country, CR is only available in large cities and FFDM is only available in our hospital. Specialized equipment is available 2 International Journal of Telemedicine and Applications for digitizing mammogram films, and several studies have compared digital mammogram modalities to film-screen mammography [8][9][10], reporting no significant differences between film-screen mammography and digital mammography modalities, such as CR and FFDM [11,12]. Nevertheless, the cost of specialized digitizers is high, which is why lowcost alternative digitization equipment, such as conventional scanners and digital cameras, is being used for teleradiology services in developing countries. While such pieces of equipment can dramatically reduce costs, their clinical performance should be determined before introducing them in telemammography.
The aim of this study was to establish and to compare the clinical performance of different alternatives to implement telemammography, such as CR, film printed from the CR, a specialized digitizer, and a digital camera. The variables used for evaluating clinical performance were the PPV, sensitivity, specificity, accuracy, the area under the receiving operating characteristic (ROC) curve (AUC), and the proportions of true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP), all of which were based on the final assessment categories of the Breast Imaging Reporting and Data System (BI-RADS) [13].
No significant differences between the compared modalities have been reported in other studies designed to test the null hypothesis that the performances of different modalities are equal, but the power test was not reported, so it is not clear if these studies failed to find significant differences. In statistical hypothesis testing, failure to reject the null hypothesis does not mean the null hypothesis is true. In contrast, the present study is set to evaluate equivalence or noninferiority, in which we can conclude equivalence or noninferiority based on significant results. To establish that the performances are equal or that one modality is noninferior to the other, the null hypothesis has to be that their performances are not equal or that one is inferior to the other. Only by rejecting such a hypothesis can we conclude that the modalities under comparison are equivalent [14][15][16][17].

Materials and Methods
The ethics committee of our institutions approved this retrospective study, and informed consent was not required. A factorial design with repeated measures was used in this study. The design of this study applied a treatment-by-readerby-case factorial design with 70 patients, seven radiologists, three derived images, and the reference images (i.e., CR), for a total of 1,960 observations for each variable.
2.1. The Reference Standard. The standard for positive cases was a malignant lesion confirmed by biopsy within two years of the initial mammography screening, corresponding to BI-RADS final assessment categories 4A, 4B, 4C, and 5 [9,12,18]. Negative cases were defined as cases without any lesions confirmed by biopsy or cases with normal follow-up mammograms within the same two-year interval, corresponding to BI-RADS final assessment categories 2 and 3. Two radiologists with more than ten years of experience in reading mammograms who had access to the clinical history of the patients (biopsy, follow-up mammograms, etc.) established the reference standard.

Study Sample and Readers.
At most rural health centers in our country, there are no mammography services [19], and where they are available, there are no mammograms repositories, so there are not available mammograms to use for a retrospective study. In addition, in these regions, there are not enough patients to develop a prospective study in a short time. For these reasons, this study was undertaken using CR screening mammograms from our hospital, which is a reference hospital for mammography screening, serving patients from remote undeserved areas of our country (approximately 8,000 mammograms interpreted per year). Mammography studies from patients who attended mammography screenings at the Fundación Santa Fe de Bogotá University Hospital (FSFBUH) were randomly selected without repetition from our screening database; the patients were all asymptomatic, and their lesions were impalpable and verified by pathology. The masses ranged in size from 6 mm to 23 mm, with a mean of 11 mm (SD = 4.2). Each case was required to include the following four standard mammographic views: mediolateral oblique, craniocaudal, left, and right, even if additional views were taken in the original screening mammograms. Cases of tomosynthesis or large masses were excluded.
To determine the sample size, we used the table proposed by Obuchowski [20] for comparisons of the AUC with the following criteria: (a) six observers, (b) small variability between radiologists, (c) moderate accuracy of the test (an AUC of approximately 0.75), (d) moderate differences suspected to be found among AUC (i.e., = 0.1), and (e) a 1 : 1 ratio between malignant and benign cases. Using these criteria, minimum 60 cases were required. The final sample size was set at 70 cases, and the number of radiologists was increased to seven. Patients ranged in age from 41 to 84 years, with a mean age of 62.1 years (SD = 11.5). The cases were distributed as follows: 33 patients had cancer and 37 patients had benign lesions or normal results. The distribution of cases according to the BI-RADS final assessment categories is shown in Table 1. There were 57 cases with calcifications, 26 with masses, 35 with asymmetries, and 11 with architectural distortions and associated features. Four patients with prostheses were included in the sample. The detailed lesion classification of the cases is presented in Table 2. In terms of composition, the distribution of cases was as follows: 17 of the breasts were almost entirely fatty, 32 had scattered areas of fibroglandular density, 11 of the breasts were heterogeneously dense, which may obscure small masses, and 10 of the breasts were extremely dense, which lowers the sensitivity of mammography.
Seven radiologists from FSFBUH who were experienced in mammography, including four with high levels of experience (more than 10 years) and three with intermediate levels of experience (more than two years), served as observers.

Variables
Observed by the Radiologists. Data collection was performed using a database and a digital form that was integrated into the image viewing software. At each interpretation, the radiologist selected the level of confidence in the presence of each selected condition, that is, calcifications,  nodules, asymmetries, and distortions, from the following scores: 0, definitely absent; 1, most likely absent; 2, cannot decide; 3, most likely present; and 4, definitely present. For conditions with scores of 3 or 4, the radiologist was required to classify the condition according to the value in Table 2. Next, the radiologist classified the breast composition and finally at the conclusion of this process, a BI-RADS final assessment category was selected.

Generation and Digitization of the Mammograms.
The process of generating film and digital images is shown in Figure 1. The original mammograms consisted of screening CR images that were stored in the picture archiving and communication system (PACS) at FSFBUH. Routine screening digital mammograms were acquired using an Agfa CR 85-X (Agfa HealthCare NV, Belgium), hereafter referred to as CR, with a resolution of 20 pixels/mm (508 dpi), 50 m per pixel, and a 14-bit grayscale from an 18 × 24 cm chassis and a 3,560 × 4,640-pixel matrix. The derived mammogram images were generated as follows: as we had no screen-film images, the CR images were printed under the supervision of a radiologist on an 18 × 24 cm film with a digital Agfa Drystar 5503 printing system (Agfa HealthCare NV, Belgium) with a resolution of 508 dpi, 50 m per pixel, and 14-bit contrast. Data that could be used to identify patients were not included in the printed mammograms. Next, the films were digitized using the following two capture devices: (1) an iCR 612SL specialized digitizer (iCR Company, Torrance, CA) that had a maximum spatial resolution of 875 dpi, a pixel spot of 29 m, 16 bits per pixel, an optical density (OD) of 3.6, and a cost of $15,000 (hereafter referred to as ICR) and (2) a Lumix DMC-FZ28 digital camera (Panasonic Corporation, Secaucus, NJ, USA) with a 10-megapixel resolution, a focal length of 4.8 to 86.4 mm, a 1/2.33 charge-coupled device (CCD), ISO 100-6,400, and a cost of $450 (plus $400 for support system and light box). The digital camera is hereafter referred to as LUMIX.
For each patient (case), the following four case studies were obtained: (1) the printed film, hereafter referred to as the FILM, and three images in digital form, including (2) images from the CR (3,560 × 4,640-pixel matrix and 14-bit grayscale), (3) images digitized with the ICR (2,436 × 3,636-pixel matrix and 8-bit grayscale), and (4) images digitized with the LUMIX (2,538 × 3,463-pixel matrix and 8-bit grayscale). This procedure was completed for each of the 70 sample mammograms, producing 280 case studies. DICOM-compliant software that was developed at our institution and previously tested in several studies [21][22][23][24] was used to scan, store, and display the cases (see Figure 2).

Display
. At a cost of $8,500, a DICOM-compliant 3-MPixel MD213MG (NEC Display Solutions, Tokyo, Japan) medical-grade grayscale display, with a dot pitch of 0.21 mm, a spatial resolution of 2,048 × 1,536 pixels, maximum luminance of 1,450 cd/m 2 , and 10-bit grayscale (i.e., 1,024 gray levels), was used as the display monitor.

Data Analysis.
To compare the AUC for the detection of patients with cancer, analyses of variance (ANOVA) of the pseudovalues of the AUC were performed using DBM-MRMC 2.3 software [21]. Using the BI-RADS final assessment category as the endpoint variable, we classified all readings as negative (BI-RADS, 2 and 3) or positive (BI-RADS, 4A, 4B, 4C, and 5) [9,12,18], and we calculated contingency tables for these values, that is, the total true positives (tTP), the total true negatives (tTN), the total of false positives (tFP), and the total of false negatives (tFN). The common diagnostic metrics were calculated for these variables as follows: PPV = tTP/ (tTP + tFP), sensitivity = tTP/(tTP + tFN), accuracy = (tTP + tTN)/(total sample), specificity = tTN/(tTN + tFP), and the area under the receiving operating characteristic (ROC) curve (AUC). In addition, we calculated the proportions of true positives TP = tTP/(total sample), the proportions of true negatives TN = tTN/(total sample), the proportions of false positives FP = tFP/(total sample), and the proportions of false negatives FN = tFN/(total sample).
These variables and the difference between the compared modalities were evaluated using generalized estimating equations (GEE) with the IBM SPSS Statistics 19 software (IBM Corp., Armonk, NY, USA). With the purpose of evaluating  International Journal of Telemedicine and Applications 5 noninferiority and equivalence, the mean differences and their standard errors were obtained from DBM-MRMC and SPSS software. The hypothesis test for equivalence was as follows: the null hypothesis Ho was |Mean Difference ( − )| − = 0 and the alternative hypothesis Ha was |Mean Difference ( − )| − < 0, where and are the two modalities compared and (delta) is the maximum allowable difference permitted to conclude equivalence or noninferiority, as suggested by several authors in recent years [14][15][16][17]. We calculated a (1-2 )% confidence interval for all comparisons, which is a method to evaluate equivalence [16,17]. The significance level was set to 5% (i.e., = 0.05) and was set to 0.1, as this was the difference established in the sample selection to evaluate the area under the ROC curves. We were interested in evaluating equivalence using lower values for , in particular = 0.05, to assess the PPV and sensitivity for screening purposes. Finally, we calculated the required value of to claim equivalence for each variable and the comparison.

Procedure.
Each radiologist read each case using the following viewing methods: the film in a light box and three viewings on the medical display for digital cases of CR, ICR, and LUMIX. Pairs of patients and devices were presented at random by the software; hence, there were at least 30 different patients before a patient was repeated for any radiologist. At each reading, the radiologist determined the variables mentioned in the section entitled "Observed Variables." Each radiologist received training in the use of the viewer software before the readings were initiated. A pilot study was conducted to determine the usefulness of the viewer software and the interpretation form. The software provides case blinding and several image manipulation tools to adjust the window/level, brightness, and contrasts and histogram tools (e.g., the average optical density, histogram equalization, and fullscale histogram stretching). These tools may be combined with the overall zoom and the magnifying glass. These tools were available for all images and could be used at the observer's discretion to improve image quality, especially for patients with dense breasts and amorphous calcifications. The readings were performed over the course of ten months in two-or four-hour sessions by each radiologist, with no time limitations for each reading.

Mean Values by Device.
The mean values, standard error of the mean, and the 95% confidence interval for each device and each calculated variable presented in the data analysis section are shown in Table 3. Each of these means was calculated from 490 observations (70 cases and seven radiologists). The TN ranged from 0.46 to 0.48, the TP ranged from 0.30 to 0.35, the FN ranged from 0.12 to 0.17, and the FP ranged from 0.05 to 0.07. The mean values for the derived variables were as follows: accuracy ranged from 0.77 to 0.82, the PPV ranged from 0.67 to 0.68, sensitivity ranged from 0.64 to 0.74, specificity ranged from 0.87 to 0.90, and the AUC ranged from 0.87 to 0.90.   and the equivalence tests for accuracy, the PPV, sensitivity, specificity, and the AUC by paired devices are shown in Table 6 (for = 0.1) and Table 7 (for = 0.05). For both Tables 4 and 6, the equivalence test was preformed using = 0.1 as the original setting of this study in terms of the AUC, and in addition, a value of = 0.05 was included as explained previously. In the last column of both Tables 5 and 7, the calculated value of required in each variable and comparison to conclude equivalence between the compared devices is presented. The absolute differences for the calculated variables were as follows: the TN differences ranged from 0.000 to 0.016, the TP differences ranged from 0.008 to 0.047, the FN differences ranged from 0.008 to 0.047, and the FP differences ranged from 0.000 to 0.016. For = 0.1, all the comparisons in Table 4 showed equivalence ( < 0.001); for = 0.05, most comparisons (20) showed equivalence (P values ranged from 0.0001 to 0.0347), while no significant differences were found for the TP and the FN in LUMIX versus FILM and ICR versus FILM; nevertheless, the required to achieve equivalence was near 0.05 (0.051 and 0.069).

Mean Difference Values and the Equivalence Test by Paired
Devices for the Derived Variables. The absolute differences for the derived variables were as follows: the accuracy differences ranged from 0.010 to 0.057, the PPV differences ranged from 0.002 to 0.009, the sensitivity differences ranged from 0.017 to 0.100, the specificity differences ranged from 0.000 to 0.031, and the AUC differences ranged from 0.009 to 0.034. For = 0.1, all the comparisons for accuracy, the PPV, and specificity showed equivalence ( values ranged from 0.0001 to 0.004), while for sensitivity, again, the LUMIX-FILM and ICR-FILM comparisons showed no significant differences. For = 0.1 in the AUC tests, the comparisons showed statistical equivalence for the following pairs: LUMIX versus CR ( = 0.008), LUMIX versus FILM ( = 0.04), and FILM  versus CR ( = 0.03); in the LUMIX versus ICR comparison, equivalence was not found, but the noninferiority of LUMIX was observed ( = 0.046); for ICR versus CR and ICR versus FILM, neither equivalence nor noninferiority was noted, and the required values of to achieve equivalence were 0.133 and 0.118, respectively. However, for = 0.05, less consistency was observed. Only paired comparisons for the PPV were all equivalent ( < 0.001); for specificity, the LUMIX versus FILM comparison failed to show equivalence ( = 0.15). For = 0.05, in paired comparisons for the other derived variables, few tests confirmed equivalence: three showed equivalent accuracy, three showed equivalent sensitivity, and only one showed an equivalent AUC.
In general, the required values for to confirm equivalence ranged from 3.4% to 8.4% for accuracy, 0.7% to 1.5% for the PPV, 6.8% to 14.2% for sensitivity, 3.8% to 6.1% for specificity, and 7.3% to 13.3% for the AUC.

Evaluations of Dense Breasts.
We ran the GEE analysis using only the readings of cases with heterogeneously dense and extremely dense (21 patients by 7 radiologists: 147 interpretations) breasts for the TP, TN, FP, FN, VPP, sensitivity, specificity, and accuracy evaluations (see Table 8). The best values of these variables were observed for FILM; nevertheless, the values for the digital images were very similar regardless of whether the device is of highest or lowest resolution, that is, CR or LUMIX, respectively. In pairwise comparisons between the high-resolution device (CR) and the low-resolution devices (ICR and LUMIX), the results were as follows: between CR and ICR, no significant differences were observed for the TP, TN, FP, FN, VPP, VPN, sensitivity, specificity, and accuracy; between CR and LUMIX, no significant differences were observed for the TP, TN, FP, FN, sensitivity, specificity, and accuracy. In pairwise comparisons between printed film (FILM) and the three digital devices (CR, ICR, significance of the test (0.05), : difference of the means allowed to achieve equivalence, : test for difference of compared devices, that is, = (|Difference ( − )| − )/SE, : retained hypothesis equivalence at level ("Ha" indicates equivalence achieved and "Ho" indicates failing to reject the null hypothes).
and LUMIX), the results were as follows: no significant differences were found for the TP and TN, nor for the FP, FN, VPN, specificity, and accuracy, while differences were noted for the sensitivity and VPP between FILM and CR. In comparisons between FILM and ICR and LUMIX, which are digital images with lower resolutions, differences were noted in the TP, FN, VPP, VPN, sensitivity, accuracy, and VPP, and for the specificity between FILM and LUMIX; while no differences were observed for the TP, FP, and VPN. High values for the AUC (ranging from 0.86 to 0.90), with no significant differences, were found among the four devices ( = 0.186).
As we found many nonsignificant differences ( > 0.05), we performed equivalence analyses, finding (delta) values for which equivalence may be claimed with significant values. In this analysis, LUMIX and CR achieved TP equivalent at 4%, while ICR and CR achieved TP equivalent at 2.2%. The TN were equivalent at 7.3% for CR-LUMIX and 6.1% for CR-ICR. Sensitivities were equivalent at 6.5% for CR-LUMIX and 3.6% for CR-ICR. The VPP values were equivalent at 7.5% for CR-LUMIX and 1.8% for CR-ICR. Only for specificity comparisons were the equivalence values larger than 10%. In this analysis, LUMIX and CR achieved AUC values equivalent at 4.9%, while ICR and CR achieved AUC values equivalent at

Discussion
The values observed for the AUC for each device ranged from 0.87 to 0.90. These accuracies were higher than the assumed value accuracy used in the sample size calculation for this study (i.e., 0.75). In the paired comparisons, low differences were observed for most derived variables; for PPV, which is one of the most important variables in mammography [4][5][6], all values were inferior to 0.9% (0.009). In contrast, the largest differences identified among the paired comparisons were 10.0% (0.1) for sensitivity in a comparison of ICR and FILM. Readings from the LUMIX, which was the lowest-cost device in this study, were equivalent to CR in terms of accuracy, the PPV, sensitivity, specificity, and the AUC for = 0.1. This is important because the LUMIX images were obtained after printing CR images on film and digitizing them with the camera, which may deteriorate the quality of these images. Comparing LUMIX with ICR (which is approximately 30 times more expensive than LUMIX), equivalence was observed in terms of the accuracy, the PPV, sensitivity, specificity, and noninferiority in the AUC for = 0.1.
In this study, we used a value of = 0.1 (10%) to evaluate equivalence, which was the value used in this study and in our previous studies to calculate sample size [22,25,26]. With this value, global equivalence was observed. As a post hoc evaluation, = 0.05 was used to be more conservative with respect to sensitivity. With this value, fewer comparisons showed equivalence or noninferiority at a cutoff significance level of 0.05. The value of the required to achieve equivalence may be useful in further calculations of the required sample size for similar studies.
Our results regarding dense breasts suggest that the lower digital images of the digital camera LUMIX and especially ICR are still good quality low-cost alternatives, even for heterogeneously dense and extremely dense breasts, with better performance observed for ICR than LUMIX. The results provide support for the hypothesis that there are no significant differences between the interpretations of CR mammography examinations and soft copy examinations produced by a specialized film digitizer or a digital camera. In the same sense, our results suggest that the lower quality digital images of the digital camera LUMIX are still of adequate quality even for amorphous calcifications.
A limitation of our study, as explained before, is that all of the mammography images in this study were obtained from a referral hospital with high standards and quality equipment. Therefore, the results of this study should be revisited using film-screen mammography images obtained at rural hospitals with equipment and technical standards of varying quality. Another limitation of this study is the variability between radiologists. Consequently, it was more difficult to obtain significant results when less-than-10% non-inferiority or equivalence margins were selected. A third limitation was the selection procedure to establish this margin, which must be a predetermined clinically meaningful limit. The researchers of this study did not agree when to set the value at 5% or 10%, or another more appropriate value, for the inferiority or equivalence margin, and of course, this value may be different for each calculated variable (e.g., sensitivity, specificity, the PPV, and the AUC). This disagreement is due to ignorance regarding the actual values that these variables take on when our radiologists interpret routine mammograms. In this sense, this study is a first estimation of these values and can be used to improve sample size calculations in further studies at our hospital.
In our analysis, the specificity and AUC values were high, whereas the accuracy and PPV were moderate, and the sensitivity values were relatively low. Other studies have compared film-screen mammography with digitized film [8][9][10]12] and reported no significant differences in their diagnostic accuracy, but these studies were not equivalence or noninferiority evaluations, and no report about the power test was presented. In our study, we measured mean AUC values that were similar to or higher than those reported by Powell et al. [8], Gitlin et al. [9], and Pisano et al. [10] and by Lewin et al. using FFDM [18]. To our knowledge, no previous study has evaluated the equivalence or noninferiority performance of observers reading mammograms that were captured with a digital camera.
Screening may have side effects associated with false positives. Previous studies have shown that one in three test results leads to biopsy, which often turns out to be negative for cancer. Even when cancer is ruled out by the pathology results, high rates of testing generate a 33% cost overrun for screening [27] and cause permanent anxiety in patients [1]. Moreover, 50% of cancer patients survive regardless of whether they were enrolled in a screening program [2]. The risk of false positives should be maintained below 10% by comparing successive screening mammograms at intervals of 12 to 18 months until the patient's life expectancy is less than 10 years [22]. In our study, low FP were noted (<6.7%) for all devices, which is important for reducing stress in patients and the health system costs.
The principal difference of our study with respect to previous studies is that in this evaluation an equivalence or noninferiority study was performed, instead of a conventional two-sided hypothesis test setting for the nonequivalence testing as was the case in many previously published articles, in which no statistical differences were reported without reporting the power test.

Conclusion
In conclusion, our findings suggest that telemammography screening programs may be provided to underserved populations at a low cost, using a film digitizer or a digital camera, with differences of 10% in terms of the sensitivity, specificity, positive predictive value, accuracy, and the area under the receiver operating characteristic curve. To increase the power in equivalence or noninferiority tests for margin differences of 5%, more images or more observers must be included in the study.