External Validation of an Artificial Neural Network and Two Nomograms for Prostate Cancer Detection

Background. Multivariate models are used to increase prostate cancer (PCa) detection rate and to reduce unnecessary biopsies. An external validation of the artificial neural network (ANN) “ProstataClass” (ANN-Charité) was performed with daily routine data. Materials and Methods. The individual ANN predictions were generated with the use of the ANN application for PSA and free PSA assays, which rely on age, tPSA, %fPSA, prostate volume, and DRE (ANN-Charité). Diagnostic validity of tPSA, %fPSA, and the ANN was evaluated by ROC curve analysis and comparisons of observed versus predicted probabilities. Results. Overall, 101 (35.8%) PCa were detected. The areas under the ROC curve (AUCs) were 0.501 for tPSA, 0.669 for %fPSA, 0.694 for ANN-Charité, 0.713 for nomogram I, and 0.742 for nomogram II, showing a significant advantage for nomogram II (P = 0.009) compared with %fPSA while the other model did not differ from %fPSA (P = 0.15 and P = 0.41). All models overestimated the predicted PCa probability. Conclusions. Beside ROC analysis, calibration is an important tool to determine the true value of using a model in clinical practice. The worth of multivariate models is limited when external validations were performed without knowledge of the circumstances of the model's development.


Introduction
Prostate specific antigen (PSA) is the most valuable tool for prostate cancer (PCa) detection [1]. The status of digital rectal examination (DRE) remains important, but especially for PCa screening the DRE is less important than PSA [2]. Transrectal-ultrasound-(TRUS-) guided needle biopsy of the prostate is nowadays the most simple and accurate method to obtain prostatic tissue for histological evaluation [3]. Although PSA is regarded as the best biochemical marker for PCa [4], an important limitation regarding its use in cancer detection is the considerable overlap of patients with PCa and those with benign prostate hyperplasia (BPH), specifically in the serum PSA range 4.0-10.0 ng/mL [4]. Percent-free PSA (%fPSA) has been proposed as a primary decision tool for first-time biopsy in men with a nonsuspicious DRE within the tPSA range 4-10 ng/mL, as well as in lower PSA values [5,6].

Patient Population.
From May 2005 to June 2008 a total of 282 patients (101 with PCa and 181 with no evidence of malignancy (NEM)) were included in the trial (median age 66 years) because of either a suspicious DRE a PSA value between 4 and 10 ng/mL. All patients were referred by urologists for PCa screening. None of the included patients had a TRUS-guided biopsy nor a transurethral resection of the prostate before.

Clinical and Pathologic
Evaluation. The Beckman Access PSA assay was used for 195 patients and the Roche Elecsys 2010 for 87 patients and clinical stage was defined according to the sixth edition of the American Joint Committee on Cancer Staging Manual [23]. Blood samples were taken before prostate manipulation and centrifuged within 2-3 h after venipuncture. Serum was analyzed on the same day. Twelve core systematic TRUS-guided biopsies were performed in all subjects as described elsewhere [24]. All biopsy specimens were histologically graded according to the Gleason grading system by two pathologists. Total prostate volume was calculated with the prolate ellipsoid formula (volume = 0.52 × length × width × height). A DRE finding nonsuspicious for cancer was defined as negative; and a finding suspicious for cancer as positive.

Data Analysis.
Data from all 282 patients were applied to the online available ANN "ProstataClass" (named ANN-Charité) using both the Beckman Access and the Roche Elecsys tPSA and fPSA assays [19]. This ANN was built on 798 samples (468 PCa and 330 NEM) investigated retrospectively from archival sera collected between 2001 and 2004 [19].
The ANN model was constructed with the MATLAB Neural Network Toolbox (The Mathworks, Natick, Mass, USA). Feed-forward back-propagation networks were built in which the input layer consisted of five neurons for the variables tPSA, %fPSA, age, prostate volume, and DRE, with three neurons as hidden layer and one output neuron, ranged from 0 (low PCa risk) to 1 (high PCa risk). To get the best generalization of the ANN, we used Bayesian regularization. To avoid overfitting the number of epochs to train the network over the entire set of input patterns was limited to 5. To compare possible population effects on model differences, two other LR-based models [7,17] built on external cohorts were applied to our cohort. The calibration of the nomograms as help to compare the predicted and observed probabilities was performed as described before [19]. However, the 282 patients were subdivided in 10 groups of each 28 men in order of their respective predicted nomogram probability. For each group the observed and mean predicted probabilities were computed.

Selection of Two Other
Nomograms. Nomogram I was developed by Karakiewicz et al. [8] nomogram II and was published by Kawakami et al. [18]. Both nomograms had very similar patients' characteristics regarding the number of included patients. Karakiewicz's nomograms belong to the mostly used nomograms in the internet. These are reasonable facts for us to select these nomograms to compare the results with our population.

Statistical Analyses.
All 282 observations were used to access the predictive accuracy and the performance characteristics of the ANN [19]. The individual ANN predictions were generated with the use of the web-based ANN application, which relies on age, DRE, PSA, %fPSA, and prostate volume.
We used the statistical software SPSS 17.0 for Windows (SPSS, Chicago, USA) and Sigma Plot 2001 for Windows. The nonparametric Kruskal-Wallis test of variance, the Mann-Whitney U test, logistic regression analysis with forward variable section, and Spearman rank correlation were carried out. The diagnostic validity of tPSA, %fPSA, and the ANN was evaluated by ROC curve analysis with calculations of the AUC and specificities at 90% and 95% sensitivity by using Graph ROC for Windows [25] and MedCalc 11.2.1 (MedCalc Software, Mariakerke, Belgium). Significance was defined at P < 0.05. Table 1 shows the characteristics of the cohort of 282 patients used in the external validation of the ANN. Age ranged between 46 and 83 years (median: 66). In the Beckman Group PSA and %fPSA ranged from 4.01 to 9.91 ng/mL (median: 6.77) and 5% to 48% (median: 15.69%), respectively. In the Roche Group PSA and %fPSA ranged from 4.01 to 9.99 ng/mL (median: 6.98) and 4% to 31% (median: 15.63%), respectively. Of all men, 67 (23.8%) demonstrated suspicious DRE findings. Total prostate volume ranged from 7.1 to 171.0 cc (median: 42.6). Overall, 101 (35.8%) PCa were detected. Of men with suspicious DRE, 37 (55.2%) had PCa on biopsy. Table 2 shows median and mean values for age, tPSA, %fPSA, prostate volume, and DRE status for the validation cohort and the training cohort for the ANN-Charité. The percentage of PCa patients (35.8%) is much lower in our external validation cohort as in the "ProstataClass" cohort (58.6%). Comparisons between PCa and NEM within our external validation cohort and the "ProstataClass" cohort revealed significant differences for age, %fPSA, PSAD, and number of positive DREs (P always <0.05) with exception for tPSAs (P = 0.387) and volume (P = 0.900) in the external validation cohort. The ANN, which is based on age, DRE, PSA, %fPSA, and prostate volume, was 78% accurate in the original report [10]. As shown in Figure 1, ROC curve analyses for tPSA, %fPSA, and the ANN were performed for our cohort. The AUCs of ROC curve analysis were 0.501 for tPSA, 0.669 for %fPSA, 0.694 for ANN-Charité, 0.713 for nomogram I, and 0.742 for nomogram II, showing a significant advantage for the nomogram II (P = 0.009) compared with %fPSA while the other models did not differ from %fPSA (P = 0.15 and P = 0.41). The ROC analyses also demonstrated a higher specificity at 95% sensitivity for nomogram I (specificity 30.4%) compared with %fPSA (specificity 12.9%), tPSA (specificity 3.96%), nomogram II (specificity 18.2%), or ANN-Charité (specificity 18.8%). At 90% sensitivity the ROC analyses demonstrated a higher specificity for nomogram I (specificity 40.9%) compared with %fPSA (specificity 25.7%), tPSA (specificity 6.93%), nomogram II (specificity 27.1%), or ANN-Charité (specificity 33.7%). These data at 90% and 95% sensitivity confirm the similarities between ANN models and nomograms.

Results
Beside ROC analysis, the concordance between the predicted PCa and observed PCa probability is a good measure of a multivariate model's quality. In Figure 2, the predicted PCa probabilities are shown in relation to the observed PCa rate for the ANN model and nomograms. In the case of total concordance, there is no difference between predicted and observed probabilities-all points lie on the 45 • line. Here the intraclass correlation coefficient (ICC) is a measure for the consistence of the observed and predicted values and a value of 1 would be ideal. To suppress random fluctuations in graphical representation a cubic smoothing spline was  computed to expose the relationship between predicted and observed probabilities. The intraclass correlation coefficients for the observed versus predicted probabilities were 0.802 for nomogram I, 0.611 for the ANN-Charité, and 0.657 for nomogram II. We further performed the decision curve analysis and found only marginal differences between the 3 models.

Discussion
In the "ProstataClass" cohort the indications for referral were increased PSA values, lower urinary tract symptoms, abnormal DRE, or biopsy confirmed PCa, which explains the higher number of PCa patients [10,19]. Our population is a screening population for PCa, and only suspicious DRE and/or a PSA value between 4.01 and 9.99 ng/mL were indications for biopsy. This could be a reason why our detection rate is lower than in the original cohort from Charité Universitätsmedizin Berlin [19]. The ANN-Charité was created for a PSA range 0-27 ng/mL; so it can also be applied for the PSA range 4-10 ng/mL we used in our cohort.
Different molecular forms of PSA, PSA density and velocity, or age-adjusted cutoffs ameliorate the detection rates in screening for PCa [4]. It has been shown that the use of %fPSA significantly improves specificity by ∼15-20% compared with tPSA [10,11,26]. The AUC for %fPSA in our cohort (0.669) runs significantly above the AUC for tPSA (0.501). Our data confirm the improved diagnostic accuracy of %fPSA. The AUCs for ANN-Charité (0.694), nomogram I (0.713), and nomogram II (0.742) were all above the %fPSA AUC, but only nomogram II reached significance. When evaluating the specificities at the clinical important cutoff of 95% sensitivity, surprisingly the nomogram I was superior compared with %fPSA, tPSA, nomogram II, and ANN-Charité. However, these results show the clinical importance of cutoffs when using the ANN model or a nomogram instead of a single %fPSA or tPSA cutoff for biopsy decision. It should be mentioned that published ANN models mostly provide cutoffs for a biopsy decision [11,19] while published nomograms usually estimate a PCa probability only [8,16,18]. For external user a given cutoff for biopsy decision is easier to handle and should be therefore preferred.
Data on PSA-assay-specific comparisons of different ANN models and nomograms regarding retrospective and prospective data generation are rare. As seen in Tables 2 and 3, one of the main aims of this study could be reached only partially since the ANN-Charité could not repeat its significantly better performance compared with %fPSA in our cohort. Possible reasons for this relatively weak performance of the ANN-Charité are already provided when comparing our cohort and the "Prostata Class" cohort.
In 2007, Stephan et al. were able to show that different ANN and LR models perform similarly when applying to the same cohort [27]. This hypothesis was clearly confirmed in this study where ANN-Charité and the nomograms performed similarly, but not the same, when testing them in the same cohort.
We have differences between the ANN-Charité (AUC = 0.694) and the nomograms. This could be caused by differences in PCa detection rate, age, %fPSA, PSA, and number of positive DREs in the training and test cohort. While the overall ANN performance in the "ProstataClass" cohort was superior compared with the other cohort, the AUC difference between tPSA and the ANN models is smaller (<0.2) in the "ProstataClass" cohort compared with our cohort (0.18 to 0.25). This is mainly due to the large AUC of tPSA in the "ProstataClass" cohort with already 0.7. However, several other points showed a good comparability between the original and our cohort. The percentage of tPSA and prostate volume did not differ between both cohorts. The prostate volume is an important variable for this ANN. In our study, we used two systematic sextant patterns to take the biopsies in all cases. In other studies, it was shown that in patients with larger prostate volume a higher number of biopsies is useful. This should be considered as a yield of the prostate cancer detection rate [26]. Especially in the cohort of Kawakami et al., the number of biopsy cores was much higher with 20 cores in mean [18]. Furthermore, the typical significance between PCa and NEM patients was visible in both cohorts.
When analyzing Figures 1 and 2, the AUC differences appear small, but the calibration curves and ICC differences are larger. The results from analyzing the Saarow cohort with the ANN-Charité failed to show an improved performance with an AUC of 0.694 and an ICC of 0.611 only. While the two nomograms showed smaller differences in their AUC values, the differences in their performance were large when comparing the calibration curves and ICC. Thus, when only analyzing AUC values in validation studies, differences in predicted and observed PCa detection rates may not be detected [20].
Stephan et al. [10] could show in the first multicenter evaluation in almost 1200 men within a broad PSA range of 2-20 ng/mL that the combination of age, DRE, PSA, %fPSA, and prostate volume clearly enhances the specificity of %fPSA by 20% at 95% sensitivity. However, this ANN was built only with one PSA and fPSA assay (Immulite 2000 systems, Siemens Healthcare Diagnostics). By using a new model of this ANN built on 5 different tPSA and fPSA assays [19] we could show that this ANN by using the Beckman Coulter Access PSA assay confirms the diagnostic improvements. Using multivariable models has several advantages over using a single parameter for important clinical decisions and is seen as one of the future ways to maximize specificity for PCa detection [15]. We believe that paper versions of models like nomograms could not be as practical as internetor computer-based nomogram models or ANN programs like "ProstataClass" [19] or the ANN by Finne et al. [11]. Web-or computer based software is needed to integrate such models in clinical practice.
Regardless of the method used, nomograms and especially ANN help to assess the patient's risk of PCa better than single parameters like %fPSA, complexed PSA, or PSA alone. Using this recently introduced ANN [19] the number of unnecessary biopsies can be reduced.

Conclusion
Our results showed limitations of multivariate models when external validations were performed without keeping in mind the circumstances of the model development especially in population characteristics. However, models like the used ANN are more helpful in daily routine to increase the PCa detection rate and reduce unnecessary biopsies compared with nomograms used due to the usability of cutoffs.