We investigate the performance of different classification models and their ability to recognize prostate cancer in an
early stage. We build ensembles of classification models in order to increase the classification performance. We measure
the performance of our models in an extensive cross-validation procedure and compare different classification
models. The datasets come from clinical examinations and some of the classification models are already in use to support
the urologists in their clinical work.
1. Introduction
Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected
new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [1]. Early detection of
prostate cancer improves the chances of a curative treatment and a lot of
progress has been made in this field during the last decade. The early
detection is considerably enhanced by the measurement of prostate-specific
antigen (PSA) in conjunction with other clinically available data like age,
digital rectal examination (DRE), and transrectal ultrasonography (TRUS)
variables like prostate volume. We compared several classification models and
analyzed their performance on the clinical dataset with an extended
cross-validation procedure. The models were linear discriminant analysis (LDA),
penalized discriminant analysis (PDA) [2], logistic regression [3], classification and
regression trees (CARTs) [4], multilayer perceptron (MLP) [5], support vector machines
(SVMs) [6, 7], and nearest neighbour
classifiers [8]. All
these models are implemented in an open-source Matlab-toolbox that is available
on the internet [9].
This study will help to improve the software package ProstataClass [10] which was
developed at Charité and currently uses an artificial neural network as
classification engine. This program is successfully used in clinical practice
for several years.
2. Data
We had access to the clinically available data of 506 patients with 313 cases of prostate
cancer (PCa) and 193 non-PCa. The data were selected from a group of 780
patients randomly. The data entry for each patient included age, PSA, the ratio
of free to total prostate-specific antigen (PSA-Ratio), TRUS, and the
diagnostic finding from the DRE which was a binary variable (suspicious or
nonsuspicious). Blood sampling and handling were performed as described in
Stephan et al. [11].
The samples were taken before any diagnostic or therapeutic procedures, and
sera were stored at 80°C until analyzed.
After thawing at room temperature, samples were processed within 3 hours.
Prostate volume was determined by transrectal ultrasound using the prolate
ellipse formula. The scatter plot of the variables under investigation is shown
in Figure 1. PCa and non-PCa patients were histologically confirmed by 6–8
core prostate biopsy.
A scatterplot matrix of the data. Each box shows a pair of variables and the cases are
color-coded, a red cross marks PCa, and a blue circle non-PCa. The DRE is a binary variable (suspicious or nonsuspicious).
3. Ensembles
The average output of several different models fi(x) is called an
ensemble model:f^(x)=∑i=1Kωifi(x),where we assume that the model
weights ωi sum to one ∑i=1Kωi=1. There are several suggestions concerning the choice
of the model weights (see Perrone and Cooper [12]) but we decided to use uniform weights with ωi=1/K for the sake of
simplicity and not to run into overfitting problems as reported by Krogh and Sollich
[13].
The central feature of the ensemble approach is the
generalization ability of the resulting model. In the case of regression models
(with continuous output values), it was shown that the generalization error of
the ensemble is in the average case lower than the mean of the generalization
error of the single-ensemble members (see Krogh and Vedelsby 1995 [14]). This holds in general,
independent of the model class, as long as the models constituting the ensemble
are diverse with respect to the hypothesis of the unknown function. In the case
of (binary) classification models, the situation was not so clear because the
classical bias-variance decomposition of the squared error loss in regression
problems (Geman et al. [15]) had to be extended to the zero-one loss function.
There are several approaches dealing with this problem, see Kong and Dietterich
[16], Kohavi and Wolpert [17], or Domingos [18].
The zero-one loss function is not the only possible
choice for classification problems. If we are interested in a likelihood
whether a sample belongs to one class or not, we can use the error loss from
regression and consider the binary classification problem as a regression
problem that works on two possible outcomes. In practice, many classifiers are
trained in that way.
Our ensemble approach is based on the observation that
the generalization error of an ensemble model could be improved if the models
on which averaging is done disagree and if their residual errors are
uncorrelated [13]. To
see this, we have to investigate the contribution of the single model in the
ensemble to the generalization error. We consider the case where we have a
given dataset D={(x1,y1),…,(xN,yN)} and we want to
find a function f(x) that
approximates y at new
observations of x. These observations are assumed to come from the same
source that generated the training set D, that is, from the same (unknown) probability
distribution P. It should be noted that f depends also on D. The expected generalization error Err(x,D) given a
particular x and a training
set D isErr(x,D)=E[(y−f(x))2∣x,D],where the expectation E[⋅] is taken with
respect to the probability distribution P. We are now interested inErr(x)=ED[Err(x,D)],where the expectation ED[⋅] is taken with
respect to all possible realizations of training sets D with fixed
sample size N. According to Geman et al. [15], the bias/variance
decomposition of Err(x) isErr(x)=σ2+(ED[f(x)]−E[y∣x])2+ED[(f(x)−ED[f(x)])2]=σ2+Bias(f(x))2+Var(f(x)),where E[y∣x] is the
deterministic part of the data and σ2 is the variance
of y given x. Balancing between the bias and the variance
terms
is a crucial problem in model building. If we try to
decrease the bias term on a specific training set, we usually increase the
variance term and vice versa. We now consider the case of an ensemble average f^(x), consisting of K individual
models as defined in (1). If we put this into (4), we getErr(x)=σ2+Bias(f^(x))2+Var(f^(x)),and we can have a look at the
effects concerning bias and variance. The bias term in (5) is just the average
of the biases of the individual models in the ensemble. So we should not expect
a reduction in the bias term compared to single models. According to Naftaly et
al. [19], the variance
term of the ensemble could be decomposed in the following way:Var(f^)=E[(f^−E[f^])2]=E[(∑i=1Kωifi)2]−(E[∑i=1Kωifi])2=∑i=1Kωi2(E[fi2]−E2[fi])+2∑i<jωiωj(E[fifj]−E[fi]E[fj]),where the expectation is taken
with respect to D. The first sum in (6) marks the lower bound of the
ensemble variance and is the weighted mean of the variances of the ensemble
members. The second sum contains the cross terms of the ensemble members and
disappears if the models are completely uncorrelated [13]. So the reduction in the
variance of the ensemble is related to the degree of independence of the single
models [19].
4. Cross-Validation and Model Selection
Our model selection scheme is a mixture of bagging [20] and cross-validation. Bagging or Bootstrap
aggregating was proposed by Breiman [20] in order to improve the classification by combining
classifiers trained on randomly generated subsets of the entire training sets.
We extended this approach by applying a cross-validation scheme for model
selection on each subset and after that we combine the selected models to an
ensemble. In contrast to classical cross-validation, we use random subsets as
cross-validation folds. In K-fold
cross-validation, the dataset is partitioned into K subsets. Of
these K subsets, a
single subset is retained as the validation data for testing the model, and the
remaining K−1 subsets are
used for model training. The cross-validation process is then repeated K times with each
of the K subsets used
only once as the validation data. The K results from
the folds then can be averaged to produce a single estimation.
If we lack relevant problem-specific knowledge,
cross-validation methods could be used to select a classification method
empirically [21]. This
is a common approach because it seems to be obvious that no classification method
is uniformly superior, see, for example, Quinlan [22] for a detailed study. It is
also a common approach to select the model parameters with cross-validation
[23]. The idea to
combine the models from the K cross-validation
folds (stacking) was described by Wolpert [24].
We suggest to train several models on each CV-fold, to
select the best performing model on the validation set, and to combine the
selected models from the K-folds. If we
train models of one type but with different initial conditions (e.g., neural
networks with different numbers of hidden neurons), then we find proper values
for the free parameters of the model. We could extend that
by combining models from different classes in order
to increase the model diversity. We call this a heterogeneous ensemble or mixed ensemble and applied this method effectively to regression
problems [25] and
classification tasks [26].
Our model selection scheme works as follows: for the K-fold CV, the
data is divided K-times into a training
set and a test set, both sets containing randomly drawn subsets of
the data without replications. The size of each test set was 25% of the entire
dataset.
In every CV-fold, we train several different models
with a variety of model parameters (see Section 5 for an overview of the models
and the related model parameters). In each fold, we select only one model to
become a member of the final ensemble (namely, the best model with respect to
the test set). This means that all models have to compete with each other in a
fair tournament because they are trained and validated on the same dataset. The
models with the lowest classification error in each CV-fold are taken out and
added to the final ensemble, receiving the weight ωi=1/k (see (1)). All
other models in this CV-fold are deleted.
We can use this model selection scheme in two ways. If
we have no idea or prior knowledge, where
classification or regression method should be used to cope with a specific
problem, we could use this scheme in order to look for an empirical answer and
to compare the performance of the different model classes. The other way is the
estimation of model parameters for the different model classes described in
Section 5.
5. Classification Models
In this section, we give a short overview of the model classes that we used for
ensemble building. All models belong to the well-established collection of
machine-learning algorithms for classification and regression tasks, so details
can be found in the textbooks like, for instance, Hastie et al. [2]. The implementation of these
models in an open-source toolbox together with a more detailed description can
be found in [9]. The
toolbox is an open-source MATLAB Toolbox which allows the integration of
existing implementations of classification algorithms and it contains more then
the few model classes described in the text.
5.1. Linear Discriminant Analysis
The LDA is a simple but useful classifier. If we assume that the two classes k={0,1} have a Gaussian
distribution with mean μk and they share
the same covariance matrix Σ, then the linear discriminant function δk(x), k={0,1} is given by
δk(x)=xTΣ−1μk−12μkTΣ−1μk+log(πk),
where πk denotes the frequency of occurrence of the class labels. The predicted class labels are given by
f(x)=argmaxk=(0,1){δk(x)}.
We also implemented two modifications: the quadratic discriminant analysis (QDA) and the PDA, as
described in detail in Hastie et al. [2]. Linear method are usually conceptually simple,
robust, fast, and, in particular in high-dimensional problems, they could be
very powerful.
5.2. Logistic Regression Model
Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables
and is used extensively in the medical and social sciences. Hastie et al.
[2] pointed out that
the Logistic Regression model has the same form as the LDA, the only difference
lies in the way, the linear coefficients are estimated. See Hosmer and Lemeshow for a
detailed introduction [27]. We used the binary Log.Reg. to compute the probability
of the dichotomic variable y (PCa or non-PCa) from the m independent
variables x:y=11+exp(z)
with
z=a0+∑i=1maixi,
wherein the model coefficients
are estimated with a second-order gradient decent (quadratic approximation to
likelihood function). This could be a critical issue in high-dimensional problems because these calculations are time
and memory consuming.
5.3. Multilayer Perceptron
We train a multilayer feed-forward neural network “MLP” with
a sigmoid activation function. The weights are initialized with Gaussian-distributed
random numbers having zero mean and scaled variances. The weights are trained
with a gradient descend based on the Rprop algorithm [28] with the improvements given
in [29]. The MLP works
with a common weight decay with the penalty termP(w→)=λ∑i=1N−wi21+wi2,
where w→ denotes the N-dimensional
weight vector of the MLP and a small regularization parameter λ. The number of hidden layers, the number of neurons,
and the number of regularization parameter are
adjusted during the CV-training. We further applied the concept of an ϵ-insensitive
error loss that we introduced in the context of cellular neural networks (CNNs)
[30].
5.4. Support Vector Machines
Over the last decade, SVMs have become very powerful tools in machine learning. An SVM
creates a hyperplane in a feature space that separates the data into two
classes with the maximum margin. The feature space can be a mapping of the
original features (x,x′) into a higher-dimensional space using a positive semidefinite function:
(x,x′)⟼k(x,x′).
The function k(⋅,⋅) is called the kernel function and the so-called kernel trick uses Mercer's condition,
which states that any positive semidefinite kernel k(x,x′) can be
expressed as a dot product in a highdimensional space (see [31] for a detailed
introduction). The theoretical foundations of this approach were given by
Vapnik's statistical learning theory [6, 32] and later extended to the nonlinear case [7]. We use an implementation of
SVMs that is based on the libsvm provided by Chang and Lin [33] with the standard
kernels:
k(x,x′)=(x⋅x′)linear=(x⋅x′+1)d polynomial=exp(−∥x−x′∥σ2)rbf
The parameters of the model are, with respect to the kernel-type, the polynomial degree d, the width of
the rbf σ2 and the value
concerning the cost of constrain violation during the SVM training.
5.5. Trees
Trees are conceptually simple but powerful tools for classification and regression. For
our purpose, we use the classification and regression trees (CARTs) as
described in Breiman et al. [4]. The main feature of the CART algorithm is the binary
decision role that is introduced at each tree node with respect to the
information content of the split. In this way, the most discriminating binary
splits are near the tree root building an hierarchical decision scheme. A
sketch of a decision tree is shown in Figure 3. It is known
that trees have a high variance, so they benefit from the ensemble approach
[20]. These trees
ensembles are also know as random forests. The parameters of the tree
models are related to splitting the tree nodes (the impurity measure and the
split criterion, see [2] for a detailed description).
5.6. Nearest-Neighbor Classifier
A K-nearest-neighbor
classifier (KNN) takes a weighted average over the labels zi of those
observations zi in the training
set that are closest to the query point x. This denotes asf(x)=1∑wi∑zi∈Nk(x)wizi,
where Nk(x) denotes the k-element
neighborhood of x, defined in a given metric, and wi is the related
distance. Common choices are the L1, L2, and the L∞ metrics. The
parameters of the model are the number of neighbors and the choice of the
metric. KNNs offer a very intuitive approach to classification problems because
they are based on the concept of similarity. This works fine in lower
dimensions but leads to problems in higher dimensions, known as the curse of
dimensionality [34].
6. Application to the Clinical Data
We compared the model classes described above in a unified framework under fair conditions.
Thus, we trained an ensemble of each model class consisting of 11 ensembles
members (11 CV-folds in the training scheme described in Section 4). The
performance of each ensemble model was assessed on the 20% of data (validation
set), which was initially removed and never included in model training (see
Figure 2). This procedure was independently repeated 20 times. This means that
all model-building processes, that is, the random removal of 20% of the data,
the construction of a classification model ensemble on the remaining 80% of the
data as outlined in Section 4, and the final test on the unseen validation data
were performed each time. Finally, the mean average prediction values with
respect to the validation set were calculated and are listed in Table 2. In some cases,
it is useful to apply a kind of data preprocessing
like balancing. If the distribution of the two classes differ in the sense,
that one class is only represented with a small number of examples, we can
balance the data in the training set. This can improve the convergence of
several training algorithms and has also an impact to the classification error
[35]. We apply
balancing in the way that we reduce the number of samples in the one class
until we have an balanced ratio of the class labels. The ratio of the class
labels in the validation set was never changed because it reflects the real
data distribution. Balancing was only applied to the training data. We used
three different performance measures in order to compare the different
classification models. Therefore, we have to define the four possible outcomes
of a classification that can be formulated in a 2×2 confusion
matrix, as shown in Table 1. The
accuracy,
Accuracy=tp+tntp+tn+fp+fn,seems to be the canonical error
measure for almost all classification problems if the dataset is balanced.
Other important measures are the specificity that quantifies how well a binary
classification model correctly identifies the negative cases (non-PCa
patients),Specificity=tntn+fp,and the sensitivity, which is
the proportion of true positives of all diseased cases (PCa patients) in the
population,Sensitivity=tptp+fn.A high sensitivity is required
when early diagnosis and treatment are beneficial,
which is the case in PCa.
The confusion matrix for a binary classification problem.
predicted class + 1
predicted class − 1
Real class + 1
True positive (tp)
False negative (fn)
Real class − 1
False positive (fp)
True negative (tn)
The average performance of several classifier
ensembles with respect to the validation set which was initially removed and
never included in model training. We show the mean and the standard deviation
values from 20 independent validation runs, no preprocessing was used.
Accuracy
F-score
AUC
SPS95
PDA
0.776 ± 0.026
0.823 ± 0.026
0.863
0.454
Log.Reg.
0.778 ± 0.038
0.823 ± 0.036
0.868
0.484
MLP
0.791 ± 0.045
0.823 ± 0.04
0.863
0.453
SVM
0.795 ± 0.023
0.833 ± 0.02
0.825
0.142
CART
0.757 ± 0.03
0.809 ± 0.026
0.843
0.394
KNN
0.756 ± 0.036
0.813 ± 0.032
0.809
0.309
Mixed
0.783 ± 0.03
0.828 ± 0.026
0.860
0.457
For every partition of the cross-validation, the data is divided in a training and a test set. The performance of
each ensemble model was assessed on validation set which was initially removed and never included in model training.
A sketch of a classification tree, wherein the leaves represent classes and the branches represent conjunctions
of features that lead to those classes.
The precision or positive predictive value (PPV) is
given byPPV=tptp+fp,and is the proportion of
patients with positive test results who are correctly diagnosed. The F-Score is
the harmonic mean of precision and sensitivity,F-Score=2⋅Sensitivity⋅PPVSensitivity+PPV,and it is
useful if the classes in the classification problem
are not equally distributed. Another measure is the area under curve (AUC)
wherein the curve is the receiver operating characteristic
(ROC-curve)curve. The ROC-curve is the graphical
plot of the sensitivity versus the (1-specificity) for a binary classifier as
its discrimination threshold is varied.
The ROC-curve offers the opportunity to calculate the
specificity at a fixed sensitivity level and vice versa. This is important
because, from the clinical point of view, a high sensitivity 95% is wanted to
detect all patients with PCa first. To avoid a high false-positive rate, we
computed the specificity at the level of 95% sensitivity
(SPS95) from the ROC-curve as another important performance measure.
To have an impression about the correct classified
non-PCa patients in this case, we computed the specificity at the level of 95% sensitivity
(SPS95) from the ROC-curve. If we compare
the outcome of the statistical analysis of the model performance as listed in
Table 1 for the unbalanced case and in Table 2 for the balanced case, we can
state that the differences between the different classifiers are marginal. Even
the more sophisticated classification models (SVMs or Mixed Ensembles) could
not outperform the robust linear candidates (LDA/PDA).
Tables 2 and 3 present the main results with only
small differences between the classifiers. The standard deviations of the
performance measures are given except for the ROC-curve-based measures (AUC and
SPS95). Most papers in this field do not discuss this really complex problem
and it cannot be solved in the scope of this paper, but it should be mentioned.
As an example of a special solution of this problem, see the paper of Hilgers
[36].
The average performance of several classifier
ensembles with respect to the validation set which was initially removed and
never included in model training. We show the mean and the standard deviation
values from 20 independent validation runs wherein the training data was
balanced.
Accuracy
F-score
AUC
SPS95
PDA
0.772 ± 0.034
0.809 ± 0.035
0.861
0.414
Log.Reg.
0.792 ± 0.03
0.834 ± 0.027
0.868
0.458
MLP
0.766 ± 0.027
0.787 ± 0.029
0.858
0.451
SVM
0.786 ± 0.038
0.816 ± 0.042
0.821
0.051
CART
0.755 ± 0.031
0.792 ± 0.029
0.841
0.376
KNN
0.726 ± 0.032
0.766 ± 0.034
0.801
0.297
Mixed
0.789 ± 0.033
0.830 ± 0.026
0.867
0.445
7. Conclusions
We compared several
classification models with respect to their ability to recognize prostate
cancer in an early stage. This was done in an
ensemble framework in order to estimate proper model parameters and to increase
classification performance. It turned out that all models under investigation
are performing very well with only marginal differences and are compareable
with similar studies, like, for example, Finne et al. [37], Remzi et al. [38], or Zlotta et al. [39]. In future research, it
should be investigated whether these results are valid for other populations of
patients (e.g., screening data) and other PSA test assays and whether the
performance of classification could be increased by including new variables or
by splitting the groups of patients into different PSA ranges.
JemalA.SiegelR.WardE. M.ThunM. J.Cancer facts & figures2006Atlanta, Ga, USADepartment of Epidemiology and Surveillance Research, American Cancer
SocietyHastieT.TibshiraniR. FriedmanJ.2001New York, NY, USASpringerSpringer Series in StatisticsBardY. 1974New York, NY, USAAcademic PressBreimanL.FriedmanJ.OlshenR.StoneC.1993Monterey, Calif, USAWadsworth and BrooksBishopC. M.1995Oxford, UKOxford University PressVapnikV. N.1999New York, NY, USASpringerBoserB. E.GuyonI.VapnikV. N.A training algorithm for optimal margin classifiersProceedings of the 5th Annual ACM Conference on Computational Learning Theory (COLT '92)July 1992Pittsburgh, Pa, USA14415210.1145/130385.130401MerkwirthC.ParlitzU.LauterbornW.Fast nearest-neighbor searching for nonlinear signal processing20006222089209710.1103/PhysRevE.62.2089WichardJ. D.MerkwirthC.ENTOOL—A Matlab toolbox for ensemble modeling2007http://www.j-wichard.de/entool/StephanC.CammannH.SemjonowA.Multicenter evaluation of an artificial neural network to increase the prostate cancer detection rate and reduce unnecessary biopsies200248812791287StephanC.KlaasM.MüllerC.SchnorrD.LoeningS. A.JungK.klaus.jung@charite.deInterchangeability of measurements of total and free prostate-specific antigen in serum with 5 frequently used assay combinations: an update2006521596410.1373/clinchem.2005.059170PerroneM. P.CooperL. N.MammoneR. J.When networks disagree: ensemble methods for hybrid neural networks1993New York, NY, USAChapman-Hall126142KroghA.krogh@cbs.dtu.dkSollichP.P.Sollich@ed.ac.ukStatistical mechanics of ensemble learning199755181182510.1103/PhysRevE.55.811KroghA.VedelsbyJ.TesauroG.TouretzkyD.LeenT.Neural network ensembles, cross validation, and active learning19957Cambridge, Mass, USAMIT Press231238GemanS.BienenstockE.DoursatR.Neural networks and the bias/variance dilemma199241158KongE. B.DietterichT. G.Error-correcting output coding corrects bias and varianceProceedings of the 12th International Conference on Machine Learning (ICML '95)July 1995Tahoe City, Calif, USA313321KohaviR.WolpertD.SaittaL.Bias plus variance decomposition for zero-one loss functionsProceedings of the 13th International Conference on Machine Learning (ICML '96)July 1996Bari, ItalyMorgan Kaufmann275283DomingosP.A unified bias-variance decomposition for zero-one and squared lossProceedings of the 17th National Conference on Artificial IntelligenceJuly-August 2000Austin, Tex, USA564569NaftalyU.ury@tarazan.tau.ac.ilIntratorN.nin@math.tau.ac.ilHornD.horn@vm.tau.ac.ilOptimal ensemble averaging of neural networks19978328329610.1088/0954-898X/8/3/004BreimanL.leo@stat.berkeley.eduBagging predictors1996242123140SchafferC.Selecting a classification method by cross-validationProceedings of the 4th International Workshop on Artificial Intelligence and StatisticsJanuary 1993Fort Lauderdale, Fla, USA1525QuinlanJ. R.Comparing connectionist and symbolic learning methods19941Cambridge, Mass, USAMIT Press445456GuyonI.ElisseeffA.An introduction to variable and feature selection200331157118210.1162/153244303322753616WolpertD. H.Stacked generalization1992524125910.1016/S0893-6080(05)80023-1WichardJ. D.MerkwirthC.Ogorzałek M.Detecting correlation in stockmarkets20043441-230831110.1016/j.physa.2004.06.140RothfussA.andreas.rothfuss@schering.deSteger-HartmannT.HeinrichN.WichardJ. D.Computational prediction of the chromosome-damaging potential of chemicals200619101313131910.1021/tx060136wHosmerD. W.LemeshowS.1989New York, NY, USAJohn Wiley & SonsRiedmillerM.BraunH.A direct adaptive method for faster backpropagation learning: the RPROP algorithm1Proceedings of the IEEE International Conference on Neural NetworksMarch-April 1993San Francisco, Calif, USA58659110.1109/ICNN.1993.298623IgelC.HüskenM.BotheH.RojasR.Improving the Rprop learning algorithmProceedings of the 2nd International ICSC Symposium on Neural Computation (NC '02)May 2000Berlin, GermanyAcademic Press115121MerkwirthC.WichardJ. D.OgorzałekM.Stochastic gradient descent training of ensembles of DT-CNN
classifiers for digit recognition2Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03)September 2003Kraków, Poland337341CristianiniN.Shawe-TaylorJ.2000Cambridge, UKCambridge University PressVapnikV. N.TscherwonenkisA. J.1979BerlinAkademieChangC. C.LinC. J.Libsvm—Alibrary for support vector machines2001BellmanR. E.1961Princeton, NJ, USAPrinceton University PressMerkwirthC.OgorzałekM.WichardJ. D.Stochastic gradient descent training of ensembles of
DT-CNN classifiers for digit recognition2Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03)September 2003Kraków, Poland337341HilgersR. A.Distribution-free confidence bounds for ROC curves199130296101FinneP.FinneR.AuvinenA.Predicting the outcome of prostate biopsy in screen-positive men by a multilayer perceptron network200056341842210.1016/S0090-4295(00)00672-5RemziM.AnagnostouT.RaveryV.An artificial neural network to predict the outcome of repeat prostate biopsies200362345646010.1016/S0090-4295(03)00409-6ZlottaA. R.RemziM.SnowP. B.SchulmanC. C.MarbergerM.DjavanB.An artificial neural network for prostate cancer staging when serum prostate specific antigen is 10 ng./ml. or less200316951724172810.1097/01.ju.0000062548.28015.f6