^{1, 3}

^{1}

^{2}

^{1}

^{1}

^{2}

^{3}

We investigate the performance of different classification models and their ability to recognize prostate cancer in an early stage. We build ensembles of classification models in order to increase the classification performance. We measure the performance of our models in an extensive cross-validation procedure and compare different classification models. The datasets come from clinical examinations and some of the classification models are already in use to support the urologists in their clinical work.

Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected
new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [

This study will help to improve the software package

We had access to the clinically available data of 506 patients with 313 cases of prostate
cancer (PCa) and 193 non-PCa. The data were selected from a group of 780
patients randomly. The data entry for each patient included age, PSA, the ratio
of free to total prostate-specific antigen (PSA-Ratio), TRUS, and the
diagnostic finding from the DRE which was a binary variable (suspicious or
nonsuspicious). Blood sampling and handling were performed as described in
Stephan et al. [^{°}C until analyzed.
After thawing at room temperature, samples were processed within 3 hours.
Prostate volume was determined by transrectal ultrasound using the prolate
ellipse formula. The scatter plot of the variables under investigation is shown
in Figure

A scatterplot matrix of the data. Each box shows a pair of variables and the cases are color-coded, a red cross marks PCa, and a blue circle non-PCa. The DRE is a binary variable (suspicious or nonsuspicious).

The average output of several different models

The central feature of the ensemble approach is the
generalization ability of the resulting model. In the case of regression models
(with continuous output values), it was shown that the generalization error of
the ensemble is in the average case lower than the mean of the generalization
error of the single-ensemble members (see Krogh and Vedelsby 1995 [

The zero-one loss function is not the only possible choice for classification problems. If we are interested in a likelihood whether a sample belongs to one class or not, we can use the error loss from regression and consider the binary classification problem as a regression problem that works on two possible outcomes. In practice, many classifiers are trained in that way.

Our ensemble approach is based on the observation that
the generalization error of an ensemble model could be improved if the models
on which averaging is done disagree and if their residual errors are
uncorrelated [

Our model selection scheme is a mixture of bagging [

If we lack relevant problem-specific knowledge,
cross-validation methods could be used to select a classification method
empirically [

We suggest to train several models on each CV-fold, to
select the best performing model on the validation set, and to combine the
selected models from the

Our model selection scheme works as follows: for the

In every CV-fold, we train several different models
with a variety of model parameters (see Section

We can use this model selection scheme in two ways. If
we have no idea or prior knowledge, where
classification or regression method should be used to cope with a specific
problem, we could use this scheme in order to look for an empirical answer and
to compare the performance of the different model classes. The other way is the
estimation of model parameters for the different model classes described in
Section

In this section, we give a short overview of the model classes that we used for
ensemble building. All models belong to the well-established collection of
machine-learning algorithms for classification and regression tasks, so details
can be found in the textbooks like, for instance, Hastie et al. [

The LDA is a simple but useful classifier. If we assume that the two classes

Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables
and is used extensively in the medical and social sciences. Hastie et al.
[

We train a multilayer feed-forward neural network “MLP” with
a sigmoid activation function. The weights are initialized with Gaussian-distributed
random numbers having zero mean and scaled variances. The weights are trained
with a gradient descend based on the Rprop algorithm [

Over the last decade, SVMs have become very powerful tools in machine learning. An SVM
creates a hyperplane in a feature space that separates the data into two
classes with the maximum margin. The feature space can be a mapping of the
original features

Trees are conceptually simple but powerful tools for classification and regression. For
our purpose, we use the

A

We compared the model classes described above in a unified framework under fair conditions.
Thus, we trained an ensemble of each model class consisting of 11 ensembles
members (11 CV-folds in the training scheme described in Section

The confusion matrix for a binary classification problem.

predicted class + 1 | predicted class − 1 | |
---|---|---|

Real class + 1 | True positive (tp) | False negative (fn) |

Real class − 1 | False positive (fp) | True negative (tn) |

The average performance of several classifier ensembles with respect to the validation set which was initially removed and never included in model training. We show the mean and the standard deviation values from 20 independent validation runs, no preprocessing was used.

Accuracy | F-score | AUC | SPS95 | |
---|---|---|---|---|

PDA | 0.776 ± 0.026 | 0.823 ± 0.026 | 0.863 | 0.454 |

Log.Reg. | 0.778 ± 0.038 | 0.823 ± 0.036 | 0.868 | 0.484 |

MLP | 0.791 ± 0.045 | 0.823 ± 0.04 | 0.863 | 0.453 |

SVM | 0.795 ± 0.023 | 0.833 ± 0.02 | 0.825 | 0.142 |

CART | 0.757 ± 0.03 | 0.809 ± 0.026 | 0.843 | 0.394 |

KNN | 0.756 ± 0.036 | 0.813 ± 0.032 | 0.809 | 0.309 |

Mixed | 0.783 ± 0.03 | 0.828 ± 0.026 | 0.860 | 0.457 |

For every partition of the cross-validation, the data is divided in a training and a test set. The performance of each ensemble model was assessed on validation set which was initially removed and never included in model training.

A sketch of a classification tree, wherein the leaves represent classes and the branches represent conjunctions of features that lead to those classes.

The precision or positive predictive value (PPV) is
given by

The ROC-curve offers the opportunity to calculate the specificity at a fixed sensitivity level and vice versa. This is important because, from the clinical point of view, a high sensitivity 95% is wanted to detect all patients with PCa first. To avoid a high false-positive rate, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROC-curve as another important performance measure.

To have an impression about the correct classified
non-PCa patients in this case, we computed the specificity at the level of 95% sensitivity
(SPS95) from the ROC-curve. If we compare
the outcome of the statistical analysis of the model performance as listed in
Table

Tables

The average performance of several classifier ensembles with respect to the validation set which was initially removed and never included in model training. We show the mean and the standard deviation values from 20 independent validation runs wherein the training data was balanced.

Accuracy | F-score | AUC | SPS95 | |
---|---|---|---|---|

PDA | 0.772 ± 0.034 | 0.809 ± 0.035 | 0.861 | 0.414 |

Log.Reg. | 0.792 ± 0.03 | 0.834 ± 0.027 | 0.868 | 0.458 |

MLP | 0.766 ± 0.027 | 0.787 ± 0.029 | 0.858 | 0.451 |

SVM | 0.786 ± 0.038 | 0.816 ± 0.042 | 0.821 | 0.051 |

CART | 0.755 ± 0.031 | 0.792 ± 0.029 | 0.841 | 0.376 |

KNN | 0.726 ± 0.032 | 0.766 ± 0.034 | 0.801 | 0.297 |

Mixed | 0.789 ± 0.033 | 0.830 ± 0.026 | 0.867 | 0.445 |

We compared several
classification models with respect to their ability to recognize prostate
cancer in an early stage. This was done in an
ensemble framework in order to estimate proper model parameters and to increase
classification performance. It turned out that all models under investigation
are performing very well with only marginal differences and are compareable
with similar studies, like, for example, Finne et al. [