Simpler Evaluation of Predictions and Signature Stability for Gene Expression Data

Scientific advances are raising expectations that patient-tailored treatment will soon be available. The development of resulting clinical approaches needs to be based on well-designed experimental and observational procedures that provide data to which proper biostatistical analyses are applied. Gene expression microarray and related technology are rapidly evolving. It is providing extremely large gene expression profiles containing many thousands of measurements. Choosing a subset from these gene expression measurements to include in a gene expression signature is one of the many challenges needing to be met. Choice of this signature depends on many factors, including the selection of patients in the training set. So the reliability and reproducibility of the resultant prognostic gene signature needs to be evaluated, in such a way as to be relevant to the clinical setting. A relatively straightforward approach is based on cross validation, with separate selection of genes at each iteration to avoid selection bias. Within this approach we developed two different methods, one based on forward selection, the other on genes that were statistically significant in all training blocks of data. We demonstrate our approach to gene signature evaluation with a well-known breast cancer data set.


Introduction
The era of personalised medicine is being widely heralded, with claims including "all human illness can be studied by microarray analysis, and the ultimate goal [] is to develop effective treatments or cures for every human disease by 2050" [1]. The first step is the selection of predictors from the many (often tens of) thousands of measurements that can be made using modern microarray technology.
As is well known amongst statisticians, genes identified from a microarray analysis as predictors of outcome, and so included in a molecular signature, depend greatly on the selection of patients in the training set. Hence it has been advocated that validation by repeated random sampling (RRS) should be used [2]. RRS is advocated elsewhere, for example for selecting genes for classification [3]. The RRS approach can quickly become extremely computer intensive. Alternatively, the far less computer intensive method of 10-fold cross-validation has been shown empirically, for two microarray data sets, to perform very well when compared with repeated random sampling [4]. K-fold cross-validation is a resampling method that is widely used in practice for estimating performance of a prediction rule ("signature") when there is insufficient data to split the data into the ideal 3 parts: training, validation, and testing. For microarray experiments where the number, M, of measurements (say gene-expression values on an Affymetrix chip) is extremely large compared with the number, s, of samples (chips), it is starting to become clear that some traditional data statistical practices may not always be appropriate.
There is considerable confusion about the application of cross-validation and many investigators have not understood the importance of signature selection at each of the K separate steps [4]. If the signal-to-noise ratio of the underlying function is relatively strong then the signature may not vary 2 Journal of Biomedicine and Biotechnology very much. However, if it is not strong, and this is very likely when the number of measurements is very large compared with the number of samples, then the signature may vary appreciably; see [5,6]. Further, it has been noted that "the most striking finding when comparing the significant lists [from different studies] is the virtually complete lack of agreement in the included genes . . . the present lack of coherence [] warrants further examination" [7].
The reason for the differences could be attributed to a number of factors, including methodology (model, algorithm, gene selection, preprocessing steps such as normalisation, transformation, etc.), data (different samples, heterogeneous samples, poor measurements, poor experimental designs leading to the presence of confounders, etc.), and the underlying biology (different genes from the same pathway, considered later). The fragility of a prediction model has been demonstrated [8] by showing that published results could not be replicated using the same data, when training and tests sets were interchanged and the same algorithms applied.
Here we consider a relatively large study [9] that was used to predict distant metastasis of lymph-nodenegative primary breast cancer based on determination of a single signature. The data are available in series GSE2034 at the NCBI Genebank GEO. These data consist of gene expression values from 286 Affymetrix U133a chips each with gene expression values for over 22 000 "genes". Of these patients, 209 were oestrogen receptorpositive (ER + ) and 77 patients were oestrogen receptornegative (ER − ). Briefly, the samples were a subset of frozen tumour samples from patients with lymph-node negative breast cancer which had been submitted for steroid-hormone receptor measurement from an intake of 25 hospitals; further details in [9]. Metastasis status was determined from follow-up examinations or confirmed following patient report. The (statistical) sample selection was nonrandom and the data show considerable variation in many known breast cancer prognostic factors, such as therapy type, age, menopausal status, tumour grade, and stage for example. Data on these factors were not available on the web.
In the following, the ER + patient data are chosen for illustrative purposes, noting that this selection also controls some of the overall patient heterogeneity. Use of 10-fold cross-validation is explored to determine those genes that are included in molecular signatures (lists). In other words, signature stability is evaluated, as is the variability of prediction measures.

Methods
First, gene expression values were log transformed (base 2) after zero values were set to 0.1.
Two methods, both of which correct for selection bias [4], were developed and applied to these breast cancer data. Before detailing the methods we outline the basic steps of Kfold cross-validation: (a) randomly divide the data into K separate and approximately equal parts called validation sets; (b) leave out one of the validation sets, then (b 1 ) perform the analysis using the combined data of the remaining K-1 parts (referred to as the "training set") and, usually, (b 2 ) validate the analysis (that is often making prediction/s) on the "validation set"; (c) repeat (b) K times, so that following the (usual) b 2 step each one of the (separate) K validation sets is used once (e.g., for prediction/s).
For method (i), within each training set at step (b 1 ), we first did an initial screening by applying a univariate Cox proportional hazards regression model (Cox PHM) to select those genes in each of the training sets that were statistically significant for metastasis-free survival. A (partial) likelihood ratio test, comparing the partial likelihood estimated for the gene to the null model, was used. Genes were retained when the likelihood ratio test was statistically significant at level α 1 . Next, restricting attention to just these retained genes, we used a Cox PHM with forward selection to select amongst the genes available following the initial gene selection. The selection was terminated when the statistical significance for inclusion into the Cox PHM, using a log likelihood ratio test, was greater than level α 2 . For step (b 2 ), the resulting model was used to predict relapse-free survival at 5 years on the validation data set. To evaluate predictive performance, a binary outcome was defined [2], namely whether the predicted probability of a patient was relapse-free, was greater than, or less than, 0.5. Then, three measures of predictive performance (also called accuracy of classification) were estimated, namely the true positive fraction (TPF), also known as sensitivity, the false positive fraction (FPF) and the proportion of correctly predicted samples. Specificity is 1-FPF. In a clinical setting, sensitivity and specificity are more relevant than the proportion correctly predicted (i.e., 1 -classification error) that is often the measure reported in the bioinformatics literature. It is straightforward to show that the proportion of correctly predicted samples is a weighted average of sensitivity and specificity. Note that the proportion correctly predicted can be misleading if the outcome is rare or very common.
Method (ii) followed method (i) except that instead of forward selection we selected those genes which were significant at level α 1 in all training sets. Then all these genes were used in the signature with a multivariate Cox PHM being fitted at each iteration. The validation at step (b 2 ) in the second iteration was as described for method (i). This approach has been used successfully on data from proteomic profiling to distinguish malignant pancreatic cancer from benign disease [10].
There is no general rule for the value of K, and K = 1, 5, 10, and √ s (where s is the sample size) have all been suggested; the trade-off is between bias and variance [5]. Following Ambroise & McLachlan [4], we chose K = 10. This choice seems to balance the trade-off between bias with higher values of K and variability as observed for low K values, particularly K = 1 (the so-called leave-one-one cross validation). Note that any two 10-fold cross-validation training sets have approximately 80% of the total sample in common. Here the cross-validation was applied in stratified form, so that approximately equal numbers of distant (i.e., beyond 5 years) metastases patients were included in the validation sets. We found that our substantive conclusions changed little if this restriction was relaxed.
The Cox proportional hazards model was chosen for modelling metastasis free survival, t, because of censoring in the data. The Cox PHM models the conditional hazard as the product of a baseline hazard, λ 0 (t), and an exponential form of a linear function of covariates, z, here log 2 gene expression values, as follows: The assumption of proportional hazards was evaluated and found adequate for a small number of significant genes. Known gene function was examined using a text search on annotation available from the Affymetrix web site. The R software [11] with the package "survival" was used for analysis.

Method (i).
The initial screening of genes at step (b 1 ), using a univariate Cox regression analyses, with α 1 ≤ .001 identified from 210 to 342 genes in the 10 training sets. Altogether across the 10 training sets, 675 unique genes were found.
The 10 signatures built using a Cox PHM with forward selection in each training set and with α 2 ≤ .001, consisted of 11 to 18 genes. Genes were generally uncorrelated within each signature.
The 10 signatures have very few genes in common; that is, the signatures are very unstable. For example, one gene occurred in 6 signatures (the most common), 4 genes occurred in 4 signatures and 8 genes in 3 signatures. Seventy genes occurred only once. Also very few of the 60 genes in Wang et al.'s [9] signature for the subset of ER + patients occurred in the 10 signatures. Three of the signatures included no genes in common with the Wang signature and the signature with the largest number of genes in common had only four genes in common.
The discrepancies between the ten signatures might be thought to be occurring because correlated genes in the same pathway/s are being selected. Table 1 shows the number of genes found in each signature with different functional classes. These classes are based on those in Wang et al. (Table  4) [9,12]. Functional classes that were found in none or only one signature are not shown. This analysis is not meant to be definitive but to illustrate the lack of consistency in apparent function between the signatures. We note that the variation demonstrated here may reflect the complexity of signalling processes in possibly many pathways, and/or the presence of tumour subtypes [13] and/or the heterogeneity in the sample. Table 2 shows signature performance indices estimated on the training sets. As is well known now, when the performance is estimated on the same set as used to train the signature, the performance is over optimistic (i.e., biased [4]). Table 2 shows that each signature predicts distant metastasis very well for the data on which they were trained, even though the signatures were composed of a relatively small number of genes (average 15). However, Table 3 shows that the performance is considerably worse when the signature is used to predict metastasis-free status on the validation set. The test error, estimated as the proportion misclassified, using distant metastasis within 5 years as the defining mark, varies between 0.19 (= 1 − 0.810) and 0.55 (= 1 − 0.450). The average unbiased estimate of error rate is 32.6%, while the biased estimate of error rate is only 11.5%. The variance of the unbiased test error is 122.7.

Method (ii).
We demonstrate the second method using 2 values of α 1 for the initial screening. With α 1 ≤ .001, there were 59 genes that were statistically significant in all training sets for metastasis-free survival, using a univariate Cox PHM.
To assess performance of this method we estimated the Cox PHM using these 59 genes and predicted relapse-free survival at 5 years in each corresponding validation set. The true positive fractions (TPF) and false positive fractions (FPF) are plotted on the left in Figure 1. The location of the average pair (TPF, FPF) = (0.793 0.581) is shown. The average TPF is slightly lower than for method (i) but so also is the FPF which is desirable. Given the variability of the estimates, these averages can be considered approximately the same. The lower range of the FPF is less here than for method (i), whereas the comparison of the FPF values in the figure and with corresponding column in Table 3 indicates an improvement over method (i). The average error rate from the validation sets is 33.3%, which is similar to method (i), but with variance 74.4 which is much lower. When α 1 ≤ .0001 was used for the initial gene selection, 14 genes were identified for inclusion in the signature. The

Discussion
Cross-validation is shown to be a useful tool to assess performance and stability of molecular signatures used for prediction from microarray data. Such an evaluation is necessary since the number of samples is small relative to the number of measurements. Alternatives to cross-validation for estimating predictive performance include the bootstrap and repeated random sampling [3,4] for example; but for clarity in demonstrating instability of predictors, and because it is much less computer intensive, we concentrated here on cross-validation. Within the cross-validation framework for developing gene signatures, there are many choices, including which initial screening approach to use and which method to build the signature. Each choice may lead to different genes being included in a signature. The choices made in the initial gene selection step here are somewhat arbitrary. Our initial choice, to select genes with significant (α 1 ≤ 0.001) association with breast cancer survival, was a strategy to reduce the number of genes so that the potential number was of the same order of magnitude as the number of samples. This initial gene selection step is not required for K-fold cross-validation in applications where the number of variables is closer to sample size. This screening of genes has interesting connections to the recent, independently derived, proposal of iterative sure independence screening (ISIS) for the linear model in ultrahigh dimensional feature space. ISIS was motivated by the need to deal with problems of large dimensionality for which accuracy of estimation and computational cost are two concerns. For the linear model, Fan and Lv [14] proposed reducing the dimensionality from high to a moderate scale that is below the sample size, by iteratively implementing a variable screening procedure and fitting the proposed model.
The presence of censored data indicates the use of a survival model for building the signatures. Since the Cox PHM has been found to be reasonably robust under misspecification [15], is well known in medical fields, and was used by Wang et al. [9], it was chosen for developing the signatures. This is preferable to the approach used by Ein-Dor et al. [16] that was based on genes correlated with survival, as "correlation" is not statistically appropriate in this context.
In method (i), to train the signature on the training data we used a stepwise procedure. It is well known that such methods find only a fraction of the models that fit the data well and can have the undesirable effect of over fitting the data. However, efficient screening of all models is not computationally possible given the number of genes (variables). Again, the use of a significance level of α 1 ≤ 0.01 is somewhat arbitrary. In further analyses (not reported) increasing α increased the number of genes and the lack of agreement between the signatures. The use of cross-validation helps to ascertain the extent of any over fitting but there is no known solution yet to finding the "optimal" model in this setting. It is good practice to use a number of approaches and to be satisfied only when substantive conclusions are unaltered by the algorithm, and basic assumptions are satisfied in so far as they can be tested. The importance of implementing the same gene selection rule/s for each of the training sets, in order to obtain an unbiased estimate of the prediction error, is stressed. Here there are two gene selection steps, and both need to be implemented at each training step. This is necessary because neither gene selection method is independent of the prediction method.
Method (ii) is novel, with its incorporation of selection based on variables that are common to all blocks using a cross-validation type of approach. This would appear to overcome the highly unstable list of genes identified as predictors of prognosis that is found using either repeated random sampling [3] or method (i), but needs to be further evaluated on other data sets. Note that method (ii) is comparatively simple, and computationally fast. Current research concerns evaluation of the performance and stability of signatures of both our variants of cross-validation with the SIS-style approaches to screening before fitting of the model/s. Since the data used for the selection of genes are not fully independent of the data used for validation, some bias may exist in estimates of performance [17] and alternative methods, including the use of 2-external cross-validation will be investigated.
There are many ways in which cross-validation can be developed so as to correct for selection bias [4], and we have used two. Another approach that has many similarities to our methods is to first consider a range of the numbers of genes to be used in the signature, say d 0 to d 1 . Then within each of the K training blocks, perform K-1-fold cross-validation, and evaluate the signature for each value of d, selecting the value that gives the best classification rate for that block. The value of d selected may vary between the K training blocks; see [18] for further details. With the additional layer of crossvalidation, as well as consideration of a range of values of d within each training block, this is more computer-intensive than our methods, although the associated variability may well be less. Comparative performance of these approaches is a future research project.
It is important to distinguish between the use of statistical models for prediction and for explanation [19]. In much of the literature this distinction is blurred. Here we have been concerned with choosing signatures for prediction. Some investigators attempt to use a signature to build pathway/s for explanatory purposes but such explanations are dubious when the signatures can be very unstable.
Sample design is an important, but often overlooked, consideration in microarray studies. It has been identified as a key issue [20], and we emphasize that poor sample design cannot be overcome with increased sample size. Ideally the sample should be representative of the population on which the signature is to be used. Again resampling methods can help to highlight problems. Note that most of the theoretical work underpinning justification for methodologies assumes that the samples are representative. There are many unknowns about how non-homogeneity in the sample affects the performance of the predictors.
In this data set there were confounders, including age, menopausal status, T stage, Grade, and PR status, that may be important predictors, including their possible interaction/s with particular genes, have not been considered here as they are not publicly available. Although Wang et al. [9] showed that these and other confounding factors were not statistically significant after adjusting for their gene signature in the training set, their inclusion may be useful in predicting "new" samples. Their usefulness will arguably depend on how well these-factors are "accounted for" by the genes in the signature. The "best" signatures will in the end, most probably, be developed by a combination of biological expertise and computational algorithms.
Here we have demonstrated also the usefulness of using false positive and false negative fractions for assessing the performance of signatures. Particularly for medical applications, these measures of performance are more appropriate for assessing relevance of test performance than the overall classification (or misclassification) rate.
A promising approach has been to apply a proteinnetwork-based method [21] that, instead of identifying markers as individual genes, considers them as subnetworks as extracted from protein interaction databases. Unfortunately the results suffer from selection bias. Two data sets were used [9,22], although one of the data sets [22] included both node negative and node positive patients; just node negative patients could have been selected for inclusion in the analyses. It is important to correct for selection bias when estimating classification accuracy from use of the subnetwork markers. As outlined above, to remove selection bias, the subnetwork features should be identified at each step of the cross-validation, not "identified using all microarray samples before classification" [21, page 4].

Conclusion
We demonstrate the usefulness of the 10-fold crossvalidation framework for assessing performance of signatures on a well-known Breast Cancer data set. Within this approach we developed two different methods, one based on forward selection, the other on genes that were statistically significant in all training blocks of data. This paper demonstrates a lack of agreement between signatures estimated using a forward selection method on randomly selected subsets of the same data. Our novel method overcomes the instability of the forward selection method. It is computationally fast and is shown to have lower test error variance and FPF.
The 10-fold cross-validation framework is a very useful and less computer intensive method than some other resampling methods. It allows estimation of unbiased misclassification error as well as an assessment of the signature's sensitivity and specificity. Such evaluations are of utmost importance especially when the number of samples is small relative to the number of measurements.
The use of the pair of measurements, true positive fraction (sensitivity) and false positive fraction (1-specificity), provides a more clinically relevant evaluation of overall performance than the single measure given by the correctly predicted fraction (or its complement).
Although the hope of finding a single diagnostic tool is laudable, it is still too early to be making any strong claims. Given the complexity of human breast tumours, we agree that it is unlikely that "robust and internationally agreed signature gene lists will be accumulated in the near future" [7].
Finally, we have as yet to fully understand the enormous quantities of data that the rapidly evolving microarray technologies are giving us. Moreover, "although the use of gene-expression profiles in clinical practice is very appealing, we should be very cautious . . ." [23].