Different from significant gene expression analysis which looks for genes that are differentially regulated, feature selection in the microarray-based prognostic gene expression analysis aims at finding a subset of marker genes that are not only differentially expressed but also informative for prediction. Unfortunately feature selection in literature of microarray study is predominated by the simple heuristic univariate gene filter paradigm that selects differentially expressed genes according to their statistical significances. We introduce a combinatory feature selection strategy that integrates differential gene expression analysis with the Gram-Schmidt process to identify prognostic genes that are both statistically significant and highly informative for predicting tumour survival outcomes. Empirical application to leukemia and ovarian cancer survival data through-within- and cross-study validations shows that the feature space can be largely reduced while achieving improved testing performances.
Similar to significant gene expression analysis, one demanding challenge in prognostic microarray experiments of tumour outcomes is the development of a powerful prognostic profile based on informative genes or features selected from a large pool of candidate genes measured on a relatively small number of arrays or tumour samples. Among the thousands of genes measured in an experiment, it is anticipated that only a limited number of genes are informative for prognostic purposes while a large number of genes are redundant or irrelative and thus can be ignored. Inclusion of uninformative genes for tumour outcome prediction only introduces unnecessary noise and will inevitably complicate model building and introduces computational difficulties. Obtaining a smaller subset of representative genes while retaining the prognostic characteristics of the original data should lead to a more accurate and efficient learning system with improved classification performance [
Different from significant gene expression analysis which looks for genes that are differentially regulated, feature selection in prognostic microarray studies aims at finding a subset of informative marker genes that are discriminative for prediction, ideally without redundancy. Ein-Dor et al. [
In the literature of prognostic microarray study, feature selection is predominated by the simple heuristic univariate gene filtering paradigm [
We start with identifying genes that are differentially expressed in a microarray experiment by testing the marginal association between gene expression and survival time using the popular Cox regression model in which both censored and uncensored observations are used. In the Cox regression model, we assume that the hazard of death at time point
Suppose there are
To select the second gene, each of the unselected genes indicated as
Now we repeat the procedure in selecting the first gene by calculating the squared-correlation coefficient between each of the unselected genes
Likewise, in order to select the
With the above procedure, a subset consisting of the most representative genes that accounts for the variation of the overall features with a high percentage can be selected. The data vector for each gene or feature can be approximated by a linear combination of the selected subset of features of size
After ranking significant genes, a subset of the most representative and informative marker genes are selected through optimization in the training set using forward selection which adds accumulatively each of the ranked genes (starting from the highest rank genes) to the prediction model and assesses model performance on the training set by calculating prediction accuracy (sensitivity, specificity) together with the chi-squared statistic for comparing differential survival between the predicted favourable and unfavourable groups using the log-rank test. The support vector machine (SVM) is used as the prediction model because of its popularity in machine learning. The simple linear kernel is chosen in model fitting. The free
The above described procedure is first applied to a microarray data-set (containing 6.283 genes) from Bullinger et al. [
Bild et al. [
To show how the method can deal with data from small studies, we applied it to our in-house microarray data collected by Jochumsen et al. [
As the above two data sets are all on ovarian cancer, we additionally conducted a cross-study validation to show performance of the features selected using our method and compare it with that from genes selected only according to their statistical significances. The analysis also takes advantage of the same platform of microarrays used in the two studies, that is, the Affymetrix GeneChip Human Genome U133 Arrays although with different versions. Since the array for our in-house data (U133 plus 2.0, 55.000 probe-sets) is inclusive of the published array (U133a, 22.000 probe-sets), cross-study validation is only possible for validating genes selected from the published array data. Note here we used the whole published data set of 132 tumour samples for gene filtering (
We have shown that our combinatory approach can be used for selecting statistically significant and highly informative genes for predicting tumour survival outcomes in microarray studies. The method removes redundant genes that, although statistically significant, have low impact on prediction so that improved prediction on an independent testing set is expected. Our results indicate that “significant” features selected using the genewise approaches can contain irrelative or redundant genes that serve only to complicate model building for a classifier. Our empirical result helps to further emphasize the difference between significant and prognostic gene expression analyses because the former only looks for genes statistically significantly regulated (including correlated genes coexpressed in a biological pathway) while the latter, on the other hand, tries to extract prognostic genes that are not only statistically significant but also highly informative in characterizing tumour outcomes.
Our combinatory approach consists of both the supervised univeriate differential gene expression analysis for gene filtering and the unsupervised multivariate algorithm for ranking the significant genes in a consecutive manner. The ranking of genes assists subsequently the forward optimization step in determining the final subset of informative genes for building up the final classifier. Given the large number of genes measured in an experiment, it is important to ensure that statistically highly significant genes are picked up after gene filtering in order to form a meaningful candidate feature space for subsequent ranking and optimization. This is necessary because ( 1) the gene ranking step works only if the candidate feature-space contains genes that are highly correlated with tumour survival outcomes although some of them may be of only minor impact in prediction; ( 2) picking up highly significant genes helps to reduce the number of false positive genes that are included in the candidate feature space; and ( 3) a good candidate feature space can help to increase computational efficiency because the computation load goes up exponentially with the number of genes in the feature space.
Feature redundancy reduction not only helps to improve performance and generalization of a classifier, it is also advantageous for clinical applications. With the confirmed subset of highly prognostic genes, routine bioinstrumentations for gene expression level measurement such as the qrt-PCR [
It is necessary to mention that the survival analysis in the gene filtering step makes full use of both censored and uncensored samples in identifying differentially expressed genes. The gene filtering step using survival analysis model can be generalized to binary or categorical clinical outcomes such as tumour metastasis status where corresponding statistical models (e.g.,
This work was partially supported by the US National Institute on Ageing research Grant NIAP01AG08761.