Random Subspace Aggregation for Cancer Prediction with Gene Expression Profiles

Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio. Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM, K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance. Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.


Introduction
Cancer usually has an association with genes which carry human heritage information. Completion of human genome sequencing makes genetic analysis on the genome-wide scale available and provides a deeper understanding of the underlying mechanism of cancers [1][2][3][4]. Biological technology now can simultaneously monitor ten thousands of gene expression levels [5,6]. It is meaningful to design novel methods to precisely and efficiently classify tumor samples from normal samples or recognize subclasses of some disease with gene expression profiles. Classification of gene expression data, however, faces enormous difficulties. Firstly, the data have up to ten thousands of dimensions. Traditional classification methods become intractable, since high dimensionality makes sample distribution dispersing and distance between samples ambiguous. Secondly, sample size is small for high expenses or ethical consideration. Therefore, there is no enough data to train a classical learner. Low Signal-to-Noise Ratio (SNR) is the third issue to consider for gene expression data analysis, which means noise may significantly decline performance.
To tackle the high dimensionality issue, some researches make an attempt to select important gene features by exploiting the association among genes and eliminating redundant and irrelevant information. Based on Recursive Feature Elimination (RFE), Guyon et al. used SVM method to select genes and proved that the genes filtered by SVM method perform better [7]. By feature extraction and defining "correlation feature space" for samples built on gene expression profiles through iterative utilization of Pearson's correlation coefficient, Ren et al. proposed an original method to further propel gene expression profiling technologies from bench to bedside [8]. Considering the possible interactions among genes, Zhang et al. proposed a binary matrix shuffling filter to surmount troubles linked with searching schemes of conventional wrapper method and overfitting [9].
Ensemble art is also introduced in some recent researches. Bolón-Canedo et al. provided a novel framework for feature selection by an ensemble of filters and classifiers [10]. Combining classifiers from different classification families into an ensemble based on the evaluation of performance of each classifier, Nagi and Bhattacharyya proposed an ensemble method named as SD-EnClass [11]. To ensure a high classification accuracy, Ghorai et al. showed an ensemble of nonparallel plane proximal classifiers based on the genetic algorithm through simultaneous feature and model selection scheme [12]. Given the fact that forward feature selection (FFS) method is able to obtain an expected feature subset with less iteration than that of backward feature selection (BFS) method, Luo et al. proposed two FFS methods based on the pruning of the classifier ensembles generated by a single gene feature [13].
"Blessing of nonuniformity" effect, which means samples are concentrated in a relatively low instance space rather than uniformly throughout the whole space, inspired some novel methods to perform classification in subspaces [14]. Constructing subspace in random process was firstly proposed by Ho for decision forests to overcome the dilemma between avoiding overfitting and achieving maximum accuracy [15].
Recently, researchers have done much work on cancer classification based on gene expression data. Daxa et al. proposed a framework to find informative gene combinations and to classify gene combinations belonging to their relevant subtype by using fuzzy logic, while they only focused on identifying 2-gene and 3-gene combinations [16]. Kim et al. presented a genetic filter to identify gene subset for cancertype classification on gene expression profiles, which was only tested on one dataset, that is, Leukemia dataset [17]. Vosooghifard and Ebrahimpour proposed a hybrid method using GWO and C4.5 for gene selection and cancer classification. In essence, GWO is a group optimization method, so time consuming is a factor which should be considered [18]. Buza summarized the classification of gene expression data in reference [19], where he indicated that the robustness of SVM to classify gene expression data relies on the strong fundamentals of statistical learning theory.
This paper attempts to classify gene expression data by aggregating SVMs trained on random subspaces (RS). RS method shows great potential in scenarios where the number of features is much bigger than the number of samples [20][21][22][23]. In addition, RS method has an excellent performance in coping with correlation and redundancy between features. Bias risk is relatively small in RS because of its independence of specific hypothesis on data. SVM is usually used to cope with gene expression data, since only support vectors work in classification process, and the number of support vectors is usually much smaller than that of training samples. We elaborately explored the trick of choosing parameters and the effect of size of subspaces on the classification performance. The possible reason leading to unsatisfied outcome was also revealed.

Gene Expression Datasets.
Eight real gene expression datasets are used. They are provided by Kent Ridge Biomedical Dataset Repository and collected by Li and Liu from Nanyang Technological University, Singapore [24]. Detailed information is listed in Table 1.
Breast Cancer dataset labels the patients who had got distance metastases in five years as "relapse" and label the patients who remained healthy since the initial diagnosis for interval of at least five years as "nonrelapse." Missing values are replaced by 100 [25].
Leukemia dataset was originally published in reference [26]. Dataset used in this work is an extended and more heterogeneous version than the initial one. Prostate dataset has an independent testing set, which is from a different experiment and has a nearly tenfold difference in overall microarray intensity from the training data [40].
Colon Tumor dataset was introduced in reference [41]. Rather than elaborating time-course data, this dataset consists of snapshots of the expression pattern of distinct cell types. Raw dataset, based on 22 normal colon tissue samples (positive) and 40 colon tumor samples (negative) of colon adenocarcinoma specimens, was from an Affymetrix oligonucleotide array complementary to more than 6,500 genes and expressed sequence tags (ESTs). Two thousand genes were selected to generate the dataset used here, with the highest minimal intensity across 62 samples.
CNS (central nervous system) dataset was originally published in reference [42], while only dataset C mentioned to analyze the outcome of the treatment is used here. 60 samples consist of 39 medulloblastoma survivors (Class 0) and 21 treatment failures (Class 1). The dataset contains 60 patient samples, with 21 medulloblastoma survivors (labelled as "Class 1") and 39 treatment failures (labelled as "Class 0"). There are 7129 genes in the dataset.
Ovarian dataset was originally published in reference [43], inside which experiments are to identify proteomic patterns in serum that distinguish ovarian cancer from non- cancer. The proteomic spectra were generated by mass spectroscopy and dataset used in this work includes 91 "Normal" samples and 162 "Cancer" samples without separated training set and testing set. The raw spectral data of each sample contains the relative amplitude of the intensity at each molecular mass/charge ( / ) identity. There are totally 15154 / identities. The intensity values were normalized according to the formula NV = ( − Min)/(Max − Min), where NV is the normalized value, the raw value, Min the minimum intensity, and Max the maximum intensity. The normalization is done over all the 253 samples for all 15154 / identities. Thus, each intensity value falls into the range of 0 to 1.
As the most common subtype of non-Hodgkin's lymphoma, DLBCL (diffuse large B cell lymphoma) is due to an aggressive malignancy of mature B lymphocytes. DLBCL consists of two molecularly different subclasses [44]. One subclass is "germinal centre B like DLBCL" expressing gene characteristics of germinal centre B cells and the other is "activated B-like DLBCL" expressing genes normally induced during in vitro activation of peripheral blood B cells. DLBCL dataset contains 47 mRNA samples consisting of 24 germinal centre B-like DLBCL and 23 activated B-like DLBCL. Each of 4026 column score responding to cDNA clones indicates a gene expression level. Log-transformation was implemented on raw dataset to produce the dataset used in this work.

Method
Description. SVM has an advantage in small sample cases and RS method shows an excellent performance in coping with high-dimension data. Algorithm 1 presents a description of RS SVM method used in this paper, which aggregates SVMs trained on random subspaces. Figure 1 shows the framework of RS SVM.

Gene Selection.
Gene expression profile usually contains a large number of genes with constant or near constant expression levels across samples. These genes are redundant for classification and even decline distinction between normal and tumor samples, since they sharply increase space dimensions. To address this problem, gene selection based on statistical analysis is adopted to yield a new gene set from the original one. Since -test is the first method for feature selection when microarray technology came into being, it is used in this work. Firstly, we compute value of each gene across total samples and rank genes according to value; then, top genes are filtered at 0.95 significant level. Number of top genes and optimal size of subspace on eight datasets are presented in Table 2.

Size and Number of Random Subspaces.
Random subspace size ( ) has an enormous influence on RS SVM. Supposing that value is relatively small, some important gene features may not be selected into feature subsets to train SVMs; thus, underfitting easily occurs. In contrast, if is extremely large, diversity among SVM classifiers may be reduced, leading to a useless aggregation. Following experiment sets, default to be the square root of (feature number of selected data by -test), recommended by Breiman [45], and then adjust until achieving the optimal testing error. We analyze the influence of random subspace size on classification performance via illustrating the variation of training error and testing error with different in Figure 3. An appropriate number of random subspaces ( ) can guarantee that each feature has enough chance to be selected. Since the lack of prior knowledge about , it is set to 1000 experimentally.

Results and Discussion
To validate the effectiveness of RS SVM, we perform experiments on eight real gene expression datasets mentioned above. Three experiments are designed to validate the proposed method. In the first experiment, we computed testing error of RS SVM and peer methods, including single SVM, KNN ( -nearest neighbor), CART (classification and regression tree), Bagging, and AdaBoost on eight datasets. Comparison of RS SVM with the state-of-the-art methods in related literatures is also given. The second experiment explored influence of subspaces size by presenting the fluctuation of training error and testing error. In addition, sensitivity and specificity are also obtained at different subspace size. The last experiment shows the effectiveness of gene selection based on -test. The code is written in R-2.15.2, and all the packages are downloaded from the official site (https://www.r-project .org/). Table 3 gives a detailed description of the functions, the relative parameters, and packages used in experiments. Note that there is no training set and testing set partition on Colon Tumor, CNS, Ovarian, and DLBCL; we perform leave-oneout cross validation on these datasets. Table 4 shows testing error of RS SVM and other peer methods on eight datasets. Testing error of each method is computed on the same dataset. To eschew the interference of randomness, values in Table 4 are the average of 50 iterations. It is clear that RS SVM performs best on five datasets, that is, Breast Cancer, Lung Cancer, Prostate, Ovarian, and DLBCL. It also achieves good results on Leukemia dataset. Effect of aggregation is obvious by comparing RS SVM with single SVM, since testing error of RS SVM is lower on six datasets, and RS SVM obtains the same result with single SVM on Colon Tumor. The only exception is CNS. For CNS, all the methods do not perform well, which probably was due to the special distribution of data. Table 5 shows testing error of RS SVM and the stateof-the-art methods in literatures. It is obvious that none of these methods is always the winner, since distribution or   The state-of-the-art methods are indexed by the first author in literatures. "-" means that there are no corresponding results in the given literature. correlation between gene features is diverse among different datasets. Each method has peculiar perspective for certain gene pattern. RS SVM achieved the lowest testing error on Breast Cancer and Prostate and also relatively low testing error on the datasets of Leukemia, Lung Cancer, Ovarian, and DLBCL, which implies a good generalization performance. In spite of good performances mentioned in Tables 4  and 5, an unsatisfied outcome is revealed on Colon Tumor and CNS. Possible reason might be traced to heterogeneity phenomenon appearing in the two datasets [37], which means greater variability existing in gene expression level between the categories. To visually describe the distribution, Figure 2 projects high-dimension data to two-dimension space by Principle Component Analysis (PCA). Heterogeneity phenomenon is obvious in Colon Tumor and CNS data. For CNS, distribution of "Class 1" is relatively concentrated and "Class 0" is more dispersing. Similar case happens on Colon Tumor. This suggests that RS SVM is not suitable for heterogeneous data. Figure 3 shows training error and testing error with respect to subspace size. Breast Cancer, Leukemia, Lung Cancer, Ovarian, and DLBCL share nearly similar curve trend. Initially, both training error and testing error are high when subspace size is small, which indicates underfitting exists. With the increasing of subspace size, both errors converge to nearly zero and underfitting fades away. However, the convergence rate is different among different datasets. Ovarian data converges much slower than the other four datasets. Errors of Ovarian are not near zero until subspace size is almost 800.

Influence of Subspace Size.
For Colon Tumor, when training error is near zero, there is a small gap between training and testing errors. This indicates that slight overfitting exists. More severe overfitting exists on CNS, because there is an obviously large gap between training error and testing error when training error is converging to zero. The terrible overfitting may explain RS SVM's high testing error in Tables 4 and 5.
For Prostate datasets, there is little variation on training error by increasing subspace size. Testing error, however, fluctuates dramatically, especially changing subspace size from 90 to 116. During this interval, testing error firstly drops down and minimum is obtained at the point when subspace size is set to 100, followed by rising up sharply, and finally tends to be steady. This phenomenon may be due to great differences between the distribution of training and testing set. As shown in Figure 4, tumor samples mainly concentrate in the left bottom in training set, while dispersing in the left in testing set. This indicates that the model generated on training set may not fit testing set well. Figure 5 presents sensitivity and specificity with respect to subspace size. Sensitivity shows the ability to detect positives while specificity is the ability to reject negatives. To some extent, there is a trade-off between sensitivity and specificity. The best subspace size is a compromising value between sensitivity and specificity. For Breast Cancer, Leukemia, Lung Cancer, Ovarian, and DLBC, both sensitivity and specificity are high, which coincides with the low testing errors in Tables 4 and 5. Even though two curves of Colon Tumor are relatively steady, the whole level is not high. CNS dataset cannot achieve both high sensitivity and specificity, since when one rises up, the other drops down. The characteristic of Prostate dataset is also reflected in Figure 5. The sensitivity curve of Prostate rises up rapidly and then remains steady, but specificity curve drops down sharply when subspace size passes over the optimal value, which indicates that, with the increasing of subspace size, more and more tumor samples are predicted falsely.

Validation of Gene Selection by -Test.
The above experiments are performed on the datasets after gene selection via -test, which is designed to reduce dimensionality and eliminate noise. In order to validate the effect of gene selection, we carry out experiment on datasets both with and without gene selection. Table 6 gives the testing error of RS SVM on eight datasets. For the sake of contrast, parameters of two cases are all uniform. Size of subspace chooses the optimal value obtained in Table 2. It shows that gene selection improves classification performance obviously by reducing testing errors.

Conclusions
This work proposed a cancer classification method, termed RS SVM, to analyze gene expression profiles. The robustness of SVM relies on the strong fundamentals of statistical learning theory and the technique can be extended to nonlinear discrimination by embedding the data in a nonlinear space using kernel functions. In pattern recognition systems, no single model exists for all pattern recognition problems and no single technique is applicable to all problems. Ensemble learning is to integrate several models for the same problem. Random subspace is one of the ensemble learning methods and suitable for high-dimension data. For high-dimension gene expression data, only a small fraction of all genes is effective in performing certain diagnostic test. Hence, gene expression data analysis is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio. RS SVM takes advantage of both subspace and SVM to handle the high-dimension and small sample problem in gene expression data, after obtaining the significant features throughtest, which could be regarded as prior knowledge to reduce the computing pressure. Experimental results on eight real gene expression profiles show that RS SVM outperforms single SVM, KNN, CART, Bagging, AdaBoost, and 16 stateof-the-art methods in literatures. We also applied PCA on two gene expression profiles, where the experimental results are not satisfied, to probe the unsuitability. It suggests that RS SVM is not suitable for heterogeneous data.
In RS SVM, optimal values of subspace size and subspace number were obtained empirically, which was arduous and time-consuming. How to address this problem is still an open issue. We have collected next-generation sequencing gene expression data from TCGA and will continue this research on the new data.