Random Forest in Clinical Metabolomics for Phenotypic Discrimination and Biomarker Selection

Metabolomic data analysis becomes increasingly challenging when dealing with clinical samples with diverse demographic and genetic backgrounds and various pathological conditions or treatments. Although many classification tools, such as projection to latent structures (PLS), support vector machine (SVM), linear discriminant analysis (LDA), and random forest (RF), have been successfully used in metabolomics, their performance including strengths and limitations in clinical data analysis has not been clear to researchers due to the lack of systematic evaluation of these tools. In this paper we comparatively evaluated the four classifiers, PLS, SVM, LDA, and RF, in the analysis of clinical metabolomic data derived from gas chromatography mass spectrometry platform of healthy subjects and patients diagnosed with colorectal cancer, where cross-validation, R 2/Q 2 plot, receiver operating characteristic curve, variable reduction, and Pearson correlation were performed. RF outperforms the other three classifiers in the given clinical data sets, highlighting its comparative advantages as a suitable classification and biomarker selection tool for clinical metabolomic data analysis.


Introduction
Metabolomics [1] or metabonomics [2,3] is an emergingomics approach using nuclear magnetic resonance (NMR) spectroscopy or gas chromatography/liquid chromatography-mass spectrometry (GC-MS or LC-MS) technologies. It constitutes a �eld of science that deals with the measurement of metabolite variations in a biological compartment for the study of the physiological processes in response to xenobiotic interventions that is complementary to organ-speci�c biochemical and histological �ndings. rough the analysis of one or several kinds of bio�uids including serum, urine, saliva, and tissue samples, the global and dynamic alterations in metabolism can be deciphered [4]. erefore, metabolomics has been increasingly used in many applications such as identifying metabolite markers for clinical diagnosis and prognosis [5], monitoring the chemical-induced toxicity [6], exploring the potential mechanism of diverse diseases [7], and assessing therapeutic effects of treatment modalities [8,9]. Univariate and/or multivariate statistical methods are routinely used in metabolomics studies, aiming at successful classi�cation of samples with metabolic phenotypic variations and identi�cation of potential biomarkers while minimizing the technical variations.
To date, the most widely used classi�cation methods in metabolomic data processing include principal component analysis (PCA), projection to latent structures (PLS) analysis, support vector machine (SVM), Linear discriminant analysis (LDA), and univariate statistical analysis such as Student's t-test and analysis of variance (ANOVA) test [10,11]. We recently applied some of these methods in combination to identify metabolite-based biomarkers in hepatocellular carcinoma [5], gastric cardia cancer [12], knee osteoarthritis [13], oral cancer [14], and schizophrenia [7]. Nevertheless, more effective and robust bioinformatics tools are in critical need for metabolomic data analysis especially when dealing with clinical samples with large individual variability due to diverse demographic and genetic background of patients and various pathological conditions or treatments.
A machine learning method, random forest (RF), is reported as an excellent classi�er with the following advantages: simple theory, fast speed, stable and insensitive to noise, little or no over�tting, and automatic compensation mechanism on biased sample numbers of groups [15]. RF has been widely used in microarray [16][17][18] and single nucleotide polymorphism (SNP) [19] data analysis achieving good performance. However, in the �eld of clinical metabolomic data analysis, it has not got enough attention and concern. In addition, no comprehensive performance evaluation about this classi�er is reported.
In this research, RF was used in the analysis of a GC-MS derived clinical metabolomic dataset. Its classi�cation and biomarker selection performances were compared with PLS, LDA, and SVM comprehensively. e score plot based on cross validation was used for classi�cation accuracy evaluation. e cross-validation and ROC (receiver operating characteristic) curve were carried out to test their prediction ability and stability. e 2 / 2 plot was adopted for over�tting measurement. Variable number dependence of the � classi�ers was explored by eliminating variables step by step. Besides these classi�cation performances, the variable ranking and putative biomarker selection power of RF was examined as well by Pearson correlation.

Metabolomic Data Set.
Colorectal cancer (CRC) is one of the common types of cancer and the leading causes of cancer death in the world [20]. Urinary samples of 67 CRC patients (67 preoperation samples and 63 matched postoperation ones) and 65 healthy volunteers were collected from the Cancer Hospital affiliated to Fudan University (Shanghai, China). Patients enrolled in this study were not on any medication before preoperative sample collection. e postoperative samples were collected on the 7 day aer surgery. Sample collection protocol was approved by the Cancer Hospital Institutional Review Board and written consents were signed by all participants prior to the study. e healthy volunteers were selected by a routine physical examination, and any subjects with in�ammatory conditions or gastrointestinal tract disorders were excluded. Other background information such as diet and alcohol consumption was considered during sample selection to minimize the dietinduced metabolic variations. All the samples were collected in the morning before breakfast, immediately centrifuged, and stored at −80 ∘ C until analysis. Clinical characteristics of all the samples in this study are provided in Table 1. All the samples were chemically derivatized and subsequently analyzed by GC-MS following our previously published procedures [21].
e acquired MS data were pretreated and processed according to our previously published protocols [5,7]. A total of 187 variables (areas of peaks, denoting concentrations of metabolites), 35 metabolites were obtained from the spectral data analysis. Normalization (to the total intensity to compensate for the overall variability during sample extraction, injection, detection, and disparity of urine volume), mean centering, and unit variance scaling of the data sets were performed prior to statistical analysis. Finally, the data set contains 187 variables and 195 samples. Two cases: (a) Normal versus CRC patients (preoperative) and (b) Preoperative versus postoperative patients were involved for analysis.

Random
Forest. Random forest (RF), developed by Breiman [22], is a combination of tree-structured predictors (decision trees). Each tree is constructed via a tree classi�cation algorithm and casts a unit vote for the most popular class based on a bootstrap sampling (random sampling with replacement) of the data. e simplest random forest with random features is formed by selecting randomly, at each node, a small group of input variables to split on. e size of the group is �xed throughout the process of growing the forest. Each tree is grown by using the CART (classi�cation and regression tree) methodology without pruning. e tree number of the forest in this study is set to be 200, the number of input variables tried for each node is the square root of the number of total variables, and the minimum size of the terminal nodes is set to be 2. e "score" of RF is the scaled sum of votes derived from the trained trees for out-of-bag samples.
RF includes two methods for measuring the importance of a variable or how much it contributes to predictive accuracy. e default method is the Gini score (the method of this study). For any variable, the measure is the increase in prediction error if the values of that variable are permuted across the out-of-bag observations. is measure is computed for every tree, then averaged over the entire ensemble, and divided by the standard deviation over the entire ensemble. erefore, the larger the Gini score is (ranges from 1 to 100), the more important a variable is.
Please refer to the appendices for the introduction of other classi�ers (PLS, SVM, and LDA).

��al�ation o� Classi�cation
Per�ormance. e classi�cation performance of RF as well as PLS, LDA, and SVM can be evaluated and compared using several approaches: cross-validation, 2 / 2 plot, ROC, and reduction of variable number.

Cross-Validation: Prediction
Ability and Stability. Two types of cross-validations: k-fold and hold out were employed to estimate the prediction ability with low bias and low variance. (1) In the k-fold cross-validation, the training set was �rst divided into k subsets (the folds) of approximately equal size. Sequentially each subset was tested using the classi�er trained on the remaining k−1 subsets, where k was set to be 7 and 10 in this study. (2) Holdout cross-validation is similar to k-fold cross-validation except for the repeatedly (100 times) random selection of the two mutually exclusive training and testing (holdout) subsets in accordance with a given ratio. is method was used with an understanding that the more instances le for the holdout set, the higher the bias of the estimate would be. On the other hand, fewer holdout set instances mean that the con�dence interval for the accuracy would be wider. Besides accuracy and prediction ability, the repeated holdout cross-validation was used to test the stability of a classi�er. e holdout ratios were set as 10%, 15%, and 33%, respectively, on all the classi�ers.

R
(1) In the equations, represents total number of samples, is the predicted class (0 or 1) of the th sample when all the samples are used for model building, is the actual class, is the average of the predicted class, and ( is the predicted class when all the samples except the th sample are used for model building (leave one out cross-validation).
e criteria for classi�er validity are as follows. (1) All the 2 and 2 values on the permuted data set are lower than those on the actual data set. (2) e regression line of 2 (line joining the actual 2 point to the centroid of the cluster of permuted 2 values) has a negative intercept value on the vertical axis.

Receiver Operating Characteristic (ROC): Diagnosis
Potential. ROC analysis is a classic method from signal detection theory and is now commonly used in clinical research [23]. ROC of a classi�er shows its performance as a tradeoff between speci�city and sensitivity. Sensitivity is de�ned as the proportion of subjects with disease whose tests is positive, and calculated by the formula, TruePositive/(TruePositive+FalseNegative). Speci�city, on the other hand, is de�ned as the proportion of subjects without disease whose tests is negative, and calculated in the formula, TrueNegative/(TrueNegative+FalsePositive). Typically, 1-speci�city is plotted on the -axis and sensitivity is plotted on the vertical axis. All the predictive behavior of a classi�er can be represented by the points in the ROC curve independent of class distributions or error cost [23]. e area under the ROC curve (AUC) is a statistic summary of its diagnostic potential.

Variable Number
Dependence. Generally, too many irrelevant variables are liable to result in over�tting decisions, whereas differences between groups cannot be extracted and depicted completely if crucial variables are not concerned [24]. Variable number dependence is therefore a necessary factor for classi�er performance evaluation.
To avoid bias, it is advisable to rank and eliminate variables one by one. Initially, the whole dataset is taken when a classi�er is computed. en, a list of variables in descending order relative to classi�cation importance is established and the variable in the end is eliminated for subsequent analysis. is process is repeated until only one variable is le for classi�er building. e last few variables are of great potential to be biomarkers for separating the groups.

Evaluation of Biomarker Selection
Performance. Prediction ability and stability, over�tting, diagnosis potential, and variable number dependence are important aspects for a classi�er. Variable ranking and biomarker selection is of the same importance in metabolomics study.
For RF, variables are ranked by Gini score, a measurement of average accuracy of all trees containing a particular variable [22]. For PLS, the conventional VIP (variable importance in projection) value is used for variable ranking. For LDA, the coefficients of variables in the discriminant function indicate their importance. As to SVM, the importance of variables is evaluated by the SVM recursive feature elimination (SVM-RFE) algorithm [25].
As each classi�er possesses its own algorithm for variable importance ranking with its own strength and weakness, the Pearson correlation coefficient of every two ranks was used to evaluate their consistency and the rank of t-test (by ascending order of variable values) was taken as an unbiased reference. e consistency comparison was conducted on two levels: ranks of all variables and ranks of identi�ed metabolites.
All the metabolites were identi�ed and veri�ed by public libraries such as HMDB, KEGG, and/or reference standards available in our laboratory.
All the classi�ers andevaluation methods were carried out using Matlab toolbox (Version 2009a, Mathworks).  cases (Figures 1(a) and 1(b)). In Figure 1(a), red and blue dots represent the normal and CRC patients, respectively. -axis is the sample index and -axis is the corresponding "score" of every classi�er. RF achieved the best separation with no miss-classi�ed samples (the accuracy is 100%). e performance of SVM, LDA, and PLS was good, with descending accuracy of 90.9%, 87.1%, and 82.6%, respectively. Similarly, Figure 1(b) shows the separation between CRC preoperative patients (red dots) and CRC postoperative patients (blue dots). RF yielded 95.4%, a higher classi�cation accuracy than that of LDA, SVM, and PLS, which achieved 90.8%, 80.8%, and 80.8%, respectively.

Results and Discussion
�.�. ����i�tion Ability� �tability� �v���ttin�� an� �ia�no�ti� Ability Evaluation. e accuracy of classi�cation is crucial for a classi�er, while other classi�cation behaviors such as prediction ability, stability, degree of over�tting and diagnostic ability are of equal signi�cance as well. e holdout cross-validation results (33% holdout samples, 100 times) of RF (purple), PLS (blue), LDA (red), and SVM (green) on the two cases are presented as box plots (Figure 2). e -axis denotes the error rate (the smaller, the better). e purple box of RF is always the lowest and shortest in both cases. As to the other three classi�ers, their performances are similar showing no signi�cant di�erence. ese results were validated further by more cross-validation results listed in Table 2. As expected, the average error rates and standard deviations (S.D.s) of (1) holdout crossvalidations on 10% and 15% samples; and (2) 7-and 10-fold cross-validations by PLS, SVM, and LDA are at almost the same level and are all greater than that of RF. erefore, RF is the one with highest prediction ability and best stability among all the classi�ers. Figures 3(a) and 3(b) display the correlation between the actual -variable and the permuted -variable ( -axis) versus the 2 / 2 values ( -axis) of RF (purple), PLS (blue), LDA (red), and SVM (green) on the two cases, respectively. A dot denotes 2 and a circle represents 2 . e straight line and dash dot line are the regression lines (lines linking the value from the actual data set and the 100 permuted ones) of 2 and 2 , respectively. Obviously, the 2 / 2 values of RF on the permuted data sets are all under zero and are well lower than those on the actual data set. Consequently, RF does not over �t the data and so will give out reliable result on new samples. As to SVM, its 2 values on the permuted data sets are slightly Evidence-Based Complementary and Alternative Medicine 5 T 2: Averaged error rates and their standard deviations of RF, PLS, LDA, and SVM on 2 cases by 7-and 10-fold cross-validation as well as 10% and 15% hold out cross-validation (100 times).  lower than or close to that on actual data set, although most of its values on the permuted data sets are under zero and lower than those on the actual data set. Hence, SVM is over�tted and false positive result is prone to appear. is probably caused by its dependence on kernel functions and support vectors.
e ROC curve coupled with its area under the curve (AUC) is a common method used to estimate the diagnosis potential of a classi�er in clinical applications. A larger AUC indicates higher prediction ability. e ROC curves and AUC values of all the classi�ers in the two cases are plotted in Figure 4. RF outperforms the others once more with the greatest AUC values (AUC > .97). Figures 5(a) and 5(b) show the classi�cation error rates ( -axis) against the number of variables ( -axis) involved in the two cases, respectively. It can be seen that with the decrease of variable number used for classi�er building, all the curves keep stable initially, and then rise gradually. Further reduction of variables degrades the classi�er performance heavily because of the shortage of useful information. e point (or short section) where the curve begins to rise correlates to the optimum number of variables for classi�er building. Additionally, RF usually needs fewer variables to achieve the same error rate as the other three classi�ers. In case (B), for example, when the error rate is restricted to be less than 0.18 (the red line), RF needs minimum 10 variables whereas PLS, LDA, and SVM require about 150, 45, and 125 ones at least.

Putative Biomarker Selection.
Variable number dependence section is to evaluate whether and how much the performance of RF depends on the number of variables involved. is section is to evaluate its capability on important variable (putative biomarker) selection. e Pearson correlation matrixes of ranks from every two classi�ers (including t-Test) based on all variables (A-B) and identi�ed metabolites (C-D) in the two cases are listed in Table 3. On the whole, RF, PLS and t-Test have good consistency with each other (high Pearson correlation coefficients) regardless of whether all variables (Figure 6(a)) or identi�ed metabolites (Figure 6(b)) are involved.
Interestingly, in Table 3, the highest and second highest correlation coefficients are 0.794 for PLS and t-Test (case A) and 0.756 for RF and PLS (case B) indicating the consistency and mutual complementarity of classi�ers. All the important metabolites selected by both t-Test ( . 5) and PLS (VIP > 1) could be �ltered by RF (�ini > 5 ) as well. Consistent with previous metabolomics study, dysregulated metabolic pathways, such as glycolysis, TCA cycle, urea cycle, pyrimidine metabolism, polyamine metabolism as well as gut microbial-host cometabolism were observed [26]. Additionally, more signi�cant metabolites in the above pathways could be found by RF (case A) providing complementary information for CRC study. Figure 7 presents some box plots of intensity in corresponding groups.

Conclusion
In this study, RF was applied successfully in metabolomic data analysis for clinical phenotype discrimination and biomarker selection. Its various performances were evaluated and compared with the other three classi�ers PLS, SVM, and LDA by two types of cross-validations, 2 / 2 plot, ROC curve, variable elimination, and Pearson correlation. RF demonstrated the best overall performance including accuracy, prediction ability, over�tting, diagnosis potential, stability, and putative biomarker selection, highlighting its potential as an excellent classi�cation and biomarker selection tool for clinical metabolomic data analysis. PLS outperforms the others in variable ranking and putative biomarker selection whereas its classi�cation ability is not satisfactory enough. LDA demonstrates reasonably good performance in classi�cation but its biomarker selection ability is open to �uestion. SVM may be slightly over�tting regardless of its good classi�cation accuracy and diagnosis potential.
e combinational usage of multiple methods, RF, t-Test, and PLS, for example, may provide more comprehensive information for a "global" understanding of the metabolomics or other "omics" data.

Appendices
A. Projection to Latent Structures (PLS) e basic object of PLS is to �nd the linear (or polynomial) relationship between the superior variable (a vector indicating sample groups) and the dataset (variables). e modeling consists of simultaneous projections of both the and spaces on low dimensional hyper planes. e coordinates of the points (projection of and ) on these hyper planes constitute the elements of the matrix ( scores), ( scores), (loadings), and (weights). e analysis has the following objectives. e model will iteratively compute one component at a time, that is: one vector derived from -scores ,scores , weights (or ), and loadings . e component extraction process will stop when the predictive accuracy obtained in 7-fold cross-validation ( 2 value) begin to descend. All the components are calculated in descending order of importance. e "score" of PLS is the score of the �rst component which contains most information of dataset.
e formula to calculate VIP (variable importance) can be expressed as VIP = = 2 * * , where is the number of components extracted and is the number of variables in dataset . is the correlation   B. Support Vector Machine (SVM) e key to the success of SVM is the kernel function which maps the data from the original space into a high dimensional (possibly in�nite dimensional) feature space. By constructing a linear boundary in the feature space, the SVM produces nonlinear boundaries in the original space. Given a training sample, a maximum-margin hyper plane splits a given training sample in such a way that the distance from the closest cases (support vectors) to the hyper plane ( ) is maximized where is the weight matrix, is the dataset, and is a constant term. e "score" of SVM is computed by Score . SVM Recursive Feature elimination (SVM-RFE) is a wrapper approach which uses the norm of the weights to rank the variables (the larger the norm of is, the more important the corresponding variable is). Initially whole of the data is taken and a classi�er is computed. e norm of is computed for each of the features and the feature with the smallest norm is eliminated. is process is repeated till all the features are ranked.
Linear kernel was used for SVM classi�cation and feature selection. is kernel was chosen to reduce the computational complexity and eliminate the need for retuning kernel parameters for every new subset of variables. Another important advantage of choosing a linear kernel is that the norm of the weight can be directly used as a ranking criterion; however this is not possible in other kernels such as the radial basis function kernel or a sigmoid kernel.