Metabolomic data analysis becomes increasingly challenging when dealing with clinical samples with diverse demographic and genetic backgrounds and various pathological conditions or treatments. Although many classification tools, such as projection to latent structures (PLS), support vector machine (SVM), linear discriminant analysis (LDA), and random forest (RF), have been successfully used in metabolomics, their performance including strengths and limitations in clinical data analysis has not been clear to researchers due to the lack of systematic evaluation of these tools. In this paper we comparatively evaluated the four classifiers, PLS, SVM, LDA, and RF, in the analysis of clinical metabolomic data derived from gas chromatography mass spectrometry platform of healthy subjects and patients diagnosed with colorectal cancer, where cross-validation,
Metabolomics [
To date, the most widely used classification methods in metabolomic data processing include principal component analysis (PCA), projection to latent structures (PLS) analysis, support vector machine (SVM), Linear discriminant analysis (LDA), and univariate statistical analysis such as Student's
A machine learning method, random forest (RF), is reported as an excellent classifier with the following advantages: simple theory, fast speed, stable and insensitive to noise, little or no overfitting, and automatic compensation mechanism on biased sample numbers of groups [
In this research, RF was used in the analysis of a GC-MS derived clinical metabolomic dataset. Its classification and biomarker selection performances were compared with PLS, LDA, and SVM comprehensively. The score plot based on cross validation was used for classification accuracy evaluation. The cross-validation and ROC (receiver operating characteristic) curve were carried out to test their prediction ability and stability. The
Colorectal cancer (CRC) is one of the common types of cancer and the leading causes of cancer death in the world [
Sample information.
Data set | CRC | ||
---|---|---|---|
Sample type | urine | ||
Group | Normal | CRC (preoperation) | postoperation |
Number | 65 | 67 | 63 |
Age (Mean (minimum, maximum)) | 55 (38, 74) | 59 (40, 76) | 60 (40, 77) |
Gender (male : female) | 23 : 40 | 35 : 28 | 36 : 24 |
Dimension (Sample × variable) | Case A (Normal versus CRC): 132 × 187 |
||
Case B (Pre versus Post): 130 × 187 |
The acquired MS data were pretreated and processed according to our previously published protocols [
Random forest (RF), developed by Breiman [
RF includes two methods for measuring the importance of a variable or how much it contributes to predictive accuracy. The default method is the Gini score (the method of this study). For any variable, the measure is the increase in prediction error if the values of that variable are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble, and divided by the standard deviation over the entire ensemble. Therefore, the larger the Gini score is (ranges from 1 to 100), the more important a variable is.
Please refer to the appendices for the introduction of other classifiers (PLS, SVM, and LDA).
The classification performance of RF as well as PLS, LDA, and SVM can be evaluated and compared using several approaches: cross-validation,
Two types of cross-validations:
Consider
In the equations,
The criteria for classifier validity are as follows. (1) All the
ROC analysis is a classic method from signal detection theory and is now commonly used in clinical research [
Generally, too many irrelevant variables are liable to result in overfitting decisions, whereas differences between groups cannot be extracted and depicted completely if crucial variables are not concerned [
To avoid bias, it is advisable to rank and eliminate variables one by one. Initially, the whole dataset is taken when a classifier is computed. Then, a list of variables in descending order relative to classification importance is established and the variable in the end is eliminated for subsequent analysis. This process is repeated until only one variable is left for classifier building. The last few variables are of great potential to be biomarkers for separating the groups.
Prediction ability and stability, overfitting, diagnosis potential, and variable number dependence are important aspects for a classifier. Variable ranking and biomarker selection is of the same importance in metabolomics study.
For RF, variables are ranked by Gini score, a measurement of average accuracy of all trees containing a particular variable [
As each classifier possesses its own algorithm for variable importance ranking with its own strength and weakness, the Pearson correlation coefficient of every two ranks was used to evaluate their consistency and the rank of
All the metabolites were identified and verified by public libraries such as HMDB, KEGG, and/or reference standards available in our laboratory.
All the classifiers andevaluation methods were carried out using Matlab toolbox (Version 2009a, Mathworks).
RF as well as PLS, LDA, and SVM were applied on the dataset for the two comparative cases (Figures
Classification score plots of RF, PLS, LDA, and SVM on cases (a) and (b) based on urinary metabolomic data derived from GC-MS.
The accuracy of classification is crucial for a classifier, while other classification behaviors such as prediction ability, stability, degree of overfitting and diagnostic ability are of equal significance as well.
The holdout cross-validation results (33% holdout samples, 100 times) of RF (purple), PLS (blue), LDA (red), and SVM (green) on the two cases are presented as box plots (Figure
Averaged error rates and their standard deviations of RF, PLS, LDA, and SVM on 2 cases by 7- and 10-fold cross-validation as well as 10% and 15% hold out cross-validation (100 times).
Case | Evaluation item | RF error rate | PLS error rate | SVM error rate | LDA error rate | ||||
---|---|---|---|---|---|---|---|---|---|
mean (S.D.) | mean (S.D.) | mean (S.D.) | mean (S.D.) | ||||||
(A) Normal versus CRC | 7-fold CV | 0.071 |
|
0.134 |
|
0.148 |
|
0.227 |
|
10-fold CV | 0.069 |
|
0.094 |
|
0.126 |
|
0.188 |
|
|
15% holdout CV | 0.065 |
|
0.132 |
|
0.117 |
|
0.189 |
|
|
10% holdout CV | 0.065 |
|
0.121 |
|
0.113 |
|
0.181 |
|
|
| |||||||||
(B) Pre versus post | 7-fold CV | 0.102 |
|
0.130 |
|
0.170 |
|
0.108 |
|
10-fold CV | 0.096 |
|
0.169 |
|
0.163 |
|
0.096 |
|
|
15% holdout CV | 0.088 |
|
0.137 |
|
0.186 |
|
0.127 |
|
|
10% holdout CV | 0.083 |
|
0.145 |
|
0.161 |
|
0.114 |
|
Box plots of holdout cross-validation error rates (
Figures
The ROC curve coupled with its area under the curve (AUC) is a common method used to estimate the diagnosis potential of a classifier in clinical applications. A larger AUC indicates higher prediction ability. The ROC curves and AUC values of all the classifiers in the two cases are plotted in Figure
Receiver operating characteristic curves of 4 classifiers on 2 cases. Receiver operating characteristic curve and area under curve (AUC) of PLS (blue), LDA (brown), SVM (green), and RF (Purple) on (a) Normal versus CRC and (b) pre versus post.
Figures
Variable dependence plots of 4 classifiers on 2 cases. Error rate (
Normal versus CRC
Pre versus post
Variable number dependence section is to evaluate whether and how much the performance of RF depends on the number of variables involved. This section is to evaluate its capability on important variable (putative biomarker) selection. The Pearson correlation matrixes of ranks from every two classifiers (including
Pearson correlation coefficient matrixes of rank lists by
Method |
|
PLSRankb | RFRankc | SVMRankd | LDARanke |
---|---|---|---|---|---|
Pearson correlation coefficient matrix based on all variables | |||||
| |||||
(A) Normal versus CRC | |||||
|
1.000 | 0.794f | 0.575 | 0.327 | 0.342 |
PLSRank | 0.794 | 1.000 | 0.574 | 0.328 | 0.342 |
RFRank | 0.575 | 0.574 | 1.000 | 0.210 | 0.256 |
SVMRank | 0.327 | 0.328 | 0.210 | 1.000 | 0.167 |
LDARank | 0.342 | 0.342 | 0.256 | 0.167 | 1.000 |
(B) Pre versus post | |||||
|
1.000 | 0.232 | 0.217 | 0.021 | 0.032 |
PLSRank | 0.232 | 1.000 | 0.652 | 0.066 | 0.066 |
RFRank | 0.217 | 0.652 | 1.000 | 0.086 | 0.057 |
SVMRank | 0.021 | 0.066 | 0.086 | 1.000 | 0.007 |
LDARank | 0.032 | 0.066 | 0.057 | 0.007 | 1.000 |
| |||||
Pearson correlation coefficient matrix based on identified metabolites | |||||
| |||||
(C) Normal versus CRC | |||||
|
1.000 | 0.753 | 0.754 | 0.364 | 0.340 |
PLSRank | 0.753 | 1.000 | 0.756 | 0.267 | 0.340 |
RFRank | 0.754 | 0.756 | 1.000 | 0.495 | 0.308 |
SVMRank | 0.364 | 0.267 | 0.495 | 1.000 | 0.190 |
LDARank | 0.340 | 0.340 | 0.308 | 0.190 | 1.000 |
(D) Pre versus post | |||||
|
1.000 | 0.272 | 0.258 | 0.194 | 0.187 |
PLSRank | 0.272 | 1.000 | 0.733 | 0.048 | 0.044 |
RFRank | 0.258 | 0.733 | 1.000 | 0.034 | 0.041 |
SVMRank | 0.194 | 0.048 | 0.034 | 1.000 | 0.187 |
LDARank | 0.187 | 0.044 | 0.041 | 0.187 | 1.000 |
bvariable rank by the VIP value of PLS.
cvariable rank by the Gini value of RF.
dvariable rank by the SVM-REF.
evariable rank by the LDA coefficient.
fPearson correlation coefficient of PLS and
Pearson correlation values between ranks of
Person correlation
Person correlation
Interestingly, in Table
Box plots of significant metabolites selected by RF (case A) only.
In this study, RF was applied successfully in metabolomic data analysis for clinical phenotype discrimination and biomarker selection. Its various performances were evaluated and compared with the other three classifiers PLS, SVM, and LDA by two types of cross-validations,
The combinational usage of multiple methods, RF,
The basic object of PLS is to find the linear (or polynomial) relationship between the superior variable To well approximate the To maximize the correlation between
The PLS model accomplishing these objectives can be expressed as
The model will iteratively compute one component at a time, that is: one vector derived from
The formula to calculate
The
The key to the success of SVM is the kernel function which maps the data from the original space into a high dimensional (possibly infinite dimensional) feature space. By constructing a linear boundary in the feature space, the SVM produces nonlinear boundaries in the original space. Given a training sample, a maximum-margin hyper plane splits a given training sample in such a way that the distance from the closest cases (support vectors) to the hyper plane (
SVM Recursive Feature elimination (SVM-RFE) is a wrapper approach which uses the norm of the weights
Linear kernel was used for SVM classification and feature selection. This kernel was chosen to reduce the computational complexity and eliminate the need for retuning kernel parameters for every new subset of variables. Another important advantage of choosing a linear kernel is that the norm of the weight
LDA adopts a linear combination of variables (
Different with PLS which look for linear combinations of variables to best explain both the data set
The “score” of
Gas chromatography mass spectrometry
Random forest
Linear discriminant analysis
Support vector machine
Principal component analysis
Projection to latent structures
Receiver operating characteristic
Colorectal cancer
Nuclear magnetic resonance
Mass spectrometry.
This work was financially supported by the National Basic Research Program of China (2007CB914700), National Natural Science Foundation of China Program (81170760), and the Natural Science Foundation of Shanghai, China (10ZR1414800).