Metabolomic Biomarker Identification in Presence of Outliers and Missing Values

Metabolomics is the sophisticated and high-throughput technology based on the entire set of metabolites which is known as the connector between genotypes and phenotypes. For any phenotypic changes, potential metabolite (biomarker) identification is very important because it provides diagnostic as well as prognostic markers and can help to develop new biomolecular therapy. Biomarker identification from metabolomics data analysis is hampered by the use of high-throughput technology that provides high dimensional data matrix which contains missing values as well as outliers. However, missing value imputation and outliers handling techniques play important role in identifying biomarker correctly. Although several missing value imputation techniques are available, outliers deteriorate the accuracy of imputation as well as the accuracy of biomarker identification. Therefore, in this paper we have proposed a new biomarker identification technique combining the groupwise robust singular value decomposition, t-test, and fold-change approach that can identify biomarkers more correctly from metabolomics dataset. We have also compared the performance of the proposed technique with those of other traditional techniques for biomarker identification using both simulated and real data analysis in absence and presence of outliers. Using our proposed method in hepatocellular carcinoma (HCC) dataset, we have also identified the four upregulated and two downregulated metabolites as potential metabolomic biomarkers for HCC disease.


Introduction
Biomarker discovery is comparatively new and one of the most dominating fields of biological research. Different disciplines are involved with the study of biomarkers: clinical and environmental chemistry, molecular biology, toxicology, food research, plant and animal biology, and so on. In general, biomarkers identification research is needed whenever a whole set/pool set of features are differentiated into two or more groups of samples. In the area of metabolomics, biomarkers are small molecules (metabolites) that can be used for early warning indicators of disease, observing disease progression, and predicting receptivity to treatment. Thus, biomarkers may be categorized into three most important groups: diagnostic, prognostic, and predictive markers [1]. Diagnostic markers are required for early and/or accurate diagnosis of disease to allow best possible treatment choices. On the other hand, prognostic markers give information about future route of a disease that would influence the treatment choices. Predictive markers provide sagacity on the potential responses of an individual to the different treatment options. Therefore, biomarker discovery is very important for the development of predictive, preventive, and personalized medicine [2].
For every disease or phenotypic changes, some metabolites are upregulated and/or some are downregulated from standard within a cell. The metabolites which are upregulated or downregulated between disease and control groups are 2 BioMed Research International known as differentially expressed (DE) metabolites. The classic approach to identify differentially expressed metabolites is Student's -test (for two classes of samples). If the pooled variance of two groups is small, then Student'stest often increases the false discovery (FDR) for metabolic biomarker identification. However, biological fold-change (FC) approach is used to control the FDR [3]. Recently, volcano plot [4] is using to identify differentially expressed metabolites based on both values from -test and foldchange (FC) values [5]. In most of the cases, we get a large number of DE metabolites; therefore, we need to identify the most influential feature/metabolites from the set of DE metabolites. The conventional approaches for ranking the influential metabolites are value and fold-change approach. However, Gevrey et al. in 2003 [6] computed the contribution of features using artificial neural network model and also Rakotomamonjy (2003) [7] and Ishak and Ghattas (2005) [8] suggested a new method for variable/feature selection on ranking score derived from support vector machine (SVM). Thus, in this paper, we have used the state of the art supervised learning technique SVM for ranking the most influential metabolites (biomarkers) according to the importance. However, the performance of true biomarker discovery is very much influenced by missing value imputation techniques [9] as well as outliers handling [10]. It is well known that mass spectrometry (MS) based metabolomics dataset frequently contains missing values [11][12][13] and often contains outliers [14] due to several reasons, like experimental, analytical, computational, and biological hazard. Although several missing imputation techniques are available for metabolomics data analysis like Zero imputation [11], mean imputation [11], median imputation [11], kNN imputation [15], RF imputation [16], missing value replaced by half of the minimum value found in each metabolite [17], replacing missing values by probabilistic principal component analysis (PPCA) [18], Bayesian PCA (BPCA) [9,19], multiple imputation with expectation maximization (EM) algorithm, and Monte Carlo Markov chain (MCMC) method [20], and so on, all the above methods can only solve the missing imputation problem. However, these missing imputation techniques are sensitive to outliers and cannot handle the outliers problem simultaneously because these methods did not implement any robust function or outliers detection and replacement algorithms directly. Moreover, the published outliers handling method does not deal with missing values. Therefore, in this paper we have developed a new robust technique for biomarker identification using the groupwise RSVD [21], -test, FC analysis, and SVM based feature selection approaches that can identify biomarker correctly by imputing missing values and solving outliers problem simultaneously.
To compare the performance of the proposed biomarker identification technique with those of the conventional techniques, we considered three well known traditional and recently used biomarker identification techniques (missing imputation techniques with -test and FC values): Zero imputation with -test and FC (Zero+ +FC), kNN imputation with with -test and FC (kNN + + FC), and RF imputation with -test and FC (RF + + FC).

Materials and Methods
In this paper, we have proposed a robust technique to identify metabolomic biomarker that can handle missing values as well as outliers problem at the same time. To investigate the performance of the proposed method, we considered existing three popular missing value imputation techniques, Zero imputation, kNN imputation, and RF imputation, and also used -test and fold-change approach for biomarker identification. In Zero imputation, all the missing values of a dataset were replaced by Zero. In kNN imputation [15], missing value was computed by averaging of nonmissing values of its first number of nearest neighbours. In R platform kNN imputation method can be found from the library "impute." Missing imputation algorithm through random forest [16] is a tree based regression and classification technique which is suitable for both parametric and nonparametric dataset [22]. In R-language this method can be found from the library "missForest." The short description of the proposed biomarker identification technique is given below.

Biomarker Identification Technique in Presence of Outliers and Missing Values (Proposed).
Let us consider a metabolomics data matrix = ( ), = 1, 2, . . . , and = 1, 2, . . . , , with metabolites and samples that contains missing values and outliers, where the rows of represent the different metabolites and the columns of represent the different samples. In metabolomics data analysis, we can talk about biomarker identification whenever a metabolite or a set of metabolites are differentially expressed between two groups (disease and control) of samples in a metabolomics dataset. In that case, the metabolomics dataset can be partitioned according to the groups of samples as where 1 is the number of subjects of group-1 and ( − 1 ) is the number of subjects of group-2. In this paper, we have proposed a new RSVD based metabolomic biomarker identification technique in presence of missing values and outliers. The RSVD of a data matrix can be written as = Λ , where is a column orthonormal matrix, is a row orthogonal matrix, and Λ is a diagonal matrix that contains the singular values. and contain the eigen vectors of and , respectively. If we want to approximate bỹusing rank then we can writẽ as̃= 1 1 V 1 + 2 2 V 2 + ⋅ ⋅ ⋅ + V . The number of is selected in such a way that first number of 's can explain at least (1 − )100% of total variation of data, where the value of depends on user interest. The steps of the metabolomic biomarker identification algorithm are given below.
Step 1. Partition the matrix as = ( 1 2 ) according to disease and control group of samples, where . (2) Step 2. Apply the RSVD for each partitioned matrix 1 , 2 and approximate those as̃1,̃2 by rank , where is selected in such a way that first r number of 's can explain at least (1 − )100% of total variation of data, where the value of depends on user interest; in our case we choose = 0.05.
Step 3. Reconstruct each partitioned data matrix aŝ1,̂2 by replacing the missing values and outlying cells of 1 , 2 using the corresponding values of the approximated data matrices 1 ,̃2, respectively, where outlier is detected by using outlier detection rules like interquartile range (IQR) rule [23].
Step 5. Compute the differentially expressed metabolite from the reconstructed metabolomics data matrix̂using value and FC value. The value can be calculated using -test. To declare the metabolite as differentially expressed, choose the threshold value using Bonferroni correction and absolute fold-change (FC) cut-offs of >2.
Step 6. Rank the differentially expressed metabolites according to their influence or importance using support vector machine (SVM) [24].
Step 7. List the first few top upregulated and downregulated metabolites from Step 6 that classify the samples with higher accuracy and declare those metabolites as biomarker. Upregulated and downregulated metabolites can be identified by the sign of fold-change values.
The R-code of the proposed method can be found in the following URL: http://www.statru.org/wp-content/uploads/ 2010/12/MetabBio.zip

Dataset Description.
Both simulated and real metabolomics datasets have been used to demonstrate the performance of the proposed method in a comparison of the other methods.

Simulated Dataset.
We have simulated metabolomics datasets using the one-way ANOVA model, defined by = + + ∈ , where is the intensity of th metabolite, th group, and th sample; denote the average intensity of metabolite ; is the th group effect for th metabolite and ∈ is the random error term. In this linear model, we have taken ∼uniform (4,8) and ∈ ∼ (0, 1). We have also taken two types of metabolites: (i) differentially expressed (DE) metabolites and (ii) equally expressed (EE) metabolites. DE metabolites are divided into two groups: upregulated metabolites and downregulated metabolites. Disease and control group effects for upregulated metabolites are = (2, 1) and = (0, 1); for downregulated metabolites disease and control group effects are = (0, 1) and = (2, 1) and in case of equally expressed metabolites the group effect for both cases is = (0, 1). To make a simulated metabolomics dataset, we have taken 500 metabolites (30 DE metabolites and 470 EE metabolites) and 80 samples (45 control and 35 disease). From 500 metabolites, the first 15 metabolites have been included as upregulated metabolites and 121 to 135 metabolites have been incorporated as downregulated metabolites for disease samples. We have also included different rates of missing values (10%, 15%, and 20%) in this dataset, where 50% missing were given at random and the rest of the missing were given for lower values. To measure the performance of the proposed method in different condition, we have incorporated different rates (3%, 5%, 7%, 10%, and 15%) of outliers in simulated dataset. The outliers in the th metabolite were taken from normal distribution with mean = 5 * and variance = 2 , where and 2 are the mean and variance of the th metabolite. We have distributed the outliers randomly by different rates (3%, 5%, 7%, 10%, and 15%) in the data matrix cell; therefore, outliers may fall anywhere in the data matrix. We simulated 100 datasets using the above conditions and also measured the performance of the proposed method using these simulated dataset.

Real Metabolomics Dataset.
In this paper, we have used a publicly available real metabolomics dataset on the metabolomic effect of hepatocellular carcinoma (HCC) that contains the abundance level measurements of metabolites from different subjects. This dataset was originally produced by Patterson et al. [25]. To extract the metabolomic profile from the sample, ultra-performance liquid chromatography coupled with electrospray ionization/quadrupole-time-offlight mass spectrometry (UPLC-ESI-Q-TOF-MS) was used and this dataset was normalized by Pareto scaling. This data matrix contains 1388 rows and 57 columns. Each row is a metabolite detected by the retention time (rt) and mass to charge ( / ) ratio that were included in first column and the second column, respectively. The remaining 55 columns were different subjects that came from two groups. 20 subjects were from the hepatocellular carcinoma (HCC) group and 35 subjects were from the control group. There were 26.52% missing values/cells in this dataset. More details about the data can be found in the article of Patterson et al. [25]. To measure the performance of the proposed method, we calculated the error between original and reconstructed data that we can understand how well the proposed method reconstructs the data. We also modified the real dataset by replacing 5% existing values as missing and changing 5%, 10%, and 15% real values by (5 * , 2 ), where and 2 are the mean and variance of the th metabolite in the HCC data matrix.

Performance Measurement Criteria.
To investigate the performance of the proposed method, we calculated the root mean square error (RMSE) between the original dataset and reconstructed dataset that we can easily identify better method to reconstruct the data. The formula of RMSE is where and̂are the original and reconstructed value of the th row and th column of the data matrix, respectively.
For simulated dataset, we also assess the performance of the proposed biomarker identification technique compared to the other techniques where we consider three performance indices: (i) ROC curve, (ii) area under the ROC curve (AUC), and (iii) misclassification error rate (MER). To calculate the above measures, first we identify the DE metabolites using four different techniques including the proposed technique. Using the above results, we draw ROC curve using true positive rate (TPR) and false positive rate (FPR). In a binary classification system if a positive instance is correctly classified as positive, it is called true positive (TP); however, if it is wrongly classified as negative then it is called false negative (FN). If a negative instance is correctly classified as negative, it is called true negative ( ROC curve plots contain the TPR in -axis and FPR in -axis.
In real metabolomics dataset, we do not know the class of metabolites; therefore, we measure the performance of the proposed biomarker identification method in a comparison of the other considering methods by sample classification using SVM classifier on the basis of the computed DE metabolites. Using the sample classification results, we calculate the performance indices: ROC curve, accuracy, sensitivity, specificity, positive predicted value, negative predicted value, and balanced accuracy.
A better method would produce smaller values for FPR and MER and larger values for TPR and area under the ROC curve (AUC) by any DE metabolite detector.

Results and Discussion
We have evaluated the performance of the proposed method using both simulated and real data (hepatocellular carcinoma) analysis results. The detailed description of the simulated and real data analysis results is discussed in Sections 3.1 and 3.2.

Simulated Data Analysis Results.
To investigate the performance of the proposed method for imputing missing values, we calculated the average RMSE between original and reconstructed data matrix using 100 simulated datasets with different rates (10%, 15%, and 20%) of missing values in both absence and presence of outliers. The average RMSE for different methods in both different rates (10%, 15%, and 20%) of missing values and presence of different rates (0%, 3%, 5%, 7%, 10%, and 15%) of outliers have been given in Table 1. Table 1 showed that in absence of outliers all the  methods except Zero imputation produce similar as well as smaller average RMSE between original and reconstructed data matrices; however, in presence of outliers, the proposed method gave the smaller average RMSE between original and reconstructed data matrices compared to other (Zero, kNN, and Zero imputation) techniques. Therefore, we could say that the proposed method reconstructs the simulated data very closely in presence of both different rates (0%, 3%, 7%, 10%, and 15%) of outliers and different rates (10%, 15%, and 20%) of missing values.
We also performed simulation studies to measure the performance of the four different techniques including our proposed technique for identifying DE and EE metabolites. To measure the performance, first we identified the DE and EE metabolites from the simulated dataset using four different techniques (Zero + + FC, kNN + + FC, RF + + FC, and Proposed) in both presence of 10%, 15%, and 20% missing values and different rates (0%, 5%, 10%, and 15%) of outliers. In case of simulated dataset the metabolites classes were known; therefore, using the given metabolites classes and predicted metabolites classes, we calculated the three different performance indices, ROC curve, AUC values, and MER, for each biomarker identification technique. We repeated the above computations for 100 simulated datasets and calculated the average of each performance measure and also presented them in Figures 1, 2, and 3 and Table 2. From Figures 1, 2, and 3, it is seen that the proposed biomarker identification technique produced larger average TPR regarding any point of average FPR compared to the other techniques. Also from Table 2, we can observe that the proposed biomarker identification technique gave larger average AUC values and smaller average MER compared to other (Zero + + FC, kNN + + FC, and RF + + FC) techniques in both absence and presence of outliers for different percentage (10%, 15%, and 20%) of missing datasets. Therefore, we could conclude that the proposed biomarker identification technique is better among the four techniques in both presence of missing values and different rates of outliers.

Real HCC Data Analysis Results.
To measure the performance of the proposed method in a comparison of the other  methods for imputing missing values in presence of outliers, we modified 100 real datasets by replacing 5% existing cell as missing and changing 5%, 10%, and 15% existing cell by (5 * , 2 ), where and 2 are the mean and variance of the th metabolite in the HCC data matrix and also calculated the average RMSE between real dataset and reconstructed datasets. The average RMSE for different missing imputation techniques including our proposed one have been given in Table 3. We can observe from Table 3 that the proposed technique has produced smaller average RMSE in all the cases; that is, the proposed technique could reconstruct the real data more closely compared to the other methods in presence of different rates of outliers.
To identify the robust method for biomarker identification technique using real dataset, we also computed the different performance indices, ROC curve, accuracy, sensitivity, specificity, positive predicted value, negative predicted value, and balanced accuracy for sample classification using the set of DE metabolites. To compute the performance indices, we identified the set of DE metabolites by four different biomarker identification techniques including the proposed one and using the identified DE metabolites, we classified the sample by SVM classifier. We used the fivefold cross validation to compute the performance indices whose results have been given in Figure 4 and Table 4.   All performance indices of Figure 4 and Table 4 showed that the proposed biomarker identification technique is better among the four techniques. Therefore, we identified the differentially expressed metabolites using the proposed biomarker identification technique and categorized the upregulated and downregulated metabolites using the value of the FC as well as mining the DE metabolites using heatmap plot and hierarchical clustering technique in Figure 5. We also ranked the 43 DE metabolites using SVM technique according to their importance. The ranked upregulated and downregulated metabolites have been showed in Figure 6(a). From Figure 5, it was clear that the identified 43 DE metabolites can correctly and clearly cluster the samples into two groups: control group and HCC group. We could also observe that, from the 43 DE metabolites, 21 metabolites were downregulated and the remaining 22 metabolites were upregulated in HCC patients. In Figure 6(a) red circles indicated upregulated metabolites and blue circles indicated downregulated metabolites. From Figure 6(a), we could say that the metabolite with / = 288.2899, rt = 3.7607 is the most influential upregulated metabolite for HCC disease. We got 7 upregulated and 3 downregulated metabolites from the top ten ranked DE metabolites and we also drew the correlation network of the top ten ranked DE metabolites (see Figure 6(b)). Thus on the basis of Figure 6, we could take the top four upregulated metabolites (" / = 288.2899, rt = 3.7607," " / = 272.2947, rt = 4.1998," " / = 332.3161, rt = 4.2309," and " / = 360.3472, rt = 4.2449") and the top two downregulated metabolites (" / = 544.3426, rt = 4.5071" and " / = 590.3222, rt = 4.4899") as a set of biomarkers for HCC disease. We also observed that, for sample classification, this set of biomarkers (six metabolites) gave 100% classification accuracy in fivefold cross validation. However, for taking a final decision to declare as a set of biomarkers, further wet laboratory experimental validation is required.

Conclusion
We have shown that missing values and outliers play important role in different biomarker identification techniques to identify biomarkers from GC-MS metabolomics data. We have also shown that performance of existing traditional biomarker identification procedure is very much influenced Nurul Haque Mollah coordinated and supervised the project. All authors read and approved the final manuscript.