Tools and Databases of the KOMICS Web Portal for Preprocessing, Mining, and Dissemination of Metabolomics Data

A metabolome—the collection of comprehensive quantitative data on metabolites in an organism—has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal), where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.

Estimation of mass values using PowerFT was performed as follows. The mass chromatogram data in the raw file were extracted into a text file using the MSGet software. The text file was opened in PowerFT and analyzed with the default settings.
Identification of the peaks from standard compounds is essentially the same as that in Xcalibur. Because a major peak from one dataset (Daidzein) was split into 2 parts at the peak's top, the "Dif. margin" parameter at the "Ion Group Setting" of the "Peak Detection" module was changed from 10 ppm to 15 ppm.
The accuracy of mass values was evaluated using the mass difference between the theoretical mass calculated from the elemental composition of the compound and the mass estimated by the software. Two peaks (Acacetin-7-rutinoside and Apigenin-7-rutinoside) were omitted because they showed mass differences greater than 5 ppm in the analysis with Xcalibur. Therefore, data from the remaining 143 compounds were used for comparison (Table S1).
METHOD S2: Evaluation of a data matrix resulting from peak alignment in FragmentAlign Three biological sources-Arabidopsis leaves, Lotus japonicus leaves, and transgenic Arabidopsis cell lines (T87) transformed with a binary vector pGWB2-were used for GC-TOF-MS analysis according to Ogawa [1]. The data obtained from 5 biological replicates for each source, namely 15 samples, were analyzed here. The raw data (the .smp file from the Pegasus III software, LECO, St. Joseph, MI) and peak data deconvoluted using Pegasus III in text format (.mst file) were obtained from MassBase.
The IDs of the data files are shown in Table S2.
The peak data deconvoluted by means of Pegasus III software in .mst files were imported into FragmentAlign, and peak alignment was performed with default settings.
The step of prealignment (to find the internal standard peaks for fine matching of retention times) was omitted. The alignment results were saved in a text file without any manual curation.
The data for metabolite peaks, which were detected in at least 3 of the 15 samples, were used for PCA. The linear value of peak intensity was transformed to the log scale with base 10. The missing values were filled by a small value, which was 1/10 of the minimum intensity of the peaks among the 15 samples as a tentative background. PCA was performed in the R software (http://www.r-project.org/, version 3.0.1) using the prcomp package without scaling.
Pearson's correlation coefficients of the values of peak intensity between 2 replicate samples were calculated using the CORREL function of Excel (Microsoft Corporation) using all peak data detected in each sample.

METHOD S3: CE-MS analysis
We prepared 2 types of samples for the evaluation: mixtures of amino acids as authentic compound samples, and Arabidopsis cell extracts as biological samples. As authentic compounds, a series of amino acid mixtures which contained 10, 50, 100, and 1000 µM of each amino acid (Gly, Ser, Pro, Val, Thr, Cys, Ile, Leu, Asn, Asp, Lys, Glu, Met, His, Phe, Arg, Tyr, Trp, and cystine) in Milli-Q water were prepared. As biological samples, soluble extracts from Arabidopsis thaliana suspension-cultured T87 cells [2] were prepared. This cell line was obtained from RIKEN BioResource Center (Tsukuba, Japan) and maintained as described by Ogawa [1]. The cells after 3, 10, and 14 days after subculturing were collected on filter paper, washed once with distilled water, immediately frozen in liquid nitrogen, and stored at -80ºC until use. Extraction of metabolites from the cells and pretreatment were performed as described previously [3].
As an internal standard chemical, methionine sulfone was added to all the samples. The capillary electrophoresis, positive mode detection with the electrospray ionization, was performed on an Agilent CE-Capillary Electrophoresis System as described by Urano [3].
The m/z values of the positively ionized amino acids and of the internal standard were set for the SIM scan. Triplicate and single sample injections were performed for each concentration of an amino acid mixture and for each biological sample, respectively. Chemicals were purchased from Sigma-Aldrich Co. (Tokyo, Japan).
In the data analysis using ChemStation, peak areas were calculated by means of the integration function of the software. The parameters for automatic peak detection