MarVis-Filter: Ranking, Filtering, Adduct and Isotope Correction of Mass Spectrometry Data

Statistical ranking, filtering, adduct detection, isotope correction, and molecular formula calculation are essential tasks in processing mass spectrometry data in metabolomics studies. In order to obtain high-quality data sets, a framework which incorporates all these methods is required. We present the MarVis-Filter software, which provides well-established and specialized methods for processing mass spectrometry data. For the task of ranking and filtering multivariate intensity profiles, MarVis-Filter provides the ANOVA and Kruskal-Wallis tests with adjustment for multiple hypothesis testing. Adduct and isotope correction are based on a novel algorithm which takes the similarity of intensity profiles into account and allows user-defined ionization rules. The molecular formula calculation utilizes the results of the adduct and isotope correction. For a comprehensive analysis, MarVis-Filter provides an interactive interface to combine data sets deriving from positive and negative ionization mode. The software is exemplarily applied in a metabolic case study, where octadecanoids could be identified as markers for wounding in plants.


Introduction
A central aim of untargeted Metabolomics and Metabonomics studies is the identification of marker metabolites which play a crucial role in the experimental context [1,2]. Mass spectrometry combined with either gas chromatography (GC/MS) or liquid chromatography (LC/MS) has become a key technology for metabolome analysis under different experimental conditions [3,4]. A typical data set after peak detection and sample alignment [5][6][7] consists of several thousand marker candidates which are characterized by a retention time (RT), a mass-tocharge value (m/z), and a multivariate intensity profile of abundance levels per condition, respectively [8]. The experimental conditions are represented by replicate samples and may correspond to environmental disease or genetic perturbations [9][10][11]. In order to obtain a high-quality data set of experiment-related marker candidates, the raw data set is usually ranked and filtered using supervised machine learning techniques such as Random Forest classification [12,13] or statistical analysis based on ANOVA or Kruskal-Wallis tests [14][15][16]. The filtered marker candidates are then annotated according to known metabolites from public biological and biomedical compound databases [17][18][19][20][21]. A central task of annotation is the calculation of actual molecular masses corresponding to each marker candidate by correcting the m/z ratios according to the ionization mode, potential adduct formation, and included natural isotopes [22]. This problem can be addressed by applying the ionization rules [xm + y] z [+/−] [23], where x denotes the number of combined target molecules, y the mass of attached molecules (adduct formation), and z the degree of ionization (e.g., single or double). Additionally, the number of included isotopes has to be estimated in order to query databases which contain monoisotopic compound masses. Based on a potential ionization rule with parameters x, y, z and the number of included isotopes, the corresponding compound mass can be calculated.
For the corrected masses which cannot be assigned to particular compounds, the identification can be supported by calculating possible molecular formulas. The number of considered formulas can be significantly reduced by incorporating information from preprocessing as well as rules for heuristic filtering of molecular formulas [24], respectively. A major step in this process is the estimation of the number of included carbon atoms based on the intensity profiles of previously detected isotopologues.
There are a great number of software packages available, which provide tools for statistical analysis of multivariate experimental data [25,26]. A number of tools for peak detection and sample alignment of mass spectrometry data, such as MetAlign or OpenMS, also support the deconvolution of isotopologues and statistical analysis [27,28]. For the XCMS platform [7], a package for the annotation of LC/ESI-MS mass signals based on adduct rules has been implemented [23]. The calculation of possible ionization products and the rule-based heuristic filtering of molecular formulas is provided by several software packages [22,24]. However, to the best of our knowledge, there is no software available which incorporates all of these methods in a single userfriendly tool as offered by MarVis-Filter.

Materials and Methods
In the following sections, the algorithm for adduct/isotope correction and the implementation of MarVis-Filter are described in detail.

Algorithm for Adduct and Isotope
Correction. The algorithm is based on the input of the retention times, m/z ratios, and raw intensity profiles of all marker candidates in a data set and calculates as output the potential monoisotopic mass, ionization rule, and number of included 13 C-isotopes for every candidate. The approach is based on a greedy strategy which minimizes the number of potential molecular masses and simultaneously maximizes the similarity of intensity profiles between candidates with a similar retention time and actual mass. This concept follows the paradigm that in mass spectrometry analysis a metabolite is usually represented by several marker candidates with a similar retention time and intensity profile, but different m/z ratios according to the various possibilities of ionization and number of included isotopes. As parameters, the algorithm expects a list of ionization/adduct rules sorted according to their relevance, the assumed maximal number of 13 C-isotopes per marker candidate, a mass tolerance, an RT tolerance, and a minimal cosine similarity of intensity profiles. The isotopologues correction is restricted to the detection of 13 C.
For storage of pairwise cosine similarities between candidate profiles, the algorithm utilizes a five-dimensional matrix M. Each entry M (m,a1,i1,a2,i2) corresponds to the maximal cosine similarity between the intensity profile of candidate m, assuming ionization rule a 1 and i 1 13 C-isotopes, and another candidate, which has a similar retention time (within tolerance) and corrected mass (within tolerance) assuming ionization rule a 2 and i 2 13 C-isotopes. For each candidate m, the algorithm then chooses the ionization rule and number of 13 C-isotopes which is supported by the highest sum of cosine similarities. In the following, the algorithm is described in detail.
(2) Calculate all possible masses by applying all ionization rules and number of 13 C-isotopes to all candidate m/z ratios.
(3) Consider all pairs of potential masses under the following constraints and fill M with pairwise cosine similarities of corresponding candidate profiles. (4) Calculate the reduced three-dimensional matrix M red with summed entries: (5) Choose for each candidate m: the adduct rule and isotope number with the maximal sum of similarities c max = max a1,i1 (M red (m,a1,i1) ). If c max = 0, use the first ionization rule and zero 13 C-isotopes as default.
(6) Calculate the masses according to chosen rules and isotope numbers.
In order to avoid apparently false associations between marker candidates, negative cosine similarities are disregarded. If for a given candidate different selections of the ionization rule and the number of isotopes maximize the sum of cosine similarities, the ionization rules with the highest relevance and the minimal number of 13 C-isotopes are selected.
Following the annotation of the ionization rules and 13 C-isotopes, the number of carbon atoms per candidate is estimated by comparing the raw intensities of marker candidates with zero predicted 13 C-isotopes (I M ) and the Journal of Biomedicine and Biotechnology 3 respective marker candidates including one 13 C-isotope (I M+1 ) according to the following formula: corresponding to the natural abundances of carbon isotopes. Given a pair of candidates, annotated as isotopologues (M and M + 1) and with the same ionization rule, a robust estimation of the number of carbon atoms is obtained by calculating the median n C over all samples included in both intensity profiles.

Implementation.
MarVis-Filter is implemented in the Matlab and C programming language and has been compiled together with the MarVis-Cluster tool [29] for Microsoft Windows XP/Vista/7. Execution of the software requires installation of the Matlab Compiler Runtime, which is provided with the software. The installation packages, the documentation, and example data sets can be downloaded from the project home page http://marvis.gobics.de/.
For data import and export MarVis-Filter uses the CSV (Comma Separated Values) file format, which can easily be processed by statistical analysis software and spreadsheet applications. MarVis-Filter also supports the direct import of aligned mass spectrometry samples from MarkerLynx Application Manager of MassLynx (Waters Corporation, Milford). For interactive analysis, ranking and filtering of multivariate intensity profiles MarVis-Filter provides the well-known one-way ANOVA and Kruskal-Wallis tests [14] combined with methods for P value adjustment for multiplehypothesis testing [30,31]. Based on customizable lists of ionization rules, the adduct/isotope correction can be performed on raw or filtered data sets. The ionization rules are imported as text files and can easily be adapted or extended. Figure 1 shows the main window of MarVis-Filter after import and ranking. The "Ranking plot" (1) displays the adjusted P values (y-axis) of all candidate intensity profiles in the current data set sorted in ascending order. The data set can interactively be filtered according to a user-defined significance level by selecting a marker, sliding the red separator line or jumping to a predefined level. The "Profile plot" (2) shows the raw intensity profile of the currently selected marker candidate. Intensity values of replicated samples belonging to the same experimental conditions are marked in the same color. The "Marker information box" (3) displays information about all marker candidates of the data set arranged according to the P values and characterized by the m/z ratio, RT and additional user-defined scores, which can be imported along with the data set. After adduct and isotope correction, the additional annotations are displayed in this listbox as well. The "Data set clipboard listbox" (4) shows data sets which are currently held in the MarVis clipboard. The current (filtered or unfiltered) data set can simply be added or removed to/from this list. The data set clipboard supports an adduct and isotope correction of selected data sets in a batch mode. Data sets which were corrected based on different sets of ionization rules (e.g., positive and negative ionization) may be combined into one single data set.
For selected candidate profiles, bar plots, standard error plots, and boxplots can easily be inspected and exported in various image formats. For detailed analysis, the user can zoom into all plots. Additionally, MarVis-Filter provides a convenient interface for quick candidate search based on the ID, RT, m/z, or mass value.
MarVis-Filter also provides a molecular formula calculator, which is based on the Seven Golden Rules [24] and utilizes the estimated number of carbon atoms per marker candidate obtained after adduct and isotope correction.
MarVis-Filter and MarVis-Cluster [29] are combined in the MarVis-Suite which features the direct data exchange between preprocessing in MarVis-Filter and convenient visualization of multivariate intensity profiles and high-level cluster analysis in MarVis-Cluster.

Results and Discussion
The functionality of MarVis-Filter is demonstrated using two data sets of a metabolomic case study for plant wounding experiments [8]. The data sets are available on the project homepage http://marvis.gobics.de/ together with a detailed description of the extraction and UPLC-TOF method. Additionally, the data sets are available for import in MarVis-Filter after installation of the MarVis-Suite (wound neg raw.csv and wound pos raw.csv in the examples directory).

Case Study and Data Sets.
The case study reflects a wounding time course of Arabidopsis thaliana wild-type (WT) plants as well as of mutant plants (dde 2-2), which are deficient in the biosynthesis of the plant wound hormone jasmonic acid and its derivatives [32]. The wounding time course represents eight experimental conditions. The first four conditions reflect the metabolic situation within a wounding time course of wild-type (WT) plants, starting with the unwounded control plants (abbreviation wt 0) followed by the plants harvested 0.5 (wt 30), 2 (wt 2), and 5 hours past wounding (wt 5). The conditions 5 to 8 represent the analogous time course for the jasmonate deficient mutant plant dde 2-2 (aos 0, aos 30, aos 2, aos 5). Each condition contains nine replicate samples.
After data import, the marker candidates are sorted and ranked according to the P values of a Kruskal-Wallis test and the Bonferroni-Holm adjustment for multiple hypothesis testing [30] by selecting the corresponding checkboxes in the "Filter dialog" and the "Adjustment for multiple testing" dialog.
Adduct and isotope correction are performed on the full data sets separately using predefined sets of adduct  rules for the negative (Table 2) and positive ionization mode (Table 3), an RT tolerance of 0.04 minutes, a mass tolerance of 0.005 Da, a minimal cosine similarity of 0.75, and a maximum number of two 13 C-isotopes per candidate. The adduct rules had been determined in previous targeted UPLC-TOF-MS experiments. After correction, the data sets are filtered according to a significance level for adjusted P values of 0.01 ("Goto level" entry in "Selection" menu) and added to the MarVis data set clipboard. Table 1 shows the initial number of imported marker candidates and the number of high-quality marker candidates after filtering. Finally, the two data sets in the MarVis clipboard are concatenated using the "combine" button. The combined data set can be sorted according to a user-defined method once again and is then presented in a new MarVis-Filter window. After selecting the whole data set, the combined subset of 3504 high-quality marker candidates can be exported as a CSV file, and clustered as well as visualized using MarVis-Cluster ("Goto MarVis-Cluster" entry in the "MarVis-Suite" menu). Figure 2 shows the results from clustering of the filtered and combined data in MarVis-Cluster.

Identification of Metabolites.
The corrected, filtered, and combined data sets were used to identify metabolites which show a significant change of abundance in the wound time course in WT and/or jasmonate deficient mutant plants. First, the corrected masses of marker candidates were matched to molecular masses of all compounds recorded in the KEGG [17] and AraCyc [18] database or literature [33] based on a tolerance of 0.005 Da. The identity of marker candidates was confirmed based on the isotopic pattern and coelution with identical standards or MS/MS fragmentation [34]. Thus, a number of oxylipins could be identified as wound-induced metabolite markers (see Table 4). Oxylipins are metabolites deriving from lipid peroxidation and are involved in regulating developmental processes as well as environmental responses, like the inflammatory or wound response, in nearly every organism. Among these bioactive Table 4: Identified metabolites in the combined and filtered data set. The retention time is measured in minutes and the exact compound mass is stated in Dalton. The columns "Negative" and "Positive" contain the number of associated marker candidates/ions obtained in the negative or positive ionization mode. The column "Ions" contains the sum of associated marker candidates/ions per compound. The column "P value" contains the minimal adjusted P value of the Kruskal-Wallis test over all associated marker candidates, respectively. The column "Mass error" contains the absolute difference between the corrected mass of the marker candidate with the minimal adjusted P value and the exact compound mass in Dalton. The lipids, the mammalian and plant oxylipins are the best characterized ones. Mammals use predominantly C20 fatty acids (eicosanoids), while in plants C18 fatty acids are most abundantly used for the biosynthesis of oxylipins or so-called octadecanoids [35]. The identified oxylipins (see Table 4) are part of the α-linolenic acid metabolism or members of the compound class of mono-and digalactosyldiacylglycerols. They are described in the context of plant wounding [33,34,36]. Thirteen of the fifteen identified oxylipins could only be detected in either the negative or the positive ionization mode. On average, five ions/marker candidates could be assigned per compound. The findings are supported by very low adjusted P values from the Kruskal-Wallis test of the intensity profiles (see previous section and Table 4).

Conclusions
MarVis-Filter combines essential preprocessing tools for mass spectrometry data analysis within a single user-friendly tool. Large data sets from the negative and positive ionization mode can easily be imported, corrected, filtered, and combined. Lists of ionization rules for adduct correction can be customized, extended, and commented in a convenient way using a standard text editor. Within the MarVis-Suite filtered and combined data sets can directly be clustered, visualized, and analyzed in detail using the MarVis-Cluster tool. In a case study 75 high-quality marker candidates could be clearly assigned to fifteen compounds of the oxylipin class based on the adduct and isotope correction in MarVis-Filter. The combination of data sets deriving from the negative and positive ionization mode is an important step for further data analysis. In the case study, most of the identified metabolites could only be detected in either the negative or the positive mode. The significance of the selected wound markers is supported by a high number of annotated and assigned ions/marker candidates and by very low adjusted P values from the Kruskal-Wallis test. The statistical filtering of marker candidates reduced the complexity of the data sets from about 48000 to 3500 significant candidates (about 7 percent).