Comparative Analysis of Mass Spectral Similarity Measures on Peak Alignment for Comprehensive Two-Dimensional Gas Chromatography Mass Spectrometry

Peak alignment is a critical procedure in mass spectrometry-based biomarker discovery in metabolomics. One of peak alignment approaches to comprehensive two-dimensional gas chromatography mass spectrometry (GC×GC-MS) data is peak matching-based alignment. A key to the peak matching-based alignment is the calculation of mass spectral similarity scores. Various mass spectral similarity measures have been developed mainly for compound identification, but the effect of these spectral similarity measures on the performance of peak matching-based alignment still remains unknown. Therefore, we selected five mass spectral similarity measures, cosine correlation, Pearson's correlation, Spearman's correlation, partial correlation, and part correlation, and examined their effects on peak alignment using two sets of experimental GC×GC-MS data. The results show that the spectral similarity measure does not affect the alignment accuracy significantly in analysis of data from less complex samples, while the partial correlation performs much better than other spectral similarity measures when analyzing experimental data acquired from complex biological samples.


Introduction
Metabolomics is the systematic study of metabolites found within cells and biological systems. It has emerged as the latest of the "omics" disciplines to decipher the complex timerelated concentration, activity, and flux of metabolites in biological or clinical samples, offering a path to a wealth of information about a person's health.
Multiple analytical platforms such as liquid chromatography-mass spectrometry (LC-MS), gas chromatographymass spectrometry (GC-MS), and nuclear magnetic resonance spectroscopy (NMR) have been used in metabolomics. Of these analytical platforms, the comprehensive twodimensional gas chromatography coupled with mass spectrometry (GC×GC-MS) is a promising analytical platform in metabolomics for disease biomarker discovery [1][2][3]. This approach uses a short column as the second dimension GC column after the first dimension GC column which is the main analytical column. In general, these two columns have different stationary phases, and the first dimension column is operated at a lower temperature than the second dimension column. The difference of column temperature and the chromatography matrix enables the compounds coeluted from the first dimension column to be further separated in the second dimnsion column. The compounds separated in the second dimension column are directed to a mass spectrometry system for detection. The GC×GC-MS platform offers several advantages for analysis of complex samples, such as an order-of-magnitude increase in separation capacity, significant increase in signal-to-noise ratio and dynamic range, and improvement of mass spectral deconvolution and similarity matches [4,5], providing more and accurate information about metabolite retention times and mass spectra.
In disease biomarker discovery, multiple samples from each biological cohort (disease or control) are usually collected to increase the statistical power, and each of these samples is preprocessed and analyzed on a high throughput analytical platform such as GC×GC-MS. Metabolic profiles obtained from these samples must then be aligned to compare the difference of abundance level of each compound between/among sample cohorts. The purpose of peak alignment is to recognize molecular features of the same metabolite occurring in different samples. Two alignment approaches have been developed: profile alignment and peak matching. The profile alignment uses the entire chromatographic data, that is, the raw instrumental data [6][7][8][9]. However, this approach aligns the GC×GC-MS data based on retention time alone, although the mass spectrum of fragment ions is readily available in the raw instrument data. Aligning metabolic profiles based on both retention time and mass spectrum can decrease the rate of falsepositive alignment. In order to account for this fact, the peak matching approach was introduced. The raw instrument data, in this case, are first reduced into compound peak list, and the peak lists of multiple samples are then employed for alignment [10][11][12][13][14][15]. In this study, we examined the effects of mass spectral similarity measures on the performance of the peak matching-based alignment approach.
Several peak matching-based alignment algorithms have been developed, such as MSort [10], DISCO [11], mSPA [12], SWPA [13], and MbPA [14]. MSort is a two-step peak alignment using a distance window, while DISCO is a twostep peak alignment using a mass spectral similarity window. The algorithm mSPA employs a mixture similarity score to simultaneously evaluate both the retention time distance and the mass spectral similarity. SWPA performs peak alignment using Smith-Waterman local alignment algorithm. Of these methods, MbPA is the only model-based approach, which uses an empirical Bayes model and the posterior distribution for peak alignment. DISCO, SWPA, and MbPA can be applied to both homogeneous and heterogeneous data, while MSort and mSPA are able to align only for homogeneous data. The homogeneous data mean that all samples were analyzed under the identical GC×GC-MS experiment conditions, while the heterogeneous data refer to that experiment data were acquired under different experiment conditions. Most recently, Jeong et al. [15] proposed a post hoc analysis for peak alignment by incorporating the results of compound identification.
The retention time distance measure and the mass spectral similarity measure play a critical role in peak matchingbased alignment. As for the retention time distance measure, MSort and DISCO use the Euclidean distance, while SWPA and MbPA use the rank of the Euclidean distance. In particular, mSPA investigated the effect of the four different distance measures, including Euclidean distance, Maximum (also known as Chebyshev) distance, Manhattan distance, and Canberra distance, on peak alignment and concluded that the Canberra distance is a promising distance measure for peak alignment. In case of the mass spectral similarity measure, MSort, DISCO, and SWPA use Pearson's correlation, while mSPA and MbPA use the cosine correlation (also known as dot product).
The mass spectral similarity measure is the key to compound identification in metabolomics, and is fulfilled by matching experimental mass spectra to mass spectra stored in a reference library. Various mass spectral similarity measures have been developed including cosine correlation [16], composite similarity [16], probability-based matching system [17], Hertz et al. similarity index [18], normalized Euclidean distance [19], absolute value distance [19], and wavelet and Fourier transforms-based composite measures [20]. Later, Kim et al. [21] developed partial and semipartial correlation-based similarity measures and showed that their similarity measures perform better than the dot product and its composite versions, including wavelet and Fourier transforms-based composite measures.
Although both the compound identification and the peak alignment use mass spectra and the effect of mass spectral similarity measures on compound identification has been studied, the effect of the different mass spectral similarity measures on the performance of peak alignment still remains unknown. Therefore, the objective of this work was to compare the effects of five mass spectral similarity measures, cosine correlation, Pearson's correlation, Spearman's correlation, partial correlation, and part (also known as semipartial) correlation, on peak alignment. For ease of comparison, we selected the peak alignment algorithm mSPA since it includes various peak alignment approaches and the homogeneous data are more practically applicable.
The remaining of the paper is organized as follows. Section 2 contains a review of mSPA and a detailed description of five mass spectral similarity measures. In Section 3, the selected mass spectral similarity measures were applied to experimental GC×GC-MS data to investigate the effect of the mass spectral similarity measures on peak alignment using mSPA. Finally, Section 4 provides some discussion and is closed with conclusions.

Let
= { 1 , 2 , . . . , } be the peak list of a reference chromatogram and = { 1 , 2 , . . . , } the peak list of a target chromatogram, where and (1 ≤ ≤ ,1 ≤ ≤ ) are composed of its first and second dimension retention times, ( ,1 , ,2 ) and ( ,1 , ,2 ), respectively, as well as its mass spectrum, and , respectively. Note that the mass spectrum is a vector of intensities for the peak , such as = ( 1 , 2 , . . . , ), where is the total number of massto-charge ratio (m/z). We call each peak in the reference peak list a reference peak and a peak in the target peak list a target peak. The distance and the similarity refer to the retention times and the mass spectral information, respectively. All the statistical analyses and simulations were performed using a statistical package R (R Development Core Team).
2.1. Review of mSPA. The peak alignment R package mSPA [12] provides five peak alignment algorithms for users (http://mrr.sourceforge.net/). The five peak alignment algorithms are PAD, PAS, SW-PAD, DW-PAS, and PAM. Here PAD is a peak alignment procedure using solely the peak Computational and Mathematical Methods in Medicine 3 distance without window, and PAS performs the peak alignment based on the spectral similarity without window. SW-PAD and DW-PAS are window-based peak alignments. SW-PAD stands for the peak alignment with a similarity-based window, and DW-PAS aligns peaks using a distance-based window. Kim et al. [12] further developed a mixture similarity measure ( ). That is, the mixture similarity score between a target peak and a reference peak ℎ is defined by where (0 ≤ ≤ 1) is a mixture weight factor, ( , ) and ( , ) are a spectral similarity score and a distance measure between two peaks and , respectively. PAM is the peak alignment method using this mixture similarity without any window. The main difference of PAM over other approaches is the ability to use both the retention time distance and the mass spectral similarity at the same time without window. In addition, an optimization-based peak alignment, OP-PAM, is also incorporated in mSPA. OP-PAM is the optimal version of PAM and optimizes the mixture weight w and the distance measure. For further details refer to Kim et al. [12].
mSPA uses the cosine correlation as the main mass spectral similarity measure, although a user can choose Pearson's correlation coefficient as an option. mSPA also includes four distance measures, such as Euclidean ( 1 ), Maximum ( 2 ), Manhattan ( 3 ), and Canberra ( 4 ). Kim et al. [12] showed that Canberra distance performs the best among them. However, it still remains unknown which similarity measure performs better for peak alignment.

Similarity Measures.
In this study, we selected five similarity measures, cosine correlation, Pearson's correlation, Spearman's correlation, partial correlation, and part correlation. Since all the existing peak matching-based approaches use either the cosine correlation or Pearson's correlation, we chose these two mass spectral similarity measures. Spearman's correlation was considered because it is a nonparametric measure. The partial and the part correlations were selected because of their best performance in compound identification [21].

Cosine Correlation (Dot Product).
The cosine correlation [16], which is also known as the dot product, is used to obtain the cosine of the angle between two sequences of intensities, = ( ) =1,..., and = ( ) =1,..., , where is the total number of m/z values. It is defined as where ∘ = ∑ =1 and ‖ ‖ = √∑ =1 2 . Note that ranges between −1 and 1, and it is always nonnegative if and are nonnegative intensities.

Pearson's and Spearman's Correlations.
Pearson's correlation between two sequences of intensities, = ( ) =1,..., and = ( ) =1,..., , is the covariance of the two sequences divided by the product of the standard deviations and is defined by where Cov( , ) is the covariance between and and Var( ) is the variance of . Spearman's correlation between and , , is a nonparametric version of Pearson's correlation and is defined as Pearson correlation coefficient between the ranks of two sequences of intensities.

Partial and Part (Semipartial) Correlations.
The partial correlation is the association between two random variables after removing the effect of other random variables, while the part correlation removes the effect of other random variables only for one random variable [21]. Consider a partitioned random vector ( , ) where and = ( 1 , 2 , . . . , ℎ ) are one-dimensional random variables and an ℎ-dimensional random vector, respectively. Then the partial correlation | ( ) between and i given ( ) = ( 1 , . . . , −1 , +1 , . . . , ℎ ) is defined by the correlation between the residuals | ( ) and | ( ) and is represented by where | ( ) and | ( ) are the residuals of the linear regression of and on ( ) , respectively.
The semipartial correlation ( | ( ) ) between and i with ( ) is the correlation between the random variable and | ( ) and is represented by In general, ( | ( ) ) ̸ = ( | ( ) ) and ( | ( ) ) = ( | ( ) ) if and are independent of ( ) . If and are independent of ( ) , all the three correlations, Pearson's, partial, and part correlations, are theoretically exactly similar to each other, that is, In the context of peak alignment, is the mass spectrum of a target peak and is the vector of all the mass spectra of the reference peak list.
It is known that the partial correlation can be derived by the inverse of the covariance matrix [22], so does the part correlation. In the context of partial and part correlations, each peak represents a random variable and the intensities of each m/z value correspond to the observed samples, resulting in the number of peaks being equal to the number of variables and the number of m/z values being equal to the sample size. However, the number of peaks often exceeds the number of m/z values in case of real biological data, resulting in a highdimensionality problem. This causes the singularity of the inverse covariance matrices between two peak lists. To avoid the singularity problem, we adopted the two-step approach developed by Kim et al. [21]. Namely, we first reduced the number of peaks for the calculation of the partial and the part correlations by considering only the peaks that have the first highest similarity scores obtained by Pearson's correlation. Then the partial and the part correlations were computed only for these peaks. Given the rank , the two-step partial and part correlations are defined by, respectively, and Rank( ) is the rank of the similarity score in descending order. In this study, (4) and (5) were applied to a mixture of 76 compound standards, and a biological data set employed (6) to avoid the singularity of the covariance matrix. We used 10 different ranks between 3 and 100 for , which are 3, 5, 7, 10, 15, 20, 30, 50, 70, and 100. The R package ppcor was used to compute the partial and the part correlations.

GC×GC Data Sets.
For a fair comparison with mSPA, we used the same data as those of mSPA, which are a mixture of 76 compound standards and a set of real biological samples extracted from rat plasma. A mixture of 76 compound standards is composed of 10 GC×GC-MS data sets (S1-S10), and the rat plasma sample consists of five GC×GC-MS data sets (P1-P5). For a more detailed description of the data, please refer to Wang et al. [11]. We call the mixture of 76 compound standards Data I, which has 10 data sets, and the rat plasma data set Data II, which has 5 data sets. Theoretically, one peak should be generated for each compound after peak picking. Multiple peaks, however, are usually detected for one compound by the spectral deconvolution software such as ChromaTOF, which will generate a set of peak lists. Therefore, we merged the multiple peaks by peak area. In other words, we selected the peak with the largest peak area among the multiple peaks having the same compound name. The number of peaks before and after peak merging is summarized in Table 1. The chromatograms and the densities of the first and the second dimension retention times of Data I and Data II are depicted in Figure 1. Note that the data and source code are available at http://mrr.sourceforge.net/.
The area under ROC curve (AUC) was further calculated after ROC was created by plotting between TPR and FPR according to given cut-off values using the methods in [23,24].

Results
We evaluated the effect of the five spectral similarity measures on peak alignment using mSPA. As mentioned before, mSPA provides five different peak alignment methods including an optimal version. In this study, we focused only on the following four methods: PAS, DW-PAS, SW-PAD, and PAM, since we were interested in the effect of the mass spectral similarity measures. Therefore, these four peak matching alignment approaches were applied to Data I and Data II using mSPA, with the five different similarity measures, the cosine correlation, Pearson's correlation, Spearman's correlation, the partial correlation, and the part correlation. Figure 2 displays the plots of PPV versus TPR and FPR versus TPR when PAS was applied to Data I and II. In the PPV versus TPR plot, a method is better as it is closer to the point (1,1), while the FPR versus TPR plot represents that a method is better as it is close to the point (0,1). It is worth reminding that PAS is a peak alignment solely based on the mass spectral similarity score without using the retention time distance. Spearman's correlation performs the worst for both Data I (75.52%) and Data II (49.31%) in terms of F1 scores, while the partial correlation performs the best (97.24% and 61.58% for Data I and II, resp.), as can be seen in Tables 2 and 3. Interestingly, the partial correlation performs better than the part correlation.
The method DW-PAS is a peak alignment method with a distance-based window. In this case, a user is required to set a threshold for the distance-based window, which is the rank of the retention time distance. The five different ranks, 3, 5, 10, 15, and 20, were used. The plots of PPV versus TPR, F1 scores, and FPR versus TPR (ROC) are shown in Figure 3. Likewise, Spearman's correlation performs the worst regardless of the rank and the dataset. As the rank increases, the F1 scores of the partial and the part correlations generally increase, while the F1 scores of the cosine, Pearson's, and Spearman's correlations decrease, in case of Data I (Figure 3(b)). On the other hand, in case of Data II, the F1 scores of the cosine, Pearson's, the partial, and the part correlations increase as the rank increases, while Spearman's correlation decreases. Overall, the partial correlation performs the best for both Computational and Mathematical Methods in Medicine 5 Table 1: The summary of GC×GC/TOF-MS datasets. The numbers of peaks before and after peak merging are calculated for each dataset. Data I  S1  S2  S3  S4  S5  S6  S7  S8  S9  S10   The number of peaks  Before  180  186  161  151  151  145  172  163  168  174  After  78  76  76  75  74  73  74  76  77  75  Run ID  Data II  P1  P2  P3 P4 P5   Data I (97.59%) and Data II (59.52%) in terms of F1 scores, as shown in Tables 2 and 3. Figure 4 shows the results of SW-PAD. This method requires a mass spectral similarity-based window as well as a cut-off value of the similarity (0 ≤ ≤ 1). In this study, we used 13 values between 0.1 and 0.99 for . The F1 scores of the part and Spearman's correlations are much sensitive to the cut-off value than these of other correlations in Figures  4(b) and 4(e). In case of Data I, Pearson's correlation (97.68%) with Canberra distance performs the best among them in terms of F1 score, while the F1 score (66.65%) of the partial correlation with Manhattan distance is the highest when Data II is applied, as can be seen in Tables 2 and 3.

Run ID
The PAM aligns peak lists using a mixture similarity score of the retention time distance and the mass spectral similarity. In this case, a user needs to set up the mixture weight (0 ≤ ≤ 1). If is close to zero, the mass spectral similarity plays a much more important role in peak alignment than the retention time distance does, while the retention time distance drives the peak alignment if close to one, as can be seen in (1). We used 13 values between 0.01 and 0.99 for . Similar to the other peak alignment approaches, Spearman's correlation performs the worst among them in terms of F1 scores, as shown in Figures  5(b) and 5(e). As for Data I, all the correlations except for Spearman's correlation are less sensitive to the mixture weight , while the F1 scores of all the correlations are more sensitive to the weight in case of Data II. In Data I, the highest F1 score (98.12%) occurred when Pearson's correlation used, and, as for Data II, the partial correlation had the highest F1 score (61.78%), as shown in Tables 2  and 3. Overall, Pearson's correlation with PAM performs the best in terms of F1 score for Data I (98.12%), and the partial correlation with SW-PAD performs the best for Data II (66.65%), as can be seen in Tables 2 and 3

Discussion and Conclusions
When the less dense data such as Data I are applied, the effect of the mass spectral similarity measures on the performance of peak alignment is small since the retention time distance dominates the performance of peak alignment. In fact, F1 scores of all the mass spectral similarity measures except for the Spearman's correlation are not significantly different from each other when PAM is applied to Data I, as shown in Table 2. On the other hand, when analyzing more complicated data such as Data II, the mass spectral similarity measures play a critical role in obtaining a better performance of peak alignment. As can be seen in Table 3, the F1 score of the partial correlation with SW-PAD is significantly different from those of other methods. Furthermore, all the peak alignment approaches perform the best when the partial correlation is employed, indicating that the effect of the mass spectral similarity measures on alignment is critical and we should consider the partial correlation to achieve a better performance.
In case that the mass spectral similarity measures were compared to each other in terms of accuracy of compound identification, the part correlation performed the best although its performance was comparable to that of the partial correlation [21]. Different from compound identification, the partial correlation performs significantly better than the part correlation in peak alignment. For example, we can see this from the results of PAS listed in Tables 2  and 3. Interestingly, when the more dense data are used, the performance of the part correlation with PAS becomes  ; # standard error (%); $ the cut-off rank; * * the rank for the two-step partial and part correlations; ## the distance measure E, Mx, Mh, and C stand for Euclidean, Maximum, Manhattan, and Canberra distances, respectively; $$ the cut-off similarity score. * * * The weight factor of the mixture similarity score; the numbers in bold and italic indicate the maximum for each of the peak alignment methods. worse than those of the cosine and Pearson's correlations. This may be because the characteristics of the experimental data are different between compound identification and peak alignment. Namely, in compound identification, the query mass spectra are generated from the experimental conditions typically different from that of the reference library mass spectra. Therefore, the effect of the reference library mass spectra is ignorable so that the part correlation performs the best. On the other hand, the peak alignment here uses the homogeneous data which are generated from the similar experimental conditions, resulting in that the partial correlation performs the best.
To further investigate this difference of the five similarity measures, we plotted the distributions of the five similarity  Figure 6: The distributions of similarity scores from the same and the different peaks. The blue solid lines and the red dotted lines represent the distributions of the similarity scores between the same peaks and between the different peaks, respectively. The left -axis is scaled for the different peaks and the right -axis is scaled for the same peaks. (a) and (b) are for Data I and Data II, respectively. scores from the same peaks as well as from the different peaks for Data I and Data II, as shown in Figure 6. In an ideal case, the distribution of the same peaks (the blue solid line) should be close to 1, and the distribution of the different peaks (the red dotted line) should be close to either 0 (for the cosine correlation) or −1 (for other similarity measures). We can see that the distributions of the partial correlation are clearly separated among the five mass spectral similarity measures (including the part correlation), explaining why the partial correlation with PAS performs the best in terms of F1 scores. In addition, the distributions of the cosine and Pearson's correlations have the very similar trends to each other for both Data I and Data II. In fact, this is consistent with the result of the comparison analysis of Liu et al. [25], in which Pearson's correlation coefficient is most robust, but the difference between the dot product and Pearson's correlation coefficient is subtle.
Another point to consider is that SW-PAD with the partial correlation is the best approach in case of Data II, while PAM is the best approach with Data I. In fact, the F1 score of SW-PAD with the partial correlation is improved up to 5%, compared to that of PAM with the partial correlation in case of Data II. This may be because more peaks in Data II have similar mass spectral information although they are generated from the different compounds. For example, the cut-off value of SW-PAD with the partial correlation is much larger in Data II than that in Data I (Tables 2 and  3).
In conclusion, as for the less dense data such as Data I, PAM with any one of the cosine, the Pearson's, and the partial correlations will give us a better performance of peak alignment, while SW-PAD with the partial correlation will perform the best in case of the more dense data, such as the data acquired from real biological samples.