Correctness of Protein Identifications of Bacillus subtilis Proteome with the Indication on Potential False Positive Peptides Supported by Predictions of Their Retention Times

The predictive capability of the retention time prediction model based on quantitative structure-retention relationships (QSRR) was tested. QSRR model was derived with the use of set of peptides identified with the highest scores and originated from 8 known proteins annotated as model ones. The predictive ability of the QSRR model was verified with the use of a Bacillus subtilis proteome digest after separation and identification of the peptides by LC-ESI-MS/MS. That ability was tested with three sets of testing peptides assigned to the proteins identified with different levels of confidence. First, the set of peptides identified with the highest scores achieved in the search were considered. Hence, proteins identified on the basis of more than one peptide were taken into account. Furthermore, proteins identified on the basis of just one peptide were also considered and, depending on the possessed scores, both above and below the assumed threshold, were analyzed in two separated sets. The QSRR approach was applied as the additional constraint in proteomic research verifying results of MS/MS ion search and confirming the correctness of the peptides identifications along with the indication of the potential false positives.


Introduction
Liquid chromatography (LC) combined with tandem mass spectrometry (MS/MS) plays an essential role in the field of protein research. In this technique, proteins and peptides are separated with the use of liquid chromatography methods and then identified by tandem mass spectrometry analysis. Thanks to high resolution, accuracy, and sensitivity of LC-MS/MS systems, equipped with sophisticated techniques of fragmentation, not only can simple proteins be directly investigated, but also research on the level of whole proteomes became possible [1]. However, proteins/peptides identification from biological matrices is still an analytical challenge because of the great complexity of the samples, enormous concentration ranges of the occurring proteins and lack of proper standards. It all makes an exact and precise peptide or protein identification and, consequently, proteome coverage limited [2].
Proteomic research requires also higher throughput of the protein identification in LC-MS/MS. Peptide identification in MS/MS is based on matching to parent ion m/z and m/z values of daughter ions. This procedure allows to assign an identification confidence for this particular peptide, which contributes independently to the overall confidence of the protein identification. One of the most commonly applied method for protein definition in complex samples relies on correlation algorithm Sequest proposed by Yates and coworkers [3][4][5][6]. This algorithm matches the investigated peptide tandem mass spectrometry data with proper data from protein database. To increase reliability of the identification, several statistic parameters have been considered. First, the difference between the normalized cross-correlation functions for the first and second ranked results (ΔC n ) is applied to indicate a correctly selected peptide sequence. The other criteria are cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one (X corr ), the preliminary score based on the number of ions in the MS/MS spectrum that match the experimental data (S p ), the rank of the certain match during the preliminary scoring (RS p ), and the ions value (I) describing how many of the observed ions match the theoretical ions for the listed peptide. Currently, the most often applied criteria in protein study are crosscorrelation score between the observed peptide fragment mass spectrum and the theoretically predicted one (X corr ) and cross-correlation functions for the first and second ranked results (ΔC n ). Washburn et al. [7] applied the following criteria of correctness of peptide identification: X corr above 1.9 for single charged fully tryptic peptides, over 2.2 and 3.75 for fully or partially tryptic doubly and triply charged peptides, respectively, and the ΔC n values higher than 0.08. On the other hand, in the studies performed by Peng et al. [8] the peptides were classified as properly identified when X corr was, in case of fully tryptic peptides, higher than 2.0, 1.5, or 3.3 for the charge states of 1+, 2+, 3+, correspondingly, and over 3.0 (2+ charged) or 4.0 (3+ charged) considering partially tryptic peptides, when ΔC n score was above 0.08. The relationship between application of different filtering criteria and degree of false positive identifications has also been recently demonstrated by Qian et al. [9]. There it was shown that all previously applied filtering criteria were derived using either relatively simple proteomes (e.g., the yeast proteome) or standard proteins. The degree of false positive identifications, when these criteria are extended to considerably more complex mammalian proteomes, especially human proteome, is still problematic and requires improvement of the strategies to distinguish correct from incorrect ones. Therefore, to decrease the probability of random match, which is growing up with the size of the protein database, two new sets of filtering criteria were independently developed for human cell line and human plasma samples [9]. For human cell line samples, the new criteria were as follows: X corr ≥ 1.5 for fully tryptic peptides and X corr ≥ 3.1 for partially tryptic peptides for the 1+ charge state, X corr ≥ 1.9 for fully tryptic peptides and X corr ≥ 3.8 for partially tryptic peptides for 2+ charge state, and X corr ≥ 2.9 for fully tryptic peptides and X corr ≥ 4.5 for partially tryptic peptides for the 3+ charge state. All the criteria had ΔC n value of ≥0.1. The new criteria for peptides from human plasma samples include for the 1+ charged, X corr ≥ 2.0 and ≥3.0 for fully and partially tryptic peptides, respectively; for the 2+ charged, X corr ≥ 2.4 for fully and ≥ 3.5 for partially tryptic peptides, consequently; and for the 3+ charged, X corr ≥ 3.7 for fully and ≥4.5 for partially tryptic peptides, accordingly. The ΔC n values were in all cases ≥0.1 as well.
Nevertheless, considering the variety and dynamic range of the proteins, occurring in the different organisms, there is still a possibility of false positive or false negative identification. Growing concerns about the quality of MS data affected in various ideas to harden protein identification by using bioinformatics' methods, for example, decoy search strategies [10] or additional information obtained during analysis, for example, peptide pI or retention time [11]. The retention time is very practical parameter in proteomics as it is easy to obtain from LC-MS data and does not require a lot of instrumental effort [2,12]. Comparison of the experimental and predicted retention times of the occurring peptides may examine the correctness of the identification and then enable to exclude the incorrectly identified ones. However, to predict properly peptides' retention highly accurate models should be developed. Recently, some models have been proposed which characterize quantitatively the structure of a peptide and predict its gradient RP-LC retention at given separation conditions [13,14].
Liquid chromatography (LC) is an analytical technique which can provide a great amount of quantitative, comparable, and reproducible (retention) data for large sets of structurally diversified compounds (analytes). On the other hand, chromatographic retention time can be considered as a chemical structure dependent parameter, which is constant for given separation conditions (mobile phase composition, stationary phase, temperature, pH). Due to that, quantitative structure (chromatographic) retention relationships (QSRR) have been considered as a model approach to establish strategy of retention predictions. However, to predict properly peptides' retention highly accurate models should be developed [15][16][17]. In particular, in proteomics, the structural descriptors obtained from QSRR studies can contribute to better predictions of retention times and therefore harden peptides identification.
Several previous reports [18][19][20][21] prove that retention of peptides in reversed-phase liquid chromatography (RP-LC) depends on their amino acids composition. There, the regression analysis was used to derive the regression coefficients, which represented the contribution of each amino acid in the peptide's sequence to its retention. This approach was applied in proteomics analysis, to predict the retention times of peptides' tryptic digests [22]. Then, it was also employed to increase the reliability of the peptides identification to check the predictive capability of artificial neural networks (ANNs) by Petritis et al. [23] or by Shinoda et al. [24], where created ANN was then applied to predict the retention times of peptides from Escherichia Coli proteome. The correlation between amino acid composition and peptide's retention time was used as well to provide the identity information, given by the tandem mass spectrometry, of the peptides from Drosophila melanogaster proteome, to exclude the false positive identifications [25].
Recently, a QSRR model based on multiple linear regression has been proposed [26] to quantitatively characterize the structure of a peptide and to predict its gradient RP-LC retention at established separation conditions. The logarithm of the sum of gradient retention times of the amino acids composing the individual peptide, log Sum AA , the logarithm of the peptide Van der Waals volume, log VDW Vol , and the logarithm of its calculated n-octanol-water partition coefficient, clogP, were employed [26][27][28][29].
The aim of the study was to derive the retention time prediction model and check its predictive capability based on quantitative structure-retention relationships (QSRRs). The newly modified QSRR model was derived with the use of set of peptides identified with the highest scores and originated from eight model proteins [13,24,[30][31][32]. Therefore, no synthesized peptides with known amino acid sequences were used to derive and check the model [14,31]. Moreover, descriptors applied in the new QSRR model were obtained in the new, facilitated from practical point of view, manner. Finally, its predictive ability was supported by further investigation with the use of a Bacillus subtilis proteome digest (not like previously just applying synthesized peptides with known amino acid sequences). To demonstrate that ability three sets of testing peptides received from proteins identified with different levels of confidence were used. Moreover, the additional attempts were performed to demonstrate the utility of QSRR approach as the additional constraint confirming the correctness of the peptides identifications.

Standards.
The standard amino acids solutions were prepared by dissolving seven amino acids among twenty naturally occurring ones (isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine, and valine, all from Fluka BioChemika, Buchs, Switzerland) in 0.1% aqueous solution of trifluoroacetic acid (TFA). Water was deionized by passing through a Direct-Q (Millipore) system (Millipore, Bedford, MA, USA). The concentrations of the samples were approximately 0.6 mg/mL.
The solutions of standard proteins annotated as eight model proteins (about 3 mg/mL) were as follows: bovine serum albumin (BSA), chicken egg ovalbumin (CEO), bovine milk lactoglobulin (BML), bovine milk β-casein (BMC), bovine myoglobin (BM), human serum albumin (HSA) and ribonuclease B (RibB) from Sigma-Aldrich (Steinheim, Germany), and insulin-like growth factorbinding protein 1 (IGFBP-1), which was purified from human amniotic fluid following a previously reported procedure [33]. They were obtained by dissolving the lyophilized standard proteins in deionized water and then treated as shown below in digestion protocol.

Spore
Purification. As described before [33] fortyeight-hour cultures in nutrient broth were pelleted (10000 ×g, 10 minutes) and washed three times with 1/4 volume of cold water. The pellet was resuspended in 1/5 of the initial volume of cold MQ water and incubated overnight at 4 • C. On subsequent days the suspension was centrifuged (20000 ×g, 20 min, 4 • C). The pellet was resuspended in fresh cold MQ water. This procedure was repeated for 5 to 10 days. Purified spores were kept in water suspension at 4 • C in the dark. Once per week the spore were centrifuged and suspended in fresh water to avoid spontaneous germination.

Protein Extraction.
The spore pellet (approximately 20 mg spores) was resuspended in 1 mL of extraction buffer (50 mM Tris-HCl, pH = 7.8; 2% SDS; 10% glycerol; 0,2 M DTT) and boiled for 5 min and vortexed for 30 seconds. These steps were repeated twice. Unlysed spores and spore debris were removed by centrifugation at 12,000 ×g for 5 min at 4 • C. The supernatant was precipitated with acidified acetone/methanol mixture. To one volume of protein solution four volumes of cold precipitation reagent were added and kept on at −20 • C. Precipitate was spun down at 15, 000 × g, at 4 • C and supernatant was discharged an samples were drained, then resuspended in water, and stored at −80 • C. Concentration of proteins was determined with the use of Bradford assay kit (Bio-Rad Laboratories) and it equalled 1.2-1.5 mg/mL.

Digestion
Protocol. To 1 mL of each protein (BSA, CEO, BML, BMC, BM, HAS, RibB, and IGFBP-1) sample (∼3 mg/mL), 300 μL of DTT (dithiothreitol) (Sigma-Aldrich, Steinheim, Germany) (100 mM, freshly prepared in 100 mM ammonium bicarbonate buffer, pH 8.5) were added. The samples were kept in 60 • C for 30 min, to allow reduction of the disulfide bridges. Then 50 μg of trypsin was added (ratio 1 : 50 E/S) to each sample. Samples were digested for 12 hours (overnight digestion) at 37 • C. After that 0.1 mL of TFA was added to each sample to stop the digestion. Obtained standard solutions concentrations were about 50 pmol/μL. To 1 mL of Bacillus subtilis spore cells lizates (1.2-1.5 mg/mL), 150 μL of DTT (Sigma-Aldrich, Steinheim, Germany) (100 mM, freshly prepared in 100 mM ammonium bicarbonate buffer, pH 8.5) were added. The samples were kept in 60 • C for 30 min, to allow reduction of the disulfide bridges. Then 25 μg of trypsin was added (ratio 1 : 50 E/S) to each sample. Samples were digested for 12 hours (overnight digestion) at 37 • C. After that 0.05 mL of TFA was added to each sample to stop the digestion. Obtained standard solutions concentrations were about 50 pmol/μL. Tryptic digests were stored at −20 • C (if frozen in this reaction mixture the disulfide bonds would not reoxidase). The LC-ESI-MS/MS analyses were performed in three weeks at the latest (the shelf life of such frozen solution is couple of months) (http://www.thermo.com/).
The mobile phase consisted of two solvents (A and B) mixed on-line. Solvent A was 0.1% aqueous (water was MSgrade) solution of trifluoroacetic acid (TFA) (Sigma-Aldrich, Steinheim, Germany) and solvent B was acetonitrile (ACN) (MS-grade, Sigma-Aldrich, Steinheim, Germany) containing 0.1% TFA. The applied linear gradient time was 90 min, from 0% B to 60% B. The flow rate was 200 μL/min. The injection volume was 10 μL. The LC-MS apparatus was equipped with thermostated column oven and surveyor autosampler controlled at 20 • C (Thermo Finnigan, San Jose, CA, USA), a quaternary gradient Surveyor MS pump (Thermo Finnigan, San Jose, CA, USA) with a diode array detection (DAD) system, and LTQ linear ion trap MS system with ESI ion source controlled by Xcalibur software 1.4 (Thermo Finnigan, San Jose, CA, USA). (m/z); the activation amplitude was 35% of ejection RF amplitude that corresponds to 1.58 V.

Protein Identification.
The experimental retention times of the peptides (t R exp ) were determined at peak intensity maximum. The m/z values measured manually for the most intense peaks in acquired MS/MS spectra were automatically searched against the protein database ( * fasta) using the Sequest Algorithm, incorporated into Bioworks 3.0 (Thermo Finningan, San Jose, CA, USA). The * fasta format for each protein was downloaded from Expasy (http://www.expasy.org/sprot/). During the interpretation of the results obtained after the correlation analysis done on the experimental and the predicted retention times of peptides, the exemplary filtering criteria applied in the studies were the same as those discussed previously, proposed by Washburn et al. [7]. The spectra for singly charged peptides with a cross-correlation score to a tryptic peptide (X corrs ) greater than 1.9, the spectra for doubly charged tryptic peptides with X corrs of at least 2.2, and the spectra for triply charged tryptic peptides with X corrs above 3.75 were accepted as correctly identified according to Sequest software. For all the spectra analyzed, ΔC n values were above 0.08. Regression coefficients (± standard deviations), multiple correlation coefficients, R, standard errors of estimate, s, significance levels of each term and of the whole equations, p, and values of the F-test of significance, F, were calculated.
The structural descriptors of the analyzed standard amino acids and peptides from investigated, standard proteins and Bacillus subtilis cells were calculated. First of all, in contrary to the previous models [26][27][28][29], where just log Sum AA was calculated by simple addition of component amino acids retention (taking into account all 20 naturally occurring amino acids), the novel QSRR peptide descriptor log Sum (k + 1) AA was used. The retention factor (k) was introduced, because it is more similar for different related systems than t R as it compensates for some physical differences between columns. Descriptor log Sum (k + 1) AA was calculated applying retention data for just only 7, the most retained amino acids (isoleucine, leucine, methionine, phenylalanine, tryptophan, tyrosine, and valine). The other 13 amino acids are hardly retained; therefore their presence in peptide's sequence does not influence significantly its retention. For these 13 amino acids fixed values were ascribed (k = 0) and one was added to avoid zero in the calculation of the logarithm, according to the procedure elaborated and evaluated elsewhere [34]. On the other hand, searching for the most accurate the logarithm of its calculated n-octanolwater partition coefficient, clog P, values, different calculation methods were tested (data not shown). Briefly, to obtain clog P values HyperChem 7.5 professional software for personal computers (HyperCube, Waterloo, Canada) with the extension ChemPlus, Dragon professional 5.0 software (Milano Chemometrics and QSAR Research Group-Talete, Milano, Italy), and on-line available ALOGPS 2.1 software (http://www.vcclab.org/) were obtained. Finally, to derive the appropriate QSRR model, clog P values average, log P module in ALOGPS 2.1 software was used to determine that QSRR descriptor.
The general QSRR equation has the following form: where t R is the gradient HPLC retention time and k 1 -k 4 are regression coefficients.  Table 2 were used to check the general validity of the proposed QSRR model. In view of the main objective of this work, three other sets of testing peptides originating from B. subtilis proteome digestion were used. One set includes 54 peptides belonging to proteins identified on the basis of more than one peptide with X corr above 1.5 (Table 3). A second set comprises 41 peptides belonging to proteins identified again with X corr above 1.5, but on the basis of just one peptide (Table 4). And the third set comprises 40 peptides belonging to proteins identified on the basis of just one peptide, but with X corr below 1.5 ( Table 5).

Results and Discussion
The model set consisting of 50 peptides with the highest values of X corr was used to create a model to predict further retention times of the peptides from proteome of Bacillus subtilis cells. Among this group differences between experimental and predicted retention times ranged from 0.01 to 2.81 min. 42% (21 peptides) of the results were characterized by differences between experimental and predicted retention times lower than 1 min, and for the remaining 58% (29 peptides), these values ranged from 1 to 3 min (Table 2). Taking into account retention times and the values of descriptors for those 50 model peptides, the following specific equation was derived: The description of t R by (2) was good as documented by the following criteria of statistical quality. All the regression coefficients were highly statistically significant as was the whole equation. Multiple correlation coefficient, R, standard error of estimate, s, and the value of the F-test of significance, F, all were also satisfactory.
Equation (2) provides the predictive model based on experimentally obtained descriptor (log Sum (k + 1) AA ) and improved by the implementation of molecularmodeling-based descriptor (clog P). Experimentally obtained descriptor (log Sum (k + 1) AA ) appeared to possess significant contributions into peptides' retention. However, the log Sum (k + 1) AA has little in common with n-octanol/water partition coefficient-neither for individual amino acids nor for the peptide. The considered analytes were highly ionizable and only minute fraction of molecules can exist in nonionized form in solution. Only for that fraction log P (clog P) properly reflects the ability to partition between aqueous and hydrophobic phase. Therefore, the log Sum (k + 1) AA parameter was not considered to mimic clog P; actually it reflects differences in peptides polarities. Instead, clog P was an auxiliary peptide structure descriptor: a correction for log Sum (k + 1) AA .
In order to check the correctness of the model, the set of 21 peptides (Table 2), derived from 8 model proteins, was used as the validation set. The predicted retention times, calculated from (2), were then compared to the experimental retention times and the differences between these two retention times were calculated. Differences varied from 0.09 to 3.08 minutes in retention time (mean value 1.29 min, Table 2). For 9 peptides the range of differences between experimental and predicted retention times (42.86%) was from 0.09 to 0.46 min; for 11 peptides (52.38%) the range was 1.07-2.99 min; for 1 peptide (4.76%) this value was over 3 min. Correlation (R = 0.979) between experimental and predicted retention times confirmed additionally the validity of the model (Figure 1), proving that similar values of predicted and experimental retention times of analyzed peptides correlate also with higher probability of identification correctness using Sequest algorithm ( Figure 5).

QSRR-Based Analysis of Peptides from Bacillus subtilis
Proteome. Using (1), the predicted retention times for peptides identified for proteome of Bacillus subtilis cells were further calculated (Tables 3-5). The experimental retention times for these peptides were obtained in LC-MS/MS analysis and compared to the calculated ones. Here, the special attention on peptides with low X corr (around 1.5) was taken into account to check the applicability of the proposed model and to indicate the potential false positives. In this case, the most important were the attempts to provide the QSRRbased tool to confirm true and false positively identified peptides.
The derived accurate model, as confirmed in Figure 1, was applied to calculate also the retention times of peptides from the real proteome sample of Bacillus subtilis cells. Its correctness was proved first by calculating the predicted retention times of peptides belonging to proteins identified on the basis of more than one peptide with X corr above 1.5, that is, those ones that are assumed to be the most confident true positives. It is clearly seen on correlation plot depicted in Figure 2 that the predicted retention times and  experimental retention times do not vary significantly, and so it can be concluded that those peptides, and the proteins, to which they are assigned, are correctly identified and really present in the analyzed sample. The detailed accuracy of the peptide identification can be further examined in Table 3. In the set of 54 peptides obtained from digestion of Bacillus subtilis proteome and belonging to proteins identified on the basis of more than one peptide with X corr above 1.5, the differences between experimental and predicted retention times varied from 0.08 to 18.07 min (mean value 5.13 min). For 8 peptides, being 14.82% of the set, the difference between experimental and predicted retention times was lower than 1 min. There were 6 peptides (11.11%), which retention times differences ranged between 1 and 3 min. In most cases, differences between experimental and predicted retention times were from 3 to 5 min and then from 5 to 10 min, for 18 (33.33%) and 16 (29.63%) peptides, respectively. 4 peptides (7.41%) were characterized by difference in experimental and predicted retention times ranging from 10 to 15 min. There were even also 2 cases, for which these values varied between 15 and 20 min. The correlation between experimental and predicted retention times can be considered good with correlation coefficient equaled 0.936 ( Figure 2). However, some peptides in this set could be considered probably as false positives (e.g., ESIAQVAAISAADEEVGSLIAEAMER, or MSGWLAHILE-QYDNNRLIRPR). Generally, at that moment, it was proved that it is again possible to predict the retention times of unknown peptides of Bacillus subtilis proteome, based on retention data obtained experimentally only for the limited  Among 41 Bacillus subtilis peptides belonging to proteins identified on the basis of only just one peptide with X corr above 1.5 (Table 4), the difference between experimental and predicted retention times varied from 0.35 to 11.7 min and the mean value was 4.92 min. The predicted retention times of 5 peptides varied from the experimental ones less than 1 min, which refers to 12.20% of the investigated set. For other 8 peptides (19.51%) the difference between experimental and predicted retention 8 Journal of Biomedicine and Biotechnology    times was higher than 1 min, but lower than 3 min. The range from 3 to 5 min in retention time difference was characteristic for 11 peptides, constituting 26.83% of the studied set. The highest numbers of peptides (13) were characterized by 5 to 10 min difference in retention times (31.76%). On the other hand, the highest values, over 10 min, of the difference between predicted and experimental retention times were characteristic for 4 peptides (9.76%) and the largest difference was 11.7 min ( Table 4). The correlation between experimental and predicted retention times is still reasonably with correlation coefficient equaled 0.8405 (Figure 3). Some peptides in this set seem to be also false positives (e.g., DQDISGEKATADQLLKDVK or IQNGDPIAGLFDEFTQTVQR), even though they fulfill the established level of X corr criterion for proper peptide identification. The differences between predicted and experimental retention times (here 11.49 and 11.70 minutes, resp.) suggest that these peptides, and proteins, from which they originate, may not be really present in the analyzed sample. Finally, in the group of 40 Bacillus subtilis peptides, belonging to proteins identified again on the basis of just one peptide, but with X corr below 1.5 (Table 5), the differences between experimental and predicted retention times range from 1.27 to 78.80 min (mean value equaled 29.41 min). There were only 4 peptides (10%) with predicted and experimental retention times varied less than 3 min. In next 5 cases this difference was over 3 but lower than 5 min, which makes 12.5%. There were 3 peptides (7.5%) in the range between 10 and 15 min of difference in predicted and experimental retention times. For other 5 peptides, the difference in predicted and experimental retention times was from 15 to 20 min (12.5%). Next 4 (10%) peptides in the group belonging to proteins identified on the basis of one peptide with X corr below 1.5 were characterized by 20 to 30 min difference between predicted and experimental retention times. There was 1 case (2.5%), where this difference in retention times ranged between 30 and 50 min. For last 13 peptides (32.5%) in this set the experimental and predicted retention times varied even over 50 min: there were 4 cases (10%), where these values differed between 50 and 60 min; 3 peptides (7.5%) in the 60 to 70 range of retention time difference and 6 (15%) varying more than 70 min (Table 5). It must be stated that for peptides belonging to proteins identified on the basis of one peptide with X corr below 1.5, correlation between experimental and predicted retention times cannot be observed (Figure 4). Therefore it may be concluded that a large number of peptides in this set should be classified as false positives, especially those ones with extremely high difference between experimental and predicted retention times (e.g., HGGSLSAPAIH, DGITDVL, IDFPTNITMD, or LAAGISTI, where these differences are 78.80, 77.54, 73.73, and 73.26 minutes, resp.).
Generally, it can be noticed that lower values of X corr correlate with the higher percentage of peptides are characterized by larger difference between experimental and predicted retention times ( Figure 5). In particular, it is observed, when comparing the percentage of cases, where differences between predicted and experimental retention times are higher than 15 min, that in each group of Bacillus subtilis peptides belonging to proteins and identified on the basis of the following: one peptide with X corr below 1.5 (Table 5), one peptide with X corr over 1.5 (Table 4), and more than one peptide with X corr over 1.5 ( Table 3). The percentages of peptides characterized by higher than 15 min difference in experimental and predicted retention times in these groups are 57.5%, 0%, and 3.7%, respectively. On the other hand, in model and testing sets of peptides obtained from model proteins all differences between predicted and experimental retention times were lower than 15 min (Tables  1 and 2). It is noticeable that high percent of peptides with low values of X corr was characterized by differences between predicted and experimental retention times larger than 15 min, what can provide an additional indication that they could be considered as potential false positives and in fact were not identified in the analyzed sample. Therefore, QSRR equation to predict peptides retention times might be useful tool to increase throughput of the protein identification in LC-MS/MS.

Conclusions
Quantitative structure-retention relationships (QSRRs) model derived with the use of set of peptides identified with the highest scores and originated from 8 known proteins was tested with regards to its predictive capability of the retention time prediction. Bacillus subtilis proteome digest was used to check the predictive ability of the novel QSRR model proposed in the study. It was found that the QSRR approach can be applied as the additional constraint in proteomic research verifying results of MS/MS ion search and confirming the correctness of the peptides identifications along with the indication of the potential false positives. The results suggested that due to the QSRR used for the prediction of peptide retention, liquid chromatography separation stage of proteomic research could be useful in the final identification of peptides, especially considering the most uncertain protein identifications based on findings for just one peptide. (1) Peptides identified based on more than 1 peptide with X corr over 1.5 (2) Peptides identified based on 1 peptide with X corr over 1.5 (3) Peptides identified based on 1 peptide with X corr below 1.5 Figure 5: Percentage of the difference between predicted and experimental retention times (Dt R ) of Bacillus subtilis proteins identified on the basis of one peptide with X corr below 1.5 (n = 40), over 1.5 (n = 41), and more than one peptide with X corr over 1.5 (n = 54).