Improved Label-Free LC-MS Analysis by Wavelet-Based Noise Rejection

Label-free LC-MS analysis allows determining the differential expression level of proteins in multiple samples, without the use of stable isotopes. This technique is based on the direct comparison of multiple runs, obtained by continuous detection in MS mode. Only differentially expressed peptides are selected for further fragmentation, thus avoiding the bias toward abundant peptides typical of data-dependent tandem MS. The computational framework includes detection, alignment, normalization and matching of peaks across multiple sets, and several software packages are available to address these processing steps. Yet, more care should be taken to improve the quality of the LC-MS maps entering the pipeline, as this parameter severely affects the results of all downstream analyses. In this paper we show how the inclusion of a preprocessing step of background subtraction in a common laboratory pipeline can lead to an enhanced inclusion list of peptides selected for fragmentation and consequently to better protein identification.


Introduction
In the last years, clinical proteomics has witnessed an increased interest towards mass spectrometry-based methods for quantitative differential analysis of protein content in biological samples (i.e., biological fluids from drug-treated versus untreated subjects, or from healthy versus ill patients).
MS-based proteomics approaches for comparative analysis include both methods based on the use of stable isotopes [1], such as iTRAQ (Isobaric tags for relative and absolute quantitative) and SILAC (Stable isotope labeling with amino acids in cell culture), and so-called label-free approaches [2].
Label-free liquid chromatography-mass spectrometry (LC-MS) differential analysis allows determining the differential expression level of proteins in multiple samples without presenting any limit to the number of samples being compared and without increasing the complexity of mass spectra. It is based on the direct comparison of peak intensi-ties between multiple runs obtained by continuous detection in MS mode, followed by MS/MS fragmentation of only differentially expressed peptides. This procedure avoids the bias toward abundant peptides, typical of data-dependent tandem MS, and allows an increased identification of lowabundant peptides.
In a typical label-free LC-MS experiment each analysis is performed independently and it is followed by comparison of the multiple LC-MS images. The computational framework includes the steps of peaks detection, maps alignment and normalization, peaks matching across multiple sets, and a statistical analysis of the detected features for the evaluation of the differentially expressed peptides.
Several open-source, commercial, and custom software packages that address one or more of these processing steps have been described in the literature [3]. Nevertheless, most of the available tools show little or no care in assessing a minimum quality standard for the LC-MS maps entering the pipeline. Baseline subtraction and denoising, for instance, are still often neglected, despite their strong impact on all downstream analyses [4,5].
In order to show the importance of noise rejection, in this study we report a label-free LC-MS differential analysis of protein abundance in tears samples performed with and without inclusion of a preprocessing step [6] into an established analytical and computational strategy [7,8].
The preprocessing is performed on a whole LC-MS map. The algorithms work iteratively by first extracting all Single Ion Chromatograms (SICs) and by then processing independently each SIC by means of a wavelet decomposition to identify and remove the components of the chemical and the random noise.
Several other papers have introduced algorithms that exploit the two-dimensional nature of the data to minimize the noise in the mass domain by signal processing in the chromatographic time domain. The advantage of our denoising strategy over other algorithms, though, mainly comes from characterising and subtracting the noise features from all SICs independently. The limit of other common approaches like CODA [9] or MEND [10], in fact, is that they often process only a selection of SICs. CODA, for instance, automatically retains only chromatograms with high S/N ratio and combines them to form a reduced total ion hromatogram (TIC) trace. Similarly, MEND divides the whole mass range in consecutive regions and for each region it determines a global model of the noise by combining a fixed number of "vacant" SICs, that contain no chromatographic peaks.
In the first step of the present work, tears proteins from healthy (H) subjects and patients affected by hyperevaporative dry eye (HDE) were subjected to tryptic digestion and analyzed by reverse-phase chromatography nano-LC ESI-QTOF MS in order to evaluate the protein changes related to the disease. The msInspect software [11] was used for alignment and normalization of the LC-MS maps while the open-source Proteios Software Environment (ProSE) [12] was used for statistical analysis and for the creation of the list of peptides to be identified by RP nano-LC ESI/QTOF MS/MS analysis followed by database search.
In a second phase of the work the same LC-MS raw data files were first filtered to remove chemical and random noise and then reprocessed by the same computational pipeline. The results are compared with those obtained by means of the standard procedure and the influence of noise rejection on the selection of peptides for MS/MS fragmentation is commented according to previously obtained outcomes [13].

Materials and Reagents.
All the analytical grade reagents, the Myoglobin, and the solvents were purchased from Sigma Aldrich (St. Louis, MO, USA).

Subjects Studied.
A total of 4 subjects, including 2 healthy volunteers (H; 1 M and 1 F) and 2 hyperevaporative dry eye patients (HDE; 2 M), were admitted to this study.
Inclusive criteria for patients were Schirmer test I value ≥ 10 mm/5 min, Tear Film Break Up Time (T.F.B.U.T.) < 10 seconds, and symptoms of ocular discomfort from at least two months. Inclusive criteria for healthy control subjects were Schirmer test I value ≥ 10 mm/5 min, T.F.B.U.T. ≥ 10 seconds, and no ocular discomfort symptoms. In both groups exclusion criteria were considered the presence of punctuate cheratopaty and/or autoimmune diseases, the use of contact lenses and any ocular surgery in the last 6 months.
All the tear samples were provided by the Ophthalmology Unit at the University of Bologna (Italy) after obtaining informed consent from the subjects studied and according to DEWS guide lines [14] A minimum of 5 μL tears was collected using a micropipette with sterile tip, centrifuged and stored as previously described [15].

Tear Samples Preparation.
Total protein quantification of each tear sample was performed by Bradford protein assay using bovine serum albumin (BSA) as a standard according to the manufacturers' instructions (Bio-Rad, Laboratories Inc., CA, USA).
For each sample, 10 μg of proteins were diluted to 10 μL with 6 M Urea in 100 mM ammonium bicarbonate pH 8.2 and 2 picomoles of a Myoglobin (Myo, P68082) were added as internal standard. The protein mixtures were reduced by adding 1 μL of 100 mM dithiothreitol (DTT, Sigma) in 100 mM ammonium bicarbonate for 1 hour at 37 • C and alkylated by addition of 3 μL of 100 mM iodoacetamide (IAA, Sigma) in 100 mM ammonium bicarbonate for 1 hour at room temperature in the dark. The resulting samples were incubated overnight at 37 • C with trypsin 12 ng/μL (Promega) in a 50 : 1 (w : w) ratio. The tryptic digestions were blocked after 15 hours incubation with 1 μL of formic acid, and afterwards the samples were lyophilized to dryness and resolubilized with 20 μL of 0.1% formic acid (FA).

Liquid Chromatography and Mass Spectrometry.
For each sample 4 μL were analyzed by LC-MS analysis using a CapLC (Waters, Manchester, U.K) with flow splitting from 5 μL/min to 200 nL/min, connected with a nanoelectrospray interface to a QTOF Ultima (Waters) using MassLynx v4.0 software as operating software. The peptide separation was performed on an Atlantis dC18 NanoEase column (150 × 0.075 mm, 3 μm) (Waters) with an Atlantis dC18 NanoEase precolumn (0.3 × 5 mm, 5 μm particle size) (Waters), using as mobile phase A H 2 O/acetonitrile (95 : 5) 0.1% FA while the mobile phase B was acetonitrile/H 2 O (95 : 5) 0.1% FA. A 90-minute chromatographic gradient was used to give a linear increase after 3 minutes from 2% B to 35% B in 70 minutes and from 35% B to 80% B in 2 minutes, and after 3 minutes at 80% B the column is conditioned again at 2% B for 15 minutes. One blank injection with a 30-minute gradient was run between samples to reduce sample carry over, and every six samples 2 pmol of Myo tryptic digest were analyzed as quality control using the same 90 minutes gradient to evaluate the experimental variation. During MS analysis the QTOF was set to scan in profile mode m/z 400-1800 with  1.9 seconds per scan and 0.1 seconds of scan delay. The samples were analyzed in triplicates. Four microliters of sample were injected for targeted MS/MS and the same LC gradient was used. The survey scan time was set to 1 second and a peak limit of 15 counts to switch to MS/MS mode. For inclusion lists the time tolerance was set to 300 seconds.

MS Data Analysis.
The pipeline of MS data analysis has been already described elsewhere by our research group [7,8] and is shown in Figure 1(a). Briefly, massWolf (v2.0, http://sashimi.sourceforge.net) was used for the conversion of Micromass LC-MS raw data file to mzXML, while the peptide feature detection was performed using msInspect v2.1. Two alignments were performed for the analysis of all the LC-MS features: one with an H sample and one with an HDE sample as master. The normalization of the LC-MS maps and their alignments were performed by means of the peptide Array tool in msInspect [11], using a mass window of 0.2 m/z and a time window of 250 scans. After alignment, significantly upregulated features were scheduled for targeted MS/MS in inclusion lists generated using the ProSE 2.1 platform [12]. The inclusion limit was a fold change of at least 1.5 and a P-value of 0.05 in a Student's t-test. For the t-test the total intensities, which represent the integrated peak volumes, were used. For features where peaks could not be found in the healthy control samples, a value of 50 ion counts was used, which was an estimate for the detection level in the present setup. Selected features were sorted according to intensity and put into include lists with a maximum of 300 peaks per include list. The retention times of the second technical replicate acquired sample were used in the include list. Targeted MS/MS analysis was finally performed to identify the peptides contained in the include lists.

Tandem MS Data Analysis.
To generate peak lists for peptide identification, ProteinLynx Global Server 2.2 (Waters) was used. The XML format peak lists were converted to mzData using ProSE. Mascot version 2.2 (www.matrixscience.com) was used for peptide identification.
The Sprot human database, version 57.3, was used, 468851 sequences in total. The search settings were 0.2 Da precursor and 0.6 Da fragment tolerances, carbamidomethylation of cysteine as fixed modification, methionine oxidation as variable modification, and semiTrypsin with one missed cleavage as enzymatic digestion. The search results were exported as XML and matched with MS features using a ProSE plug-in, with a retention time tolerance of 100 seconds and a mass tolerance of 0.12 Da. Proteins were considered correctly identified when at least two different peptides (with significant individual score, i.e., P < 0.05) were present.

Noise Rejection.
In the second phase of this work, the original mzXML files entering the pipeline described in Figure 1(a) were cleaned from extraneous noise by a waveletbased algorithm already described by our research group [6]. Briefly, the algorithm works on a whole LC-MS map by first extracting all Single Ion Chromatograms (SICs) from the spectrographic data and by then decomposing each SIC to

Total
Only I  Only II  Only III  I-II  I-III  II-III  I-II-III   H1 I-II-III   UNPROCESSED  4167  767  568  564  315  288  208  1457  PROCESSED  5736  1061  915  1018  379  361  338   identify and remove the noise components. This cleaning step is simply added to the standard pipeline (Figures 1(a) and 1(b)) to selectively remove random and chemical noise while leaving the peptide peaks unaffected. All data were processed by a stand-alone Java application on a 2.66 GHz iMac running Mac OS X with 1 GB of RAM allocated for the JVM heap. The decomposition was performed on 6 scales and by means of the Coifmann wavelet of degree 1.

Effects of Denoising on Peptide Feature Detection.
The peptide Array tool of msInspect was run on the technical replicates of all subjects and the results are shown in Table 1. The first column of the table shows that the number of peptide features detected by msInspect increases on average by about 35%. In order to access sensitivity of the cleaning, only peptides with a charge ≥2 were included in the table. Uncharged features were excluded to avoid spurious peaks, while singly charged peptides were excluded to avoid false positive identifications caused by chemical noise, whose regular pattern of peaks occurring at every Th can easily be mistaken for the isotopic distribution of a 1 + peptide. Furthermore, by only considering peptides that are aligned through at least two of the replicates or through all of the replicates, the average increase in detection can be estimated, respectively, around 20% or 13%. A second alignment was performed between the original and the filtered files of each LC-MS run. In this case, since each file was practically aligned to itself, stricter parameters of Mass Window = 0.05 Da and Scan Window = 5 scans were imposed. Figure 2 shows that on average half of the peptides are found both before and after the cleaning. The figure shows also that about one sixth of the peptides have disappeared from the original file because of the cleaning, while one third have emerged in the processed files after background subtraction. The biggest region of the pie chart relates to the peaks that are left unaffected by the cleaning, typically highintensities peptides. A look into the quality of these peptide features shows a 30% average increase in the number of consecutive scans in which a peptide is detected (data not shown). This improvement is usually obtained by unveiling the lowest peaks of its isotopic distribution. The general improvement of the quality of the common peptides is also evident in Figure 3. The x-axis gives the average intensity of two aligned peaks, while the y-axis shows the logarithm of their KL ratio. KL is the Kullback-Leibler score and it is used in msInspect as a measure of how closely the observed isotopic distribution of a peptide feature matches its own theoretical distribution at a given mass. This score is always nonnegative and approaches zero for better matches. Therefore all peptides which benefit from a correct denoising will gain a lower KL, will have a ratio KL cl /KL or < 1, and will be located below zero on the logarithmic axis. On the opposite, the peptides located above zero are those negatively affected by noise rejection, while points close to the horizontal axis indicate unaffected scores.
The figure thus shows that most of the peptides gained a better (lower) score and that the best improvements were obtained at very low intensities, therefore by those peptides originally masked or hidden by the chemical noise.
The second and the third sectors of the chart show peptides which have appeared or disappeared in the processed maps as a consequence of the preprocessing step. This result is more difficult to interpret because of the high complexity of the samples, which makes it almost impossible to infer which of the found peptides were true positives and which of the lost peptides were true negatives. Nevertheless, visual comparison of original versus clean maps confirms the detection of new low-intensity peaks [6], belonging to sector 2. It also shows that most of the peptides lost in sector 3 are uncharged features which are assigned a charge only after cleaning, like the peptide highlighted in Figure 4.

Effects of Denoising on Protein
Identification. The ProSE toolkit was used to identify over-expressed and underexpressed peptides in the maps of the HDE subjects compared to the H controls. The same statistical analysis was then repeated for the processed data, thus producing a total of 4 include lists (IL), subsequently used for peptide identification by tandem MS (Figure 1(b)).
The results of the MS/MS analysis are summarized in Table 2. Since no processing is performed on tandem MS spectra, the quality of protein identification can be directly ascribed to the quality of the include lists. Despite an expected variability of the identified peptides, which is mostly related to experimental conditions of the targeted acquisitions, the general trend shows a clear improvement of protein identification obtained from the processed data.
Considering the protein under-expressed in HDE patients, the denoising strategy allowed the identification of 4 new proteins and achieved a higher sequence coverage and a better protein score for three proteins already found with the unprocessed data. In particular, Secretoglobin (SG1D1) had 0 peptides in the raw IL and 5 in the clean one, Mammaglobin-B (SG2A1) had 0 versus 3, Ig kappa chain C region (IGKC) had 0 versus 2, Cystatin (CYTS) had 0 versus 4, while Lipocalin-1 (LCN1) had 6 versus 12, Proline-rich Table 2: List of the peptides identified using the include lists from unprocessed and processed data. Bold italic proteins indicate abundance variations in agreement with previous outcomes [13]. ( protein 4 (PROL4) had 1 versus 5, and Lactotransferrin (TRFL) had 4 versus 8. One protein was found with a better sequence coverage but a lower protein score (Polymeric Immunoglobulin Receptor (PIGR): 3 versus 4) and a last one was found only before filtering (Serum Albumin (ALBU): 2 versus 0). Among the proteins over-expressed in HDE patients, 1 protein was found only after noise rejection (Hemoglobin subunit alpha (HBA): 0 versus 4) and 2 proteins achieved a higher sequence coverage and a better score (Hemoglobin subunit beta (HBB): 5 versus 13 and ALBU: 3 versus 6). The obtained results have been compared with the outcomes of a previous study performed by our research group [13], in which the differential expression of proteins in HDE patients over H controls was monitored by mono-dimensional gel electrophoresis and western blot analysis. Under-expressions of LCN1, SG1D1, SG2A1, TRFL, and over-expression of ALBU are in perfect agreement with the previous study and these proteins are shown in bold italic in Table 2. In particular, noise filtering allowed to identify ALBU as overexpressed in HDE patients, whereas the standard pipeline wrongly identified the protein as both under-and overexpressed.
As regards the variations in the abundances of HBA, HBB, IGKC, PROL4, PIGR, and CYTS associated to HDE, this could not be validated by comparison with previously published data. Nevertheless, a clear proof of their proper identification can be observed in Figure 4, where the correct charge assignment of the CYTS peptide QLCSFEIYEVP-WEDR allowed its inclusion in the list of peptides underexpressed in tear samples from HDE patients.

Conclusions
We have previously shown that wavelet denoising in the RT domain achieves selective rejection of chemical and random noise while preserving peptides features and morphology. The approach has proved to unveil low-intensity peptides originally masked by the chemical noise and to reduce false positive identification, by filtering noise peaks originally mimicking the peptide morphology.
In this work we have applied our noise filtering strategy to a label-free LC-MS differential analysis of protein abundance in tears samples. The mzXML files have been simply intercepted, processed by our algorithm and reinserted in the standard workflow just before the analyses by msInspect.
Journal of Biomedicine and Biotechnology 9 The results show that noise rejection allows to increase the sensitivity of msInspect to real peptides and to obtain more accurate include lists for further targeted MS/MS analysis. These results are validated by comparison to previous outcomes which confirm an improvement in terms of number of identified proteins, higher sequence coverage, and better protein scores.