Development of an MS Workflow Based on Combining Database Search Engines for Accurate Protein Identification and Its Validation to Identify the Serum Proteomic Profile in Female Stress Urinary Incontinence

A critical stage of shotgun proteomics is database search, a process which attempts tomatch the experimental spectra to the theoretical one. Given the considerable time and effort spent in analysis, it is self-evident for a researcher to aspire for rigorous computational analysis and a more confident and accurate peptide/protein identification. Mass spectrometry (MS) has been applied across several clinical disciplines. The pathophysiology of Stress Urinary Incontinence (SUI), caused by a damaged pelvic floor, has become a boundless disease altering the quality of life worldwide. Although some studies pointed markers that can be bioindicators for SUI, these findings raise the issue of sensitivity and specificity. Therefore, it is critical to have a sensitive and specific analytical approach to identify markers that have been associated with protective and deleterious associations in disease. Here, we describe our designed and developed workflow for protein identification from tandem mass spectrometry that uses multiple search engines. We apply our workflow to an existing study addressing the pathophysiology of SUI. We demonstrate how using the combined approach together with high-performance computing techniques can surmount the challenges of complex analyses and extended computing time. We also compare the relative performance of each combination. Our results suggest that a combination of MSGF+ and COMET represents the best sensitivity-specificity trade-off, outperforming all other tested combinations. The approach was also sensitive and accurately identified a set of protein that was shown to be markers for categories of diseases associated with the pathophysiology of SUI. This workflow was developed to encourage proteomic researchers to adopt MS-based techniques for accurate analysis and to promote MS as a routine tool to the clinical cohorts.


Introduction
Stress Urinary Incontinence (SUI) is caused by weakened pelvic floor muscles or a weakened urethral sphincter leading to urine leaks whenever there is sudden physical pressure applied to the abdomen or bladder. This type of urinary incontinence causes sudden spurts of leaked urine when someone coughs, laughs, or sneezes, or with straining and exertion. In women, physical changes can also contribute to stress incontinence, including pregnancy, vaginal deliveries, and menopause. One of the major contributors to stress incontinence is estrogen deficiency, which is thought to increase the chances of leakage by lowering muscle pressure around the urethra. Age, parity, and family history are other risk factors that have been previously noted [1,2]. Other risk factors influencing the occurrence of SUI include age, delivery mode, concomitant diabetes mellitus, hereditary factors, ethnicity, neurological illnesses, obesity, Parkinson's disease, parity, recurrent urinary tract infections, and pregnancy [3].
SUI has become a widespread disease that affects profoundly the quality of life worldwide [4]. The etiology of SUI is still not completely known, although several studies have suggested that serum and tissue proteins can be bioindicators for SUI. Some of these markers, however, were found to be nonspecific or did not affect the pathophysiology, while others showed no association with SUI [5,6].
While earlier studies have identified the serum proteomic profile in patients with SUI using database search engines, these findings raise a number of questions, including the issue of sensitivity and specificity.
The present research tries to revisit and implement a workflow for identifying proteins from tandem mass spectrometry data to complement findings from an existing study addressing the pathophysiology of SUI [7]. Mostly, a workflow of shotgun proteomics is a whole process of mass spectrometry data analysis aiming to address a biological question. Over recent years, shotgun proteomics has made tremendous progress and has become the most comprehensive and versatile tool for studying proteins in a large scale [8]. This technique is aimed at identifying proteins in complex mixtures using HPLC in combination with MS/MS.
A shotgun proteomics workflow starts with the extraction of the proteins to be studied, from a tissue or cell, followed by the digestion step using a digestive enzyme, commonly trypsin [9], which generates a group of peptides [10]. These peptides are afterwards inspected by liquid chromatography coupled to mass spectrometry, where they are separated by C18 chromatography in a first step. Secondly, they are electrosprayed into the mass spectrometer where ions are sorted according to their mass/charge ratio. After acquisition and signal processing, we visualize the fragmentation spectra. The data are then analyzed to quantify and identify the specific proteins [11]. The final step of the workflow is the functional analysis where the appropriate proteins are placed on the context of the biological question of interest. One of the most important steps in the process is the identification of proteins. Typically, there are two most common strategies for peptide/protein identification: database search engines, which attempt to compare and match the experimental spectra to identified spectra belonging to peptide libraries, and de novo search engines, which attempt to predict amino acid sequence from its tandem mass spectrum without the assistance of database.
In the aforementioned study of SUI, Marianne Koch et al. identified the serum proteomic profile in patients with SUI. Database search algorithms, X!Tandem [12] and Mascot [13], were used to perform peptide identification. The strategy of database search is a crucial element of shotgun proteomics study [14]. After the acquisition of the experimental spectra, database search algorithms are used to define the best sequence match to the spectrum.
Various algorithms have been developed to carry out searches of MS/MS data and estimate the probability of a match, but they differ in terms of sensitivity and specificity. Sensitivity measures how accurate analysis can detect the smallest number of targeted proteins, while specificity measures the analysis accuracy in singling out target proteins from other (noncontributing) proteins that may be present in the sample [15]. This process can lead to significant false-positive results. The overall estimation of false positives in database search engine results is given as the false discovery rate (FDR)-a measure of the false peptide spectral matches (PSMs) out of the accepted ones [16]. Although various strategies exist to estimate FDR, the target-decoy (TD) database search remains the most commonly used one in shotgun proteomics. In this strategy, the database search is performed on both the true, known as the target and a null database, known as the decoy. The null model is mandatory to estimate the FDR. The decoy database is generated from the target database using different methods, either by randomization, permutation, or reversal. The TD approach assumes that the number of false PSMs in decoy search and false PSMs in target search will be equal above a given threshold score. For the TD approach, the database search can be carried out in two different ways: concatenated or separated. The concatenated search is performed by combining both the target and decoy database together. While the separated search implies searching both the target and decoy database separately.
It is popularly assumed that the accuracy and dynamic range of the analysis increase as the number of PSMs is maximized [17]. Thus, since search engines identify different subsets of PSMs, the idea of combining the capacity of various search engines seems natural to gain a better and more accurate result. Furthermore, in the case where the tandem mass spectra have a consistent fragmentation and a good signal to noise ratio, the identification of the correct sequence match remains fairly straightforward. However, multiple database search algorithms can be combined to perform the analysis when the tandem mass spectra have poor quality or an irregular fragmentation [18]. The results of different search engines can be widely divergent. The disagreement between algorithms is due to the flexibility allowed by some programs to use different types of tandem mass spectrometry data and/or modification patterns.
Handling this large number of spectra, while still minimizing computation time and memory expenses, has become a crucial issue in proteomics research. Fortunately, the recent emergence of cluster computing and high-performance computing (HPC) provides an opportunity to reduce the computation time high-throughput analyses partly by using parallelization to apply more processor power.
The central tenant of our work is to combine the results of various search engines to profit from their varying selectivity and to assess the likely advance made in the previous study. High-performance computing and clustering will be used to manage the complexity of running several search engines and reduce computational time.
In the following sections, we describe our deployed analysis pipeline for protein identification using multiple search engines together with HPC. We also discuss the unique contributions of search engines by comparing their single performance on the SUI dataset. We provide a comparison of the various search engine combinations and highlight those with better sensitivity-specificity tradeoffs. Finally, we present and discuss the accuracy of the identified set of proteins.

Material and Methods
The data used in this follow-up study are those of the previous study (PXD008553). The samples are blood serum samples collected from 19 SUI patients and 19 controls. Only 32 samples were publicly available in PRIDE from all the 38 samples. However, four samples were not case-control matched; therefore, a total of 24 samples (12 samples of patients with SUI and 12 controls) were available for the analysis.
The main modifications considered during sample preparation are serum albumin depletion, digestion using a combination of trypsin and Gluc-C, and peptide separation using nano-HPLC. The identification was performed by searching the human Swiss-Prot database on April 15, 2019, and the search was carried out including a decoy database, with an FDR cut-off set to 0.1%. The software used in this analysis is the following: X!Tandem version Vengeance 2015. 12 [20], and MyriMatch version 2.2.10165 [21] as search engines; PeptideShaker for identification result interpretation [22]; and SearchGUI version 3.3.10 for running and configuring the searches [23]. Figure 1 summarizes the main steps of our deployed workflow. Pipeline scripting was performed using Nextflow, a tool enabling scalable and reproducible computational workflows through software containers. It also simplifies the deployment of complex distributed pipelines [24]. The searches were executed in parallel on a Linux cluster running Slurm for job scheduling and task management [25]. Then, SearchGUI was used to configure the search parameters. The main search configurations were the selection of both Glu-C and trypsin as digestion enzymes; determining carbamidomethyl on Cys being a fixed modification; definition of phosphorylation on Ser, Thr, and Tyr; and oxidation on Met as variable modifications. When the search results are generated, in-house scripts developed in R were used to combine the results of the different search engines and validate peptides/protein by readjusting PTM localization scores and redesigning the protein inference. Once these previous tasks were completed, we started downloading and processing the results. Considering the formats outputted by Myri-Match and MS-GF+ are dissimilar than X!Tandem and COMET output, scripts for processing and parsing results were written using R for this purpose and for executing decoy-based FDR estimation. Error rate calculations were performed on both protein and peptide levels. First, FDR was estimated for each engine per sample; then, a new FDR (1%) was set up taking into account hits found in all engines.
A multitool result requires overcoming some conspicuous obstacles. For merging search engine results, in addition to the issue of different output file formats generated by each engine, the problem of PSM quality matching between engines needs to be clarified.
The quality of a PSM is expressed by a different score parameter for each search engine, which can make matching between a set of PSMs corresponding to the same spectrum difficult [26]. Once the problem of different output file formats is overcome by converting the original search engine output to a common format, we used a decoy match approach by adding decoy proteins in the database search step to surmount the PSM quality matching issue.
The main outcome measures were the proteins detected by sample through the combination of search engine results, the proteins detected in SUI samples and not detected in controls, and the proteins detected in control samples and not in SUI. For the statistical analysis, only proteins present at least 6 times in the same group were used. Those proteins were analyzed and matched against the DisGeNET-a platform of the largest publicly available collections of genes and variants associated with human diseases-(https://www.disgenet.org/) and the KEGG (Kyoto Encyclopedia of Genes and Genomes; https://www.genome.jp/tools/kaas/).
Search engine unique contributions were also evaluated. This task was accomplished by comparing the number of accepted proteins by FDR for each search engine per sample. An upset of control vs. case sample by the search engine was also performed.

Results and Discussion
While comparing serum samples of SUI patients and controls, we identified 13 induced proteins (abundantly found only in SUI samples) and 26 depleted proteins (abundant only in control samples) as illustrated in Tables 1 and 2.   The four most abundant proteins identified in SUI patients were Q9NQB0, Q93074, A8MU46, and P28566. All four proteins were found involved in prenatal exposure delayed effects and body weight change [27][28][29]. Smoothelin-like protein 1 plays a role in smoothing muscle fibers and mediating vascular adaptation to exercise and regulating contraction and relaxa-tion of skeletal. It is furthermore implicated in fetal growth retardation disease and memory disorders [30]. Transcription factor 7-like 2 is implicated in blood glucose homeostasis. Genetic variants of its gene have been found to be associated with an increased risk of type 2 diabetes [31].  A mediator protein was found in SUI samples, the MED12 protein. This protein is crucial for activating CDK8 kinase which plays a role in modulating mediatorpolymerase II interactions to regulate transcription initiation and reinitiation rates [32]. Phenotypes linked to Med12 are Lujan-Fryns syndrome, Ohdo syndrome, and X-linked Opitz-Kaveggia syndrome known as FG syndrome.
A G-protein-coupled receptor was also found in significantly higher abundance in SUI patients. 5HT1E (G-protein-coupled receptor for serotonin) is also known to act as a receptor for different psychoactive substances and alkaloids [33]. This protein is shown to be involved in mental disorders and substance-related disorders [34].
Five proteins were found fairly abundant in control samples. They were shown to be involved in two main class diseases, neurodegenerative diseases and female urogenital diseases and pregnancy complications.
A member of the fibroblast growth factor family, P31371, is involved in a variety of biological processes, such as cell growth, tissue repair, and embryonic development. It is also thought to have a role in brain tissue regeneration [35]. This protein was associated with prenatal injuries and fetal death [36,37].
An enzyme was singled out in control samples; it is a member of the protein family ubiquinone oxidoreductase subunit NDUFB4. O95168 was shown to be implicated in the mitochondrial membrane respiratory chain complex I assembly, mitochondrial electron transport, and response to oxidative stress [38,39]. Links with nerve degeneration and nervous system diseases were determined for this protein [40,41].
A further enzyme was found abundantly in controls belonging to the peptidase C19 family. It is known to catalyze various reactions, such as peptide and isopeptide bonds formed by the C-terminal Gly of ubiquitin [42]. It was also revealed that Q70CQ1 performs deubiquitinating of histone H2B at "Lys-120" and acts as a regulator of pre-mRNA splicing [43]. Implications in urogenital abnormalities, female infertility, and female genital diseases were demonstrated for USP49 [44][45][46].
THAP3 is an element of the THAP1/THAP3-HCFC1-OGT complex. The protein is implicated in the regulation of the transcriptional activity of the cell cycle-specific gene PRM1 [47]. THAP3 participates also in some molecular functions such as DNA binding and metal ion binding [48]. Besides its implication in urogenital diseases and pregnancy complications, THAP3 is shown to be a biomarker for Polycystic ovary syndrome [49].
The last protein found in high abundance in controls is myeloid-associated differentiation marker-like protein 2. MYADML2 is an integral component of the membrane and is predicted to localize to the cytoplasm. It is found to be involved in congenital, hereditary, and neonatal diseases and abnormalities and uterine diseases [50,51].

BioMed Research International
Search engine unique contribution was also assessed. We evaluated the unique performances of accepted proteins by FDR for each of the four search engines. Figure 2 shows that MS-GF+ performance was outstanding by identifying the larger number of accepted proteins followed by COMET who competed well. It should be mentioned that the weak performance of X!Tandem could be blamed to its scoring function. Figure 3 compares the number of identified proteins by each engine per group, as well as the intersection of result search engines through different combinations. The total number of detected proteins in controls is slightly higher than those detected in SUI samples, MS-GF+ reached 4000 proteins in control against 3000 proteins in SUI. This differ-ence is due to the biological variance of the sample. MS-GF+ outperformed MyriMatch, COMET, and X!Tandem for both groups.
One can also notice that the combination of MS-GF+ and COMET behaves better than others. Rationally, such a combination is expected to produce results more correct than those of other combinations. It is logical to assume that the dissimilarity of the scoring function used by each of the two engines leads to a better separation between correct and incorrect identifications. MS-GF+ uses a robust probabilistic model while COMET uses a descriptive model. It is popularly assumed that algorithms based on the descriptive approach show a better sensitivity, whereas algorithms based on the  BioMed Research International probabilistic approach show a better specificity. This is the probable reason that the MS-GF+ and COMET combination outran the other combination. Finally, we compared the unique contributions of the four search engines (MS-GF+, COMET, MyriMatch, and X!Tandem) with the combined approach that represents the combination of the four search engine results. As illustrated in Figure 4, the combined approach shows an accurate selectivity within the usable range of FDR (from 0 to almost 1%). Passing this range, the slow increase of other curves is explained by the fact that all the hits are already reported. The combined approach still increases to the simple fact of summing the identified hits from each engine.
The combined approach combined four search engine results, which implies four different scoring functions. Accordingly, the approach benefited from the selectivity and sensitivity of each search engine.

Conclusion
In this work, we have utilized an existing study of SUI to assess and validate our developed workflow for protein identification using a combined approach of database searching together with HPC.
Thirteen proteins were found exclusively in SUI samples and are known to be associated with prenatal exposure delayed effects and body weight change. The twenty-six proteins found exclusively in control samples belonged to two main classes of female urogenital diseases related to risk factors influencing SUI. Our analysis is agnostic to whether these proteins are causal in the development process of SUI or simply the result of SUI. Further studies of the identified proteins are required to answer this question.
We have designed and implemented a workflow for protein identification by combining multiple search engine results. We have opted for high-performance computing and cloud computing for managing the multiple searches and reducing the computational time expense.
By examining different search engine combinations, we showed that MS-GF+ and COMET led to a dramatic improvement in protein identification accuracy. Thus, the selection of search engines should also consider the complementarity of their scoring model. Passing the usable range of FDR (higher FDR values), for each single engine, almost all the hits are already accepted which explains the slow increase. While the combined approach will continue increasing for the simple fact of summing the reported hits from all engines, the number of accepted proteins for the combined approach is found highly correlated to the FDR.

BioMed Research International
In processing the same dataset using our developed approach, we were able to identify a more accurate set of proteins shown to be involved in diseases associated with risk factors affecting the pathology. In fact, the KEGG and enrichment analyses showed the top four most significantly induced proteins were involved mainly in prenatal exposure delayed effects and body changes, and the five most significantly depleted proteins were shown to be implicated mostly in two main class diseases: neurodegenerative diseases and female urogenital diseases and pregnancy complications. Given that the aforementioned risk factors influencing SUI generally include neurological illnesses, obesity, Parkinson's disease, parity, recurrent urinary tract infections, and pregnancy, thus, our results show that the developed approach succeeded to be more accurate during the identification process. In general, combining search engines serves to benefit from the strength of each engine and thus complement the peptide/protein identification.
We have also demonstrated that the combined approach improves the specificity and sensitivity of the analysis by increasing the confidence of identified proteins.

Data Availability
The datasets generated and analyzed during the current study as well as the whole developed workflow are fully reproducible and available in the GitHub repository, https://github.com/taoufik-elpho/Serum-Analysis-Elpho.

Conflicts of Interest
The authors have declared no conflict of interest.