ProbSeq—A Fragmentation Model for Interpretation of Electrospray Tandem Mass Spectrometry Data

We describe a probabilistic peptide fragmentation model for use in protein databank searching and de novo sequencing of electrospray tandem mass spectrometry data. A probabilistic framework for tuning of the model using a range of well-characterized samples are introduced. We present preliminary results of our tuning efforts.


Introduction
Sequencing of peptides from their collisioninduced fragments is a well-established technique (Papayannopoulos, 1995). A biological sample, which may contain an unknown protein or a mixture of proteins, is subjected to an enzymatic digest (usually tryptic) resulting in a mixture of peptides. This mixture is analysed using liquid chromatography and electrospray quadrupole timeof-flight (Q-TOF) mass spectrometry.
A Q-TOF (Morris et al., 1996) mass spectrometer consists of a quadrupole mass analyser, a collision cell filled with an inert gas, and a time-of-flight analyser. The quadrupole allows the mass:charge range of peptides which enter the collision cell to be selected accurately. The Q-TOF can operate in two modes. In wide-bandpass, low-energy mode, all peptides are allowed to pass into the collision cell and the voltage applied to the collision cell is low, allowing the peptides to pass through intact into the TOF stage. In narrow-bandpass, high-energy mode, the quadrupole is tuned to pass only a small mass:charge range around a single peptide. In this mode, the collision cell voltage is relatively high, causing the peptide to fragment as it is pulled through the gas. The final, TOF stage of the instrument allows the peptide ion or fragment ion masses to be determined to an accuracy of around five parts per million on a well-calibrated instrument. The decision to switch mode is made in real time using a number of possible strategies, and the collision energy can be adjusted to make useful fragmentation of each peptide more likely.
Peptides fragment in complex ways, and their fragments in turn have complex representations in mass spectra. Manual interpretation of these spectra is a skilled and laborious job, and a modern experiment can produce many hundreds. Around 1998 (see Skilling, 2000aSkilling, , 2000b, we began to use the ProbSeq fragmentation model in combination with Markov chain methods to interpret this data. Commonly, we are required to answer two questions: 'How much can we infer about the sequence of the peptide which gave rise to a single spectrum?' and 'Given many spectra, what can we infer about the mixture of proteins that may be present?'. To address these questions in a robust, probabilistic way, we appeal to Bayes' theorem: (1) The 'Likelihood' Pr(Spectrum | Sequence, I) on the RHS of the above equation encodes our incomplete knowledge about peptide fragmentation. It is the probability that the peptide under consideration could give rise to the observed spectrum. The 'Prior' Pr(Sequence | I) is a probability that we assign to each candidate sequence before we look at the data. At this level, the 'Evidence' is a normalization factor, but it can be used to objectively compare different likelihood functions and other background assumptions. I contains information about the context in which this calculation is being used, and it happens that it is different for each of the above questions. Our likelihood function is called 'ProbSeq' and it is, in itself, a Bayesian calculation which involves a good deal of prior information about peptide fragmentation. The rest of this paper introduces a new framework which we can use to improve the model. We regard it to be a strength of the Bayesian approach that we can treat this as an inference problem at every level (Skilling, 1998).

Materials and methods
We decided to base the tuning of the likelihood function on human assignments of sequences to spectra. The samples chosen consisted of tryptic digests of pyruvate kinase, glyceraldehyde 3phosphate dehydrogenase, alcohol dehydrogenase and β-lactoglobulin. The solvent was 0.1% formic acid in acetonitrile:water, 1:1. The samples were infused directly into a nanolockspray source. The instrument used was a Waters Q-TOF ULTIMA  operating in positive ion mode. The collision energy was 20-45 eV.
The acquired spectra were analysed manually to provide a list of verified sequences. The data was processed using the MaxEnt3  deconvolution algorithm to remove isotope series and to resolve overlapping isotope clusters and multiple charge states. The data was split into two sets. The first, 'tuning', set was used to tune the fragmentation model, while the second, 'validation', set was reserved for assessment of the resulting parameters.
The chemical structure and fragmentation modes for a typical peptide are shown in Figure 1. The types of fragments that are observed depends on the collision energy used. The calculation of the likelihood is based on a probabilistic summation over all of the possible ways that a peptide could fragment and give rise to trial masses: where is a database containing probabilistic information about peptide fragmentation and 'Frag' is a particular fragmentation pattern. Experience tells us that some patterns of fragmentation are more likely than others, e.g. y ion series are correlated: if y n is present, there are better than even odds that y n+1 is also present. This is the information that is encoded in the last term in equation (2). For a peptide of length n amino acids, the sum in equation (2) contains 2 n terms for y ions alone and 2 6n terms if a,b,c,x ,y,z -ions are all considered with extra factors for various losses. It would be impractical to do this summation explicitly for each trial sequence. Remarkably, using a Markov chain allows the summation to be performed in O(n) steps. Considering only y ions, the chain is structured as follows: Pr(y 1 on) = p 1 and Pr(y r on) = p r Pr(y r−1 on) + q r Pr(y r−1 off). (3) As mentioned above, the probability p that the y series stays on is expected to be greater than 0.5, and the probability q that it switches on is likely to  Figure 1. The chemical structure of a peptide. The boldface letters are one-letter amino acid codes. Also indicated are the three backbone fragmentation modes. At the collision energies used in this study, the b, y mode dominates. a and z ions are also commonly observed. Losses can also occur from terminal and side-chain groups be less than 0.5. It is also to be expected that the values of p and q at a particular point in the chain should depend on the amino acid sequence of the peptide under consideration. At each step along the chain, the probability that a y ion will be observed is influenced by whether or not its predecessor was observed and we do as many summations as are locally available. When we reach the end of the chain, we have accumulated Pr(Spectrum | Sequence, , I), with a computational cost of only O(n).
As well as the Markov chain parameters which describe the expected appearance of series of b and y ions, each amino acid may undergo a number of losses, may exhibit a propensity for cleavage to occur on the C-or N-terminal side and may appear isolated from the rest of the sequence as an immonium ion. The maximum number of probabilities required to encode this description is seven per amino acid. This leads to our fragmentation model, , having over 100 tuneable parameters. Some examples are given in Table 1. Reasonable values for some of these can be arrived at by consulting the literature (Papayannopoulos, 1995), and others can be estimated by manually inspecting many fragmentation spectra.
We decided to treat the tuning of the ProbSeq parameters as another exercise in Bayesian inference. In order to do this we needed to define prior distributions for our parameters and a likelihood function for comparison against the data. Having a Table 1. Some examples of tuneable likelihood parameters. The first column contains probabilities that apply globally, while the second column contains probabilities that may be different for each amino acid

Global Amino Acid
Pr(y n+1 on | y n on) Pr(break left) Pr(y n+1 off | y n on) Pr(break right) list of sequences validated against known spectra, the tuning likelihood becomes: We take the prior distribution for each of the probabilities in the ProbSeq model, , to be uniform on (0, 1). We are left with the problem of how to explore the >100-dimensional parameter space. We validate the result by checking whether a chosen set of parameters, , does or does not enable a de novo sequence determination to recover the correct, known, sequence. This has the advantage of tuning our parameters directly upon our own implementation software.
De novo sequencing involves the exploration of a large space of peptide sequences that are consistent with the intact mass of an unknown peptide. At least some of the candidate sequences must be compared with the associated fragmentation spectrum. The number of possible trial sequences grows exponentially with precursor mass with: where n is the number of trial sequences and M is the nominal mass of the precursor. The implementation used in the Waters proteomics product, ProteinLynx Global Server  , does not use an exhaustive search but simulates this by sampling from the space of possible peptide sequences through a terminated Markov Chain Monte Carlo (MCMC) algorithm. Initially, an ensemble of trial solutions is constructed by sampling from a prior distribution. The prior probability of a trial peptide is based on the natural abundances of its constituent amino acids and the preference for C-terminal residues, if appropriate, for a particular digestion reagent. In order to proceed, new trial sequences must be generated. We employ transition engines that change the state of an ensemble member, with a probability of transition, T , defined by: where i and j represent two possible solutions in the space. Transitions of this type will eventually converge on occupancies of states which match the prior distribution. The various transition schemes employed by the de novo algorithm are outlined in Table 2. The data must be introduced via the likelihood function (ProbSeq in this case), as we wish to progress from the prior to the posterior. The transition probabilities are therefore combined with acceptance probabilities, A, where: so that:  Hastings, 1970). In order to improve convergence, the likelihood is introduced in a modified form, by raising it to some power 0 ≤ λ ≤ 1. Here, λ is analogous to an inverse temperature and by raising it slowly from 0 to 1 (i.e. by slow cooling), we implement 'simulated annealing' (Kirkpatrick et al., 1983). We use simulated annealing for both de novo exploration and exploring the ProbSeq parameter space, (see Table 3 for an outline of the commonalities).
The result of a de novo exploration is a number of candidate peptide sequences that may account for the fragmentation spectrum and precursor mass, accompanied by a posterior probability. In order to assess whether the results of tuning the ProbSeq likelihood parameters afford greater discrimination in favour of the correct sequences, we can perform de novo searches and inspect the ranks and posterior probabilities of the correct sequences. Indeed, we could have replaced equation (4) in favour of a likelihood for the tuning parameters which incorporated de novo searches in order to involve this discrimination directly in the tuning process. However, the exploration of the parameter space would have become prohibitively time-consuming.

Results and Discussion
Although cysteine and tryptophan are under-represented in the tuning dataset, the probabilities for immonium production allow us to make general comparisons with observations described in the literature. Papayannopoulos (1995) indicates that arginine, lysine, leucine/isoleucine, cysteine, histidine, phenylalanine, tyrosine and tryptophan may give diagnostically important immonium ions (see Table 4). Our results indicate that arginine, leucine/ isoleucine, cysteine, histidine, phenylalanine, tyrosine and tryptophan are likely to give stronger signals, although the strengths reported may be sensitive to the configuration of the instrument. The results of performing de novo searches on the tuning and validation datasets are summarized in Tables 5 and 6. There is a noticeable improvement in the results of de novo sequencing in the tuning dataset: the correct sequence is ranked first on eight occasions with the tuned prior as opposed to six with the original prior out of a total of 39 sequences. With the tuning dataset, 17 correct sequences appeared in the top 10 with the tuned prior, against 10 for the original prior.
The improvement is less clear in the validation dataset but is still present, with 11 correct for the tuned prior and nine for the original prior out of 31 sequences; 17 correct sequences appeared in the top 10 for both the tuned and original priors.
The improvements gained by tuning the prior probabilities of the ProbSeq model are slight. This is to be expected as, when accurately massmeasured data are available, the information in the data dominates the prior. The quality of our results is not critically dependent on specific parameter settings, which is reassuring. We may have gained some improvement for marginal data but this remains to be seen, perhaps through tuning and validation with a more extensive library of data.
Overall, tuning has resulted in the softening of some probabilities, e.g. those for immonium Table 4. Listed for each amino acid are occurrences in the validated sequences throughout the tuning dataset, occurrences in different spectra, mean probability of immonium ion production and its standard deviation as sampled from the posterior distribution during the MCMC exploration  production, and confirmation for others in the original set. However, the results are not markedly different and we conclude that a single set of reasonable parameters will suffice to give good results over a range of instruments and the details of their configurations. We have put in place a flexible, automated scheme for tuning the prior used in the calculation of the likelihood Pr(Spectrum | Sequence), used in both de novo sequencing and databank searching applications, to which new data can easily be added and the fragmentation characteristics observed from new instruments and experiments accommodated.