Signal-BNF: A Bayesian Network Fusing Approach to Predict Signal Peptides

A signal peptide is a short peptide chain that directs the transport of a protein and has become the crucial vehicle in finding new drugs or reprogramming cells for gene therapy. As the avalanche of new protein sequences generated in the postgenomic era, the challenge of identifying new signal sequences has become even more urgent and critical in biomedical engineering. In this paper, we propose a novel predictor called Signal-BNF to predict the N-terminal signal peptide as well as its cleavage site based on Bayesian reasoning network. Signal-BNF is formed by fusing the results of different Bayesian classifiers which used different feature datasets as its input through weighted voting system. Experiment results show that Signal-BNF is superior to the popular online predictors such as Signal-3L and PrediSi. Signal-BNF is featured by high prediction accuracy that may serve as a useful tool for further investigating many unclear details regarding the molecular mechanism of the zip code protein-sorting system in cells.


Introduction
Signal peptides which are usually N-terminal extensions with 3-60 amino acids long direct proteins to their corresponding cellular and extracellular localizations. We treated its function as an "address tag" or "zip code." If the signal sequence in a nascent protein was changed, the protein could end up in a wrong cellular location causing various weird diseases.
The advent of signal peptides predictor has a significant impact on developing novel strategies for drug discovery, as well as for revealing the molecular mechanisms of some genetic diseases (refer a review [1]). Faced with the avalanche of new protein sequences emerging in the postgenomic era, to timely use them for basic research and drug discovery [2,3], it is highly desirable to develop the fast and accurate algorithms to identify the signal sequences and predict their cleavage sites. Actually, many efforts have been made in this regard [4][5][6][7][8][9][10][11][12][13][14][15][16][17]. Based on different kinds of characteristics, several machine learning approaches have been proposed for this task, such as neural networks [8][9][10][11], hidden Markov models [12], and support vector machines [13][14][15]. Recently, Shen and Chou developed two algorithms based on evidence theory to predict the signal sequences and achieve favorable results [4,5].
In this paper, we propose a novel predictor based on Bayesian learning algorithm to predict the N-terminal signal peptides and their cleavage sites. Bayesian learning algorithm has been previously applied in a number of other bioinformatics problems [18][19][20][21][22][23], such as protein-protein interactions. But their approaches are not designed to deal with the N-terminal signal peptide sequences and the amino acid preference at the cleavage sites [24]. Fundamentally differed from theirs, the Bayesian network is a method of statistical inference in which some kind of evidence or observations are used to calculate the probability if a hypothesis may be true, which is particularly suited for this task. Its advantage lies in that there are significant statistical preferences of different amino acids along the signal peptides mentioned in the previous studies [4,5].
The integration system which was built by multiple base classifiers has a stronger generalization ability than a single  Number of  non-secretory proteins  Total   Human  894  1129  2203  Plant  338  559  897  Animal  1435  1762  3197  Eukaryotic  635  785  1420  Gram-positive  269  356  625  Gram-negative  613  721  1334 good classifier. So we use integration system to improve the prediction accuracy. First, base classifier is built by using different feature datasets as Bayesian network input. Then, the ultimate result is fused by the results of different Bayesian classifiers through weighted voting system. The experimental results show that Signal-BNF is superior to two other popular signal peptide predictors of Signal-3L [4] and PrediSi [25]. So, the approach we proposed is quite promising.

Materials and Methods
The datasets constructed in [4] were adopted in this paper, which contain the secretary proteins and the nonsecretary proteins from six different species. It was human, plant, animal, eukaryotic, Gram-positive, and Gram-negative (refer Table 1).
Signal peptide sequences are usually N-terminal extensions although they can also be located within a protein or at its C-terminal end. It will be cleaved off by a signal peptidase when the protein goes through a membrane. The cleavage site is the position between the last residue of the signal peptide sequence and the first residue of the mature protein. It is symbolized by (−1, +1) ( Figure 1). The signal peptide sequences of different secretory proteins are quite different in sequence components and orders. And they all have different sequence length. Figure 2 shows the length distribution of the signal peptides in the six species secretory proteins.
Since different proteins differ in the length of the signal peptide, we introduced the concept of scaled window to solve the difficulty in predicting the signal peptide for a general algorithm. The scaled window approach has been adopted for this study before [6].
The scaled window which is symbolized as [−ξ 1 , +ξ 2 ] is marked consecutively with −ξ 1 , . . . , −2, −1, +1, +2, +ξ 2 to define the corresponding position of amino acids of a protein sequence within the window. In this way, a segment can be used as a "benchmark window" to search the secretioncleavable site along a protein sequence and can deduce its signal peptide accordingly. Only the one with the residue at the scale −1 being the very last residue of the signal sequence and the residue at the scale +1 being the first residue of the mature sequence are regarded as the secretion-cleavable segment (refer Figure 3(a)), while all the other segments are where R −ξ1 represents the amino acid residue at the position −ξ 1 , R −1 represents the amino acid residue at the position −1, R +ξ2 represents the amino acid residue at the position +ξ 2 , and so forth. The whole prediction task is composed of two steps: (1) identifying whether a protein is secretory or not and (2) determining the signal peptide cleavage site for a secretory protein. In this study, we choose ξ 1 = 13, ξ 2 = 2 as the size of scaled window for predicting the cleavage site, which is demonstrated optimal in previous studies [6]. By sliding such a "window" along each of these protein sequences, we obtained 6 corresponding training datasets for the 6 species. It is important to point out that, for a secretory protein sequence of length L1, we can obtain L1 − (ξ 1 + ξ 2 ) + 1 different sequence segments. But in these segments only one secretion-cleavable segment, the others are nonsecretion cleavable segments. While a nonsecretory protein sequence of length L2 can obtain L2 − (ξ 1 + ξ 2 ) + 1 different sequence segments which are all nonsecretion cleavable segments. The one secretion-cleavable segment called positive sample and the other nonsecretion cleavable segments called negative sample. All the secretion-cleavable segments, namely, positive samples, denoted by S + and all the nonsecretion cleavable segments, namely, negative samples, denoted by S − . Apparently, the scaled window approach causes the samples extreme imbalance. Hence, we take a random sampling process in the negative subset, which can relatively reduce the imbalance phenomena. The sampling proportion of S − refers to Table 2.
As we known, most data classification techniques require the numeric discrete feature vectors as input. It means that the amino acid symbol should be replaced by the decimal integer, such as the local physicochemical properties. Due to that we need different feature datasets as different classifiers' input, we gain the different datasets through different coding schemes. In this paper, three different coding schemes (subsystems [26]) are adopted.
The first subsystem considers that each position in the scaled window has 21 possible values (20 amino acids and  Figure 1: A schematic drawing shows the signal sequence of a protein and how it is cleaved by the signal peptidase. An amino acid in the signal part is depicted as a white circle with a black number to indicate its sequential position, while in the mature protein depicted as a black circle with a white number. The signal sequence contains Ls residues and the mature protein Lm residues. The cleavage site is at the position (−1, +1), that is, between the last residue of the signal peptide sequence and the first residue of the mature protein.   Figure 3: Illustration to show the sequence segments highlighted by sliding the scaled window [−ξ 1 , +ξ 2 ] along a protein sequence. During the sliding process, the scales on the window are aligned with different amino acids so as to define different peptide segments. When the scale −1 is aligned with the tail residue of the signal sequence and scale +1 aligned with the head residue of the mature protein as shown in (a), the peptide segment is seen within the window regarded as secretion cleavable. Peptide segments seen within the window for all the other cases, such as those shown in ((b) and (c)), are regarded as nonsecretion cleavable. a null input). Hence, it uses an integer ranging from 1 to 21 indicators (refer Table 3), which is taken as the input of Signal-BNF, to present each amino acid. The second subsystem deems that each amino acid is associated with 10-bit binary (i.e., value 0 or 1) indicators to represent its multiview properties. Each row in Table 4 shows that an amino acid can have multiple properties. And "y" means the amino acid has the property. If there is "y," the value is 1, otherwise 0. Then, the binary is converted to a decimal integer to represent each amino acid.
The last subsystem represents the relative hydrophobic value of amino acids with 3-bit binary indicators. In Table 5, each amino acid has been encoded into decimal integer. Therefore, we received three different feature datasets according to the above subsystems.

Bayesian Networks.
The term "Bayesian networks" was coined by Judea Pearl [18] in 1985, its theory, algorithms and applications can be found in [19][20][21][22][23][24]. A Bayesian network [27,28], which is a kind of learning machine, encodes the joint probability distribution of a set of variables {x 1 , . . . , x v } as a directed acyclic graph and a set of conditional probability tables (CPTs). The probability of an arbitrary event X = (x 1 , . . . , x v ) can be computed as   (x d,1 , . . . , x d,v ), the goal of learning is to find the Bayesian network that best represents the joint distribution P (x d,1 , . . . , x d,v ). In other words, when the Bayesian network is unknown we need to learn it by estimating the network structure and the parameters of the joint probability distribution from the training data and prior information. We assume no missing data, then attention the problem on learning network structure. At present, there are mainly two kinds of Bayesian network learning methods [27]: conditional-independence-test-based method and searchbased method. The conditional independence test is very sensitive of the error. And condition independence test times relative to the number of variables to increase exponentially in some cases. Search-based algorithm can search for the accurate and complete network structure, but the structure space is very large. Search the best Bayesian network structure from all possible network structure space is a NPhard problem, so the commonly used method is heuristic algorithm. The widely used and the most representative heuristic algorithm is K2 algorithm which is a famous scorebased algorithm.
Learning model structures from data is important for the construction of Signal-BNF. We have empirically compared the behavior of some Bayesian network classifiers base on Bayes Net in Weka [29] and base on Bayes Net Toolbox (BNT) in Matlab [30] over six datasets. In this paper, we use the K2 structure learning algorithm which performs relatively better than others. It maximizes the scoring measure of marginal likelihood. K2 is a greedy search algorithm which applies a known ordering of the nodes and the maximum limit on the number of parents for any node to constrain the search over network structure.
Followed by the network structure learning, the parameter learning is another important step, and we use the Bayesian estimation method for determining the related parameters. By doing this, we can get a Bayesian network that can be used to make inferences.

Classify the Secretory-Cleavable Peptides from Non-Secretory Cleavable Peptides by Base
Classifier. Suppose a training set S of N samples (S 1 , S 2 , . . . , S N ) that can be separated into two subsets: S + consists of the secretion-cleavable peptides only and S − the nonsecretion cleavable peptides only. We used Signal-BNF to distinguish secretion-cleavable peptides from nonsecretion cleavable. Through the Signal-BNF classifier the CPTs can be obtained, as formulated by where ρ(S i , S θ ) mean the probability of the sample S i belongs to the class S θ . The criterion of predicting the secretion cleavability for a given peptide sequence can be formulated as follows: The sample is secretion-cleavable peptide if γ(S i ) = 1, otherwise is nonsecretion cleavable. If the sample is identified as secretion cleavable, it will be continued to predict the cleavage site.

Classify the Secretory-Cleavable Peptides from Non-Secretory Cleavable Peptides and Identify the Signal Peptide Cleavage Site of Secretory Proteins by Fusion Classifier System.
The cleavage site is the position between the last residue of the signal peptide sequence and the first residue of the mature protein. Signal peptide can be automatically determined, while cleavage site is identified. We use Bayesian classifier to predict cleavage. But the result of Bayesian classifier may contain false result. To compensate for this error as much as possible, we consider to composite the base classifiers together. By the above three coding schemes, different feature datasets and Bayesian network constitute base classifiers. Then, the base classifiers fuse as Signal-BNF to predict the cleavage site. The composite approach for classifying proteins has been used in previous study [31]. From the literature [32], we know that multiple classifier systems can be divided into three structures: cascade, parallel, and hierarchical (refer Figure 4). In cascade system, the result of base classifier directly depends on the success classification of the previous base classifier. The overall system error of this type classification system is the accumulation of each base classifier error. In other words, the error which previous classifier produced is unrecoverable. The parallel system, which each base classifier independently produces results, integrates the results of base classifier by decision logic. As long as the decision logic cleverly designed, you can get more satisfactory results. Hierarchical system is the combination of cascade system and parallel system. So we use the integrated classification model as shown in Figure 5 in the fusion stage.
Furthermore, we use voting as the decision-making method in integration of multiple classifier outputs. Generally, voting includes the weighted voting and the majority voting which has three decision methods: unanimity, simple majority and plurality. In this paper, we use weight voting which can obtain better accuracy to decide which candidate wins.
Here, u ∈ {1, 2, 3} represent different classifiers. w u is the weight of each base classifier. If γ 2 (S i ) equals to γ 3 (S i ), the weight is w 1 = 0, w 2 = 1, w 3 = 0, otherwise w 1 = 1, w 2 = 0, w 3 = 0. If σ(S i , S θ ) = 1, the sample is secretion-cleavable peptide, otherwise is nonsecretion cleavable peptide. Then, we can continue to predict the cleavage site. As the protein has been cut into many segments, we have the starting position of each secretion-cleavable peptide in a protein, as formulated below:  where N + represents the number of the set S + , namely, the number of secretary proteins. The cleavage site position of a secretary protein is formulated as

Results and Discussion
The methods frequently used for cross-validating the accuracy of classifier in statistical prediction cover the single independent dataset test, sub-sampling test, and jackknife test. In this study, the 5-fold cross-validation test was performed on Signal-BNF. Table 6 compares the accuracy of some Bayesian network classifiers in weka. Table 7 compares the accuracy of some Bayesian network classifiers in Matlab. From them we can conclude that K2 structure learning algorithm is performed relatively better than others. Table 8 lists the prediction accuracy for secretory proteins from nonsecretory proteins by three subsystems and fusion system. Table 9 compares our approach's prediction accuracy for secretory proteins from nonsecretory proteins to other approaches. Table 10 lists the prediction accuracy for the cleavage sites by three subsystems and fusion system. Table 11 compares our approach's prediction accuracy for the cleavage sites to other approaches.   From Table 8, we can clearly conclude that the fusion system can complement the shortage of each base classifier to improve prediction accuracy. Similar results can also be observed from Table 10.
The comparison performances of the other two popular predictors of Signal-3L [4] and PrediSi [25] are listed in Tables 9 and 11. From Table 9, where the success rates of Signal-BNF is 1.63-6.91% higher than PrediSi [25] and 1.66-5.43% higher than Signal-3L except Gram-positive dataset. Signal-BNF achieves the best prediction accuracy when discriminating the cleavage sites which can be observed in Table 11. The success rates of Signal-BNF is 11.67-22.9% higher than PrediSi [25] and 3.84-17.5% higher than Signal-3L. These results indicate that the Signal-BNF can get a better prediction accuracy of the signal peptide sequences and their cleavage sites Efficiently prediction of N-terminal signal peptides and their cleavage sites is important to both basically research and drug discovery. In this paper, we have proposed a novel  Bayesian learning network approach named Signal-BNF to reach this goal. The experimental results also reveal that Signal-BNF can achieve the better prediction accuracy than other popular predictors. So we say that Bayesian networks can be a powerful computational tool for predicting signal peptide cleavage sites. The experiment also shows that fusing multiple predictors can provide effective complementarities among them for predicting N-terminal signal peptides since different algorithms have their own merits and shortcomings.