TOPPER: Topology Prediction of Transmembrane Protein Based on Evidential Reasoning

The topology prediction of transmembrane protein is a hot research field in bioinformatics and molecular biology. It is a typical pattern recognition problem. Various prediction algorithms are developed to predict the transmembrane protein topology since the experimental techniques have been restricted by many stringent conditions. Usually, these individual prediction algorithms depend on various principles such as the hydrophobicity or charges of residues. In this paper, an evidential topology prediction method for transmembrane protein is proposed based on evidential reasoning, which is called TOPPER (topology prediction of transmembrane protein based on evidential reasoning). In the proposed method, the prediction results of multiple individual prediction algorithms can be transformed into BPAs (basic probability assignments) according to the confusion matrix. Then, the final prediction result can be obtained by the combination of each individual prediction base on Dempster's rule of combination. The experimental results show that the proposed method is superior to the individual prediction algorithms, which illustrates the effectiveness of the proposed method.


Introduction
According to the present genome data, roughly 20-30% of the genes in a typical organism code for -helical transmembrane (TM) protein [1][2][3]. Transmembrane protein is the principal executives of the biomembrane's functions and plays many important roles in cell such as substance transportation, and energy conversion. In order to explore the structure, function, and transmembrane mechanism of transmembrane protein, the topology prediction of transmembrane protein has been a hot �eld in bioinformatics and molecular biology [1,2,4].
e topology of transmembrane protein [5], that is, the number and position of the transmembrane helixes and the in/out location of the N and C terminal of the protein sequence, is an important issue for the research of transmembrane proteins. For a protein sequence, if both transmembrane helixes and location of the N and C terminal have been predicted correctly, the topology of the protein sequence is said to be predicted correctly. Recently, information science and technology are widely used in the biology and medicine [6][7][8]. In essence, the topology prediction of transmembrane protein is a typical pattern recognition problem. As shown in Figure 1, given a protein sequence, the task is to determine the class label for each residue among these three classes of "i" (intracellular), "M" (transmembrane), and "o" (extracellular). At present, the most accurate methods to determine the topology of transmembrane protein are some experimental techniques, such as nuclear magnetic resonance (NMR) and X-ray crystal diffraction. However, these experimental techniques usually require strict conditions so that they cannot be applied on a large scale. ey cannot meet the e Scienti�c World �ournal needs of the increasing protein sequences. erefore, various computational methods have been developed to predict the topology of transmembrane protein [9][10][11].
Generally speaking, in a previous study there mainly exist three primary kinds of algorithms to predict the topology of transmembrane protein. e �rst kind of algorithms is on the basis of the chemical or physical properties of amino acids, for example, the hydrophobicity of residues or the charges of residues in different location. Some classical prediction algorithms are TopPred [2], and so on [12,13]. e second kind of algorithms for the topology prediction is based on the statistical analysis on a huge amount of structure known as transmembrane proteins, such as MEMSAT [14], TMAP [10], and PRED-TMR [15]. In the third kind of algorithms, various machine learning technologies such as hidden Markov model (HMM) and support vector machine (SVM) have been introduced to the prediction of transmembrane protein topology. A series of algorithms have been developed, for example, HMMTOP [11], PHDhtm [16,17], and so forth [18][19][20][21].
According to the mentioned above, even though there exists many algorithms for the prediction of transmembrane protein topology, however, different algorithms depend on different principles, and their applicable scopes are different. To a prediction system, if more information have been taken into consideration, the prediction ability of the system must be much more stronger. Essentially, it is a viewpoint of ensemble learning [22][23][24][25]. Using this idea to the topology prediction of transmembrane protein, various prediction algorithms have been treat as basic predictors; the task is the combination of multiple predictors to obtain a combination predictor which has a better performance than basic predictors. Within this process, there are two critical problems, that is, the representation of each predictor's prediction results and the combination method of combining multiple predictors. In regard to the representation of predictor's prediction results, as Xu et al. [23] pointed three types of output information can be utilized for different prediction algorithms, namely, the information in the abstract level, rank level, and measurement level, respectively. As to the combination method, traditional methodologies are usually on the basis of the framework of probability theory. To some degree, it is very effective, especially for the randomness. However, in the real world there are various uncertainties, not only the randomness but also the fuzziness and incompleteness, and so forth [26,27].
As a theory of evidential reasoning under the uncertain environment, the Dempster-Shafer theory of evidence [28,29] has an advantage of directly expressing various uncertainties and has been widely used in many �elds [30][31][32][33][34][35][36][37]. It provides a general and effective framework for the representation and combination of multiple individual algorithms. In this paper, a new topology prediction method of transmembrane protein based on evidential reasoning approach, called TOP-PER, has been proposed. In the proposed TOPPER method, the prediction results of basic predictor are represented by basic probability assignment (BPA) which has been constructed in terms of the confusion matrix of the predictor. en, various basic predictors are combined by using the Dempster's rule of combination. Finally, the topology of a transmembrane protein sequence are determined according to the combination prediction results. In this paper, an experiment demonstrates the effectiveness of the propose prediction method. e rest of this paper is organized as follows. Section 2 introduces some basic concepts about the Dempster-Shafer theory of evidence. In Section 3 the proposed method is presented. Section 4 gives experimental veri�cation to demonstrate the effectiveness of the proposed method. Conclusions are given in Section 5.

Preliminaries
In this section, a few concepts commonly in the Dempster-Shafer theory of evidence will be introduced.
e Dempster-Shafer theory of evidence [28,29], also called the Dempster-Shafer theory or evidence theory, is used to deal with uncertain information. As an effective theory of evidential reasoning, the Dempster-Shafer theory has an advantage of directly expressing various uncertainties. is theory needs weaker conditions than the Bayesian theory of probability, so it is oen regarded as an extension of the bayesian theory. For completeness of the explanation, a few basic concepts are introduced as follows.
If 2 Ω , is called a proposition.
For a frame of discernment Ω, a mass function is a mapping from 2 Ω to [0, 1], formally de�ned by which satis�es the following condition� In the Dempster-Shafer theory, a mass function is also called a basic probability assignment (BPA). If ( ) 0, is called a focal element, the union of all focal elements is called the core of the mass function.
For a proposition Ω, the belief function Bel 2 Ω → [0, 1] is de�ned as where = Ω − . Obviously, Pl( ) Bel( ); these functions Bel and Pl are the lower limit function and upper limit function of proposition , respectively.
Consider two pieces of evidence indicated by two BPAs 1 and 2 on the frame of discernment Ω; the Dempster's rule of combination is used to combine them. is rule assumes that these BPAs are independent. �e�nition �. e Dempster's rule of combination, also called orthogonal sum, denoted by = 1 ⨁ 2 , is de�ned as follows: Note that the Dempster's rule of combination is only applicable to such two BPAs which satisfy the condition 1.

Proposed Method
In this section, a new transmembrane protein topology prediction method is proposed based on evidential reasoning. For the sa�e of convenience, it is brie�y written down as TOPPER (Topology prediction of transmembrane protein based on evidential reasoning). e proposed prediction method TOPPER is on the basis of the combination of multiple individual prediction algorithms. In order to obtain the combination predictor, the process is presented step by step as follows.

e Selection of Basic Predictor.
Because the proposed topology prediction method is the combination of multiple individual prediction methods, the basic predictors should be constructed �rst. Here, �ve individual prediction algorithms, OCTOPUS [3], PRO-TMHMM and PRODIV-TMHMM [38], SCAMPI-msa, and SCAMPI-seq [13], have been selected to construct these basic predictors. In pattern recognition, the prediction performance of each predictor is expressed by confusion matrix. In the topology prediction of transmembrane protein, since there are only three classes "i" (intracellular), "M" (transmembrane), and "o" (extracellular), the confusion matrix is formulated by where each item is the number of residues belonging to the class but predicted as the class according to the basic predictor .

e Representation of the Basic Predictor's Prediction
Results. In the combination of multiple predictors, the representation of the basic predictor's prediction results is a critical problem. In this paper, BPA is used to represent these prediction results. But the next is how to construct BPAs. For example, a residue in a protein sequence has been predicted that it belongs to transmembrane helix (i.e., class "M") by a basic predictor. However, due to that the prediction is not 100% correct, how can we represent this uncertainty. Here, a classical and effective method proposed by Xu et al. [23] has been adopted to construct BPAs. In Xu et al. 's method, the output was treated as single class labels, and the source of evidence for the propositions of interest was de�ned on the basis of the performance of predictors in terms of recognition, substitution, and rejection rates which are generated from confusion matrix. Brie�y spea�ing, it is a BPA construction method based on confusion matrix.
To a predictor of transmembrane protein topology with confusion matrix , according to Xu et al. 's method [23], a BPA can be constructed for each class by = , ∀ Ω,

e Combination of Multiple Predictors.
Once all BPAs of each predictor have been constructed, the prediction results of multiple predictors can be combined. In this paper, these prediction results of basic predictors have been treated as various evidences coming from different sources. e various prediction results can be combined by using the Dempster's rule of combination, as shown in Figure 2.
Assume there are basic predictors in the evidential prediction system, is the set of constructed BPAs for all classes from basic predictor , and = { i M o }.
) is an operation used to obtain the matched BPA for a residue predicted by . e combination of multiple predictors to predict the class of residue can be expressed by 3.4. e Determination of Topology. rough the above steps, the combination prediction result has been derived for each residue in a transmembrane protein sequence. It is indicated by a BPA . In order to get the �nal class that the residue belongs to, the BPA will be translated into a probability distribution by using the so-called pignistic probability transformation (PPT) function, proposed by Smets and Kennes in the transferable belief model (TBM) [39]. e PPT function [39] is de�ned as follow. Let be a BPA on a frame of discernment , a pignistic probability transformation function Bet ∶ → [0 1] corresponding to is where | | is the cardinality of proposition . By using PPT function, the BPA can be translated into a probability distribution . en the class of the residue can be determined according to the maximum value of the probability distribution . At last, the topology of a transmembrane protein can be determined when the classes of all residues in the protein sequence have been determined. For each protein, the transmembrane orientation is determined by the location of the �rst residue, and each transmembrane region whose length exceeds a threshold consists of these residues labelled as class "M. " According to the topology, all transmembrane helixes and the orientation of each transmembrane helix can be derived. In this paper, a data set of 125 transmembrane protein sequences with known topology is collected from the data set of MPtopo [40] to verify the effectiveness of the proposed method TOPPER. In order to re�ect the performance of combination predictor faithfully and to avoid over�tting, the experiment is performed using tenfold cross-validation. For each fold, it roughly contains 12-13 transmembrane proteins and their homology has been reduced to 30% below by using cd-hit program [41].
In order to assess the prediction performance of transmembrane regions (i.e., transmembrane helixes without considering orientation) of different algorithms, an evaluation method developed by Tusnády and Simon [11] is adopted in this paper. To a transmembrane region, the prediction is considered successful when the overlapping region of predicted and observed transmembrane region contains at least 9 amino acids. e total numbers of predicted and real observed transmembrane regions are indicated by prd and obs , respectively. e overlapping predicted and real observed transmembrane regions are indicated by cor . e efficiency of the transmembrane regions prediction is measured by = cor / obs and = cor / prd . e overall prediction power is de�ned by Besides, if all transmembrane regions and orientation of a transmembrane protein sequence have been predicted correctly, the topology of the transmembrane protein is said to be predicted correctly. In the rest of this section, various prediction algorithms will be compared from three aspects, namely, the prediction performance of residue level, transmembrane region level, and topology level, respectively. In the level of residue prediction, the confusion matrix of residue prediction for each algorithm is shown in Table 1. According to these confusion matrices, Table 2 shows some indexes to measure the performance of residue prediction, including the recall rate, precision rate, F score of each class, and the prediction accuracy of residues. In TOPPER, the prediction accuracy of residue is 80.00%, while in other algorithms they are 78.69%, 77.91%, 77.63%, 78.69%, and 77.66%, respectively. e proposed method has the highest prediction accuracy of residue, shown in Figure 3. In addition, investigate the F score of each class in these algorithms. e TOPPER also has the highest value of F score no matter to class "i", "M", and "o", shown in Figure 4. Hence, it is quite clear that the proposed TOPPER outperforms other algorithms.
In the level of transmembrane region prediction, Table  3 shows the prediction performance of various algorithms to the prediction of transmembrane region. According to the overall prediction power de�ned in �11], the value of TOPPER is 97.85%, while the values of other algorithms are 97.37%, 96.98%, 96.83%, 97.37%, and 96.68%, respectively. e value of TOPPER is the highest, shown in Figure 5. So TOPPER is superior to other algorithms. In the level of topology prediction, Table 4 shows the prediction accuracy of topology for each algorithm. e topology's prediction accuracy of TOPPER is 74.4%, which is the highest among these algorithms, shown in Figure  6. erefore, the proposed TOPPER is superior to other algorithms.
According to the mentioned above, the proposed TOP-PER outperforms other algorithms no matter in the level of residue prediction, transmembrane region prediction, and topology prediction. Hence, the effectiveness of the proposed method has been demonstrated.

Conclusions
Transmembrane proteins are some special and important proteins in cells. e topology prediction of transmembrane protein is a foundation of the research of transmembrane proteins. In this paper, a new topology prediction method of transmembrane protein is proposed based on evidential reasoning. e proposed method is the combination of multiple  individual prediction algorithms. In the proposed method, the Dempster-Shafer theory has been used to represent and combine the results of basic predictors. Experimental results show that the proposed method is superior to the individual prediction algorithms and demonstrates the effectiveness of the proposed method.