Identification of Protein Pupylation Sites Using Bi-Profile Bayes Feature Extraction and Ensemble Learning

Pupylation, one of themost important posttranslationalmodifications of proteins, typically takes place when prokaryotic ubiquitinlike protein (Pup) is attached to specific lysine residues on a target protein. Identification of pupylation substrates and their corresponding sites will facilitate the understanding of themolecularmechanismof pupylation. Comparingwith the labor-intensive and time-consuming experiment approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, a new bioinformatics tool named EnsemblePup was developed that used an ensemble of support vector machine classifiers to predict pupylation sites. The highlight of EnsemblePup was to utilize the Bi-profile Bayes feature extraction as the encoding scheme. The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset. When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629. The experimental results suggested that EnsemblePup presented heremight be useful to identify and annotate potential pupylation sites in proteins of interest. A web server for predicting pupylation sites was developed.


Introduction
As the firstly identified posttranslational small protein modifier in prokaryotes, prokaryotic ubiquitin-like protein (Pup) in Mycobacterium tuberculosis (Mtb) is an important signal for the selective degradation of proteins [1].Pup attaches to substrate lysine via isopeptide bonds in a manner reminiscent of ubiquitin (Ub) and ubiquitin-like modifier (Ubl) conjugation to proteins in eukaryotes [2].Although pupylation and ubiquitylation are functional similarity, the enzymology of pupylation and ubiquitylation is different [3].Generally, there are three-step reaction and three kinds of enzymes participating in the eukaryotic ubiquitylation process, including ubiquitin-activating enzymes, ubiquitinconjugating enzymes, and ubiquitin ligases [4,5], but only two-step reaction and two kinds of enzymes participating in the prokaryotic pupylation process.Firstly, the Pup-GGQ Cterminal is deamidated to -GGE by deamidase of Pup [6], and then the proteasome accessory factor A (PafA) attaches the deamidated Pup to specific lysine residues of substrates [7].
Since identification of protein pupylation sites is of fundamental importance to understand the molecular mechanism of pupylation in biological systems, much interest has focused on this field and large-scale proteomics technology has been applied to identify pupylation proteins and pupylation sites [8][9][10].However, the experimental determination of exact modified sites of pupylated substrates is labor intensive and time consuming, especially for large-scale data sets.In this regard, the computation approaches which could effectively and accurately identify the pupylation sites are urgently needed.Liu et al. had constructed the first online predictor, GPS-PUP, for the prediction of pupylation sites [11].In their method, 127 experimentally identified pupylation sites in 109 prokaryotic proteins had been utilized as the training dataset, with an accuracy of 0.789 and an MCC of 0.286.However, there is significant room for improvement of the prediction performance.In this study, the prediction performance of pupylation sites has been improved by using a new encoding scheme, Bi-profile Bayes feature extraction (BPB), which has been widely used to deal with diverse prediction topics in the field of bioinformatics [12][13][14][15].Since the new constructed pupylation sites dataset was highly imbalanced: the number of pupylation sites was much smaller than the number of nonpupylation sites, the ensemble learning method was adopted here to deal with the imbalanced data classification problem.The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset.When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629.The experimental results suggested that EnsemblePup presented here might be useful to identify and annotate potential pupylation sites in proteins of interest.A web server for predicting pupylation sites was developed and was available at http://210.47.24.217:8080/EnsemblePup/.

Mathematical Problems in Engineering
The organization of this paper is as follows.Section 2 introduces the dataset for establishing the predictor, the vector encoding schemes, and the proposed prediction model.Section 3 shows the experimental results, discusses the performance of the proposed predictor, and compares the proposed predictor with other methods.Finally Section 4 gives the conclusions.

Dataset.
The pupylated proteins used in this study were extracted from PupDB [3].Protein sequences with less than 50 amino acids were excluded because they may be just fragments [16,17].Protein sequences including nonstandard amino acids, such as "B, " "J, " "O, " "U, " "X, " and "Z" were excluded as well.As a result, there were 182 pupylated proteins with 215 known pupylation sites.After a homology-reducing screening procedure by using CD-HIT [18,19] to remove those proteins that had 40% sequence identity to any other, we finally got 153 pupylated proteins with 183 positive sites, which constructed the nonredundant training dataset named as Dataset 1 in this study (see Supporting Information Text S1 available online at http://dx.doi.org/10.1155/2013/283129).In order to fairly compare our proposed method EnsemblePup with a previously developed method GPS-PUP, the dataset collected by Liu et al. [11] was also adopted here.We named it as Dataset 2 in this work, and the details of Dataset 2 were listed in Table 1.
Subsequently, similar to the development of other PTM site predictors [20,21], the sliding window strategy was utilized to extract positive and negative samples.In order to ensure the peptides (sequence fragments) with a unified length, a nonexisting residue coded by " " was used to fill the corresponding position.Peptides with pupylation lysine as the middle residue were regarded as positive samples, and the remaining peptides with nonpupylation lysine as the middle residue were regarded as negative samples.

Vector Encoding Schemes.
In this study, the Bi-profile Bayes feature extraction (BPB) based encoding scheme was used.For details on this encoding scheme, readers are advised to refer to Shao et al. [14].Briefly, let  =  1 ,  2 , . . .,   represent a sequence fragment, where   denotes one amino acid and  stands for the length of the sequence fragment. belongs to two categories  1 and  −1 , where  1 and  −1 represent pupylation sites and nonpupylation sites, respectively.Then, a feature vector can be described as where  1 ,  2 , . . .,   represent the posterior probability of each amino acid at each position for the sequence fragment of pupylation sites (category  1 ) and  +1 , . . .,  2 represent the posterior probability of each amino acid at each position for the sequence fragment of nonpupylation sites (category  −1 ), which is the so-called Bi-profile.In this paper, the posterior probability is estimated by the occurrence of each amino acid at each position in the training datasets [14].
The binary encoding scheme was also carried out here to be compared with the BPB encoding scheme.As it is known to all, there are 20 types of amino acids in protein sequences, which are given as ACDEFGHIKLMNPQRSTVWY. Therefore, each amino acid is represented by a 20-dimensional binary vector; that is, A corresponds to (10000000000000000000), C corresponds to (01000000000000000000), and Y corresponds to (00000000000000000001).For each sequence fragment with length , the total dimension of the binary feature vector is 20 × ( − 1), since the central amino acid is always K, which is not necessary to be considered.

Support Vector Machine Learning and Imbalanced Data.
Support vector machine (SVM) is a popular machine learning algorithm mainly used in dealing with binary classification problems.SVM looks for a rule that best maps each member of training set to the correct classification [22,23], and it has been widely used in bioinformatics community.In this paper, LIBSVM package [24] with radial basis kernels (RBF) is used, where the kernel width parameter  represents how the samples are transformed to a high dimensional space.Grid search strategy based on 5-fold cross-validation is utilized to find the optimal parameters C and  ∈ {2 −7 , 2 −6 , . . ., 2 8 }, so that a total number of 256 grids are evaluated.
Since the training dataset was imbalanced, in which the number of pupylation sites was much smaller than the    number of nonpupylation sites, the bootstrap procedure was used to deal with this situation.As shown in Figure 1, we obtained  training subsets using the bootstrap procedure, where  represented the times of data sampling.In this study, the bootstrap procedure was implemented by WEKA package [25] and the parameter  was set as the ratio of the number of positive samples divided by the number of negative samples.

The Ensemble Model for Pupylation Sites Identification.
Since ensemble learning methods have unique advantages in dealing with high-dimensional and complicated data, there is an increasing use of it in the field of bioinformatics [26][27][28][29][30].In this study, the ensemble model was established by a collection of SVM classifiers, each was trained on a subset of the original training dataset (obtain by the bootstrap procedure in Figure 1).Figure 2 showed the entire schematic diagram for the prediction of pupylation sites.As shown in Figure 2, the final result was computed from the prediction result of the individual SVM classifier.For example, when given a new unlabeled test data , the th SVM classifier returned a probability   of  belonged to the positive class, where  = 1, 2, . . ., .The collection estimated probability was obtained by  Ensemble = (1/) ∑  =1   .

Performance Assessment.
In this study, 5-fold cross validation and jackknife cross validation tests were chosen for evaluating the proposed predictor.More details about these two methods can be found in two recent papers [31,32].In order to evaluate the proposed predictor, four measurements are used: sensitivity (Sn), specificity (Sp), accuracy (Ac), and Matthews correlation coefficient (MCC).They are defined by the following formulas: where TP, TN, FP, and FN stand for the number of true positive, true negative, false positive, and false negative, respectively.In addition, the receiver operating characteristic (ROC) curves and the area under the curve (AUC) values are also carried out.

Determination of the Best Window Size.
We firstly analyzed the position-specific propensities of the residues surrounding pupylation sites and nonpupylation sites using Two-Sample-Logo, which generated the graphical sequences logo for the relative frequency of the corresponding amino acid at each position around pupylation sites and nonpupylation sites.As shown in Figure 3, we found that the characteristics of the residues had significant differences between pupylation sites and nonpupylation sites.To encapsulate the position-specific propensities of residues for computational prediction, we established SVM prediction models of different lengths (represented as BPB-SVM15, BPB-SVM17, BPB-SVM19, BPB-SVM21, and BPB-SVM23) trained on a balanced training dataset (constructed by sampling a number of nonpupylation sites equal to the number of pupylation sites) using the Bi-profile Bayes feature extraction (BPB) method.As shown in Table 2, after a preliminary evaluation, the optimal window size was 17 in this paper (BPB-SVM17), with 8 residues located upstream and 8 residues located downstream of the pupylation sites in the protein sequence.However, when the Bi-profile Bayes feature extraction encoding scheme was replaced by the binary encoding scheme of window size of 17 (represented as Binary-SVM17), the binary encoding scheme showed mediocre prediction performance, the prediction accuracy was 20.55% lower than that of the Bi-profile Bayes feature extraction encoding scheme, which indicated that the Bi-profile Bayes feature extraction encoding scheme has an advantage over the binary encoding scheme in predicting pupylation sites.Therefore, we adopted Bi-profile Bayes feature extraction encoding scheme in this study.

Comparison of EnsemblePup with a Single SVM Classifier.
In order to enhance the prediction performance of the pupylation sites predictor, ensemble learning was used, and the final results were obtained by combining the outputs of different single SVM classifier.Here, we compared the performance of the ensemble of SVM classifiers with that of a single SVM classifier.All experiments were performed and reported the Sn, Sp, Ac, and MCC.The comparison results of the two prediction models by 5-fold cross validation test on the Dataset 1 were shown in Table 3 we can see the ensemble predictor got the accuracy of 80.82%, higher than the result obtained by using a single SVM classifier with 73.97%, and the AUC value was 0.55 higher than that of the a single  SVM classifier.In summary, the ensemble learning had an advantage in predicting pupylation sites.

Comparison of EnsemblePup with Other Methods.
We have demonstrated that EnsemblePup could achieve a promising prediction performance in the 5-fold cross validation on Dataset 1.To objectively evaluate our proposed predictor, we further compared the EnsemblePup predictor with GPS-PUP [11].Liu et al. searched PubMed with the keywords of "pupylation" and "prokaryotic ubiquitin" and collected 127 experimentally identified pupylation sites in 109 prokaryotic proteins; we named the data from Liu et al. as Dataset 2 in this work (the details were listed in Table 1).The compared results were shown in Table 4.As can be seen from the table, the EnsemblePup predictor proposed in this study obtained an accuracy of 82.00%, higher than the GPS-PUP predictor with the accuracy of 80.21%, and MCC of EnsemblePup was 0.343 greater than that of GPS-PUP.4. Note that the input protein sequence must be in the FASTA format.
The FASTA format sequence consists of a single initial line beginning with a greater-than symbol (">"), followed by lines of amino acid sequence.You can click on the "example and note" button to see the example protein sequence.
(iii) Choose a threshold value in the drop-down list.For prediction with high confidence (less probability of false positive prediction), high threshold should be chosen.(iv) Click on the submit button to see the predicted result.For example, if you use the first sequence in the example page, the prediction results will be ">A0QNF6 K147 0.7450251 yes, " which means that the lysine on the position of 147 is a pupylation site with the probability of 0.7450251.Generally, it takes about 50 seconds to predict the pupylation site for a protein sequence shorter than 1000 amino acids before the predicted result appears.

Conclusion
Prediction of pupylation sites is important to understand the molecular mechanism of pupylation in biological systems.
Though some researchers have focused on this problem, the accuracy of prediction is still not satisfied.In this study, we have presented a new predictor EnsemblePup for the prediction of pupylation sites based on Bi-profile Bayes feature extraction encoding scheme.Since the new constructed pupylation sites dataset was highly imbalanced: the number of pupylation sites was much smaller than the number of nonpupylation sites, the ensemble learning method was adopted here to deal with the imbalanced data classification problem.The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross-validation on the training dataset.When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629.Experimental results have shown that our method is very promising and may be a useful supplement tool to existing methods.Due to the considerable performance, we have made EnsemblePup freely available as a web server.Although the results obtained here were very promising, further investigation was needed to further clarify the mechanism of pupylation process.

Figure 1 :
Figure 1: The bootstrap procedure for the imbalanced dataset.

Figure 2 :
Figure 2: The entire schematic diagram for the prediction of pupylation sites.

Figure 3 :
Figure 3: The Two-Sample-Logo of the position-specific residue composition surrounded the pupylation sites and nonpupylation sites.This logo was generated using the web server http://www.twosamplelogo.org/and only residues significantly enriched and depleted surrounding pupylation sites (-test,  < 0.1) were shown.

Table 1 :
Number of pupylation and non-pupylation sites in each dataset.

Table 2 :
Results of the SVM prediction on Dataset 1.

Table 3 :
The comparison of predictive performance between single SVM and ensemble of SVMs using the 5-fold cross validation on Dataset 1.

Table 4 :
The comparison of predictive performance between our method and GPS-PUP using the leave-one-out cross validation on Dataset 2.