Positive-Unlabeled Learning for Pupylation Sites Prediction

Pupylation plays a key role in regulating various protein functions as a crucial posttranslational modification of prokaryotes. In order to understand the molecular mechanism of pupylation, it is important to identify pupylation substrates and sites accurately. Several computational methods have been developed to identify pupylation sites because the traditional experimental methods are time-consuming and labor-sensitive. With the existing computational methods, the experimentally annotated pupylation sites are used as the positive training set and the remaining nonannotated lysine residues as the negative training set to build classifiers to predict new pupylation sites from the unknown proteins. However, the remaining nonannotated lysine residues may contain pupylation sites which have not been experimentally validated yet. Unlike previous methods, in this study, the experimentally annotated pupylation sites were used as the positive training set whereas the remaining nonannotated lysine residues were used as the unlabeled training set. A novel method named PUL-PUP was proposed to predict pupylation sites by using positive-unlabeled learning technique. Our experimental results indicated that PUL-PUP outperforms the other methods significantly for the prediction of pupylation sites. As an application, PUL-PUP was also used to predict the most likely pupylation sites in nonannotated lysine sites.


Introduction
Recently, a prokaryotic ubiquitin-like protein (Pup) has been identified in prokaryotes [1,2]. Pup is an intrinsically disordered protein with 64 amino acids and marks the target proteins which are needed to be degraded [3,4]. The process of Pup linking substrate lysine by isopeptide bonds is named pupylation which plays an important role in regulating protein degradation and signal transduction in prokaryotic cells [5]. Although pupylation and ubiquitylation are functional analogues, the enzymology involved in them is different [6]. In contrast to ubiquitylation requiring three enzymes E1 (activating enzyme), E2 (conjugating enzyme), and E3 (protein ligase), pupylation requires only two enzymes: the deamidase of Pup (DOP) and the proteasome accessory factor A (PafA) [7].
To understand the molecular mechanisms of pupylation, it is important to identify pupylation substrates and sites accurately. As the large-scale proteomics methods [8][9][10][11] are usually time-consuming and labor-intensive, several computational methods have been developed to predict the pupylation sites in recent researches. Liu et al. had developed the first predictor GPS-PUP for the prediction of the pupylation sites on the basis of group-based prediction system (GPS) 2.2 algorithm [12]; Tung developed a predictor, iPUP, by using SVM algorithm and the composition of k-space amino acid pairs (CKSAAPs) feature [13]; Chen et al. also proposed SVM-based predictor named PupPred, in which amino acid pairs feature was employed to encode lysinecentered peptides [14]. Recently, Hasan et al. introduced a Profile-Based Composition of k-Spaced Amino Acid Pairs for the prediction of protein pupylation sites and built a web server named pbPUP [15].
Note that in the aforementioned three existing computational methods, the experimentally annotated pupylation sites are used as the positive training set and the remaining nonannotated lysine residues are used as the negative training set to build classifiers for prediction of new pupylation sites from the unknown proteins. However, due to the limitations of experimental condition and technique, the remaining 2 BioMed Research International nonannotated lysine residues may contain some pupylation sites which are not experimentally validated yet [13,14]. Thus, the classifiers are actually trained on a noisy negative set. As a result, the performance of the classifiers may not be as good as it was supposed to be.
In contrast to existing prediction methods, experimentally annotated pupylation sites were used as the positive training set and the remaining nonannotated lysine residues were used as the unlabeled training set in this study. We developed a novel method to predict pupylation sites by using the positive-unlabeled (PU) learning technique. This method was called PUL-PUP (PU learning for pupylation sites prediction). Experimental results show that the performance of our method significantly outperforms the other methods on both training and test sets. As an application, the most likely pupylation sites were predicted in nonannotated lysine sites by the method we proposed in this paper. PUL-PUP Matlab software package is freely accessible at https://pul-pup.github.io/.

Dataset.
Tung's training set and independent test set [13] were used in this study. The training set consisted of 162 proteins with 183 experimentally annotated pupylation sites and 2258 nonannotated pupylation sites; the independent test set consisted of 20 proteins with 29 experimentally annotated pupylation sites and 408 nonannotated pupylation sites. Sliding window method was used to encode every lysine residue K of dataset because pupylation only occurred in lysine residues K. According to [13], window size was selected as 21 in our study.

Feature Extraction and Feature Selection.
The CKSAAP encoding has been widely used to various posttranslational modifications' site prediction [16][17][18]. The CKSAAP features [13,19] with = 0, 1, 2, 3, and 4 were used to encode each residue of lysine fragment in this study. Thus, each sample was represented by 2205 features. In Tung's paper [13], chisquare test and backward feature elimination algorithm were used to remove the irrelevant and redundant features. Firstly, chi-square test was employed to rank the importance of the 2205 features. Then, the backward feature selection algorithm was used to eliminate 50 features with the lowest ranks in each iteration. Here, the top 150 CKSAAP features were selected as optimal feature set which were also same as Tung's paper [13].

Development of PUL-PUP.
The experimentally annotated pupylation sites were used as the positive training set and the remaining nonannotated lysine residues were used as the unlabeled training set to build classifier in this study. In this way, two types of subset were received in the training set: (1) the positive dataset and (2) the unlabeled dataset . Thus our problem became learning from positive and unlabeled samples. We proposed a novel PU learning algorithm named PUL-PUP to predict pupylation sites. The core learning algorithm of PUL-PUP is support vector machine (SVM) which has been widely used in various biological problems [20][21][22]. The flowchart of PUL-PUP algorithm is shown as follows: Stage 2 (expansion of reliable negative example set).
(i) A final SVM classifier was trained on positive set and representative reliable negative set RN There are three stages in PUL-PUP algorithm as follows.
Stage 1 (selection of initial reliable negatives). PUL-PUP selected the initial reliable negative set RN 0 from unlabeled set U by maximum distance rule. RN 0 should be located as far away from P as possible to ensure that the reliable negative set was the most dissimilar from the positive set P. Therefore, RN 0 would satisfy the formula described below: where ( , ) is Euclidean distance between and : Stage 2 (expansion of reliable negative example set). After the selection of initial reliable negative set, PUL-PUP algorithm iteratively trained a series of two-class SVM classifiers and gradually extended reliable negative set. Specifically, at the th iteration, an SVM classifier was firstly trained in positive set and current reliable negative training set RN ; then, would be used to classify the current unlabeled set and calculate its decision value. To guarantee the reliability of the negative set, samples with the decision value less than a threshold ( ) were selected as newly predicted negatives pred ; here was set to −0.25. To overcome the problem of imbalance during the iteration, the negative support vectors sv and their surrounding points in RN , named̃s v , were used to represent the existing negative set RN , and the size of pred was controlled less than 2 * | |. At the + 1th iteration, +1 = \ pred ; RN +1 = pred ∪ sv ∪̃s v . Classifier +1 was trained in positive set and current reliable negative training set RN +1 . As this process continues, RN i may contain more and more false positive examples; therefore, iteration should be terminated at some point. Iteration was repeated until the size of goes below a threshold * | |; here was set to 4.

Stage 3 (acquisition of final classifier).
After the extraction of representative reliable negative set, a final SVM classifier was trained on positive set and representative reliable negative set RN.

SVM Parameter Selection.
The core learning algorithm of PUL-PUP is support vector machine (SVM) with radial basis function (RBF) kernel. Libsvm [23] was used for training SVM models, and the grid search method was applied to tune the parameters in cross-validation.
where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.

Performance of 10-Fold Cross-Validation on Training
Set. In order to evaluate the effectiveness of the selected representative reliable negative samples on pupylation sites prediction, we compared our method with two other methods including SVM balance and PSoL [24] on training set because the core learning algorithm of our method was SVM and our method was inspired by PSoL. For PUL-PUP and PSoL algorithms, the nonannotated lysine sites were used as the unlabeled training samples. The 10-fold crossvalidation of them was performed on positive set and representative reliable negative set RN. For SVM balance, a balanced negative training set which had the same size with the positive training set was randomly selected from the nonannotated lysine sites and the 10-fold cross-validation was also performed on the positive training set and the balanced negative training set to find the best parameters of SVM. The 10-fold cross-validation of the four methods was shown in Table 1. As shown in Table 1, PUL-PUP reached the highest Sn, Sp, ACC, MCC, and AUC values of 82.24%, 91.57%, 88.92%, 0.74, and 0.92, respectively, on training dataset. As the selected representative reliable negative samples, the PUL-PUP achieved an excellent performance on training set.

Comparison of PUL-PUP with Other Methods on Independent Test Set.
To further evaluate the performance of pupylation sites prediction by PUL-PUP, we firstly compared it with PSoL and SVM balance on independent test set. The compared results of different methods are shown in Table 2. Although SVM balance can avoid the imbalanced problem, the performance of SVM balance cannot be as good as the PUL-PUP because the negative training set in SVM balance is randomly selected and cannot truly reflect the distribution of negative set well. It should be pointed out that stage 2 of PUL-PUP was similar to the negative set expansion in PSoL. But, in PUL-PUP, RN was represented by sv ∪̃s v rather than sv merely. Thus, more information in RN is included and makes our algorithm more effective than PSoL.
We also compared our method with three existing pupylation sites predictors: GPS-PUP [12], iPUP [13], and pbPUP [15] on independent test set. Three thresholds of "High," "Medium," and "Low" were defined for PUL-PUP according to the SVM scores which were higher than 0.9672, 0.4032, and 0.1088, respectively. The performances of PUL-PUP and three existing pupylation sites predictors were shown in Table 3. As we can see from Table 3, the performance of our algorithm outperformed the existing three predictors significantly. Taking threshold "Medium," for example, the MCC of PUL-PUP (0.24) was higher than that of GPS-PUP (0.14), iPUP (0.16), and pbPUP (0.07). Moreover, PUL-PUP achieved the highest AUC value (0.77). As our classifier is iteratively trained on the positive and reliable negative set in this paper, the performance of our algorithm outperformed the existing three predictors significantly. This demonstrates that PUL-PUP is more suitable for predicting the pupylation sites than other methods.

Prediction of the Most Likely Pupylation Sites in Nonannotated Lysine Sites.
For the 183 pupylated proteins in PupDB [6], there are 212 experimentally annotated pupylation sites and 2666 nonannotated lysine sites. As mentioned earlier, those nonannotated lysine sites may contain some pupylation sites which have not been experimentally validated yet. To predict the most likely pupylation sites in nonannotated lysine sites, we run PUL-PUP algorithm on all data of the PupDB. The top 20 most likely pupylation sites in nonannotated lysine sites were listed in Supplementary S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2016/4525786). Here, we just give a possible hypothesis; whether those sites will cause pupylation or not remains to be experimentally verified.

Conclusions
In this study, we have developed novel pupylation sites prediction method PUL-PUP by using the PU learning.
To the best of our knowledge, this is the first time PU learning has been applied to predict the pupylation sites. Experimental results have shown that our method outperformed the existing pupylation sites predictors significantly. Moreover, the most likely pupylation sites were predicted in nonannotated lysine sites by using PUL-PUP. We believe that our method can also be applied to predict the other types of posttranslational modification sites. In future research, we will develop a web server for the PUL-PUP.