Prediction of (Aggregation-Prone) Peptides with Hybrid Classification Approach

Protein aggregation is a biological phenomenon caused by misfolding proteins aggregation and is associated with a wide variety of diseases, such as Alzheimer’s, Parkinson’s, and prion diseases. Many studies indicate that protein aggregation is mediated by short “aggregation-prone” peptide segments. Thus, the prediction of aggregation-prone sites plays a crucial role in the research of drug targets. Compared with the labor-intensive and time-consuming experiment approaches, the computational prediction of aggregation-prone sites is much desirable due to their convenience and high efficiency. In this study, we introduce two computational approachesAggre Easy andAggre Balance for predicting aggregation residues from the sequence information; here, the protein samples are represented by the composition of k-spaced amino acid pairs (CKSAAP). Andwe use the hybrid classification approach to predict aggregation-prone residues, which integrates the näıve Bayes classification to reduce the number of features, and two undersampling approaches EasyEnsemble and BalanceCascade to deal with samples imbalance problem. The Aggre Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70% and aMCC of 0.42; the sensitivity, specificity, and MCC of Aggre Balance reach 70.32%, 80.70% and 0.42. Experimental results show that the performance of Aggre Easy and Aggre Balance predictor is better than several other state-of-the-art predictors. A user-friendly web server is built for prediction of aggregation-prone which is freely accessible to public at the website.


Introduction
Protein aggregation is a phenomenon caused by misfolding proteins aggregation.Many studies indicate that protein aggregations can cause amyloid fibrils which are associated with a wide variety of diseases, such as Alzheimer's, Parkinson's, and prion diseases [1].Although the amyloidogenic proteins do not share the homology in sequences or common native fold patterns, they are remarkably similar in  structure [1].Experiments demonstrate that protein aggregation is mediated by short "aggregation-prone" peptide segments.So the identification of aggregation prone in the protein sequences is the key to finding protein aggregation phenomenon.As we know, traditional experimental identification and characterization of aggregation prone are laborintensive and expensive.Therefore, the aggregation residues prediction by computational technology attracted more and more attention in the past few years.
Over the past ten years, a large number of computational approaches have been developed to analyze and predict the aggregation prone.Broadly, from the perspective of feature extraction, these approaches can be divided into three categories: experiment-based methods, structure-based methods, and physical-chemical attribute-based methods.For example, Aggrescan [2] proposed by Conchillo-Solé et al. and the saturation mutagenesis analysis [3] performed by López de la Paz and Serrano were both validated by experiments.Among structure-based methods, Galzitskaya et al. [4] used a new parameter, "mean packing density," to detect both amyloidogenic and disordered regions in a protein sequence; SALSA [5], Hexapeptide Conf.Energy [1], and SecStr [6] were proposed by -sheet structure analysis.NetCSSP [7] developed by Kim et al. used CSSP algorithm and 3D structure to predict the amyloid fibril formation.On the other hand, physical-chemical attribute-based methods such as PaFigure [8] proposed by Tian et al. and Tango [9] [10] and AMYLPRED2 [11] which integrated 5 predictors and 11 predictors, respectively.However, the above methods did not consider that the dataset of aggregation-prone prediction was imbalanced, and some methods were based on the structure information which had high computational complexity.For these reasons, we develop two approaches Aggre Easy and Aggre Balance based on the sequence information to predict the aggregation residues.In this study, the protein samples are represented by the composition of k-spaced amino acid pairs (CKSAAP) [12][13][14].Then, we use a hybrid classification approach to solve sample imbalance problem.The hybrid classification approach integrates the naïve Bayes classification to reduce the number of features and undersampling strategy to deal with the class-imbalance problem.Two undersampling algorithms, EasyEnsemble and BalanceCascade, are both utilized in this paper.The Aggre Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70%, and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre Balance reach 70.32%, 80.70%, and 0.42.Experimental results show that the performance of Aggre Easy and Aggre Balance predictor is better than several other state-of-the-art predictors.A user-friendly web server is built for prediction of aggregation prone which is freely accessible to public at the following website: http://202.198.129.220:8080/AggrePrediction/.

Materials and Method
2.1.Datasets.In this paper, we select 33 amyloidogenic proteins to predict "aggregation-prone" peptides.And all the proteins are extracted from Uniprot/Swiss-Prot (Mar, 20,2013).Moreover, in order to facilitate comparison with the AMYLPRED2, we select the same dataset.For aggregationprone peptides prediction, 25 proteins are used for training and the remaining 8 proteins for testing.Similar to [11], all experimentally verified aggregation sites in this paper are regarded as positive samples, and the other nonaggregation sites in the same proteins are taken as the negative samples (as can be seen in Supporting Information Text S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2015/857325).The number of proteins samples in each dataset is shown in the Table 1.
We define a possible aggregation-prone peptide (2×+1) as the aggregation bond is flanked by "w" residues upstream and "w" residues downstream from the aggregation site.In this paper, we select four different values of w (2, 3, 4, and 5), and the window sizes are the 5, 7, 9, and 11.If the aggregation site is located at the N-or C-terminus of the protein and the length of the peptide is smaller than (2×+1), one or multiple "O" characters are added to complement the peptide (2 ×  + 1).

Protein Encoding Schema.
To develop a powerful predictor, an effective mathematical expression to formulate the protein sequences plays an important role, which can truly reflect their intrinsic correlation with the attribute to be predicted [15,16].In this research, we use the encoding scheme based the composition of k-spaced amino acid pairs (CKSAAP) [12][13][14], which is successfully used for predicting more types of posttranslational modifications (PTMs) sites (e.g., prediction of palmitoylation sites [13], ubiquitination sites [12], and Phosphorylation Sites [17]).We describe the detailed procedures as follows.
Generally, we define an aggregation prone with a sequence fragment of (2 ×  + 1) amino acids.There are 441 possible amino acid pair types (i.e., , , , . . ., ).Note that the pairs are extended to the k-spaced amino acid pairs (i.e., pairs that are separated by  other amino acids).We can use the vector (         ⋅ ⋅ ⋅    ) 441 to describe a feature vector.For instance,    denotes that  pairs occur  times, and amino acid  is separated by  other amino acid from the amino acid  in the sequences fragment.In this study, based on previous experience, the amino acid pairs for k = 3, 4, 5 are jointly considered.So the total dimension of the proposed feature vector is 441 × ( − 1).

Hybrid Classification Approach.
From Section 2.1, we can see that the negative samples are about five times of positive samples, so the traditional learning algorithm, such as SVM, cannot get good performance in this kind of imbalance dataset.For the huge number of features transformed from the CKSAAP encoding, many feature selections methods are carried out to overcome this problem by reducing the dimension of the features.In this paper, we design a hybrid classification approach integrating naïve Bayes classification and two undersampling methods EasyEnsemble and Bal-anceCascade to predict aggregation sites, which has also been successfully used for text document classification [18,19].It takes advantage of both the simplicity of the Bayes technique and the efficient strategy of the undersampling to deal with class-imbalance problem.In Figure 1, the black frame illustrates the process of hybrid classification approach.Firstly, all training proteins or peptides are represented by CKSAAP encoding schema.Secondly, the features of composition of k-spaced amino acid pairs are used as the input data for Bayes classification, where the dimensions of process of Bayes classification are based on the number of available classes in the classification task [20].Finally, we use the undersampling approaches for the predictors.For the query protein, we can use the training model to predict whether or not it is an aggregation protein.The details would be shown in the following sections.

The Bayes Classification Approach.
The naïve Bayes classifier [21] starts with the initial step of encoding the sample by extracting the composition of k-spaced amino acid pairs (CKSAAP).The list of AAPs (amino acid pairs) is constructed with the assumption that input data contains aap 1 , aap 2 , . . ., aap −1 , aap  , where  is the CKSAAP encoding schema length.And it can be used to create a table; containing the probabilities of the amino acid pair (AAP) in each class.And Table 2 shows the details.
Based on the list of AAP numbers, the trained probabilistic classifier calculates the posterior probability of the particular AAP of the sample being annotated to a particular class by using the formula (1), since each AAP in the input sample contributes to the sample's class probability: The The total occurrence of a particular AAP in every class can be calculated by searching the training database, which is composed from the lists of AAP occurrences for every class.As previously mentioned, the list of AAP numbers for a class is generated from the analysis of all training samples in the particular class during the initial training stage.The same method can be used to retrieve the sum of numbers of all samples in every class in the training database.
To calculate the likelihood of a particular Class  with respect to a particular aap  , the lists of AAP number from the training database is searched to retrieve the numbers of aap  in Class  and the sum of all AAPs in Class  .This information will contribute to the value of Pr(aap  , Class  ) given in Pr (aap  , Class  ) = numbers of aap  in Class  ∑ numbers of all aaps in Class  .
Based on the derived Bayes' formula for classification and the value of the prior probability Pr(Class), the likelihood Pr(AAP, Class) and evidence Pr(AAP), along with the posterior probability, Pr(Class, AAP) of each AAP in the input data being annotated to each class can be measured.
The probability for an input sample to be annotated to a particular Class  is calculated by dividing the sum of each of the "Probability" column with the length of the query, , which is shown in where aap 1 , aap 2 , aap 3 , . . ., aap −1 , aap  are the AAPs that are extracted from the input sample.The Pr(Class, Sample) is the probability value for a sample to be annotated to a class.And if we have class list as Class 1 , Class 2 , Class 3 , . . ., Class  , each sample would have  associated probability values, where sample will have Pr(Class 1 , Sample), Pr(Class 2 , Sample), Pr(Class 3 , Sample), . .., and Pr(Class  , Sample).All the probability values of a sample are combined to construct a multidimensional array, which represents the probability distribution of the sample in the vector space.In this way, all the training samples are vectorized into their probability distribution in vector space, in the format of numerical multidimensional arrays, with the number of dimensions depending on the number of classes.

EasyEnsemble and BalanceCascade. Liu et al. proposed
EasyEnsemble algorithm and BalanceCascade algorithm [22], which were the undersampling algorithms and were widely used for classification task.EasyEnsemble algorithm extracted several subsets from majority class examples by themselves; for each subset, a classifier was built, and all generated classifiers created an ensemble learning system and then combined them for the final decision by using Adaboost [23].BalanceCascade algorithm depending on supervised learning methods extracted examples from majority class examples and then created ensemble classifiers with training datasets [24].The pseudocodes for EasyEnsemble and Bal-anceCascade were shown in Algorithms 1 and 2.

Evaluation.
In this study, we adopt the 10-fold crossvalidation.The dataset is randomly divided into ten equal sets, out of which nine sets are used for training and the remaining one for testing.This procedure is repeated ten times and the final prediction result is the average accuracy of the ten testing sets [25][26][27][28][29][30][31][32].
Four parameters, sensitivity (Sn), specificity (Sp), , and Mathew correlation coefficient (MCC), are used to measure where TP, TN, FP, and FN denote the number of true positive, true negative, false positive, and false negative, respectively.

The Performance in the Testing Dataset.
In this research, we select 33 amyloidogenic proteins for the prediction of "aggregation-prone" peptides.And 25 amyloidogenic proteins are selected for training; there are 923 positive samples and 5074 negative samples; and the rest of 8 amyloidogenic proteins are selected for testing; thus, there are 335 positive samples and 1499 negative samples.The details are shown in Table 1.We define a possible aggregation-prone peptide (2 ×  + 1) as the aggregation bond; "w" is 3, 4, and 5, and the window size is 7, 9, and 11.Next, we use the encoding scheme based on the composition of k-spaced amino acid pairs (CKSAAP) to formulate the aggregation-prone peptide, and the "k" is 3, 4, and 5.In Tables 3 and 4, we compare the values of MCC to determine the best values of  and k.We use the hybrid classification approach (naïve Bayes vectorizer and two undersampling algorithms called EasyEnsemble and BalanceCascade) to improve the classification accuracy and performance in the imbalance dataset.For the EasyEnsemble approach, the CART is used to train weak classifiers; the number subset  is 4; the number of iterations   is 10 in the each Adaboost ensemble method; the same parameters are used for the BalanceCascade approach.Meanwhile, we perform a 10-fold stratified cross validation.Within each fold, the classification method is repeated 10 times considering that the sampling of subsets introduces randomness.The whole cross validation process is repeated 10 times, and the averages of these 10 cross validations are the final performance of the method.
The average performance of the different parameter is summarized in Tables 3 and 4. When the window size is 7 and the  value was 4, the value of MCC is the highest, 0.0827 for EasyEnsemble learning algorithm and 0.0738 for BalanceCascade learning algorithm in the testing dataset.Thus, we select 7 (window size) and 4 (the  value) as the final parameters of classifier, which is used to comprise with other predictors by 10-fold cross validation in all datasets.
The average Sn of the EasyEnsemble learning algorithm and BalanceCascade learning algorithm is shown in Figures 2 and 4. When the window size is smaller, the value of Sn is higher; for example, window size is 5 and 7, and the Sn is about 0.39∼0.41for EasyEnsemble and 0.27∼0.32 for Bal-anceCascade; on the contrary, when the window size is 9 and 11, the Sn is about 0.34∼0.38 for EasyEnsemble and 0.24∼0.31for BalanceCascade.Also in Figures 3 and 5, the average Sp of the EasyEnsemble learning algorithm and BalanceCascade learning algorithm is summarized.It is about 0.66∼0.70 for EasyEnsemble and 0.73∼0.77for BalanceCascade, when the window size is 5 and 7; however, it is about 0.69∼0.75for EasyEnsemble and 0.76∼0.80 for BalanceCascade, when the window size is 9 and 11.It indicates that smaller window size would be beneficial to predict positive sample; also, the larger the window size is, the more redundant the information is.What's more, the value of Sn is higher for EasyEnsemble than for BalanceCascade, about 10% higher; however, the value of Sp is lower for EasyEnsemble than for BalanceCascade, about 7% lower; it illustrates that the EasyEnsemble would improve the prediction performance of sensitivity, and Bal-anceCascade would improve the prediction performance of specificity.

Comparison with Other Predictors.
As the result in Table 5, the prediction sensitivity and MCC of Aggre Easy and Aggre Balance are the highest compared to others, the Sp is 79.46%, MCC is 0.42 for the Aggre Easy, the Sp is 70.32%,MCC is 0.42 for the Aggre Balance.It indicates that our predictor has good performance to predict the positive samples in the imbalance dataset.However, the value of specificity is lower than others.For Aggre Easy, the value of specificity (Sp = 74.43%) is lower than Amyloidogenic Pattern, Average Packing Density, Beta-strand contiguity, SecStr, Tango, AMYLPRED, and AMYLPRED2, slightly lower than Aggrescan, AmyloidMutants, and Hexapeptide Conf.Energy, and higher than NetCSSP, PaFigure, and Waltz.For Aggre Balance, the value of specificity (Sp = 80.70%) is lower than Amyloidogenic Pattern, Average Packing Density, Beta-strand contiguity, SecStr, Tango, AMYLPRED, and AMYLPRED2 and higher than other methods.More importantly, the reasonably good performance of Aggre Easy and Aggre Balance reflects that the method effectively captures the information of aggregation sites, and we propose that the hybrid classification approach can take advantage of the simplicity of the Bayes technique and the sensitivity of the undersampling ensemble learning algorithm.In Table 5, the false positives (FP) is large; the main reason was because of the fact that only a relative small portion of them have been studied and confirmed experimentally to be amyloidogenic [11].On the other hand, we would propose the window redirection operator to improve the prediction performance in the future.In the web server, the models based on the datasets with the optimal parameters are used to predict sites in submitted sequences.As is displayed in Figures 6 and 7, users could submit the uncharacteristic sequences with FASTA format, and the system would return the prediction results.A region in the polypeptide sequence was considered an aggregation prone if there are 5 or more sequentially continuous residues to be prediction aggregation prone.

Conclusion
Accurate identification of the aggregation residues could help fully decipher the molecular mechanisms.Though some researchers have focused on this problem, the overall prediction performance is still not satisfied.In this paper, we develop approaches Aggre Easy and Aggre Balance to predict the aggregation prone from the primary sequence information.The Aggre Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70%, and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre Balance reach 70.32%, 80.70%, and 0.42.Experimental results show that the performances of Aggre Easy and Aggre Balance predictor are better than several other stateof-the-art predictors and our methods are helpful for the prediction of aggregation prone.

Figure 2 :Figure 3 :
Figure 2: The value of Sn for EasyEnsemble learning algorithm in testing dataset.

Figure 6 :
Figure 6: The interface of user input.

Figure 7 :
Figure 7: The example of prediction result by Aggre Easy.

Table 1 :
The number of aggregation sites and nonaggregation sites in training and testing dataset.
Total of AAPs in Class  Total of AAPs in Training Dataset .(2) Meanwhile, we calculate the Pr(aap  ), which is represented by the probability of each aap  in all classes; it is expressed as Pr (aap  ) = ∑ numbers of aap  in all Classes ∑ numbers of all AAPs in all Classes .
prior probability, Pr(Class  ) can be computed from Pr (Class  ) =

Table 3 :
The performance for EasyEnsemble learning algorithm in testing dataset.

Table 4 :
The performance for BalanceCascade learning algorithm in testing dataset.
Aggregation-Prone Prediction.An effective prediction servers, Aggre Easy and Aggre Balance,

Table 5 :
The performance comparison of Aggre Easy and Aggre Balance with existing 12 predictors.