Protein aggregation is a biological phenomenon caused by misfolding proteins aggregation and is associated with a wide variety of diseases, such as Alzheimer’s, Parkinson’s, and prion diseases. Many studies indicate that protein aggregation is mediated by short “aggregation-prone” peptide segments. Thus, the prediction of aggregation-prone sites plays a crucial role in the research of drug targets. Compared with the labor-intensive and time-consuming experiment approaches, the computational prediction of aggregation-prone sites is much desirable due to their convenience and high efficiency. In this study, we introduce two computational approaches Aggre_Easy and Aggre_Balance for predicting aggregation residues from the sequence information; here, the protein samples are represented by the composition of
Protein aggregation is a phenomenon caused by misfolding proteins aggregation. Many studies indicate that protein aggregations can cause amyloid fibrils which are associated with a wide variety of diseases, such as Alzheimer’s, Parkinson’s, and prion diseases [
Over the past ten years, a large number of computational approaches have been developed to analyze and predict the aggregation prone. Broadly, from the perspective of feature extraction, these approaches can be divided into three categories: experiment-based methods, structure-based methods, and physical-chemical attribute-based methods. For example, Aggrescan [
However, the above methods did not consider that the dataset of aggregation-prone prediction was imbalanced, and some methods were based on the structure information which had high computational complexity. For these reasons, we develop two approaches Aggre_Easy and Aggre_Balance based on the sequence information to predict the aggregation residues. In this study, the protein samples are represented by the composition of
In this paper, we select 33 amyloidogenic proteins to predict “aggregation-prone” peptides. And all the proteins are extracted from Uniprot/Swiss-Prot (Mar, 20, 2013). Moreover, in order to facilitate comparison with the AMYLPRED2, we select the same dataset. For aggregation-prone peptides prediction, 25 proteins are used for training and the remaining 8 proteins for testing. Similar to [
The number of aggregation sites and nonaggregation sites in training and testing dataset.
Dataset | Number of the proteins | Number of the aggregation sites | Number of the nonaggregation sites |
---|---|---|---|
Training datasets | 25 | 923 | 5074 |
|
|||
Testing datasets | 8 | 335 | 1499 |
|
|||
All datasets | 33 | 1258 | 6573 |
We define a possible aggregation-prone peptide
To develop a powerful predictor, an effective mathematical expression to formulate the protein sequences plays an important role, which can truly reflect their intrinsic correlation with the attribute to be predicted [
Generally, we define an aggregation prone with a sequence fragment of
From Section
Proposed hybrid classification approach block diagram.
The naïve Bayes classifier [
Table of AAPs numbers and probabilities.
AAP | Probability Class1 | Probability Class2 | Probability Class3 |
|
Probability |
Probability |
---|---|---|---|---|---|---|
aap1 |
|
|
|
|
|
|
aap2 |
|
|
|
|
|
|
aap3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
Total |
|
|
|
|
|
|
Probability of input sample |
|
|
|
|
|
|
Based on the list of AAP numbers, the trained probabilistic classifier calculates the posterior probability of the particular AAP of the sample being annotated to a particular class by using the formula (
The prior probability,
Meanwhile, we calculate the
The total occurrence of a particular AAP in every class can be calculated by searching the training database, which is composed from the lists of AAP occurrences for every class. As previously mentioned, the list of AAP numbers for a class is generated from the analysis of all training samples in the particular class during the initial training stage. The same method can be used to retrieve the sum of numbers of all samples in every class in the training database.
To calculate the likelihood of a particular
Based on the derived Bayes’ formula for classification and the value of the prior probability Pr(Class), the likelihood Pr(AAP, Class) and evidence Pr(AAP), along with the posterior probability, Pr(Class, AAP) of each AAP in the input data being annotated to each class can be measured.
The probability for an input sample to be annotated to a particular
The
Liu et al. proposed EasyEnsemble algorithm and BalanceCascade algorithm [
Input: Training dataset (1) Begin (2) For (3) Creating a subset (4) Use the Adaboost with the weak classifiers (5) End For (6) Output: An ensemble like: (7) End
Input: Training dataset (1) Begin (2) example to the minority class) that (3) For (4) Creating a subset (5) Use the Adaboost with the weak classifiers (6) Adjust (7) Remove from (8) End for (9) Output: A single ensemble like:
(10) End
In this study, we adopt the 10-fold cross-validation. The dataset is randomly divided into ten equal sets, out of which nine sets are used for training and the remaining one for testing. This procedure is repeated ten times and the final prediction result is the average accuracy of the ten testing sets [
Four parameters, sensitivity (Sn), specificity (Sp),
In this research, we select 33 amyloidogenic proteins for the prediction of “aggregation-prone” peptides. And 25 amyloidogenic proteins are selected for training; there are 923 positive samples and 5074 negative samples; and the rest of 8 amyloidogenic proteins are selected for testing; thus, there are 335 positive samples and 1499 negative samples. The details are shown in Table
The performance for EasyEnsemble learning algorithm in testing dataset.
Window size | The value of the |
|
MCC | TP | FN | TN | FP |
---|---|---|---|---|---|---|---|
5 | 3 | 0.5318 | 0.0514 | 136 | 199 | 986 | 513 |
7 | 3 | 0.5438 | 0.0725 | 131 | 204 | 1044 | 455 |
9 | 3 | 0.5426 | 0.0706 | 130 | 205 | 1045 | 454 |
11 | 3 | 0.5457 | 0.0779 | 122 | 213 | 1092 | 407 |
5 | 4 | 0.5443 | 0.0720 | 140 | 195 | 1006 | 493 |
|
|
|
|
|
|
|
|
9 | 4 | 0.5347 | 0.0595 | 115 | 220 | 1090 | 409 |
11 | 4 | 0.5347 | 0.0598 | 112 | 223 | 1101 | 398 |
5 | 5 | 0.5417 | 0.0677 | 139 | 196 | 1002 | 497 |
7 | 5 | 0.5188 | 0.0313 | 117 | 218 | 1034 | 465 |
9 | 5 | 0.5303 | 0.0512 | 116 | 219 | 1069 | 430 |
11 | 5 | 0.5308 | 0.0542 | 104 | 231 | 1128 | 371 |
The performance for BalanceCascade learning algorithm in testing dataset.
Window size | The value of the |
|
MCC | TP | FN | TN | FP |
---|---|---|---|---|---|---|---|
5 | 3 | 0.5292 | 0.0506 | 107 | 228 | 1107 | 392 |
7 | 3 | 0.5382 | 0.0694 | 100 | 235 | 1166 | 333 |
9 | 3 | 0.5197 | 0.0660 | 104 | 231 | 1142 | 356 |
11 | 3 | 0.5357 | 0.0641 | 101 | 234 | 1152 | 347 |
5 | 4 | 0.5366 | 0.0649 | 106 | 229 | 1135 | 367 |
|
|
|
|
|
|
|
|
9 | 4 | 0.5319 | 0.0580 | 97 | 238 | 1163 | 336 |
11 | 4 | 0.5253 | 0.0489 | 81 | 254 | 1213 | 286 |
5 | 5 | 0.5379 | 0.0667 | 108 | 227 | 1130 | 369 |
7 | 5 | 0.5197 | 0.0353 | 94 | 241 | 1139 | 360 |
9 | 5 | 0.5293 | 0.0543 | 92 | 243 | 1175 | 324 |
11 | 5 | 0.5253 | 0.0470 | 81 | 254 | 1208 | 291 |
We use the hybrid classification approach (naïve Bayes vectorizer and two undersampling algorithms called EasyEnsemble and BalanceCascade) to improve the classification accuracy and performance in the imbalance dataset. For the EasyEnsemble approach, the CART is used to train weak classifiers; the number subset
The average performance of the different parameter is summarized in Tables
The average Sn of the EasyEnsemble learning algorithm and BalanceCascade learning algorithm is shown in Figures
The value of Sn for EasyEnsemble learning algorithm in testing dataset.
The value of Sp for EasyEnsemble learning algorithm in testing dataset.
The value of Sn for BalanceCascade learning algorithm in testing dataset.
The value of Sp for BalanceCascade learning algorithm in testing dataset.
As the result in Table
The performance comparison of Aggre_Easy and Aggre_Balance with existing 12 predictors.
Method | Sn (%) | Sp (%) |
|
MCC | TP | TN | FP | FN |
---|---|---|---|---|---|---|---|---|
Aggrescan | 35.37 | 79.26 | 57.32 | 0.13 | 445 | 5210 | 1363 | 813 |
AmyloidMutants | 41.65 | 74.91 | 58.28 | 0.14 | 524 | 4924 | 1649 | 734 |
Amyloidogenic Pattern | 13.99 | 94.95 | 54.22 | 0.12 | 176 | 6208 | 365 | 1082 |
Average Packing Density | 28.70 | 84.12 | 56.41 | 0.12 | 361 | 5529 | 1044 | 897 |
Beta-strand contiguity | 33.15 | 85.62 | 59.39 | 0.18 | 417 | 5628 | 945 | 841 |
Hexapeptide Conf. Energy | 39.27 | 78.69 | 58.98 | 0.15 | 494 | 5172 | 1401 | 764 |
NetCSSP | 51.27 | 65.22 | 58.25 | 0.12 | 645 | 4287 | 2286 | 613 |
PaFigure | 51.75 | 71.43 | 61.59 | 0.18 | 651 | 4695 | 1878 | 607 |
SecStr | 11.37 | 94.40 | 52.88 | 0.09 | 143 | 6205 | 368 | 1115 |
Tango | 13.67 | 95.57 | 54.62 | 0.14 | 172 | 6282 | 291 | 1086 |
Waltz | 56.44 | 65.42 | 60.93 | 0.16 | 710 | 4300 | 2273 | 548 |
AMYLPRED | 32.99 | 86.23 | 59.61 | 0.19 | 415 | 5668 | 905 | 843 |
AMYLPRED2 | 39.27 | 84.48 | 61.88 | 0.22 | 494 | 5553 | 1020 | 764 |
Aggre_Easy | 79.46 | 74.43 | 76.95 | 0.42 | 1000 | 4892 | 1681 | 258 |
Aggre_Balance | 70.32 | 80.70 | 75.51 | 0.42 | 885 | 5304 | 1269 | 373 |
In Table
An effective prediction servers, Aggre_Easy and Aggre_Balance, are available at
The interface of user input.
The example of prediction result by Aggre_Easy.
Accurate identification of the aggregation residues could help fully decipher the molecular mechanisms. Though some researchers have focused on this problem, the overall prediction performance is still not satisfied. In this paper, we develop approaches Aggre_Easy and Aggre_Balance to predict the aggregation prone from the primary sequence information. The Aggre_Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70%, and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre_Balance reach 70.32%, 80.70%, and 0.42. Experimental results show that the performances of Aggre_Easy and Aggre_Balance predictor are better than several other state-of-the-art predictors and our methods are helpful for the prediction of aggregation prone.
Text S1: All datasets are consisting of the 33 proteins and their sites information.
Text S2: The prediction result of aggregation prone for Aggre_Easy, Aggre_Balance, AMYLPRED and AMYLPRED2, by comparison. Simply, we remove the single prediction positive site, and, in the future, we will propose the window redirection operator to improve the prediction performance.
The authors declare no conflict of interests.
This research is partially supported by National Natural Science Foundation of China (61403077 and 61403076), the Fundamental Research Funds for the Central Universities (14QNJJ029), and the Postdoctoral Science Foundation of China (2014M550166).