QuaBingo: A Prediction System for Protein Quaternary Structure Attributes Using Block Composition

Background. Quaternary structures of proteins are closely relevant to gene regulation, signal transduction, and many other biological functions of proteins. In the current study, a new method based on protein-conserved motif composition in block format for feature extraction is proposed, which is termed block composition. Results. The protein quaternary assembly states prediction system which combines blocks with functional domain composition, called QuaBingo, is constructed by three layers of classifiers that can categorize quaternary structural attributes of monomer, homooligomer, and heterooligomer. The building of the first layer classifier uses support vector machines (SVM) based on blocks and functional domains of proteins, and the second layer SVM was utilized to process the outputs of the first layer. Finally, the result is determined by the Random Forest of the third layer. We compared the effectiveness of the combination of block composition, functional domain composition, and pseudoamino acid composition of the model. In the 11 kinds of functional protein families, QuaBingo is 23% of Matthews Correlation Coefficient (MCC) higher than the existing prediction system. The results also revealed the biological characterization of the top five block compositions. Conclusions. QuaBingo provides better predictive ability for predicting the quaternary structural attributes of proteins.


Background
Proteins are responsible for a vast amount of biological synthesis, enzyme catalysis, transport of molecules, and functions in cells. In addition, their specific functions are closely associated with molecular structure. Protein structure can be divided into four levels, that is, from primary to quaternary structure. Many important biological functions must be achieved by polymerization of protein monomers to form oligomeric proteins or higher order multimeric proteins. The concept of protein quaternary structure was first presented by Bernal in 1958 [1,2], in which he found that some protein compositions and structures were more complicated than others. These proteins were shown to be composed of several protein subunits to form biological macromolecules. The quaternary structures of protein subunits fold together by noncovalent bonds, and thus the structure classification can be delineated according to the type of subunit. If the protein complex consists of identical subunits, it is called a homooligomer; otherwise it is referred to as a heterooligomer. Classification based on the number of subunits can be divided into dimers, trimers, tetramers, and so forth [3]. Examples include (1) insulin, having the activity to form a homodimer; (2) tumor necrosis factor-(tumor necrosis factor-alpha), to form a tight trimer; and (3) human hemoglobin protein is a heterotetramer, with two identical subunits and two identical subunits. An excellent review summarized what is known about the biological functions of nonhomologous homodimer and heterodimeric complexes [4]. For example, thymidylate synthase, a homodimeric protein, is highly conserved among distant species. The tertiary complex of thymidylate synthase has been revealed about the asymmetrical conformation of two homodimers (PDB ID: 4EB4). The closed and open forms of a molecule of the complex dimer may affect the ligand binding strength [5]. In addition, HIV-1 reverse transcriptase is a well-known drug target for treating HIV infections (PDB ID: 3HVT) [6]. Heterodimerization of HIV-1 reverse transcriptase contains subunit P66 and P51 is required for DNA polymerase activity.
Although there has been significant progress in the analysis of protein structure with various experimental approaches, experimentation performed to determine protein structure is typically expensive and time-consuming. Consequently, it is necessary to develop a protein quaternary assembly states prediction system that will enable the analysis of protein structure and function using the current and rapidly increasing amount of sequence data. In previous studies, Garian predicted homodimers and nonhomodimers using a decision-tree and amino acid composition method involving the integration of AAindex. Zhang utilized support vector machines (SVM) and a weighted autocorrelation function in an attempt to identify the key features from the amino acid composition. These studies demonstrated that the primary structure indeed possessed needed information about quaternary structure formation [7,8]. However, the general feature encoding method of amino acid composition will lose much important protein sequence information, such as physical and chemical properties of amino acids. Therefore, pseudoamino acid composition (PseAAC) was used to predict quaternary structure. This feature not only incorporates the sequence order effect but also reflects hydrophobic and hydrophilic properties [9]. Zhang et al. used PseAAC to develop sequence-segmented PseAAC and combined segments of the protein sequence and domain relationships in an effort to improve prediction results [10]. In recent years, functional domain composition was presented from an evolutionary and functional perspective, because proteins that share similar domain structures often have similar functions [11][12][13]. This method is suitable for applications in multiple categories of quaternary structural classification problems and can greatly improve prediction performance. However, a disadvantage is that some proteins may not contain any other known functional domains. In fact, the corresponding known functional domains are too few to represent proteins, which result in a classifier being unable to learn effectively. These problems are due to the current database still being incomplete.
The objective of this study is to construct an accurate prediction system for protein quaternary structure attributes. In addition to the previous studies, which have been shown to achieve high prediction accuracy of functional domain composition, the method of functional domains possesses problems that need to be overcome. Accordingly, we attempt to improve this feature extraction method based on a protein sequence homology region concept, that is, block composition, which was proposed to present the protein characteristics. Since the protein interaction binding sites usually have more surface area and a high exposure of hydrophobic solvent accessibility, we will combine amino acid solvent accessibility information and pseudoamino acid composition to calculate the sequence order effect. This system is a threelayer prediction classifier framework. The first layer classifier identifies the structure type of the unknown protein sequence which is, respectively, monomer, dimer, trimer, tetramer, pentamer, hexamer, octamer, decamer, and dodecamer. Then, the result of the first layer of each class serves as input for the second layer classifier, which is used to integrate different features, considering different protein features in the predictive ability of the corresponding advantages and disadvantages to enhance the accuracy of prediction. Finally, the third layer classifier determines the structure type of the query protein.
Cross-validation results show that the predicted results using block composition obtain the best results. Specifically, the overall average prediction accuracy rate is more than 90% in the 60% sequence similarity of each class. Functional domain composition and PseASA are lower by about 10% and 20%, respectively. The results prove that block composition is able to effectively identify quaternary structure assembly states. In addition, performance analysis of different types of function proteins revealed that QuaBingo exhibits superior predictive ability for enzymes, gene regulation, signal transduction, molecular binding, and other important proteins. An online web server is freely available at http://predictor.nchu.edu.tw/QuaBingo/.

Compilation of Datasets.
The protein oligomer sequences used in this study come from the 3D Complex [14] protein quaternary structure classification database. This database provides protein structures, structure type, symmetrical patterns, and other pieces of relevant information. We searched homo-and heterooligomers of each class from the 3D Complex, and information regarding the corrected number of subunits was utilized to construct the database. The following steps were performed for processing: (1) remove oligomer sequences with lengths of less than 30 amino acids; (2) remove those sequences containing greater than or equal to three unknown amino acid; and (3) use the CD-HIT [15,16] to remove redundant sequences in the database, that is, the sequence identity with 60%, for avoiding prediction bias. However, the classes of pentamer, octamer, decamer, and dodecamer used CD-HIT 90% for processing to avoid losing sufficient statistical significance. Finally, the database had 8,444 sequences, named Oli8444. This database was employed as the training dataset of the first and the second layer classifiers. Specifically, there were 3,273 monomers, 3,658 homooligomers, and 1,513 heterooligomers. In addition to monomers, the homo-and heterooligomers have eight individual subcategories, that is, dimer, trimer, tetramer, pentamer, hexamer, octamer, decamer, and dodecamer. Heptamer and undecamer sequences are not used due to little available information. The serial numbers of each category are listed in Supplementary Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2016/9480276. In order to obtain more representative types of sequences, the training data in third layer classifier are processed by CD-HIT 40% from the Oli8444 training set to further remove sequences containing one or more types of oligomer and named Oli6926. However, the sequence has too few categories, such as pentamer, octamer, decamer, and dodecamer, and is no longer subject to CD-HIT 40% treatment. The independent test is collected from nonlearning test sequences of Oli6926.

Block Composition (Block).
A motif is a small and highly conserved sequence in the secondary structure, which is usually associated with protein function; there are multiple motifs in proteins. The Blocks database [17][18][19] is a protein motif database which is based on SWISS-PROT and Prosite to calculate ungapped multiple alignment of protein sequences present in short segments of high sequence similarity blocks. Because this feature extraction method is based on searching the sequence of the Blocks database, the method is termed block composition. The Blocks database currently contains 29,068 protein blocks. Block can be defined as 29,068 dimensional space vectors by (1). If the protein can be compared to the corresponding block in the Blocks database, is 1; otherwise it is 0. The rule is defined by the following equation (2). One has (2)

Functional Domain Composition (FunD).
Proteins usually consist of one or more functional domains. When the same functional domains are discovered in different proteins, this indicates that they may have the same evolutionary origin and function. Version v3.10 of CDD [20] contains 44,354 protein domains and families and includes several external source databases (Pfam [21], SMART [22], KOG [23], COG [23], PRK [24], and TIGR [25]). We use a conservative threshold with -value <0.01 in order to identify what kinds of functional domains are found for query protein . 44,354 proteins can be expressed as a feature vector FunD dimensional space by (3). If is 1, this means that the th domain in CDD is found for , otherwise it is 0. The rule is defined by (3). One has

Pseudoamino Acid Composition Based on Solvent
Accessibility of Amino Acid (PseASA). Protein quaternary structure formed by interactions between two or more polypeptide chains and the interaction depend on surfaces of amino acids  [26,27]. Protein binding sites usually have a more exposed hydrophobic area and higher solvent accessibility. Therefore, we will apply this feature in encoding pseudoamino acid composition [9], named PseASA, to investigate the effect of the relationship between protein interactions and structure on the prediction system. First, the information regarding amino acid solvent accessibility is derived from NetSurfP version 1.1 [28] prediction data and divided into "exposed" or "buried" states. The discontinuous exposure and buried amino acid are linked into exposed protein sequence 1 2 3 4 5 ⋅ ⋅ ⋅ and buried protein sequence 1 2 3 4 5 ⋅ ⋅ ⋅ ( and are the sequence lengths and may change with prediction data of different proteins). PseAAC-Builder [29] was used for feature encoding of pseudoamino acids. However, because of the consideration about overall accuracy of using protein features on Oli8444 dataset, QuaBingo did not use the PseASA feature (see Section 3; Tables 1 and 2).

The
Three-Layer Architecture of Classifiers. SVM is generally used as a binary classifier that was initially applied to pattern recognition and other fields [30]. In the past, SVM has been successfully applied in various fields of classification problems, and the predictions of quaternary structure have also been found to achieve good results [8,10,31]. LibSVM is utilized in this study, and was developed by Chang and Lin [32].
The construction of the prediction system in the current study employs a three-tier architecture, the first layer of which uses SVM to create different characteristic rules of binary classification prediction model. Feature selection using python syntax written LibSVM package fselect.py [33] gives -score based on the importance of each feature and then sorts the trained model by -score. In order to avoid poor recognition and enormous computational time, the trained models are divided into four equal parts according to the -score from high to low and remove 25% or less or more than 75% of the models. Finally, the construction of first layer classification model is completed by choosing better sensitivity, specificity, and Matthews Correlation Coefficient (MCC) based on 10-fold cross-validation accuracy of measurement. Due to 10-fold cross-validation results of first layer classification model, the predictive power of three kinds of characteristic rules for different classes of oligomers was known.
The second layer is the first layer using SVM optimization model predictions, the purpose of which is combining the individual features of each oligomer model outputs into one. Training the second layer integrated model approach is using 10-fold cross-validation test predictions of first layer as input and considering the strengths and weaknesses of the characteristics of different proteins in order to improve accuracy of prediction.
By comparing the data analysis ability of different machine learning algorithms, we finally selected Random Forest to construct the third layer classifier for the integration of these recognition results and determine the quaternary structure type of protein oligomer. Figure 1 is a flowchart of the predicting system.

Evaluation Measures.
To assess the predictive performance of the classifier, we use the following formula. TP, FP, FN, and TN are true positives, false positives, false negatives, and true negatives, respectively. Sensitivity (Sn) on behalf of this type of protein oligomer reflects the percentage of correct predictions for that category. Specificity (Sp) on behalf of nonprotein oligomers of this type indicates the percentage of correct predictions of nonclass. Accuracy (ACC) is used to assess the overall predictive power of the prediction accuracy. Matthews Correlation Coefficient (MCC) values range from −1 to 1, in which the value of 1 represents a completely correct prediction, the value of 0 represents random prediction, and the value of −1 represents exactly the opposite prediction:  , For the third layer classifier evaluation criteria for the classification results, we used Kappa statistics and -measure for viewing. Kappa statistics [34] are used to judge the classifier results, consistent with the random assortment. Its value is in the range of −1 to 1. When = 1, it represents that the predicting results are different with random classifier prediction; = 0 means predicting results are the same as random prediction; = −1 represents that there is no effect and classification credibility. Here, we also use -measure as the evaluation results of the standard classification.measure is a combination of precision and recall, with values from 0 to 1.

Performance of Using Different Protein Features in the
First Layer. In order to understand the different types of feature codes for the accuracy of the prediction structure, we trained the SVM classification model with 10-fold crossvalidation evaluation model validity. Tables 1 and 2 show the 10-fold cross-validation prediction sensitivity, specificity, accuracy, and MCC on the monomer, homooligomer, and heterooligomer.
As can be seen from the results of the cross-validation, block composition in the monomer, homooligomer, and heterooligomer achieved an overall accuracy of 78.91%, 92.27%, and 91.13%, respectively. MCC was 0.579, 0.848, and 0.827, respectively. Since most of sensitivity performance has more than 80%, it indicates that a block composition method is indeed suitable for exhibition of protein characteristics and effectiveness of structure type classification. In the verification results of Functional domains (FunD) feature, the overall accuracy of monomer, homooligomer, and heterooligomer was 75.75%, 80.26%, and 79.93%, respectively. The results of FunD in homooligomer and heterooligomer were lower than the ones of block composition about 10%, while the sensitivity of homooctamer, heterooctamer, and heterodecamer are less than 50%. These results represent that FunD cannot be rendered for associated characteristics. The overall PseASA prediction accuracy is relatively low, that is, respectively, 68.40% and 73.05%. However, compared with the FunD, using PseASA method to predict heterooligomer, pentamer, octamer, and decamer is better at 86.67%, 86%, and 90% of sensitivity, respectively. In addition, the MCC of PseASA for prediction is generally lower, showing that the homology between the whole sequences is not high or that the same category of the sequence number and complexity increases, which makes it difficult to obtain correct predictions. Even if it does not contain pentamer, octamer, decamer, and dodecamer which have a high sequence homology, the overall accuracy of homo-and heterooligomers still reached 90.72% and 90.48%, respectively. To further enhance prediction accuracy, we used the second layer SVM to integrate the various features of the model output.

Performance of Model Combination to Enhance Oligomer
Type Prediction Accuracy. The purpose of establishing the second layer is to integrate different predicted results of characteristic model in each category. We unitized different combinations of characteristic models, in which the model is constructed by three features referred to as (Block), (FunD), and (PseASA). Table 3 displays that performance comparison of model combination in 10-fold cross-validation for oligomer classification in the second layer. In the result of the monomer combination + with an accuracy of 78.91%, a difference of the combination of + is about 3%. Using a combination of + enhanced accuracy, improving from 78.91% to 82.64%. However, + and + + combination exhibited less accuracy. The same situation also appears in the feature models combination for homoand heterooligomers. Overall, + model combinations can have better performance than using the single Block model. Most of the categories were improved from 1 to 6%. Therefore, this study will feature + combination to construct the first layer and the second layer of the classification model.

Performance Comparison of Classification Algorithms in the Third Layer.
In order to obtain unique results to determine an unknown protein quaternary structure type, we use a layer of the classifier to process the output of the second layer. By comparing different types of algorithms on power of data analysis and problem solving ability, we selected the better algorithm for constructing the third layer classifier. Studies using six types of typical algorithms are tested, that is, Bayes, Functions, Lazy, Rules, Trees, and Meta. The Oli6926 dataset is used in this training. We also used the two authentication methods, 10-fold cross-validation and self-consistency, to assess the learning effectiveness of the classifier.
In the results of 10-fold cross-validation, Correctly Classified Instances (CCI) of LibSVM and Logistic were 67.40% and 67.28%, respectively (Table 4). Kappa statistics was 0.5288 and 0.5285, respectively. And the -measure was 0.616 and 0.615, respectively. These two algorithms have best predicted results. However, we found that the predictive accuracy and statistical value of LibSVM and Logistic are higher because most correct predictions which occurred in the large quaternary categories and in minor categories predictions, like pentamer, hexamer, and octamer, are completely ignored. Other algorithms, such as decision table and Bagging, also have a similar situation. Conversely, the accuracy of Random Forest, Random Tree, and IBk was 58.91%, 54.65%, and 58.45%, respectively. Kappa was 0.4306, 0.3817, and 0.4126, respectively. -measure was 0.566, 0.537, and 0.551, respectively. Although the results of these three algorithms are not perfect, they are not susceptible to imbalance of data numbers.
The results of 10-fold cross-validation of LibSVM and Logistic in the self-consistency test were not significantly increased. Relative under the self-consistency verification, Random Forest, Random Tree, and IBk correctly predicted ratio reached about 90%, since they have good recognition capability for the known information. The prediction performance of Random Forest and IBk was similar in selfconsistency which could achieve the highest value of 0.856 MCC. Since the cross-validation and prediction results of  Random Forest algorithms for minor categories were good, we finally chose the Random Forest classification algorithm as the third layer classifier in QuaBingo.

Performance Analysis.
In order to understand the prediction capabilities of QuaBingo for different functional protein structures in the cell, we compared it with a known quaternary structure prediction tool QuatIdent [12] using an independent test. As shown in Table 5, the predicted result of the average sensitivity of QuaBingo was 51.95%. For the protein categories in the enzyme, gene regulation, membrane protein, single transduction, and molecular binding, there was better prediction of ACC from 77% to 80%. In the QuatIdent, the average sensitivity was 20.74%. These results illustrated the predicting method which is composed of functional domain and PsePSSM cannot obtain a correct identification result for most quaternary protein structures.

The Top Five Features of Block Composition of Oligomer on
Oli8444. The feature extraction method of block composition is simple, which implies that a lot of useful information can be gained to help discover mechanisms of protein aggregation and serial modes. We will optimize block composition by feature selection, according to the degree of importance of each characteristic value, giving an -score numerical score. The top five features are shown in Table 6. For example, the IPB006052A of block composition in the top five features is TNF (Tumor Necrosis Factor) family of conserved sequence, which is found in trimeric CD40 ligand (PDB ID: 1ALY) in the training data and also found in the human Collagen X sequences (PDB ID: 1GR3). Human Collagen X needs to rely on the C1q domain to form a stable homotrimer. In existing data annotation, C1q and TNF-like domains overlap, and there are a number of important positions on the sequence of amino acids with high conservation and similar topology [35]. Much literature has confirmed that these amino acids play an important role in the formation of a hydrophobic core stability trimeric structure and formation of biologically active protein complexes [27,35,36]. In addition, many other features of block composition are associated with a particular function of protein. Thus, feature selection not only reduces the number of features in block composition but also can effectively identify characteristic patterns obviously related to the protein molecule aggregation phenomenon and hence distinguish quaternary structure among different oligomers.

Conclusions
In this study, we propose a feature extraction method based on a block of conserved protein sequence for the classification of protein quaternary structure. This method can overcome the problems of feature extraction encountered by functional domain composition: (1) some proteins may not contain any other known functional domains; and (2) corresponding known functional domains are too few to represent proteins. It is worth noting that the first problem has not yet been encountered in our proposed method, and the second problem was comprehensively solved using QuaBingo. The 10fold cross-validation results showed that the overall accuracy of block composition of homo-and heterooligomers is 92.27% and 91.13%, respectively. Moreover, they are all 10% higher than the functional domain composition. These results demonstrate that the block composition can extract important and biologically meaningful features and thus enhance the prediction of protein quaternary structure.
Although many proteins exist as monomers, they may interact with another protein to form polymers or may further assemble to become a biologically relevant tetramer or octamer. Currently, most of these problems have not been solved through scientific research or verified by adequate information. In the future, as more and more data are added to pertinent databases, an accurate prediction system could be established that would greatly assist relevant research development. An online web server is freely available at http://predictor.nchu.edu.tw/QuaBingo/.