Novel Numerical Characterization of Protein Sequences Based on Individual Amino Acid and Its Application

The hydrophobicity and hydrophilicity of amino acids play a very important role in protein folding and its interaction with the environment and other molecules, as well as its catalytic mechanism. Based on the two physicochemical indexes, a 2D graphical representation of protein sequences is introduced; meanwhile, a new numerical characteristic has been proposed to compute the distance of different sequences for analysis of sequence similarity/dissimilarity on the basis of this graphical representation. Furthermore, we apply the new distance in the similarities/dissimilarities of ND5 proteins of nine species and predict the four major classes based on the dataset containing 639 domains. The results show that the method is simple and effective.


Introduction
It is becoming increasingly important to accurately predict structure and function of proteins because there is an increasing amount of protein sequences collected. Now, many methods have been proposed to gain the additional information or knowledge about the sequence. Graphical representations have become an effective aid in understanding numerical characterizations of biological sequences. One method of creating a graphical representation of a biologic sequence is to create a mapping from the sequence of amino acids or bases, in increasing sequence order, to a numeric characterization of a property of the amino acid or base. According to the numerical characterizations, we can further analysis and research of biological sequences.
In order to have a more intuitive understanding about the biological characteristics implied in the sequence and analyze the similarity/dissimilarity of the protein sequences, Randić and others [22][23][24][25][26] proposed many numerical characterizations, such as , , / , / ( / ), / matrix. For example, / matrix is the quotient of the Euclidean distance and the Graph distance between points in the curve; / ( / ) represents quotient of the Euclidean distance and the sum of distances between a pair of points in the curve. Furthermore, these different characteristic invariants were applied to compare the similarities of biological sequences. However, the numerical characterization methods require a great amount of calculation and lose some information of sequences. So many simple and direct methods were proposed in order to solve complex problems in the sequence alignment. For instance, Randić et al. [27,28] and He et al. [19] directly apply the generating graphical representation of protein sequences to compare the similarities/dissimilarities of the protein sequences of different species.
In this paper, a 2D graphical representation of protein sequences is introduced based on the hydrophobicity and hydropathy index. According to the graphical representation, a new numerical characteristic has been proposed to compute the distance of different sequences for analysis of sequence similarity/dissimilarity. Then, we use the new numerical characteristic of graphical representation to analyze the similarities/dissimilarities of ND5 proteins of nine species. For illustrating the utility of our method, the correlation analysis has been provided to compare between our results and the results based on the other graphical representations with the ClustalW's results. Furthermore, we utilize our method to predict protein structural class, the prediction accuracy of All-, + class and the overall accuracy have obviously improvement. The result indicates that EH and Hp indexes have important function when the primary sequence folds into secondary structure; it also indicates that our method is simple and effective.

The Graphical Representation of Protein Sequences
The hydrophobicity and hydrophilicity of AAs in a protein play an important role in its folding and its interaction with the environment and other molecules, as well as its catalytic mechanism [29]. Based on the hydrophobicity (EH) [30] and hydropathy (Hp) [31] index which were considered by Kurgan and Chen [32], we introduce a graphical representation of proteins to analyze the evolutionary relationships of the protein sequences and predict the structural class from the primary sequences. At first, we consider mapping of each AA, as follows: where the EH 0 and Hp 0 ( = 1, 2, . . . , 20) are the original EH and Hp values of 20 AAs which are listed in columns 3 and 4 of Table 1, respectively. Based on (1), the 2D-Cartesian coordinates of 20 AAs are listed in columns 5 and 6 of Table 1, respectively. Because the slope decides the direction of a curve, we use an equation to construct a 2D graphical representation for each protein sequence, as follows. For a protein sequence = 1 2 ⋅ ⋅ ⋅ , inspect it by stepping one AA at a time. For step ( = 1, 2, . . . , ), a 2D space point ( , ) can be constructed as follows: Let 0 ( 0 , 0 ) = (0, 0). When runs from 1 to , we obtain a series of points 1 , 2 , . . . , , connecting the adjacent points in turn; a 2D zigzag curve that contains + 1 points can be obtained.
As an example, the 2D graphical representations of the two short protein segments of Saccharomyces cerevisiae [27] are plotted in Figure 1 to illuminate our approach.
In the curve, -, -coordinate values represent the positions of AAs in the sequence and the direction of the curve, respectively. And we find that the protein sequences I and II are generally similar except four AAs no matching.

The New Distance Metrics of Two Sequences
In order to have a more intuitive understanding about implied biological characteristics in the sequence and analyze the similarity/dissimilarity of different protein sequences, many authors proposed different characteristic invariants in different matrices, such as the , , / , / , / matrices [22][23][24][25][26]. However, the numerical characterization methods require a great amount of calculation and may lose some information of sequences. Therefore, some researchers used the cumulative distance of every point to present the distance of the sequences [20,27,28]. These numerical characterizations can avoid losing some information of the protein sequences.
We define the distance metrics between sequences 1 and 2 by (3) to compute the similarity of sequences: BioMed Research International 3  where 1 , 2 denote the lengths of two sequences 1 and 2 ; 1 , 2 are their -coordinate values, respectively. This distance eliminates reflection of no equal length sequences, so the numerical characterization is more effective.
The distances among ND5 proteins of nine species are computed based on (3), and their similarities/dissimilarities are listed in Table 2. The smaller distance represents the two species are more similar. Observing Table 2, we find the fin whale-blue whale is the most similar. The human, gorilla, pygmy, and common are also similar, and the rat and mouse are similar. Furthermore, we find the opossum is the dissimilar to the other eight species. And we obtain the human is more similar to pygmy and common than human and gorilla. These results about the similarity are consistent with the known fact of evolution and reduce the computational complexity.
To illustrate the effectiveness of our method, the ClustalW is used to compute the similarity of sequences and construct the phylogenetic tree [34]. ClustalW is a multiple sequence alignment program for biological sequences, which attempts to calculate the best match for the selected sequences and lines them up so that the identities, similarities, and differences can be observed. Then, the distance matrix for ND5 proteins of nine species is calculated by ClustalW and listed in Table 3. In order to illustrate the effectiveness of our method, we give the scatter plot of correlation analysis from element by element of Tables 2 and 3. If the points are all round the trend line, this shows that the correlation is better between our method and ClustalW. Furthermore, the scatter plots of correlation analysis are obtained about the results of Yao et al. method [15], Wen and Zhang method [17], Abo El Maaty et al. method [35], and Wu et al. method [36] with the distance matrix of Table 3. Observing Figure 2, our method is better than other graphical representation approaches of proteins.

The Prediction of Structural Class Using -NN Algorithm
Protein function, regulation, and interactions can be learned from their structure [37,38], which promotes development of novel methods for the prediction of the protein structure. And knowledge of protein structure plays an important role in molecular biology, cell biology, pharmacology, and medical science. Protein secondary structural is generally classified into four structural classes: all-, all-, / , and + . The alland all-classes represent structures that contain mainlyhelices and -strands, respectively. The / and + classes include both -helices and -strands where the / class consists of mainly parallel -strands and + class includes antiparallel strands. We obtain that the dataset includes 640 domains that share sequence identity below 25% [33] in http://biomine.ece.ualberta.ca/Structural Class/SCEC.html. In this paper, we use the dataset that only includes 639 protein domains deleting a wrong domain.
In this work, the -Nearest Neighbor ( -NN) classifiers algorithm is used to predict the structural class. The -NN algorithm is the simplest among those used in machine learning and can determine the attribute of a query point by taking the weighted average of the -NN to the point, and as such is a highly effective inductive inference method [39]. Given a sequence , we calculate the distance metrics of sequence with other sequences and select the -nearest sequences. The distance metrics ( 1 − 2 ) between two sequences 1 and 2 are calculated using (3). In the sequences, we use the 1, 2, 3, 4 to indicate the numbers of sequences which belong to all-, all-, / , and + class, respectively. If the 1 or ( 2 or 3 or 4) is the maximum, sequence is, respectively, predicted for all-, all-, / , and + class. According to the calculation process, we list the performance results of our method using the jackknife test when = 29 in Tables 4 and 5 (i.e., to say 1 + 2 + 3 + 4 = 29).
The following evaluation of the predicted results used several quality measures in this work, including the prediction accuracy (ACC), sensitivity, specificity, and Matthews correlation coefficient (MCC). In the section, the ACC was used to evaluate the results of our method and other published approaches: where TP and TN are the numbers of correctly classified sequences of positive and negative samples, respectively. FP and FN are the numbers of incorrectly classified sequences of negative and positive samples, respectively. The simple and intuitive of ROC curve is given that can accurately reflect a specificity and sensitivity analysis method and is the comprehensive representation of the test accuracy. Meanwhile, the area under the ROC curve (AUC) is given to evaluate the predicted probabilities. Observing Table 4, the results indicate that the overall prediction accuracy with our method achieves 60.82% in the 639 domains, which is the highest among the compared methods, including IB1, C4.5, Naive Bayes, logistic regression [33], and Liao's method [20]. In Chen's article [33], the authors declared that + class was the most difficult to predict than the other three structural classes. However, the prediction accuracy of + has evidently improved using our method. And the all-class and overall accuracy are also higher  Table 3 in Abo El Maaty et al. [35] ClustalW     Table 5 and Figure 3, respectively. Observing Table 5, the predictions for the / class have higher quality with 65.25% for sensitivity, 91.58% for specificity, and 50.36% for MCC. In Figure 3, the AUC values for each of the four classes are above 0.5 (for random predictions). Although the overall prediction accuracy with our method is lower than the method of SVM [33], our approach is simpler and less time consuming.

Conclusions
The hydrophobicity and hydrophilicity of AAs play an important role in folding for secondary structure. Based on the two physicochemical indexes, a 2D graphical representation of protein sequences is proposed in the paper. This graphical representation of protein sequences has the better visibility and can reflect more information of protein sequences. In order to obtain the intuitive understanding of sequences implying biological characteristics and make the similarity comparison conveniently, a new distance is suggested based on the graphical representation of protein sequences. We firstly apply the new distance to analyze the similarities/dissimilarities of ND5 proteins of nine species, and correlation analysis is given to compare our results and other graphical representations with ClustalW's result. Furthermore, using the new distance of graphical representation, the four major classes are predicted based on the dataset containing 639 domains that share sequence identity below 25%. The prediction result shows that the method can improve the prediction accuracy for All-, + class and the overall accuracy. In particular, using our method can evidently improve the prediction accuracy of the + class. The result demonstrates that EH and Hp index have important function when the primary sequence folds into secondary structure. The calculation methodology is more simple, convenient, and fast. In addition, the method can be extended to other physicochemical properties of amino acids and will be useful to study and solve some bioinformatics problems.