ADLD: A Novel Graphical Representation of Protein Sequences and Its Application

To facilitate the intuitional analysis of protein sequences, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced in this paper first, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. Comparing with existing methods, our ADLD based method is proved to be effective in the similarity/dissimilarity analysis of protein sequences and have the merits of good intuition, visuality, and simplicity. The examinations of the similarities/dissimilarities for both the 16 different ND5 proteins and the 29 different spike proteins illustrate the utility of our ADLD based approach.


Introduction
Homology analysis is one of the hot topics in the area of protein sequences analysis. Up to now, lots of methods have been proposed for the homology analysis of protein sequences [1][2][3], and among them a useful one is the graphical representation of protein sequences, which is proved to be a powerful tool for visual comparison of protein sequences.
At first, graphical representation methods were introduced for representation of DNA sequences on the basis of multiple dimension space [4][5][6][7]. After obtaining the sequence invariants from the graphics, one can compare the sequences based on comparison of sequence invariants. Graphical representation methods were proposed as an alternative approach of direct comparison of DNA sequences, which are computational intensive (even those of a restricted length) [8]. Protein sequences are to some degree similar to DNA sequences, which are composed of different units. Thus the graphical representation methods can be extended to describe protein sequences obviously.
Currently, many researchers have proposed different methods for the graphical representation of protein sequences [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. For example, Feng and Zhang [25] suggested Zp-curve based on the hydrophobicity and charged properties of amino acid residues along the primary sequence. Randić et al. [26] introduced a graphical representation of protein sequences based on a graphical representation of triplets of DNA in which the interior of a square or a tetrahedron is utilized to accommodate 64 sites for the 64 codons. Bai and Wang [27] derived a 2D graphical representation of protein sequences based on nucleotide triplet codons. Yao et al. [28] outlined a 2D graphical representation of protein sequences based on two classifications of amino acids. Abo el Maaty et al. [29] proposed a novel unique 3D graphical representation of protein sequences based on three physicochemical properties of amino acid side chains. Abo-Elkhier introduced a 3D graphical representation of protein sequence based on a right cone of a unit base and unit height on protein sequences interfaces [30]. El-Lakkani and El-Sherif [31] proposed a graphical representation of protein sequence to help similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Ma et al. [32] introduced a family of Iterated Function Systems (IFS) to outline a 2D graphical representation of protein sequences.
In most of these existing methods, the main drawbacks are that the higher the dimension of the protein sequence graphs, the heavier the computation complexity of the methods or the lower the recognition degree of the protein sequence graphs. For example, in the methods proposed in [26,28], the main drawback is that the lines will cross each other, which will decrease the visibility of the graphics. In the methods proposed in [29][30][31], the main drawbacks are that the 3D graphics seem to be more complex and have lower visibility than the 2D graphics, and, in addition, to obtain the sequence invariants from the graphics, complex matrixes are required to be constructed, which need much computation and storage.
Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences [33]. Up to now, there are many kinds of algorithms having been implemented for sequence alignment [34][35][36][37]. These methods are usually efficient but complex and time consuming. Comparing with the alignment methods, existing graphical representation methods can also display the inner structure of the protein sequences and can be utilized to find the similarity/dissimilarity more visible according to their graphics. In this paper, we proposed a novel method for analyzing the similarity/dissimilarity by combining the idea of the sequence alignment and the graphical representation methods to some degree avoid the weakness of both of these two methods.
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, which has been proved to be effective in the process of protein sequence analysis [38][39][40]. Therefore, in order to overcome the main drawbacks of existing methods, in this paper, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced based on PCA, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. And, in addition, to validate the effectiveness of our ADLD based method, we adopt it to analyze the similarity/dissimilarity of both the 16 different ND5 proteins and the 29 different spike proteins, respectively, which are widely used as the test data [16][17][18][19][20][21][22][23][24][25][26]. The analysis results show that our method is not only visual, intuitional, and effective in the similarity/dissimilarity analysis of protein sequences but also quite simple, since there are no high dimensional matrixes required to be constructed.

Procedure of Our Method for Analysis of Protein Sequences.
In this section, we will illustrate the overall procedures of our method for analyzing protein sequences as follows at first.
(1) Select the same 9 different properties for each amino acid and construct a 20 × 9 matrix as the input data of the PCA algorithm on the basis of total 20 different amino acids.
(2) According to the PCA algorithm, we can obtain a unique feature for each amino acid.
(3) For each protein sequence in the test data, we will replace each amino acid in the protein sequence with its corresponding unique feature, and then we can transform the protein sequence into a numerical sequence. (4) For any two numerical sequences, we can draw a graph, named ADLD, and then abstract some numerical characteristics of it, which can be utilized to analyze the similarity/dissimilarity of these two sequences.
Next, in Sections 2.2-2.6 we will introduce the details of constructing the ADLDs and obtaining some of the numerical characteristics of them. In Section 3.1, we will give the method for constructing the similarity/dissimilarity of our test sequence groups.

Amino Acids and Their
Properties. Proteins are composed of 20 different amino acids, and these amino acids have many different physicochemical and biological properties such as the molecular weight (mW), hydropathy index (hI), the pKa value for terminal amino acid groups COOH (pK1), the pKa value for terminal amino acid groups NH 3 + (pK2), isoelectric point (pI), solubility ( ), the number of triplet codons (cN), frequency of human proteins ( ), and van der Waals radius of side chains (vR). The names and symbols of the 20 amino acids and the value of their 9 major properties are illustrated in Table 1.

Principal Components
Analysis. Principal components analysis (PCA) is a common technique for dimensionality reduction and pattern recognition in datasets of high dimension [41]. The main purposes of PCA are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information. The general steps of conducting PCA are as follows.
Step 3. From the correlation matrix R, obtain its eigenvalues respectively. And, from now on, we can obtain principal components for ∈ {1, 2, . . . , } as follows: Step 4. For each principal component for ∈ {1, 2, . . . , }, obtain its contribution rate CR and accumulated contribution rate ACR according to the following formulas, respectively: Generally, in order to lower the computation complexity, we can keep only the first ( ≤ ) principal components { 1 , 2 , . . . , }, where the accumulated contribution rate of the th principal component shall satisfy the fact that ACR ≥ 85%. Step Then, for each ∈ {1, 2, . . . , }, we can obtain the total score of the th sample as follows:   Table 1, if we consider the 20 amino acids as 20 different samples and the 9 properties of each amino acid as its 9 components, then, according to the general steps of conducting PCA illustrated in Section 2.3, we can obtain a 20 × 9 matrix X and its standardized matrix X * , a 9 × 9 correlation matrix R, and 9 principal components { 1 , 2 , . . . , 9 }. And, therefore, as illustrated in Table 2, we can obtain the 9 eigenvalues of R and the contribution rates and the accumulative contribution rates of the 9 principal components { 1 , 2 , . . . , 9 }, respectively.

Symbols of amino acids Total scores
Observing the above 4 formulas, it is easy to find that there are three big coefficients in the first formula, which are 0.5036 (corresponding to mW), 0.4377 (corresponding to ), and 0.4349 (corresponding to vR), respectively. Therefore, it means that the three properties such as mW, , and vR will have a major role in the first principal component represents a protein sequence with amino acids, where ∈ Ω for ∈ {1, 2, . . . , }; then we can obtain a numerical sequence Ψ = ( 1 , 2 , . . . , ) corresponding to the protein sequence Ψ through replacing each amino acid in Ψ with its corresponding value of TotalScore( ) for ∈ {1, 2, . . . , }.
For example, consider the following 3 abbreviated protein sequences: According to the above descriptions and Table 4, then we can obtain their corresponding numerical sequences as follows: 2.6. ASDs and ADLDs of Protein Sequence Pairs. For a given protein sequence pair ( 1 , 2 ), suppose that the protein sequence 1 includes 1 amino acids, 2 includes 2 amino acids, and 1 ⩾ 2 ; then, in order to measure the similarity/dissimilarity between them, in this section, we will present a new method called Alignment Scatter Diagram (ASD) to plot the two sequences into a scatter diagram first. And, for convenience, we call the points in the ASD the alignment-plots (APs). The ASD of the protein sequence pair ( 1 , 2 ) can be obtained through the following steps.
Step 1. According to the method given in Section 2.5, translate the protein sequence pair ( 1 , 2 ) into two numerical sequences with the same length as follows: Step 2. Let be the alignment width (AW) of the protein sequence pair ( 1 , 2 ); that is, let 1 = 1 , 2 , 3 , . . . , 1 , 2 = 1 , 2 , 3 , . . . , 2 ; then, for any amino acid in the protein sequence 1 , we will compare it with these 2 + 1 amino acids { − , . . . , −1 , , +1 , . . . , + } in the protein sequence 2 , and then can be simply defined as follows: where > 0 is a given threshold to guarantee that the AW of the protein sequence pair ( 1 , 2 ) will not be too small to expose the association of the inner structures of the protein sequence pair ( 1 , 2 ). In actual applications, we suggest that shall be no less than 10.
Step 3. Let > 0 be the dissimilarity degree (DD) of two amino acids; that is, if = 0, then it means that the two amino acids are the same; otherwise, it means that the two amino acids are different from each other to some degree, and then the APs in the ASD of the protein sequence pair ( 1 , 2 ) can be briefly defined as follows: where ∈ {1, 2, . . . , 1 }, ∈ {1, 2, . . . , 1 }, and Θ is a Heaviside function, which can be defined as follows: Thereafter, we can obtain an 1 × 1 alignment matrix (AM) as follows: Step 4. For the 1 × 1 elements in the alignment matrix AM, we can plot points on -plane for these elements in the AM with = 1 and | − | ≤ . And, for convenience, we call the obtained graph the Alignment Scatter Diagram (ASD) of the protein sequence pair ( 1 , 2 ). From Figure 1, it is easy to see that there are lots of disordered points in these ASDs, which will lower the visuality of the ASDs remarkably and obstruct us from distinguishing the similarity/dissimilarity between the protein sequence pairs intuitively while observing these ASDs. Therefore, in order to improve the intuition of the ASD, we will propose a simplified variant diagram of the ASD, which is called the Alignment Diagonal Line Diagram (ADLD).
For convenience, in an ASD, we call its main diagonal line the artery tracks (ATs) and the lines parallelling to its main diagonal line the by-path tracks (BTs), respectively. And, in addition, we define a set consisting with no less than consecutive APs on the AT or BTs as a CAPS, where ≥ 1 is a given threshold.
For a given CAPS caps 1 , if there is no CAPS caps 2 satisfying caps 1 ⊂ caps 2 , then we call the caps 1 a maximum CAPS. And, for convenience, we call the line formed by connecting all of the APs in a maximum CAPS a similar fragment (SF), and simultaneously we call all of the APs on the AT but not on any SFs the free points (FPs).

Computational and Mathematical Methods in Medicine
Obviously, in an ASD, if keeping all of the SFs and FPs only and omitting all those other APs, then we will obtain a simplified variant diagram of the ASD, and, for convenience, we call it the Alignment Diagonal Line Diagram (ADLD). Apparently, if = 1, then an ADLD will degenerate into an ASD. Therefore, in actual applications, we suggest that will be no less than 2. And, particularly, in order to find more accurate SFs in the ADLD of a protein sequence pair, the longer the protein sequences in the protein sequence pair are the bigger the value of shall be.
For convenience of analysis, in an ADLD, suppose that there are 1 different SFs and 2 different FPs on its AT, different BTs locating above its AT, and different BTs locating below its AT; then we get the following.
(1) For these 1 different SFs and 2 different FPs on the AT of the ADLD, we will number these 1 SFs and 2 FPs from left to right and utilize {ASF 1 , ASF 2 , . . . , ASF 1 } and {FP 1 , FP 2 , . . . , FP 2 } to represent these 1 SFs and 2 FPs separately. And, in addition, we would also call these SFs on the AT of the ADLD the ASFs.
(2) For these different BTs locating above the AT, we will number these BTs from down to up and utilize {BT 1 , BT 2 , . . . , BT } to represent these BTs separately, and, for these different BTs locating below the AT, we will number these BTs from up to down and utilize {BT −1 , BT −2 , . . . , BT − } to represent these BTs separately.
(3) For each BT , where ∈ {1, 2, . . . , }, suppose that there are 3 different SFs on the BT ; then we will number these 3 SFs from left to right and utilize {BSF 1 , BSF 2 , . . . , BSF 3 } to represent these SFs separately. And, in addition, we would also call these SFs on the BTs of the ADLD the BSFs.
According to the above assumptions, in Figure 2, we show the two ADLDs corresponding to the ASDs illustrated in Figures 1(a) and 1(b) while letting = 3. And, in addition, to make the ADLDs more visual and intuitional, in Figure 2, we use the red " * " to represent the FPs on the AT and the blue lines to represent the SFs on the AT or BTs.
Observing Figure 2(b), we can easily find that there are also two SFs in the ADLD of the sequence pair (human, gorilla). But, different from that in Figure 2(a), the two SFs in Figure 2(b) are both ASFs; one is ASF 1 , that is, the line segment from the point (1, 1) to the point (104, 104), and the other is ASF 2 , that is, the line segment from the point (106, 106) to the point (121, 121). And, in addition, the two ASFs in Figure 2(b) are separated by one gap, and there exist no FPs or BSFs on the AT or BTs.
Through analysis, we can know that, for a given protein sequence pair, if there exist some deletions or insertions of amino acid segments between the two protein sequences, then there will exist some misalignments of SFs in their ADLD; that is, some ASFs on the AT will be transformed into BSFs on some BTs. And, in addition, if there exist some substitutions of the amino acids between the two protein sequences, then, in their ADLD, there will exist some gaps between two neighboring SFs or FPs on the AT. Furthermore, if there exist some insertions, deletions, or substitutions of the amino acid segments at the end of the two protein sequences, then, in their ADLD, there will exist no SFs or FPs on the AT or BTs.
From the above descriptions, it is easy to know that the ADLD of any given protein sequence pair obtained by our above proposed method reflects some inner and specific differences between these two protein sequences in the given protein sequence pair, which may be useful in the similarity/dissimilarity analysis of protein sequence pairs.

Method for Similarity/Dissimilarity Analysis of Protein
Sequences Based on the ADLDs. According to the above analysis, we have known that the ADLDs may be useful in analyzing the differences of the inner structures of protein sequence pairs. In this section, we will show how to utilize the ADLDs to analyze the similarity/dissimilarity of a group of protein sequences.
Generally, suppose that there are protein sequences {Ψ 1 , Ψ 2 , . . . , Ψ }; then while applying the ADLDs to analyze the similarity/dissimilarity of these sequences, the similarity/dissimilarity matrix of these sequences can be obtained through the following steps.
Step 2. For a given protein sequence pair {Ψ , Ψ }, ∈ {1, 2, . . . , }, ∈ {1, 2, . . . , }, we can obtain their ADLD through adopting the method proposed in Section 2.6, and then we can obtain all of the SFs (including ASFs and BSFs) and FPs in the ADLD. Hence, we can obtain the lengths of these ASFs, the lengths of these BSFs, and the number of these FPs, respectively.
Observing Table 6, it is easy to find that there are some similar pairs such as (c-chim, pi-chim) with the distance 0.0510, (human, c-chim) with the distance 0.0814, (human, 8 Computational and Mathematical Methods in Medicine   pi-chim) with the distance 0.0720, (gorilla, c-chim) with the distance 0.0865, (gorilla, pi-chim) with the distance 0.0833, and (fin-whale, blue-whale) with the distance 0.0324. And, among them, the opossum seems to be a peculiar mammal, since the shortest distance between it and the remaining mammals is more than 0.4023. Obviously, the result is consistent with the fact that opossum is the most remote species from the remaining mammals.
Additionally, gallus seems to be more peculiar than opossum, since the shortest distance between it and  the remaining animals is more than 0.4423, which is bigger than 0.4023 (the shortest distance between Opossum and the remaining mammals). Obviously, the result is consistent with the fact that gallus is not a kind of mammal.
Therefore, it is apparent that the results illustrated in Table 6 are wholly consistent with the results of the known fact of evolution. That is to say, our ADLDs based method can be utilized as an effective way to analyze the similarities/dissimilarities of protein sequences.

The Phylogenetic Tree of the Protein Sequences Based on the ADLDs.
A phylogenetic tree is a diagram that is used to represent the evolutionary relationships of organisms that are thought to have a common ancestry, and it is a commonly used tool for researchers in some fields to help them analyze the clustering of different species.
Obviously, only through observing the similarity/ dissimilarity matrix illustrated in Table 6, we will find that it is not very convenient to distinguish the similarity/dissimilarity of protein sequences. Therefore, in order to show the similarity/dissimilarity of the protein sequences more vividly and intuitively, according to the similarity/dissimilarity matrix illustrated in Table 6, then we will construct the phylogenetic tree of the above 16 ND5 proteins through adopting the software MEGA 6.06 that is provided by Tamura et al. [41], and the result is illustrated in Figure 3.
From Figure 3, it is obvious that we can not only find out the evolutionary relationships of these 16 ND5 protein sequences visually and intuitively but also know easily that the constructed phylogenetic tree is consistent with the results of the known fact of evolution to some degree.
To further validate the performance of our ADLDs based method, we applied our method to analyze the similarity/dissimilarity of another group of proteins including 29 spike proteins of coronavirus and compared our method with the method proposed by Wen and Zhang [17] based on the above given 16 ND5 proteins and the following 29 spike proteins, respectively. The basic information of the 29 spike proteins is illustrated in Table 7.
For the 29 spike proteins illustrated in Table 7, we construct the phylogenetic tree in Figure 4. Since the spike protein sequences are very long (with more than 1100 amino acids), therefore, during simulation, we set = 5 to avoid the effect of noise points.
Generally, coronavirus can always be classified into four classes such as the Group I, the Group II, the Group III, and the SARS-CoVs (Severe Acute Respiratory Syndrome Coronaviruses). And, among these four classes, the Group I includes the Canine coronavirus (CCoV), the  From observing Figure 4, it is easy to know that the 29 spike proteins of coronavirus can be perfectly classified into the above four classes by our ADLDs based method.
Finally, for the convenience of comparison, we illustrate the phylogenetic trees of the above given 29 spike proteins of coronavirus and 16 ND5 proteins, constructed by adopting the method proposed by Wen and Zhang [17], in Figures 5 and 6, respectively.
Comparing Figure 3 with Figure 6 and Figure 4 with Figure 5, respectively, it is obvious that the phylogenetic trees obtained by the method proposed by Wen and Zhang are quite unreasonable and not consistent with the known facts of evolution at all. But, on the contrary, the phylogenetic trees obtained by our ADLDs based method are not only quite reasonable but also consistent with the known facts of evolution to some degree. Therefore, there is no doubt that the performance of our method is much better than that of the method proposed by Wen and Zhang.

The Analysis of Intuition and Visuality of the ADLDs.
In Section 2.6, we have stated that the ADLDs of protein sequence pairs are intuitional and visual. In this section, we  will further discuss the intuition and visuality of the ADLDs in detail.
From Table 6, we can obtain some similar pairs such as (fin-whale, blue-whale), (pi-chim, c-chim), (Human, cchim), (cheep, goat), (human, pi-chim), and (hare, rabbit) and some dissimilar pairs such as (human, opossum) and (human, gallus), among the above given 16 ND5 proteins. From these similar/dissimilar pairs, we will choose three pairs including (human, gorilla), (human, opossum), and (human, gallus) as examples to further show the intuition and visuality of the ADLDs of these three protein sequence pairs. The ADLDs of these three similar/dissimilar pairs are illustrated in Figure 7, while letting = 3.
Observing Figure 7, we can clearly find that the total length of all of the SFs in each of these three ADLDs satisfies the total length of all of the SFs in the ADLD of Figure 7(a) > the total length of all of the SFs in the ADLD of Figure 7(b) > the total length of all of the SFs in the ADLD of Figure 7(c). Therefore, we can intuitively identify that the similarity of the proteins in each of these three protein sequence pairs satisfies the similarity of the proteins in the pair (human, gorilla) > the similarity of the proteins in the pair (human, opossum) > the similarity of the proteins in the pair (human, gallus).
Moreover, from Figure 7, we can also intuitively identify that the two protein sequences in the protein sequence pair (human, gorilla) are very similar to each other, since the total  length of all of the SFs in the ADLD of Figure 7(a) looks very long. But, on the contrary, we can intuitively identify that the two protein sequences in either the protein sequence pair (human, opossum) or the protein sequence pair (human, gallus) are apparently dissimilar to each other, since both the total length of all of the SFs in the ADLD of Figure 7(b) and that in the ADLD of Figure 7(c) look very short.
And, through statistic, we can know that the actual total lengths of all of the SFs in the ADLDs of these three protein sequence pairs (human, gorilla), (human, opossum), and (human, gallus) are 556, 288, and 248, respectively.
Additionally, observing Figures 2(a) and 2(b), hardly can we distinguish the total length of all of the SFs (including ASFs and BSFs) in the ADLD of Figure 2(a) and that in the ADLD of Figure 2 Figure 2(b); therefore, we can intuitively identify that the two protein sequences in the protein sequence pair (chimpanzee, human) are more similar to the two protein sequences in the protein sequence pair (human, gorilla).
Hence, from the above descriptions, we can know that the ADLDs obtained by our newly proposed method are quite visual and intuitional and may be a powerful and effective tool for visual comparison of protein sequences and numerical sequences in other research fields.

Conclusions
In this paper, a novel ADLDs based graphical representation of protein sequences is proposed, which is utilized to analyze the similarity/dissimilarity of protein sequences. To validate the performances of the new method, we select two groups of well-known protein sequences as examples, and, additionally, in order to observe the similarity/dissimilarity of protein sequences more intuitively, we construct the phylogenetic trees of protein sequences. The results show that our ADLDs based method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation, since there are no high dimensional matrixes required. Therefore, it means that our ADLDs based method can work well in the analysis of protein sequences.