Measuring Similarity among Protein Sequences Using a New Descriptor

The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.


Introduction
Information encoded in the genome of any organism plays a central role in defining the life of that organism. e nucleotide sequence that forms any gene is translated into its corresponding amino acid sequence. is sequence of amino acids becomes functional only when it adopts its tertiary structure. Experimental methods such as X-ray diffraction and nuclear magnetic resonance are considered authoritative ways for obtaining proteins' structure and function. ese experimental methods are very expensive and time consuming. erefore, computational methods for predicting protein structure have become very useful. Proteins with similar sequences are usually homologous, typically displaying similar 3D structure and function.
Sequence alignment is the first step of 3D structure prediction for protein sequences. Alignment approaches are classified into alignment-based and alignment-free methods. BLAST (basic local alignment search tool) and ClustalW are the most widely used computer programs for alignmentbased approaches [1][2][3]. Results of these programs provide an approximate solution to the protein alignment problem. On the other hand, many alignment-free approaches are proposed for sequence comparison. Most biological sequence analysis methods still have weaknesses, including having low precision and being time consuming [4,5].
Similarity/dissimilarity analysis of biological sequences is used to extract information stored in the protein sequence. Many mathematical schemes have been proposed to this end. Graphical representations of biological sequences identify the information content of any sequence to help biologists choose another complex theoretical or experimental method. Graphical representation provides not only visual qualitative inspection of gene data but also mathematical characterizations through objects such as matrices.
Some 2D and 3D graphical representations are created by selecting a geometrical object that is used to describe nucleic acid bases or residues [6][7][8][9][10]. Others are based on assigning vectors of two or three components to nucleic acid bases or amino acids [11][12][13][14][15][16][17]. Adjacency matrices are also introduced in some articles [18][19][20][21], where an exact solution is obtained to the protein alignment problem. Additional methods use discrete Fourier transform (DFT) in which DNA sequences are mapped into four binary indicator sequences, followed by the application of DFT on these indicator sequences to transform them into a frequency domain [22,23]. Dynamic representation is used to remove degeneracies in the previously mentioned approaches [24][25][26][27][28][29][30][31]. Another method is based on the simplified pulsecoupled neural network (S-PCNN) and Huffman coding where the triplet code was used as a code bit to transform DNA sequence into numerical sequence [32].
In this study, we introduce a new alignment-free method for protein sequences. Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (A t , SA t ). Our graphical representation eliminates degeneracy and has no loss of information. It is suitable for both short and long sequences. As a proof of concept, our approach is applied on nine beta globin protein sequences and nine ND5 (NADH dehydrogenase subunit 5) protein sequences. It can be applied on any sequence length with the same efficiency. Correlation and significance analyses are introduced among our results, along with PID% [15] and ClustalW [33] to demonstrate the utility of our approach.

Dataset, Technology, and Tools
All the protein sequences used in this study were downloaded from e National Center for Biotechnology

2D Graphical Representation
A new 2D graphical representation is introduced. Each amino acid in any protein sequence is represented by the suggested intensity Y x (i) and intensity level A x (i). e intensity (Y x (i)) of each amino in the sequence depends on its abundance and location in the different sequences. It is calculated using where f x is the frequency of amino acid x in the sequence, number of times of x/N. N is the protein sequence length, number of residues in protein sequence. i is the position of each amino acid x in a sequence. en, the intensity level A x (i) of each amino acid (x) in the sequence is calculated by using the natural logarithm function as in erefore, each amino acid has its own intensity level which is a vector of N elements according to equation (2). Finally, the combined intensity level of the protein sequence A t (i) is obtained by the summation of the 20 intensity levels' vectors A x (i) of the protein sequence by using equation (3). e combined intensity level A t (i) is also a vector of N elements: Each amino acid has its own graph. Now, twenty graphs are obtained for each sequence of the 20 different amino acids. e combined graph is obtained by combining these 20 graphs within a single graph. is combined intensity level is our new 2D graphical representation.
Our approach is first applied on two short segments of protein from "yeast Saccharomyces cerevisiae": Protein I: "WTFESRNDPAKDPVILWLNGGPGCS-SLTGL" Protein II: "WFFESRNDPANDPIILWLNGGPGCS-SFTGL" ese two short proteins consist of 30 amino acids each. e two sequences are different in amino acids at positions 2, 11, 14, and 27. e values Y x (i) and A x (i) for each amino acid in the two sequences are calculated. For protein I, the G amino acid is repeated four times in the protein sequence. ese four repeats occur in positions 20, 21, 23, and 29. e frequency, f G , equals (4/30). By substituting in equations (1) and (2), the results of Y G (i) and A G (i) are presented in Table 4.
By summing the values of A x (i) for all amino acids in protein I, the total value of A t (i) is obtained, as shown in Figure 1(a). e position i of each amino acid is located on the x-axis, and the total intensity level A t (i) is located on the y-axis. We next apply our approach on nine beta globin and nine ND5 (NADH dehydrogenase subunit 5) protein sequences, which are illustrated in Tables 1 and 2. e 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in representations for fin whale and rat ND5 protein sequences are illustrated in Figures 3(a) and 3(b), respectively. We finally apply our approach on 24 coronaviruses protein sequences which are illustrated in Table 3. e 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively.

Protein Sequence Descriptor
Mathematical descriptors help in recognizing major differences among similar protein sequences quantitatively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean A t and standard deviation SA t of the combined intensity level value A t (i) of the protein sequence. ey are evaluated according to the following equations: is descriptor compresses the information from primary protein sequences into a single vector composed of only two values. e beta globin, ND5, and coronaviruses protein sequence descriptors are illustrated in Tables 5-7, respectively. Table 7 shows that the mean of all 24 coronaviruses is around 38.7 and with a range from 38.601 to 38.838 while their standard deviation varies according to their class. ey are divided into four classes. e first four viruses belong to class I. e fifth to the ninth coronaviruses belong to class II. Class III contains the tenth and eleventh viruses. e rest viruses from the 12th to the 24th belong to SARS-CoV. According to our approach, the standard deviation of class I ranges from 10.94 to 11.17. Class II's standard deviation ranges from 10.68 to 10.77. Class III's standard deviation has values from 10.6271 to 10.6458. SARS-CoV's standard deviation almost equals 10.58. e resulting standard deviation values of the 24 coronaviruses classify them correctly to the four classes. e coronaviruses classes' ranges according to our approach are shown in Figure 5.

Similarity/Dissimilarity Analysis
To compare the species' protein sequences, the Euclidean distance among species' descriptors is evaluated. For example, the human beta globin protein sequence's descriptor is (37.145, 11.505) and the chimpanzee beta globin protein sequence's descriptor is (36.912, 11.586). To measure the degree of similarity between human and chimpanzee, the Euclidean distance between these vectors is evaluated. e similarity/dissimilarity matrices of beta globin and ND5 protein sequences are illustrated in Tables 8 and 9, respectively. Table 8 results show that human and chimpanzee sequences are similar. ere is also striking similarity between mouse and rat sequences, while human and opossum sequences are obviously dissimilar.    Species  ID  Length  1  Gorilla  CAA43421  121  2  Chimp  CAA26204  125  3  Human  AAA16334  147  4  Rat  CAA29887  147  5  Mouse  CAA24101  147  6  Gutta  ACH46399  147  7  Duck  CAA33756  147  8  Gallus  CAA23700  147  9 Opossum AAA30976 147   1  2  3  4  5  6  7  8  9  10  11  12  13  14 15 16  17  18  19  20  21  22  23  24  25  26  27  28  29   pigmy chimpanzee, common chimpanzee, human, and gorilla ND5 protein sequences are similar, while the blue whale is similar to the fin whale, and mouse is similar to rat. Similar to the other sequence, human and opossum are still dissimilar. However, our algorithm cannot measure the degree of similarity very well for pigmy chimpanzee. e distance between human and pigmy chimpanzee is 0.1826, while the distance between human and gorilla is 0.0575, as shown in Table 9. e results of both Tables 8 and 9 are approximately comparable to previous reports [13, 15, 21, 33-39].

The Phylogenetic Tree of the Protein Sequences Based on Our Method
We got the phylogenetic trees of beta globin and ND5 protein sequences by applying the UPGMA (Unweighted Pair Group Method with Arithmetic Mean). e phylogenetic tree based on Tables 8 and 9 of our method is presented in Figures 6 and 7, respectively. Figure 6 proves the utility of our similarity/dissimilarity analysis for beta globin protein sequences. Figure 7 shows our analysis of similarity/dissimilarity of ND5. It is mentioned that our algorithm cannot   measure the degree of similarity very well for pigmy chimpanzee with human. is appears of course in Figure 7. e P. chimp branch should be close to C. chimp. Despite this error, the tree shows that human, common chimpanzee, pigmy chimpanzee, and gorilla belong to the same cluster. To check the effect of this error on our algorithm, the results of our algorithm are compared to sequence homology. A correlation and significance analysis is also provided.

Our Method Compared to PID% and ClustalW Results
e results of our algorithm are compared to the sequence homology by two methods. First, we use the Smith Waterman algorithm to calculate the number of identical residues in each pair of protein sequences [15]. e results of the PID% of nine beta globin sequences are illustrated as a similarity/dissimilarity matrix in Table 10. e larger PID% represents the more similar protein sequences. A correlation and significance analysis is provided to compare our approach in Table 8 with PID% in Table 10. e correlation of the two sets of data is sufficiently strong   when the correlation coefficient (r) is greater than 0.7. e negative sign of (r) indicates that when the first data set increases, the second data set decreases. We then assess statistical significance for correlation coefficient values greater than 0.7 to ensure that they likely do not occur by chance. Our sample set is composed of nine protein sequences. erefore, we use 7 degrees of freedom. A t-value of 2.385 or greater indicates that a less than 0.05 chance of the results occurred by coincidence. e results for correlation coefficients and t-values for our approach are illustrated in Table 11.
Second, ClustalW is a widely used system for aligning any number of homologous nucleotides or protein sequences [33]. e ClustalW program's distance matrix of nine ND5 protein sequences is illustrated in Table 12. Correlation and significance analyses are also provided to compare our approach in Table 9 with ClustalW results in Table 12.
e results of the correlation and significance analyses of our approach and other approaches [15,33] are illustrated in Table 13. Our sample set of ND5 is also composed of nine protein sequences. erefore, we use 7 degrees of freedom and a t-value of 2.385 or greater. Despite the unusual result for pigmy chimpanzee that appeared in      Table 11: e correlation and significance analysis between our similarity analysis results of beta globin protein sequences in Table 8 and PID% similarity matrix in Table 10.    Tables 9 and 7 in [33] and Table 3 in [15] and ClustalW similarity matrix in Table 12.
Correlation coeff. (r) of our approach t-value of our approach Correlation coeff. (r) of [33] t-value of [33] Correlation coeff. (r) of [15] (Table 3) t-value of [15] (  Table 9, the correlation coefficient of pigmy chimpanzee in our similarity matrix and clustalW matrix is 0.8811. is value likely does not occur by chance, as the t-value equals 4.928, as illustrated in Table 13. e comparison between our results and both PID% and ClustalW and other approaches' results indicate the utility of our approach.

Conclusions
A new graphical representation of protein sequences is introduced. It is the combined intensity level of the 20 amino acids composing any protein sequence. Each amino acid in a given protein sequence has its own intensity and intensity level. ey are vectors of N elements as N is the protein sequence length. e combined intensity level is then computed and graphed to represent any protein sequence graphically. Our 2D graphical representation effectively displays differences between protein sequences without degeneracies. e graph does not overlap or intersect with itself. Our new descriptor suggested a vector of two elements, which are the mean and standard deviation of the combined intensity level (A t and SA t ). A similarity/dissimilarity analysis is evaluated by computing Euclidean distance between each two species' descriptors. Examination of similarity/dissimilarity among nine beta globin, nine ND5, and 24 coronaviruses protein sequences provided good results compared to previous approaches. e suggested approach is effective for both short and long sequences, and the computations are very simple. Furthermore, loss of sequence information is avoided. Correlation and significance analyses with PID% and ClustalW are also introduced to show the utility of our approach.