Phylogenetic Analysis of H7N9 Avian Influenza Virus Based on a Novel Mathematical Descriptor

A new mathematical descriptor was proposed based on 3D graphical representation. Using the method, we construct the phylogenetic trees of nine proteins of H7N9 influenza virus to analyze the originated source of H7N9. The results show that the evolution route of H7N9 avian influenza is from America through Europe to Asia. Furthermore, two samples collected from environment in Nanjing and Zhejiang and one sample collected from chicken are the sources of H7N9 influenza virus that infected human in China.


Introduction
In February 2013, two patients with severe pneumonia were admitted to Shanghai Fifth Hospital affiliated with Fudan University [1]. Both patients died in one week, respectively. Chinese experts concluded that the patients were infected with novel avian influenza A H7N9 [2,3]. 131 people were diagnosed with novel avian influenza A H7N9 in china according to National Population and Family Planning Commission of China, updated to May 7, 2013. This particular A (H7N9) virus had not previously been seen in either animals or people until 2013 in China [4].
The influenza virus consists of eight negative-strand RNA molecules surrounded by an envelope. The envelope contains the HA and NA proteins. There are 17 known HA subtypes and 10 known NA subtypes. Many different combinations of HA and NA proteins are possible. Six of the eight mRNAs code for single proteins, while the remaining two code for two proteins by differential splicing of the RNA. Each mRNA segment is associated with multiple copies of the nucleocapsid protein (NP) and an RNA polymerase (made from the viral proteins PB1, PB2, and PA). H7N9 avian influenza virus is previously only isolated from birds. No case of human infection has been reported before. The Lancet published a research report, in which the H7N9 avian influenza virus has some types of mutations associated with circulation in humans [5].
Graphical representation of biological sequences is one of the most commonly used models to analyze protein sequences . Since Hamori and Ruskin proposed Hcurve for studying genomic data in 1983, more and more improved graphical representations of DNA sequences were introduced to analyze gene data [6][7][8][9]. Then, a number of the graphical representations of DNA sequences were expanded into protein sequences to describe and analyze protein sequences [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. For instance, analogously to the scheme of Jeffrey for graphical representation of DNA, Randić et al. [11,12], Yang et al. [13], Rasouli et al. [14], and He et al. [15,16] have suggested several 2D graphical representations of protein sequences. According to the indices of some physicochemical properties of the twenty amino acids, some graphical representations of protein sequences have been proposed to compare the similarities/dissimilarities of proteins [17][18][19][20][21][22][23][24][25][26]. Among these graphical representations, 20 amino acids are usually first represented by 20 pregiven vectors. Based on these vectors, an iterated function system is given to generate a curve representing protein sequences. The numerical characterizations of the curves are used to describe corresponding protein sequences. The representation above provides a simple way of viewing, sorting, and comparing various protein sequences.
In this paper, we divided 20 amino acids into five groups based on the two kinds of physicochemical properties of amino acids: charge and polarity. According to the classification, we proposed a novel 3D graphical representation to describe protein sequences. A novel numerical characterization method was suggested, in which each protein sequence can be transformed into eleven vectors. The phylogenetic tree of nine ND5 proteins was constructed to illustrate our approach. The result shows that our method has better representation accuracy and improvement on generation of phylogenetic tree for protein. And then, the phylogenetic trees of nine kinds of protein sequence of H7N9 avian influenza virus were constructed based on this method. The results show that the evolution route of H7N9 avian influenza virus is from America through Europe to Asia. Moreover, the H7N9 influenza virus that infected humans is originated from three strains of which one extracted from chicken in Zhejiang and the others extracted from environment in Nanjing and Zhejiang. Furthermore, all these three strains are originated from the wild bird in Korea which is the earlier source.

Materials.
For our analysis, we collected complete protein sequences (HA, M1, M2, NP, NS1, NS2, PA, PB1, and PB2) of the H7N9 influenza virus from NCBI-Flu database with release date updates to June 14, 2013. The number of NA, PA-X, NEP, PB1-N40, and PB1-F2 protein sequences of the H7N9 influenza virus is too small to analyze phylogeny of H7N9 influenza virus. Thus, we took 26 HA protein sequences, 27 M1 protein sequences, 25 M2 protein sequences, 27 NP protein sequences, 28 NS1 protein sequences, 25 NS2 protein sequences, 24 PA protein sequences, 21 PB1 protein sequences, and 21 PB2 protein sequences to construct the phylogenetic tree for H7N9 avian influenza virus. These proteins come from 30 H7N9 avian influenza virus strains around the world.

3D Graphical Representation.
Amino acids are the structural units that make up proteins. The physicochemical properties of amino acids in a protein are very important factors for three-dimensional structure and chemical reactivity of protein. In the section, we consider two physicochemical properties of amino acids: polarity and charge.
In the work of Huang et al. [23], 20 amino acids were divided into five groups according to polarity and charge: Both the first sort and second sort of amino acids are nonpolar amino acids, in which the first sort of amino acids is nonpolar aliphatic amino acids and the second sort is nonpolar aromatic amino acids. The third sort and the fourth sort of amino acids belong to uncharged amino acids. The fifth sort of amino acids is composed of positive charged amino acids (lysine and arginine) and negative charged amino acids (aspartic and glutamic).
In this paper, we also adopt the above classification for 20 amino acids. Given a protein sequence = (1) (2) ⋅ ⋅ ⋅ ( ) with the length , ( ) means the th amino acids of the protein sequence; inspect it by stepping one amino acid at a time. Using the classification above, the protein sequence is transformed into points in 2D space by (1) For (1), the parameters , , and are arbitrary constants. In the paper, we take = 3, = 10, and = −30. We call the graph as the PC graph of , which is denoted as .
For , we take the points on the line = 20 as a subsequence (1) (1) by stepping one amino acid at a time, for step ( = 1, 2, . . . , 1 ), the vertex can be constructed according to iterated function where ⃗ denotes the vector corresponding to the th point. Connecting the adjacent points, a curve can be obtained in 2D space for protein sequence (1) , called (1) . Then, we construct the L/L matrix to describe numerically the curve (1) . The elements of the L/L matrix are defined as the quotient of the Euclidean distance between a pair of vertexes of the curve (1) and the sum of distances between the same pair of vertexes measured along the curve (1) . The leading eigenvalue of the L/L matrix is computed, which is denoted as (1) .
Finally, we construct a 3D graphical representation of protein sequence based on the graph and ( ) ( = 1, 2, 3, 4 and 5) as follows: where ( ) is the quotient of the length of subsequence (1) and protein sequence with the length . Connecting the adjacent points, we can obtain a curve in 3D space for protein sequence . The 3D graphical representation can be proved to be acyclic and nondegenerate in mathematics. Therefore, the correspondence between protein sequences and graphical curves is one to one.

Numerical Characterization
In order to numerically characterize a protein, a novel sequence descriptor is introduced as follows.   ( 5 ), respectively, which is denoted as ( , , ) ( = 1, 2, 3, 4, 5) correspondingly to five lines. For the coordinate of the center point of all points on five lines 1 -5 in , we denote it as ( 6 , 6 , 6 ). Similarly, ( 7 , 7 , 7 ) denotes the center point of all points on four lines 1 -4 , ( 8 , 8 , 8 ) the center point of all points on three lines 3 -5 , and ( 9 , 9 , 9 ) the center point of all points on two lines 1 -2 , respectively. For the points on the line 5 , we divide them into two parts according to the charge, positive or negative. One contains {K, R}; the other contains {D, E}. Using the same calculation above, two center points are obtained, ( 10 , 10 , 10 ) and ( 11 , 11 , 11 ).
Thus, protein sequences can be described by 11 vectors, ( , , ) ( = 1-11). For two protein sequences and , we define their distance as the following equation: The smaller the distance is, the closer the two protein sequences are.
3.1. Similarities/Dissimilarities of 9 ND5 Proteins. As discussed above, the similarity of sequences can be compared with the distance among them. To illustrate our method, we consider the numerical characterization of mutations and analyze the similarities among sequences belonging to nine ND5 proteins: human (Homo sapiens, P03915), gorilla (Gorilla gorilla, P03917), common chimpanzee (Pan troglodytes, Q35648), pygmy chimpanzee (Pan paniscus, P03916), fin whale (Balaenoptera physalus, P24978), blue whale (Balaenoptera musculus, P41299), rat (Rattus norvegicus, P11661), mouse (Mus musculus, P03921), and opossum (Didelphis virginiana, P41309), whose sequence data were all downloaded from UniProtKB. The distances among proteins were calculated using (3). If the total number of proteins is , a real symmetric × distance matrix is constructed, whose element is used to reveal the evolutionary distance between protein sequences and .
Using the UPGMA method [27], the phylogenetic tree is obtained based on the distances between each pair of ND5 proteins, shown in Figure 1, which is consistent with the results obtained with ClustalW methods and some other methods proposed recently [19][20][21][22][23]. The results show that our approach has better representation accuracy and improvement on generation of phylogenetic tree for protein.

Result and Discussion
The avian influenza A virus genome is composed of eight single (nonpaired) RNA strands that can code for up to 14 proteins. For H7N9 influenza virus, we can compare the similarities of nine kinds of protein sequences based on their distances in the section. Using UPGMA method, the phylogenetic tree was obtained, shown in Figures 2-10. The phylogenetic trees of H7N9 genes are consistent with other published studies [4,5,28,29]. Figure 2 shows that the virus strains isolated from the humans are grouped together in one clade. The virus strains, A/Nanjing/1/2013, A/Zhejiang/HZ1/2013, and In Figure 3, 9 virus strains isolated from the humans are also grouped together in a clade which contains  In Figure 5, the evolution route of 28 strains of viruses suggests that the strain A/environment/Nanjing/2913/2013 is the source of NS1 protein isolated from humans, whose earlier source is the strain A/emperor goose/Alaska/44063-061/2006 or A/wild bird/Korea/A9/2011. China, Korea, and Alaska are about thousands of kilometers of distance away from each other and human beings could be infected and some of patients contacted poultry before [28]. It may indicate that the strains are sourced probably through poultry trade between the above places.
In Figure 6, the phylogenetic tree of NS2 protein is similar to that of NS1. The strain A/Shanghai/4664T/2013 originates from A/chicken/Zhejiang/DTID-ZJU01/2013 and the strain A/environment/Nanjing/2913/2013 is the source of another NS2 protein of H7N9 virus extracted from human.
Observing the group of virus strains extracted from human in Figure 7, we can see that the virus strain A/chicken/Zhejiang/DTID-ZJU01/2013 is most probably the source of PB1 protein of H7N9 virus extracted from human.
In Figure 8, the PB2 protein sequences of three H7N9 virus strains A/chicken/Zhejiang/DTID-ZJU01/2013, A/environment/Nanjing/2913/2013, and A/environment/Hangzhou/ 34/2013 are identical, which implies their source should be the same virus. Observing Figure 8, we can see that A/Anas crecca/Spain/1400/2008 is the source of PB2 protein of H7N9 virus extracted from human.  All M1 protein sequences of H7N9 virus extracted from China are identical except A/Zhejiang/HZ1/2013, while the other avian H7N9 viruses also possess identical M1 protein sequences except A/Wild bird/Korea/A9/2011 as shown in Figure 9.
There are also two main branches roughly for Figure  In summary, three strains, A/environment/Nanjing/ 2913/2013, A/chicken/Zhejiang/DTID-ZJU01/2013, and A/ environment/Hangzhou/34/2013, are three main sources of H7N9 virus that infected humans. The phylogenetic trees of PB2 and M1 show that they have identical amino acid sequences. We conclude that three strains originated from the same source.
In addition, the H7N9 outbreak at the years 1988, 1995, 1999, 2000, 2006, 2008, 2009, 2011, and 2013 is shown in Figure 11 according to records available in the NCBI-Flu database. The time interval between two outbreaks is 7, 4, 1, and 6 years before the year 2006 and every two years afterwards. The outbreak of H7N9 is more and more frequent, which implies that H7N9 evolution is speeding up.
Observing each distance matrix , we can see the same protein sequences are identical. For instance, we find that the strains A/Fujian/1/2013, A/Shanghai/02/2013, and A/Zhejiang/DTID-ZJU01/2013 have identical HA protein sequences, and the HA protein of strain A/Nanjing/1/ 2013 duplicated in strains A/Zhejiang/HZ1/2013 and A/environment/Nanjing/2913/2013. All identical protein sequences are listed in Table 1.
From the perspective of geography [30], we show identical sequences (NS1, NS2, M1, and M2) occurring in different locations connected by arrow lines in Figure 12. Observing the date and location of occurrence for each identical protein sequence above, we find that the earliest H7N9 avian influenza virus comes from America, spreading to Europe and finally arriving in Asia. For example, M1 protein sequence of H7N9 avian influenza started from Minnesota (1988), passing Delaware Bay (1995), Delaware (2000), Alaska (2006), and Ohio (2006), then Guatemala (2008), Spain (2008), and Czech Republic (2009), and finally arrived at Korea in Figure 11 (2011). In another instance, the evolution route of NS1 protein sequences of H7N9 avian influenza virus is also from America (Minnesota, 1988;Guatemala, 2008) through Europe (Czech Republic, 2009) to Asia (Korea, 2011).
Comparing the results of phylogenetic trees of nine kinds of H7N9 protein, we can see that the evolution route of H7N9 avian influenza virus is from America through Europe to Asia.

Conclusion
The physicochemical properties of amino acids determine three-dimensional structure and the biological activity of the protein. In the paper, according to charge and polarity of 20 amino acids, we divided them into five sorts. Then a novel 3D graphical representation was proposed, called Curve, for a protein sequence. Using the defined distance, we obtain similarity matrix by computing the distance of nine kinds of H7N9 protein sequences. By the single linkage method, their phylogenetic trees are constructed based on similarity matrix . To show the utility of the approach, the phylogenetic tree of nine ND5 proteins was constructed.
Furthermore, the phylogenetic trees of nine kinds of protein sequences of H7N9 avian influenza virus were constructed based on this method. From the phylogenetic trees of nine kinds of H7N9 protein sequences, we deduce that two samples collected from environment in Zhejiang and Nanjing and one sample collected from chicken in Zhejiang are sources of strains that infected humans in 2013. The phylogenetic tree of PB2 and M1 shows that the three strains may originate from the same source. The result is supported by the geographic analysis of virus outbreak record.