Natural Selection Determines Synonymous Codon Usage Patterns of Neuraminidase ( NA ) Gene of the Different Subtypes of Influenza A Virus in Canada

Synonymous codon usage patterns of neuraminidase (NA) gene of 64 subtypes (one is a mixed subtype) of influenza A virus found in Canada were analyzed. In total, 1422 NA sequences were analyzed. Among the subtypes, H1N1 is the prevailing one with 516 NCBI accession records, followed by H3N2, H3N8, and H4N6. The year of 2009 has the highest report records for the NA sequences in Canada, corresponding to the 2009 pandemic event. Correspondence analysis on the RSCU values of the four major subtypes showed that they had distinct clustering patterns in the two-dimensional scatter plot, indicating that different subtypes of IAV utilized different preferential codons. This subtype clustering pattern implied the important influence of natural selection, which could be further evidenced by an extremely flattened regression line in the neutrality plot (GC12 versus G3s plot) and a significant phylogenetic signal on the distribution of different subtypes in the clades of the phylogenetic tree (λ statistic). In conclusion, different subtypes of IAV showed an evolutionary differentiation on choosing different optimal codons. Natural selection played a deterministic role to structure IAV codon usage patterns in Canada.


Introduction
Codon usage is not a random event [1].Codon usage bias has been broadly observed, and different mechanisms have been proposed to explain the bias patterns, for example, mutation pressure, translational efficiency, gene length [2], dinucleotide bias [3], tRNA abundance [4], organ specificity [5], and so on.Codon usage bias patterns have been broadly studied in recent years, especially for virus genomes [3,6,7].
In recent years, codon usage patterns have been widely explored for influenza viruses [6,[8][9][10][11][12].Among the three influenza viruses, influenza A virus (IAV) is the major concern since it has a lot of subtypes.IAV is a genus of Orthomyxoviridae family of viruses, which caused influenza in birds and mammals [6,13].Among the eight RNA segments of IAV, hemagglutinin and neuraminidase (NA) genes are the principal concerns.
Currently, most of modeling efforts on IAV are focused on Asian regions [6,8]; little attention is paid on the evolutionary patterns of IAV in local areas of North America.To fill such a knowledge gap, in the present study, I analyzed all the available 1436 NA ORFs for IAV found in Canada to reveal the codon usage patterns of IAV different subtypes in Canada.

Materials and Methods
2.1.Sequence Data.1436 NA sequences found in IAV strains of Canada were extracted from NCBI GenBank database (http://www.ncbi.nlm.nih.gov/).Only the open reading frames (ORFs) are considered for codon usage bias analysis.The alignment of the large number of NA sequences is done using MAFFT software [14,15].The dataset and the alignment will be available at the online digital repository website: http://datadryad.org/.

Measure of Codon Usage Patterns.
Relative synonymous codon usage values of each codon in a gene are calculated to investigate the characteristics of synonymous codon usage.The RSCU index is calculated as follows [16]: where   is the observed number of the th codon for the jth amino acid which has   kinds of synonymous codons.Higher (or lower) RSCU values, higher (or lower) frequencies of the codons being chosen.When the corresponding RSCU values of a codon are close to 1, it is used randomly and evenly.

Effective Number of Codons.
The effective number of codons (ENCs) is a measure of bias from equal codon usage in a gene [17].The calculation formula is where   (k = 2, 3, 4, 6) is the mean of   values for the k-fold degenerate amino acids, which is estimated using the formula as follows: where  is the total number of occurrences of the codons for that amino acid and where   is the total number of occurrences of the th codon for that amino acid.ENC ranges from 20 for the strongest bias (where only one codon is used for each amino acid) to 61 for no bias (where all synonymous codons are used equally).
For revealing the relationship between GC3s and ENC values, the expected ENC values for different GC3s were calculated as follows: where  denotes the value of GC3s [6].These expected ENC values can form a unimodal curve.As such, the observed and expected ENC values can be compared to determine the influence of nucleotide compositional constraint on structuring synonymous codon usage bias.If the observed data points are on or nearby the null expected ENC curve, then compositional constraint plays the major role in structuring codon usage patterns.If the observed points are fallen below the null curve, then the role of natural selection emerges.

Correspondence Analysis and Correlation Analysis.
Correspondence analysis (COA) was conducted to investigate the major trend involved in the codon usage patterns of the genomes from different virus strains, which were measured by corresponding RSCU values for the 59 codons (after excluding Met, Trp, and three termination codons) [7].COA is performed using "ca" package [18] under R environment [19].
The first two axes of COA (CA1 and CA2) were then subjected to correlation analysis.A Spearman's rank correlation analysis was used to infer the relationships between different codon usage indices and the first two axes (CA1 and CA2) of IAV RSCU values.

Phylogenetic Signals of Different Subtypes of IAV for
NA Gene.Phylogenetic signals revealed the clustering and overdispersion (or randomness) of the different subtypes in the phylogenetic tree.If different subtypes tend to group in the same clade, then the clustering pattern could be found and the phylogenetic signal should be detected.In contrast, if different subtypes tend to disperse across different clades, then the overdispersion pattern could be found and the phylogenetic signal should be minor.I implemented two metrics for testing phylogenetic signals for the major subtypes of IAV found in Canada: Blomberg's K statistic [20] and Pagel's  statistic [21,22].Both tests were implemented in R [19] package "phytools" [23].
For inferring phylogenetic signals, a phylogenetic tree is required.Given that we have over 1400 NA sequences, we used the fastest software, FastTree [24], to construct a maximum likelihood (ML) tree, which is based on the sequences derived from the major subtypes (H1N1, H3N2, H3N8, and H4N6).The reconstructed ML tree will be available from the online digital repository website: http://datadryad.org/.The multichotomies of the tree are resolved using R [19] package "ape" [25], and the dating of divergence time is done using the mean path length method [26].Negative branch lengths have occurred after molecular dating but do influence the inference of phylogenetic signal test.

Count and Temporal Distribution of Subtypes of Influenza
A Viruses in Canada.As shown in Figure 1(a), the most prevailing subtype of IAV in Canada is H1N1 (with 516 records), followed by H3N2 (with 132 records), H3N8 (with 152 records), and H4N6 (with 113 records).
Based on strain records (Figure 1(b)), it was found that the year of 2009 has the highest number of IAV NA genes being sequenced with a number of 531, followed by the year of 2007 (with a number of 204) and the year of 1968 (66 records).

ENC-GC3s and Neutrality Plots.
As shown in Figure 2, most points lay below the null curve considerably, while some points from H1N1 were very far from the null curve, indicating the importance of selection.
Neutrality plot of Figure 3(a) showed that natural selection should play a very crucial role in structuring the overall codon usage patterns across different subtypes of IAV in Canada.The relative influence of mutation pressure is very small on the first and second codon positions (GC12) with respect to that on the third codon position (GC3), as indicated by the slope of the regression line (0.0158).Such a conclusion is enforced since this regression slope is not significant ( > 0.05), which suggests that there is no clear trend between GC12 and GC3 across different subtypes.Such a decoupling is argued to be structured by natural selection [27,28].Furthermore, when the neutrality plots are applied to each of the major subtypes (Figures 3(b), 3(c), 3(e), and 3(f)), most of them are found to have near-zero or negative regression slopes (indicated by correlation coefficients and associated  values).As such, natural selection is exclusively important to structure codon usage pattern of NA gene within a single subtype and between different subtypes of IAV in Canada.

Clustering Patterns of Codon Usage Patterns of Major Subtypes in Canada.
As shown in the COA plot of the first two axes in Figure 4, apparently there are distinct clustering patterns among the first four subtypes of IAV.However, the mixed subtype group showed a dispersing pattern, which could be inferred that these mixed sequences are principally generated from subtypes H3N8 and H4N6 because most mixed sequences were well overlapped with the positions of these two known subtype groups.
With respect to the codon usage bias patterns of the four major subtypes (H1N1, H3N2, H3N8, and H4N6), it is found that they did show distinct preferences on choosing optimal codons (Table 1).For subtype H1N1, the most preferred codons (standard: RSCU > 1.2 being the highest in that subtype) were GCU, CUA, UUA, AGA, and AGU in comparison to other subtypes.For subtype H3N2, the most preferred codons were CAU, CCU, AGC, and UAU, respectively.For subtype H3N8, the most preferred codons were GCA, AUU, AAA, AAU, AGG, and ACU.For subtype H4N6, the most important codons were GAA, UUA, GGA, AUA, UUG, CCA, UCA, GUA, and GUG.Apparently, different subtypes tended to congruently use A-end codons in most cases.Such an observation is consistent with many previous studies working on specific subtypes of IAV [6,8].Interestingly, no most preferred codons were found for mixed subtype group (Table 1).This is completely consistent to the clustering patterns in Figure 4. 2), it is evidenced that the C3s and G3s individually are the principal drivers for determining the first major trend CA1 of codon usage bias patterns of IAV in Canada.Interestingly, GC3s, the combined G3s and C3s, is not a strong determinant of the first major trend CA1.Comparatively, A3s and A contents are the major predictors   of the second major trend of codon usage patterns of different subtypes of IAV.

Phylogenetic Clustering Patterns of Different Subtypes of IAV in
Canada.Blomberg's K statistic failed to return a result because the phylogenetic covariance matrix based on the inferred ML tree is singular when the inverse of the matrix is taken.However, Pagel's  statistic could be implemented.The randomization test (1000 runs) on the  statistic for testing phylogenetic signals showed a significant result ( = 0.886, log-likelihood = −1480.885,and  = 0).The significant phylogenetic structure implied that NA sequences from the same subtype would tend to group in the same clade in the phylogenetic tree.Our study, therefore, is different from previous studies on codon usage patterns of IAV because the previous studies showed that mutation selection should be prevailing [6,8,9].Therefore, it is imperative to compare different subtypes found in different regions (either regional or continental levels) to better quantify the influence of environmental

Conclusions
Because (1) different subtypes of IAV have utilized different optimal codons, (2) flattened regression lines are always found in the neutrality plots, and (3) phylogenetic signal is significant for the distribution of subtypes over the phylogenetic tree.It is concluded that different subtypes of IAV in Canada show evolutionary differentiation patterns, and natural selection plays a central role in structuring codon usage patterns for IAV in the region.

Figure 1 :Figure 2 :
Figure 1: (a) Summary of the number of NA gene sequences for different subtypes of IAV found in Canada up to the year of 2013.(b) Summary of the number of NA gene sequences for different subtypes of IAV sequenced at each year in Canada up to the year of 2013.

Figure 3 :
Figure 3: Neutrality plot between GC3 and GC12 for the NA gene of all and each of the major subtypes in Canada.(a) Neutrality plot for all major subtypes.The fitted regression line has the equation GC12 = 0.0158 × GC3 + 0.87 ( 2 = 0.0002,  = 0.592).(b) Neutrality plot for subtype H1N1.(c) Neutrality plot for subtype H3N2.(d) Neutrality plot for subtype H3N8.(e) Neutrality plot for subtype H4N6.(f) Neutrality plot for mixed subtype.

Figure 4 :
Figure 4: CA plot of RSCU values of NA sequences for different subtypes of IAV in Canada.Different colors indicated different subtypes.Clearly, sequences from the same subtypes tended to group in the same clusters, except for those for the mixed group.
Selection on IAV.Natural selection played a dominant role in structuring different subtypes of NA genes for IAV in Canada.The significant clustering patterns supported by correspondence analysis on the RSCU values and the phylogenetic signaling randomization test clearly support the evolutionary differentiation of NA genes of different subtypes of IAV in Canada.Neutrality plot further evidenced the dominance of natural selection on NA gene in Canada.

Table 1 :
Codon usage comparison between the four major subtypes found in Canada using RSCU values.Only the codons with some RSCU values >1.2 across the four subtypes are considered for comparison of codon usage preference.The largest RSCU value for each of these codons is marked in boldface to see the most preferred subtype.

Table 2 :
The correlation between different codon usage indices and the first two axes of COA analysis.The most correlated variables for each axis of COA are marked in boldface.Significant correlations at the level of  < 0.05 are marked with asterisks.