Compositional Constraint Is the Key Force in Shaping Codon Usage Bias in Hemagglutinin Gene in H1N1 Subtype of Influenza A Virus

It is vital to unravel the codon usage bias in order to gain insights into the evolutionary forces dictating the viral evolution process. Influenza A virus has attracted attention of many investigators over the years due to high mutation rate and being cross-specific shift operational in the viral genome. Several authors have reported that the codon usage bias is low in influenza A viruses, citing mutational pressure as the decisive force shaping up the codon usage in these viruses. In this study, complete coding sequences of hemagglutinin genes for H1N1 subtype of influenza A virus have been explored for the possible codon usage bias acting upon these genes. The results indicate overall low bias with peaking ENC values. The GC content is found to be substantially low as against AT content in the silent codon sites. Significant correlations were observed in between the compositional parameters versus AT3, implying the possible role of the latter in shaping codon usage profile in the viral hemagglutinin. The data showed conspicuously that the sequences were A redundant with most codons preferring nucleotide A over others in the third synonymous codon site. The results indicated the pivotal role of compositional pressure affecting codon usage in this virus.


Background
Influenza A virus (IAV), a member of Orthomyxoviridae, remains a serious health concern on a global basis with a number of epidemics since early 19th century till date. With several variants of varying pathogenic profile, IAV is causing significant mortality every year throughout the globe. In the year 2009, the world has seen its only second global pandemic, an H1N1 pandemic which was declared as phase 6 alert level by the World Health Organization (WHO). It was the first of its kind since 1968 when Hong Kong flu was declared a global pandemic by the WHO. Reports say that about 214 countries have been affected by the pandemic influenza H1N1 of 2009 taking 18,138 lives, as updated in May 2010 (http://www.who.int/csr/don/2010 06 04/en/index.html).
What makes influenza A such a deadly virus? Generally, upon exposure to a pathogen, the host develops specific immunity against it, thus, preventing the same pathogen infecting for a second time. The IAV escapes the specific immunity of the host by a process termed as antigenic drift. This is achieved by frequent mutation in the hemagglutinin (HA) and neuraminidase (NA) genes which encode the main antigenic determinant proteins in the virus, due to which immunogenically distinct strains develop which cause the seasonal outbreaks [1]. Another process, differently termed by different authors as cross-specific shift [2] or reassortment [3], is responsible for the frequent changes in the antigenic region of the virus, as happened in case of 2009 H1N1 pandemic. The viral HA or NA or other gene segments of different subtype of IAV are exchanged resulting into a novel subtype of IAV. These two genes, HA in particular, provide virulence to the virus making it as a potential drug target for the prevention of the spread of influenza infection [1].
The degeneracy of the genetic code has rendered the privilege of using more than one codon to code for the same amino acid. The phenomenon is called synonymous use of codons. The use of synonymous codons, however, is not uniform in different species ranging from prokaryotes to complex organisms as well as in viruses; certain synonymous codons are used preferentially. This tilted use of codons is 2 International Journal of Genomics termed as codon usage bias (CUB). With the rapidly growing stockpile of sequences in public databases after whole genome sequencing of large number of species, investigators have engaged in research in the context of codon usage bias in specific genes as well as whole genome of a vast range of organisms [4][5][6][7].
The preferential use of synonymous codons is governed by different evolutionary forces [8]. Over the years many authors have reported a number of measures to assess codon usage bias across genes and genomes. Among these measures, GC content, relative synonymous codon usages (RSCU), and effective number of codons (ENC), are some of the most widely used parameters for codon bias study. Much has been debated regarding the inclination towards the selection of optimal codons in genes; many advocated increased efficiency of translation process as the main reason behind selection of optimal codons [9]. However, the exact mechanisms behind synonymous codon variation are yet to be understood clearly.
Several workers have reported that the overall codon usage bias in RNA viruses is low, which is attributed to GC compositional properties and dinucleotide content in these viruses [5,[10][11][12]. Mutational bias has been projected as the main factor that drives the codon usage variation among the influenza A viruses which are phylogenetically conserved [10,12,13].  [14] is one of the most widely used parameters for querying the pattern of synonymous codon usage across genes and genomes without confounding influence of the amino acid composition. To examine the synonymous codon usage in the genes, RSCU values were calculated. RSCU is defined as the ratio of the observed frequency to the expected frequency if all the synonymous codons for those amino acids are used equally. If the RSCU value of a codon is more than 1.0, it is said to have a positive codon usage bias, while a value of less than 1.0 means a negative codon usage bias. When the RSCU value is close to 1.0, it means that this codon is chosen randomly and equally with other synonymous codons.

Materials and Methods
The effective number of codons (ENC) estimates the enormity of codon usage bias in a gene [15]. ENC is estimated to quantify the synonymous codon usage across the target sequence which is calculated as given below: where, ( = 2, 3, 4 or6) is the average of the values for -fold degenerate amino acids. The value denotes the probability that two randomly chosen codons for an amino acid with two codons are identical. The values of ENC range from 20 (when only one codon is used per amino acid) to 61 (when all synonymous codons are equally used for each amino acid) [15][16][17]. The codon bias is considered low if the ENC value is greater than 40.
Nucleotide composition plays a crucial role in the codon usage pattern in the genes because most of the indices of codon usage bias are based on the base composition of the genes. GC 3 is the frequency of the nucleotides G+C at the synonymous 3rd positions of the codons excluding the Met, Trp, and the termination codons. Similarly, GC 1s and GC 2s represent G+C frequency at 1st and 2nd codon positions. GC 3s is a good indicator of the extent of base composition bias.
Gene expressivity was measured by codon adaptation index (CAI) as given by Sharp and Li [14]. CAI has been used as a simple and effective parameter to measure the adaptiveness of synonymous codon usage bias of a gene towards the codon usage of highly expressed genes. CAI, with the boundary values 0-1, was originally proposed to provide a normalized estimate that can be used across genes and species. A value of 1 is assigned to the most frequent codons within a gene (CAI = 1) while the least frequent codons are assigned a CAI value of 0 [18,19]. CAI is estimated as where is the number of codons in the gene and ( ) is the value for the kth codon in the gene. Frequency of optimal codon (Fop), originally proposed by Ikemura in the year 1981, is one of the first estimators used in the study of codon usage bias. As an index, Fop shows the optimization level of synonymous codon choice in each gene to translation process [8]. Fop is defined as the ratio of total number of optimal codons in a gene to the total number of synonymous as well as nonsynonymous codons in that gene.
The codon usage bias measures, namely, RSCU, ENC, GCs, Fop, and CAI for each coding sequence, were estimated in our study by using an in-house Perl program developed by SC.

Nucleotide Compositional Properties.
The coding sequences were analyzed thoroughly for their nucleotide composition. Individual nucleotides as well as GC and AT content in three synonymous codon positions were estimated. The nucleotide composition in the analyzed genes is summarised in Table 1. The results reveal that the viral hemagglutinin is A redundant with overall A content of 35.3% with a range of 34.9% to 35.6% and standard deviation (SD) of 0.167. On the other hand, the content in all the accessions is consistently low ranging from 18.2% to 18.8% with average and SD of 18.5 and 0.145, respectively.
The frequency of codons containing dinucleotide TpA is much higher in comparison to those containing dinucleotide CpG. Four codons, that is, CGA, CGC, CGG, and CGT, out of possible nine codons containing CpG, are absent in the analyzed gene; the frequencies of the remaining codons are also very low with the highest value of 9 for GCC. In contrast, most of the codons (5 out of 6) containing TpA showed higher frequency with the highest value of 17 for GTA and the lowest 6 for TTA. While three codons containing TpA are preferred, there are no preferential codons containing CpG. The overall GC content in the dataset was found to be much lower in comparison to overall AT content (40.7% and 59.3%, resp.). The suppression of GC content as compared to AT content is also evident from GC/AT content at the silent position. The overall GC 3 was found to be low (39.0%) as against AT 3 (60.7%) (Figure 1). To detect any possible relation of base composition at different synonymous codon positions, the estimated values of the four nucleotides , , , and and the AT and GC content were compared with the values of the nucleotides in third synonymous positions (i.e., 3 , 3 , 3 , and 3 ). The results indicate a strongly significant and complicated correlation which is presented in Table 2. The correlation coefficients were highly significant in majority of the parameters taking both positive and negative values except a few showing insignificant correlation. Negative correlation was also observed between GC 1+2 and GC 3 ( = −0.478, < 0.001). The correlation results indicate the possible role of mutational pressure acting on these genes. The base composition was most likely influenced by AT 3 as revealed by the highly significant correlation coefficients.
Previous studies have revealed that the CpG underrepresentation is attributable to immunologic escape, in order to avoid host immune system using the unmethylated CpGs as a pathogen marker [20,21]. CpG deficiency has also been reported in some other RNA viruses as well [10,20,22]. Thus, combating the host immune response may constitute a selection pressure in these viruses.
The general trend of the ENC values suggests the absence of strong codon bias in the hemagglutinin gene. The ENC values were consistently found in higher range with an average 4 International Journal of Genomics   value of 58 ± 0.363. Based upon these observations, it appears that the extent of codon usage bias in these genes is generally constant. The ENC values were analyzed for possible correlations with the nucleotide compositional parameters, particularly GC 3 content which has been shown previously to correlate with the former [12]. The results of our analyses are in accordance with the significant positive correlations between ENC and GC 3 ( = 0.431, = 0.014) as well as ENC and overall GC content ( = 0.724, = 0.0001).

Characteristics of Synonymous Codon Usage.
In an attempt to find out the nature of codon usage bias in the genes under study, the RSCU values of the 59 codons were analyzed (Table 3). Interestingly, most of the preferred codons ended with nucleotide A. Among the preferred codons, dinucleotide CpG is markedly suppressed while dinucleotides TpA and CpA were found to be abundant in most of them.
In quest for possible under-and over-representation of codons, RSCU values were sorted from lower to higher values. We observed that majority of the codons, both preferred as well as non-preferred, fall under unbiased or randomly used category (0.6 < RSCU < 1.6). Seven codons (GCA, AGA, CTA, TCA, ACA and GTA) showed very high RSCU values (RSCU > 1.6) and hence, were considered to be "over-represented". Similarly there were ten under-represented codons (RSCU < 0.6) ( Figure 2).
All the amino acids showed preference over a particular codon except Asp where both the codons were used equally ( Figure 4). Surprisingly, in all the accessions, out of six possible codons for Arg, only two codons, namely, AGA and AGG, were used omitting the rest four. Among these two codons, there was a high bias towards AGA with RSCU values 4.61 as compared to that of 1.32 for AGG. Ser and Leu were the most frequently used amino acids, while Cys, Gln and His were used least frequently. Frequency of the amino acids Lys, Gly, Asn, Thr, Val etc. were also towards higher side (Figure 3).
Highly expressed genes show a tendency of high biasness towards some codons and tend to use those codons frequently. To find out such biasness and predict the expression of the genes, CAI values were estimated, values of which range from 0 to 1. The CAI values for the hemagglutinin genes were found in the range of 0.3143-0.3447 with an average of 0.3829 and standard deviation of 0.0391, indicating that the codons are not translationally optimized for expression of these genes.
The frequency of optimal codons (Fop) in a gene can be used as an indicative measure to check if the codons are optimized for efficient translation [23].  GCA  GCG  CGT  CGA  AGA  AAT  GAT  TGT  CAA  GAA  GGT  GGA  CAT  ATT  ATA  TTG  CTC  CTG  AAG  TTC  CCC  CCG  TCC  TCG  AGC  ACC  ACG  TAC   International Journal of Genomics  showed a significant positive correlation with correlation coefficient of = 0.710 ( = 0.0001).

Conclusion
Amidst much debate, mutational pressure and natural selection have been cited as the major stimulants in framing the codon usage profiles of different viruses [5,20,24]. As in most of the RNA viruses, mutation rate of IAV is very high and the effects of codon usage bias are too small for natural selection to act effectively [25]. One possible explanation for lower codon preferences might be due to the fact that it helps the virus to replicate readily in alternate hosts with different codon choices [5]. Hemagglutinin constitutes one of the most important sites for human immune system to act on, thus, making it a potential drug target against this virus. Untangling the underlying mechanisms operating behind the synonymous codon usage profile of the virus will possibly bring up new avenues in the research involving development of antiviral drugs against this hazardous virus.