Synonymous Codon Usage Analysis of Thirty Two Mycobacteriophage Genomes

Synonymous codon usage of protein coding genes of thirty two completely sequenced mycobacteriophage genomes was studied using multivariate statistical analysis. One of the major factors influencing codon usage is identified to be compositional bias. Codons ending with either C or G are preferred in highly expressed genes among which C ending codons are highly preferred over G ending codons. A strong negative correlation between effective number of codons (Nc) and GC3s content was also observed, showing that the codon usage was effected by gene nucleotide composition. Translational selection is also identified to play a role in shaping the codon usage operative at the level of translational accuracy. High level of heterogeneity is seen among and between the genomes. Length of genes is also identified to influence the codon usage in 11 out of 32 phage genomes. Mycobacteriophage Cooper is identified to be the highly biased genome with better translation efficiency comparing well with the host specific tRNA genes.


Introduction
The genetic code uses 64 codons to represent the 20 standard amino acids and the translation termination signal. Each codon is recognized by a subset of a cell's transfer ribonucleotide acid molecules (tRNAs), and with the exception of a few codons that have been reassigned in some lineages [1,2] the genetic code is remarkably conserved, although it is still in a state of evolution [3].
Codons can be grouped into 20 disjoint families and each family in the universal genetic code contains between 1 and 6 codons. The usage of alternate synonymous codons in an organism is understood to be nonrandom. Grantham et al. [4,5] proposed that each genome has a particular codon usage signature that reflects particular evolutionary forces acting within that genome. The synonymous codons and amino acids are not used at equal frequencies both within and between organisms [5][6][7]; the patterns of codon usage vary considerably among organisms, and also among genes from the same genome [8].
Bacteriophages generally use the translational machinery of their hosts to synthesize both their structural and regulatory proteins. This indicates that the amount of codon usage in the protein coding genes in the phages and their bacterial hosts should be similar.
Mycobacteriophages have the potential to be used in diagnosis of tuberculosis and as molecular tools to study mycobacteria. Understanding the codon usage pattern of 2 Advances in Bioinformatics these phages should guide in the selection of appropriate ones for such purposes. The present venture is to study and understand codon usage patterns of all the mycobacteriophages so far sequenced.
Codon usage analysis was previously done for fourteen phages by Sau et al. [40]. Eighteen more mycobacteriophage genomes were subsequently sequenced and became available in Genbank. In the present work we have analysed all these 32 phage genomes to study and compare their codon usage pattern. Other codon usage indices that affect the genomes of these phages are also studied. (a) Nc, the "effective number of codons" used in a gene measures the bias away from equal usage of codons within synonymous groups [19]. Nc can take values from 20 to 61, when only one codon or all synonyms in equal frequencies were used per amino acid, respectively. Nc appears to be a good measure of general codon usage bias [19,42]. The sequences in which Nc values are <30 are highly expressed while those with >55 are poorly expressed genes [12,43].

Materials and Methods
(b) Relative Synonymous Codon Usage. Relative synonymous codon usage (RSCU) is defined as the ratio of the observed frequency of codons to the expected frequency if all the synonymous codons for those amino acids are used equally [21]. RSCU is used to observe the synonymous codon usage variation among the genes.
(c) Base composition. The frequency of A, T, G, C, and GC at first, second, and third positions of synonymously variable sense codons which can potentially vary from 0 to 1 was calculated. The variation of GC3s among genes was characterized by its standard deviation.

Statistical Methods.
Correspondence analysis (CA) is used to study the codon usage variation between genes in different organisms in which the data are plotted in a multidimensional space of 59 axes excluding those of Met, Trp, and stop codons [19]. For understanding the codon usage variation of mycobacteriophages chosen for the current study, RSCU values are used for CA in order to minimize the amino acid composition. To investigate the difference between high and low expressed genes, we have compared the codon usage variation between 10% of the genes located at the extreme right of axis 1 and 10% of the genes located at the extreme left of the axis 1 produced by CA using RSCU. To estimate the codon usage variation between these two sets of genes we have performed Chi square tests taking P < .01 as significant criterion. The Pearson correlation coefficient and linear regression were calculated to identify the indices that influence the codon usage variation in mycobacteriophages using SPSS version 10.0. The levels of statistical significance were defined as P < .01 or P < .05.

Overall Codon Usage Analysis in Mycobacteriophages.
The RSCU values of 32 mycobacteriophage genomes show that G-and/or C-ending codons are predominantly used (Table 1), in which 13 are C-ending and 6 are G-ending codons. This was expected, as these phages have a high genomic content. However, from the overall RSCU values, it can be assumed that compositional constraint is the only factor responsible for shaping the codon usage variation among the genes in these genomes. Although the overall RSCU values could unveil the codon usage pattern for the genomes, it may hide the codon usage variation among different genes in a genome.  Table 2). The average GC3s values for individual genomes varied from 65.84 to 89.35 in mycobacteriophage Barnyard and mycobacteriophage Cooper, respectively. In addition, there are marked intragenomic variations in Nc and GC3s values with standard deviation of >3.5 in both the indices. There seems to be a considerable heterogeneity in compositional bias and codon usage pattern within and among the genome of these phages. Of the 32 mycobacteriophages, the genome of Cooper is identified to have the lowest Nc and the highest GC3s values while Barnyard has the highest Nc and the lowest GC3s values indicating that highly GC rich genomes are more biased than poor GC rich genomes.

Codon Usage
In unicellular organisms, a strong correlation between gene expressivity and the extent of codon usage bias is reported for Escherichia coli and Saccharomyces cerevisiae and phages of Staphycoccus aureus and mycobacteria [13,40,[44][45][46][47][48]. Our analysis reveals that the genome of mycobacteriophage Cooper is highly biased than other 31 mycobacteriophage genomes. Based on the comparison of the highly represented codons of cooper and the copy number of host specific tRNA, the data indicate that the putatively highly expressed genes of this phage have better translational efficiency comparatively and may be expressed rapidly by the host translation machinery. Table 3 represents the base composition for the 32 completely sequenced mycobacteriophages. GC composition at the third codon position is always higher than the second and first + second codon positions observed in other GCrich genomes [9][10][11]. It is also identified that there is major variation in GC3s content among the genomes studied with no major variation in GC1s and GC2s. This suggests that GC3s has a major role to play than GC1s and GC2s and is tightly associated with the codon usage bias of these genomes.

Synonymous Codon Usage in Different Mycobacteriophages.
A plot of Nc versus GC3s (Nc plot) has been widely used to study the codon usage variation among genes in different organisms [19]. It was demonstrated that the comparison of the actual distribution of the genes, with the expected distribution under no selection, indicates that the codon usage bias of the genes has influences other than the compositional bias. In contrast, if GC3s were the only determinants of the codon usage variation among the genes, then the values of Nc would fall on the continuous curve between Nc and GC3s. It is evident from Nc plot for the mycobacteriophages studied that most of the genes fall within a restricted cloud, at GC3s between 0.65 and 0.93, and Nc values 28 and 47 ( Figure 1). Nc values for these genes lie just below the expected curve, indicating that these genes have additional codon usage bias apart from compositional bias. The rest of the genes have higher Nc values and lower GC3s values, mostly lying on and close to the expected curve. Consequently, the Nc values of these genes are substantially higher relative to expected values. However, strong influence of compositional constraints on codon usages bias in all the phages analyzed could be understood from the presence of significant negative correlation between GC3s and Nc (r = −0.969; P < .01).   mycobacteriophages has been estimated ( Table 4). As there is no information of gene expression level of mycobacteriophages so far, we have considered the highly biased genes having low Nc values as those highly expressed and vice versa. In all mycobacteriophages analyzed, except Che9d, the frequency of C at the third codon position increases with decreasing Nc values, whereas frequencies of T and A increase with Nc. However, the frequency of G is not influenced in the third codon position excluding for few phages such as Bxb1, Barnyard, Corndog, Plot, PBI1, and Bethelhem. Thus the influence of mutational bias of these phages is reflected in the choice of bases in the third position. However, this is expected since the optimal codons are, in general, chosen in accordance with the mutational bias of these phages. In other words, it is due to the translational selection that the mutational bias appears to be more prominent in the third codon position of highly expressed genes [18].

Effect of Translational Selection on the Synonymous Codon
Usage in Mycobacteriophage Genomes. Some of the earlier reports showed that synonymous codon usage in the highly expressed genes of diverse array of organisms is influenced by cellular tRNA abundance [44,[49][50][51][52]. Kanaya et al. [52,53] have reported that the cellular tRNA abundance in several organisms are directly proportional to their tRNA copy number.
Of the 32 mycobacteriophages analyzed, 10 phages (244 (2 tRNA genes), Bxz1 (26 tRNA genes), Omega (2 tRNA genes), Cjw1 (1 tRNA gene), Wildcat (21 tRNA genes), D29 (5 tRNA genes), L5 (3 tRNA genes), Che12 (3 tRNA genes), Catera (26 tRNA genes)) encode tRNA genes for few of the amino acids. Both Bxz1 and Catera are identified to encode large number of tRNA genes (26 tRNA genes) and Wildcat encoding for 21 tRNA genes. Of the 10 phage genomes encoding tRNA genes, excluding 244 and Omega, eight carry tRNA genes for the overrepresented codons in highly and lowly expressed genes. To see whether the synonymous codon usage of putatively highly expressed genes of these mycobacteriophage genomes is positively correlated with the host tRNA abundance, the number of over represented synonymous codons in such genes was determined by comparing with that of the putatively lowly expressed genes. It was found that among the 22 overrepresented synonymous codons in highly expressed genes, 21 codons are recognized by the M.tuberculosis specific tRNAs (data not shown). Based on the above analysis, the data indicate that the putatively highly expressed genes of these phages have translational efficiency.

Relationship between Codon Bias and Gene Length.
Selection for translational accuracy is predicted to have a positive correlation between codon bias and gene length [20]. Previously, the relationship between gene length and synonymous codon usage bias has been reported for Drosophila melanogaster, Escherichia coli, Saccharomyces cerevisiae, Pseudomonas aeruginosa, and Yersinia pestis [11,15]. From the plot drawn with gene length against Nc (Figure 2), it is understood that shorter genes have a much wider variance in Nc values, and vice versa for longer genes. Lower Nc values in longer genes may be due to the direct effect of translation time on fitness or to the extra energy cost of proofreading associated with longer translating time. A significant correlation was identified in 11 phages revealing that gene length influences codon usage of these genomes (Table 5). Similar results were also reported for S. pneumoniae, P. aeruginosa and SARS coronavirus [11,16]. Eyre-Walker [20] has reported that the selection for fidelity in protein translation is likely to be greater in longer genes because the cost of producing a protein is proportional to its 6 Advances in Bioinformatics length. Therefore selection of translational accuracy predicts a positive correlation between codon usage bias and gene length. And this selection may be stronger at constrained codons coding for evolutionarily conserved amino acids than the nonconserved amino acid. As the codon bias is lower in longer genes than shorter ones, further analysis in finding these constrained codons will help us in understanding whether it is the same in all genes irrespective of their length.

Correspondence Analysis Using RSCU Values.
In order to determine the factors that influence variations in codon usage among the genes of mycobacteriophage genomes, correspondence analysis was conducted on the RSCU values of its genes. Only the distributions of the genes along the first two major axes were shown, as these accounted for 13.63% and 6.89% of the total variation ( Figure 3).
The first major axis is negatively correlated with G3s (r = −.235, P < .01) and C3s (r = −.778, P < .01) but correlated positively with A3s (r = .687, P < .01) and T3s (r = .827, P < .01). Interestingly, high degree of positive correlation exists between position of genes along the first axis with Nc (r = .863, P < .01) (Figure 4) and high degree of negative correlation with GC3s (r = −.934, P < .01) ( Figure 5). These findings suggest that highly biased genes, those with G-and C-ending codons, are clustered on the negative side, whereas the codons ending in A and T predominate on the positive side of the first major axis.
Additionally, significant negative correlation is observed with Nc against GC3s and GC. Highly expressed genes tend to use "C" or "G" at the synonymous positions compared with lowly expressed genes. It is also studied that C-ending codons are preferred over G-ending codons in highly expressed genes. Preference of C-ending codons in the  highly expressed genes might be related to the translational efficiency of the genes as it has been reported that RNY (Rpurine, N-any nucleotide base, and Y-pyrimidine) codons are more advantageous for translation [54]. Thus, compositional mutation bias possibly plays an important role in shaping the genome of these phages. The genomes of Llij, PMC, Wildcat, TM4, Che8, Tweety, and Che9d phages showed no significant correlation between first major axis and GC3s. Whereas, phages such as 244, Bxb1, Bxz2, Che9c, Rosebush, Omega, Halo, Barnyard, Bxz1, Cjw1, Corndog, Orion, Plot, Qyrzula, and Giles show strong negative correlation with GC3s. The primary trend in codon usage variation in these phages can be attributed to the presence of putatively foreign genes acquired through horizontal gene transfer with unusually A + T rich codon usage. However, in phages such as D29, L5, PBI1, PG1, Cooper, Che12, Catera, Bethelhem, and U2, axis 1 coordinates are significantly positively correlated with GC3s values (Table 6). Moreover, when G3s and C3s are considered separately, the correlation coefficient exhibited by the positions of genes along the first axis with C3s is significantly larger than that with G3s (Table 6), indicating that the contribution of C3s to the interspecies variation in overall GC3s content is greater than that of G3s.  Table 7 shows RSCU values for each codon for the two groups of genes. The asterisk represents the codons whose occurrences are significantly higher in the genes situated on the extreme left side of axis 1, compared to the genes present on the extreme right of the first major axis. It is important to note that out of 22 codons that are statistically overrepresented in genes located on the extreme left side of axis 1 there is 16 C-ending codons and 5-G ending codons. UGA is the most frequent stop codon among highly and lowly expressed genes. Similar pattern is also seen in Mycobacterium tuberculosis genome, where the highly expressed genes prefer codons ending with "C" and "G" [18].

Conclusion
Compositional bias and translational forces had been reported to play a major role in shaping the codon usage of 14 mycobacteriophages. Our observations corroborate with the earlier report with respect to all the 32 mycobacteriophages. Gene length has a minor role in the selection of codon usage of 11 out of 32 mycobacteriophage genes analyzed. High level of heterogeneity is seen within and among the mycobacteriophage genomes. Cooper is identified to be the highly biased genome with better translation efficiency comparing well with the host specific tRNA genes. (341) * Codons whose occurrences are significantly higher (P < .01) in the extreme left side of axis 1 than the genes present on the extreme right of the first major axis. Each group contains 10% of sequences at either extreme of the major axis generated by correspondence analysis. AA: amino acid; N: number of codon; a genes on extreme left of axis 1; b genes on extreme right of axis 1.