A Comparison of Synonymous Codon Usage Bias Patterns in DNA and RNA Virus Genomes: Quantifying the Relative Importance of Mutational Pressure and Natural Selection

Codon usage bias patterns have been broadly explored for many viruses. However, the relative importance of mutation pressure and natural selection is still under debate. In the present study, I tried to resolve controversial issues on determining the principal factors of codon usage patterns for DNA and RNA viruses, respectively, by examining over 38000 ORFs. By utilizing variation partitioning technique, the results showed that 27% and 21% of total variation could be attributed to mutational pressure, while 5% and 6% of total variation could be explained by natural selection for DNA and RNA viruses, respectively, in codon usage patterns. Furthermore, the combined effect of mutational pressure and natural selection on influencing codon usage patterns of viruses is substantial (explaining 10% and 8% of total variation of codon usage patterns). With respect to GC variation, GC content is always negatively and significantly correlated with aromaticity. Interestingly, the signs for the significant correlations between GC, gene lengths, and hydrophobicity are completely opposite between DNA and RNA viruses, being positive for DNA viruses while being negative for RNA viruses. At last, GC12 versus G3s plot suggests that natural selection is more important than mutational pressure on influencing the GC content in the first and second codon positions.


Introduction
Codon usage is not a random event [1]. Codon usage bias has been broadly observed, and different mechanisms have been proposed to explain the bias patterns, for example, mutation pressure, translational efficiency, gene length [2], dinulcoetide bias [3], tRNA abundance [4], organ specificity [5], and so on. Codon usage bias patterns have been broadly studied in recent years, especially for virus genomes [3,6,7]. However, most of these previous studies only consider a specific virus or a specific virus clade [8][9][10], a global comparison of virus codon usage bias patterns is still largely lacking, even though some literature had worked on many RNA and DNA viruses as whole [11][12][13]. A holistic observation and comparison of codon usage patterns over different clades of viruses would throw new insights into virus genome explicitly. To cope with such a knowledge gap, in the present study, I analyzed codon usage patterns for the available 2317 virus genomes for the purpose of providing a more robust and integrated understanding of synonymous codon usage patterns.
Given the accumulation of genome sequences from different viruses in GenBank database, another purpose of the present study is to quantify the relative contribution of mutation pressure and natural selection on influencing codon usage patterns of virus genomes. I could achieve such an objective by introducing a new statistical method called variance partitioning to quantitatively examine the separated role of different mechanisms on synonymous codon usage patterns of viruses.

Sequence Data.
The complete genome sequences for 2317 different virus species were originally obtained from GenBank database (http://www.ncbi.nlm.nih.gov/genomes/ VIRUSES/viruses.html). Because some viruses have been sequenced for multiple times using different strains, for avoiding sampling bias, only one from these multiple genomic sequences for the same virus is used. Furthermore, because RNA and DNA viruses are very different on their codon usage biase patterns [12], RNA

Measures of Relative Synonymous Codon Usage (RSCU).
Relative synonymous codon usage values of each codon in a gene are calculated to investigate the characteristics of synonymous codon usage. The RSCU index is calculated as follows [14]: where is the observed number of the th codon for the th amino acid which has kinds of synonymous codons. Codons with higher (or lower) selected frequencies have higher (or lower) RSCU values. When the corresponding RSCU values of a codon are close to 1, it is used randomly and evenly.

Effective Number of Codons.
The effective number of codons (ENC) is a measure of bias from equal codon usage in a gene [15]. The calculation formula is where ( = 2, 3, 4, 6) is the mean of values for the -fold degenerate amino acids, which is estimated using the formula as follows: where is the total number of occurrences of the codons for that amino acid and where is the total number of occurrences of the th codon for that amino acid. ranges from 20 for the strongest bias (where only one codon is used for each amino acid) to 61 for no bias (where all synonymous codons are used equally).
For elucidating the relationship between GC3s and ENC values, the expected ENC values for different GC3s are calculated as follows: where denotes the value of GC3s [6]. The observed and expected ENC values are compared to determine the influence of nucleotide compositional constraint on structuring synonymous codon usage bias.

Codon Adaptation
Index. The codon adaptation index (CAI) estimates the extent of bias toward codons that are known to be favored in highly expressed genes [16]. In the present study, for simplicity, the Escherichia coli optimal codons are used as the reference.

Indices for Measuring Chemical Properties of Amino Acids.
Hydrophobicity (GRAVY) and aromaticity (AROMO) of conceptually translated gene product may be factors influencing codon usage bias patterns [17]. As such, I quantify both indices to reveal the evidence of natural selection on codon usage bias. For hydrophobicity index [17], it is calculated as where is the number of amino acids and is the hydrophobic index of the th amino acid.
For aromaticity index [17], it is calculated as where V is either 1 (for an aromatic amino acid) or 0 (for a nonaromatic amino acid) and is the number of amino acids.

Correspondence Analysis and Canonical Correspondence
Analysis. In addition to utilizing conventional correspondence analysis (CA) [17], in the present study, I introduce a new method, namely, canonical correspondence analysis (CCA) [18], which could help reveal the principal trends of codon usage bias patterns and identify the most correlated variables simultaneously. CCA method has been broadly applied in ecological studies [18,19]. However, it might be the first time to be applied to study synonymous codon usage patterns for viruses in the present study. The mathematical formulation for CCA method [18,19] is a bit complicated in comparison to its linear analogue redundancy analysis (RDA) [19,20] because it requires data transformation. As such, the calculation steps for RDA are present here for demonstrating the calculation core steps of CCA.
Assuming that one has the codon usage matrix and the matrix of explanatory variables (codon usage indices) (both have the same rows), then the RDA procedure is to predict the elements (codon usage values) in the matrix aŝ where the subscript denotes the transpose of the matrix and −1 denotes the inverse of the matrix.
Thus the covariance matrix for the predicted codon usage matrix̂is ( denotes the row number) The RDA or CCA method is to decompose the above matrix into normalized eigenvalues and normalized eigenvector matrix . Elements from ranked from high to low represent the explained proportion of total variation in the codon usage patterns, while the corresponding eigenvectors can be used to obtain sample scores and biplots when generating the 2-dimensional plots.

Quantifying the Influence of Mutation Pressure and Selection Pressure Using Variation
Partitioning. Variation partitioning is a relatively new method for helping elucidate the influence of each group of explanatory variables in multivariate statistics [21]. Variation partitioning has been broadly applied in ecological and evolutionary studies [19]. For quantifying the influence of mutation selection, I consider the metrics related to codon contents, like GC, GC3s, A3s, T3s, C3s, and G3s contents, as the factors reflecting mutational pressure. In contrast, the indices CAI, all kinds of protein properties, including hydrophobicity and aromaticity, are regarded as the representative of natural selection [3,17,22]. For simplicity, the mathematical formulation for variation partitioning technique is as follows [21,23,24].
Supposing that there are two groups of explanatory variables in two matrices 1 and 2 , the total variation in the codon usage table matrix with rows is written as where a hyphen above the variable(s) denotes the mean(s). Then the proportion of variation 1 only explained by the group of explanatory variables 1 is obtained as The percentage of total variation 2 attributed to the explanatory variable group 2 is calculated following the same procedure as above.
Finally, the percentage of total variation explained by the interaction of the two variable groups 1 and 2 requires the determination of the percentage of variation ( 12 ) explained by all the variables , the matrix of which combines matrices 1 and 2 together: Thus, the proportion of variation that cannot be explained by any current explanatory variables is determined by Then, the percentage of total variation explained by the interaction of the two variable groups is given by, In a summary, 1 , 2 , 1∩2 and 0 are the focused explained variation for the present study.

ENC-GC3s
Plot. As seen in Figure 1, the ENC-GC3s plot showed that most DNA or RNA virus genes lay on or slightly under the expected curve, indicating the extreme importance of mutational pressure for both groups of viruses. However, a great amount of points was laid under the curve as well for DNA (Figure 1(a)) and RNA (Figure 1(b)) viruses, suggesting that other factors, especially the influence of natural selection, were nontrivial.

Influence of Gene Lengths and Protein Properties on GC
Variation. Based on the present observation, there was a significant and positive correlation between GC content and gene lengths for DNA viruses (Figure 2(a)). The log-transformation further enhanced the positive trend (Figure 2(b)). However, for RNA viruses, the patterns became reverse: GC was significantly and negatively correlated with gene lengths for either original (Figure 3(a)) and log-transformed data points (Figure 3(b)). These results thus were incongruent with many previous studies working on a single of virus or a clade of viruses, which suggested that there were no clear trends between GC and gene lengths, for example, influenza viruses [28], polioviruses [10], parvoviridae [29], and so on. Similar to the relationship between GC and gene lengths, there was an opposite relationship between GC and hydrophobicity for different types of viruses as well (Figures 2(c)  and 3(c)). For DNA viruses (Figure 2(c)), the correlation between the two quantities was positive and significant, while for RNA viruses (Figure 3(c)), the correlation became positive and significant (Figure 3(c)).  Finally, the relationship between GC and aromaticity was always negatively and significantly correlated for either DNA (Figure 2(d)) or RNA (Figure 3(d)) viruses. These significant correlations should suggest the signature of the influence of natural selection on codon usage patterns of viruses. Figure 4(a), for DNA viruses, the correlation of GC3s and GC12 was best fitted by a linear function as GC12 = 0.202 × GC3s + 0.203( 2 = 0.461, < 0.0001) for DNA viruses. For RNA viruses, the linear regression model was as similar as GC12 = 0.225 × GC3s + 0.206( 2 = 0.461, < 0.0001) (Figure 4(b)). The slope of the GC12-GC3s regression line indicated the relative mutaion pressure functioned on the first and second codon positions in relation to that on the third codon position [30][31][32]. As seen, GC12 was influenced by mutation pressure and natural selection with a ratio being 0.202/0.798 = 0.253 for DNA viruses and 0.225/0.775 = 0.29 for RNA viruses correspondingly. These results indicated that the natural selection was more important on structuring the first and second codon positions and had similar influences for both groups of viruses.

CA and CCA Analyses for Characterizing the Major Trends in Codon Usage Patterns of Viruses.
For the ORFs of DNA viruses, the first (CA1) and second (CA2) axes of CA explained 34.5% and 5.6% of total variation in the codon usage patterns (Figure 5(a)). For RNA viruses, the first and second axes of CA explained 27.3% and 9.2% of the total variation, respectively, in synonymous codon usage patterns ( Figure 5(b)). Thus, CA1 reflected the major trends for both DNA and RNA virus ORFs.
For RNA viruses, the first two axes explained 50.8% and 16.2% of total variation ( Figure 5(d)). The most important variables correlated with CCA1 were identical as those for DNA viruses, while A3s, T3s, and CAI were the most important variables for CCA2 (Table 1).
When comparing both CA and CCA results, it was consistently found that the following factors are repeatedly identified as most correlated ones for the principal axes for both DNA and RNA viruses: GC3s, GC, and A3s (Table 1). Thus, these variables should be of great importance to influence codon usage bias patterns for viruses.

Quantifying the Relative Importance of Mutation Pressure and Natural Selection in Overall Codon Usage Patterns of Viruses.
Based on the results of variation partitioning (Figure 6(a)), for DNA viruses, it was found that 27% of total variation could be attributed to mutational pressure, while only 5% of total variation of codon usage patterns attributed to selection pressure. The interaction between mutation and selection further explained 10% of total variation. Very similarly, for RNA viruses (Figure 6(b)), mutational pressure explained 21% of the total variation in codon usage patterns, while natural selection explained 6% of the total variation. The interaction of both mechanisms further explained 8% of the total variation.

The Relationship between GC Variation and Codon
Usage Factors. Interestingly, it is found that the correlation between GC content and hydrophobicity and gene lengths is positive and significant for DNA viruses (Figures 2(a)-2(c)), while being negative and significant of RNA viruses (Figures 3(a)-3(c)). In contrast, the tendency between GC versus aromaticity is always negative (Figures 2(d) and 3(d)). The positive correlation between GC and hydrophobicity for RNA viruses is contradictory to some previous studies working on specific RNA virus species or clades, which argued that the correlation should be positive [33]. Moreover, it is still controversial whether there is a clear correlation for a specific virus or a clade of viruses. Some previous studies [3,34] concluded that there was no clear relationship between these two quantities. I do not observe a congruent relationship between GC content and gene lengths for DNA and RNA viruses ). Based on some predictions, GC content should be correlated with gene length since selection should be stronger in longer genes, causing the directional change of GC content [35][36][37][38]. Indeed, GC has been thought to relate to gene lengths in prokaryotes, plants, nematodes, or insects [2,22,35,36,39], although the relationship among these taxa is still debatable [40]. On the basis of the results for DNA and RNA virus genomes at the present study, I argue that there is no consistent relationship between GC profiling and gene lengths for viruses. As shown in Figures 2(a) and 2(b), for DNA viruses, the relationship between GC and gene lengths is positive, implying the imprint of natural selection. However, for RNA viruses (Figures 3(a) and 3(b)), the relationship becomes negative, being opposite to the prediction of natural selection. This is not surprising, because RNA viruses are believed to have much higher mutation rates than DNA viruses [12,28,41]. At this perspective, viruses are different to other life forms from other kingdoms, in which natural selection plays differential roles to influence the stability of longer genes of DNA and RNA viruses. It is found that the prescreening and removal of shortlength ORFs are very crucial to obtain accurate trends between GC and codon usage indices. For example, the negative relationship between GC and hydrophobicity may become obscured when more short-length ORFs (less than 350 bp) are included in the study for DNA viruses. The correlation will become nonsignificant (results not showed here).

Virus Genomes Are Profoundly Influenced by Mutation
Pressure. Based on the results of variation partitioning, the present study identifies that mutational pressure is the most prevailing mechanism driving the codon usage bias patterns for both DNA and RNA viruses (Figure 6), because it can explain 27% and 21% of total variation, respectively, in codon usage patterns of both groups of viruses. In contrast, the influence of natural selection is very minor, only explaining 5% and 6% of the total variation, respectively, for both groups of viruses. Previous studies on a single or a clade of viruses largely have confirmed the dominating influence of mutational pressure [6,10,42,43], but many studies also mentioned the considerable importance of selection [29,44]. Thus, through the present intergenomic analysis, I have a chance to quantify the relative importance of mutation versus selection on structuring codon usage patterns of viruses, and the similar conclusion is enforced: natural selection is not so important in comparison to mutational pressure in synonymous codon usage patterns of viruses.
Through correlation analysis between codon usage indices and major axes from CA and CCA analyses, the present study identifies that the three most important indices are GC, GC3s, and A3s. Thus, the present results are contradictory with a previous study [13] which showed that genomic nucleotide content was the most important factor predicting synonymous codon usage patterns in RNA viruses using randomization techniques. The difference raised may be partially due to the studied data size. In the previous study [13], only 29 RNA virus species were examined. This number is a very small number in comparison to the present study which works on 725 RNA viruses. Thus, the conclusion from the previous study [13] stating that GC content was a poor predictor of codon usage patterns of RNA viruses may be challenging if the authors can work on a large number of RNA viruses.
Finally, as implied by a large fraction of the unexplained variation (over 50%) for both groups of viruses, my quantification of mutational pressure and natural selection by utilizing the present ten codon usage variables might not be sufficient to quantify the relative importance of mutational pressure and natural selection. Other more important variables, especially those for characterizing natural selection  (e.g., the frequency of usage of optimal codons [45]), might increase the explanatory power of natural selection on codon usage patterns of virus genomes.

Limitations of the Present Study.
I have to acknowledge that the recombination events happened at either gene or genome levels can influence the codon usage bias patterns to some extent, as evidenced by some previous studies [46][47][48]. Virus genomes have been broadly observed to process some degrees of homologous recombination [49][50][51][52]. As such, it would be a contribution when ones eliminate the influence of homologous recombination in the virus genomes before analyzing codon usage patterns to accurately disentangle the relative importance of mutation and selection.