Retroviruses infect a wide range of organisms including humans. Among them, HIV-1, which causes AIDS, has now become a major threat for world health. Some of these viruses are also potential gene transfer vectors. In this study, the patterns of synonymous codon usage in retroviruses have been studied through multivariate statistical methods on ORFs sequences from the available 56 retroviruses. The principal determinant for evolution of the codon usage pattern in retroviruses seemed to be the compositional constraints, while selection for translation of the viral genes plays a secondary role. This was further supported by multivariate analysis on relative synonymous codon usage. Thus, it seems that mutational bias might have dominated role over translational selection in shaping the codon usage of retroviruses. Codon adaptation index was used to identify translationally optimal codons among genes from retroviruses. The comparative analysis of the preferred and optimal codons among different retroviral groups revealed that four codons GAA, AAA, AGA, and GGA were significantly more frequent in most of the retroviral genes inspite of some differences. Cluster analysis also revealed that phylogenetically related groups of retroviruses have probably evolved their codon usage in a concerted manner under the influence of their nucleotide composition.
The retroviruses are a diverse family of enveloped single stranded retro transcribing RNA viruses unique for their use of reverse transcription of the viral RNA into linear double stranded DNA during replication and the subsequent integration of the DNA into the host genome. Members of this family cause diseases in a wide range of organisms, including humans [
The analysis of codon usage of whole organisms and/or organisms from closely related groups of them reveals trends and anomalies in the choice and bias in the frequency of codons and related nucleotide composition, including evolutionary features. Synonymous codons do not occur in equal frequency in genes and genomes. The relative frequency of these synonymous codons in the genes varies significantly in a nonrandom manner between species, even between those from the same taxon due to a complex balance between mutational bias, various selection forces (e.g., translational selection), and drift acting on the genes or genomes [
In this study, the codon usage patterns of all the available 56 sequenced retroviruses’ genomes (from GenBank) containing 246 ORFs (longer than 150 bp) were analyzed. Results from this study would be useful for revealing retroviral gene composition and evolution and additionally may be useful in selecting appropriate host expression systems to improve the expression of target genes
56 completely sequenced retroviral genomes were available from NCBI GenBank (February 2010). These belonged to two major subfamilies: Orthoretrovirinae, Spumaretrovirinae and rest of the viruses were unclassified. Six viruses belong to Spumaretrovirinae, while 3 viruses were unclassified. The remaining 47 viruses belong to Orthoretrovirinae subfamily. Six genera are present within Orthoretrovirinae, namely:
The various statistical parameters characterizing synonymous codon frequency, codon bias, base composition of whole genes, base composition at 3rd codon positions, relative gene expression levels, preferred and optimal codons, correspondence and cluster analysis on codon usage, and the associated means, standard deviations (SD), correlation coefficients, and chi-square statistics
The “Effective Number of Codons” (ENc) of a gene sequence measures the degree of bias in codon usage in the gene [
Characteristics and codon usage pattern and AU distribution of retroviral genomes (shown in clades).
Virus* | Abbrev. names | Accn. number | Mean ENc | SD | Mean AU % | SD | Mean AU3 % | SD |
---|---|---|---|---|---|---|---|---|
Orthoretrovirinae | ||||||||
|
||||||||
|
ACMHV-2 | NC_001402 | 40.51 | 0.00 | 36.20 | 0.00 | 18.50 | 0.00 |
|
ALV-A | NC_001408 | 57.09 | 3.32 | 44.80 | 7.35 | 45.23 | 9.32 |
|
AMCV | NC_001866 | 52.79 | 0.00 | 39.60 | 0.00 | 32.00 | 0.00 |
|
FuSV | NC_001403 | 42.61 | 0.00 | 37.90 | 0.00 | 22.20 | 0.00 |
|
RSV | NC_001407 | 53.50 | 9.93 | 44.90 | 5.28 | 41.90 | 15.25 |
|
UR2SV | NC_001618 | 55.68 | 0.59 | 53.40 | 0.00 | 56.65 | 1.77 |
|
Y73SV | NC_008094 | 39.41 | 0.00 | 38.20 | 0.00 | 21.20 | 0.00 |
|
||||||||
|
ENTV-2 | NC_004994 | 49.86 | 1.36 | 57.63 | 2.33 | 70.20 | 1.53 |
|
JSRV | NC_001494 | 49.25 | 4.26 | 58.58 | 2.38 | 66.08 | 9.57 |
|
MPMV | NC_001550 | 50.10 | 1.31 | 57.58 | 1.27 | 68.13 | 1.60 |
|
MMTV | NC_001503 | 51.84 | 2.16 | 55.98 | 0.62 | 62.80 | 2.42 |
|
ENTV-1 | NC_007015 | 49.23 | 2.80 | 58.28 | 2.72 | 71.45 | 2.87 |
|
SMRV-HLB | NC_001514 | 53.01 | 4.61 | 51.68 | 2.01 | 54.78 | 3.39 |
|
||||||||
|
BLV | NC_001414 | 52.05 | 2.54 | 44.98 | 3.49 | 47.35 | 4.83 |
|
||||||||
|
HTLV-1 | NC_001436 | 50.88 | 1.62 | 46.27 | 2.05 | 45.35 | 2.82 |
|
STLV-1 | NC_000858 | 51.09 | 2.06 | 46.73 | 2.29 | 45.80 | 3.45 |
|
||||||||
|
HTLV-2 | NC_001488 | 50.04 | 1.86 | 45.58 | 3.31 | 44.46 | 3.11 |
|
STLV-2 | NC_001815 | 51.19 | 4.24 | 43.08 | 3.82 | 41.97 | 5.16 |
|
HTLV-4 | NC_011800 | 50.86 | 2.71 | 42.85 | 2.62 | 40.68 | 2.23 |
|
STLV-6 | NC_011546 | 54.01 | 3.83 | 47.00 | 2.78 | 48.73 | 4.68 |
|
STLV-3 | NC_003323 | 55.39 | 3.72 | 44.92 | 2.73 | 44.52 | 2.31 |
|
||||||||
|
SnRV | NC_001724 | 51.88 | 7.91 | 50.31 | 4.73 | 58.46 | 4.80 |
|
WDSV | NC_001867 | 53.51 | 2.45 | 57.82 | 3.39 | 65.50 | 3.56 |
|
||||||||
|
AbMLV | NC_001499 | 55.03 | 6.45 | 46.73 | 8.81 | 46.50 | 11.23 |
|
FeLV | NC_001940 | 53.84 | 4.16 | 50.30 | 1.70 | 53.60 | 0.99 |
|
||||||||
|
FrMLV | NC_001362 | 54.95 | 1.03 | 46.70 | 1.41 | 47.70 | 1.82 |
|
MoMLV | NC_001501 | 54.72 | 0.10 | 47.00 | 1.84 | 48.77 | 2.80 |
|
MTCR | NC_001702 | 52.33 | 3.00 | 45.90 | 0.99 | 45.50 | 2.44 |
|
R-MuLV | NC_001819 | 55.42 | 0.62 | 46.87 | 1.40 | 47.60 | 1.32 |
|
GALV | NC_001885 | 55.83 | 1.34 | 47.57 | 1.16 | 48.83 | 0.67 |
|
MOMSV | NC_001502 | 56.40 | 3.24 | 47.60 | 6.26 | 46.08 | 8.42 |
|
MuSV | NC_001506 | 49.97 | 0.98 | 42.65 | 1.77 | 37.35 | 7.14 |
|
RD-114 | NC_009889 | 54.31 | 1.93 | 49.70 | 3.68 | 51.60 | 3.25 |
|
REV | NC_006934 | 57.09 | 0.20 | 46.90 | 0.80 | 46.50 | 1.44 |
|
SFFV | NC_001500 | 51.03 | 5.85 | 45.27 | 2.28 | 47.33 | 3.37 |
|
WMSV | NC_009424 | 51.39 | 8.65 | 42.60 | 4.04 | 37.97 | 13.23 |
|
XMRV-VP62 | NC_007815 | 52.79 | 2.41 | 46.43 | 1.29 | 46.80 | 2.18 |
|
||||||||
|
BIV | NC_001413 | 53.22 | 4.43 | 53.26 | 3.60 | 57.90 | 5.35 |
|
CAEV | NC_001463 | 45.93 | 6.96 | 57.10 | 3.45 | 67.63 | 4.46 |
|
EIAV | NC_001450 | 47.05 | 7.93 | 59.25 | 4.56 | 67.43 | 1.18 |
|
FIV | NC_001482 | 43.87 | 6.58 | 62.01 | 3.83 | 71.39 | 9.91 |
|
HIV-1 | NC_001802 | 45.05 | 4.01 | 55.49 | 4.66 | 64.78 | 7.24 |
|
HIV-2 | NC_001722 | 52.43 | 5.73 | 51.49 | 3.02 | 56.51 | 4.77 |
|
OLV | NC_001511 | 44.59 | 4.47 | 57.72 | 3.78 | 65.73 | 6.90 |
|
||||||||
|
SIV | NC_001549 | 48.80 | 4.12 | 54.73 | 3.32 | 60.62 | 7.82 |
|
SIV-mnd-2 | NC_004455 | 51.44 | 5.83 | 54.91 | 2.63 | 58.94 | 6.43 |
|
VISNA | NC_001452 | 44.47 | 5.34 | 57.83 | 2.70 | 68.25 | 6.51 |
Spumaretrovirinae | ||||||||
|
BFV | NC_001831 | 55.93 | 2.36 | 52.14 | 3.66 | 58.76 | 5.22 |
|
EFV | NC_002201 | 45.53 | 2.69 | 58.00 | 4.83 | 71.16 | 5.47 |
|
FFV | NC_001871 | 48.58 | 3.78 | 60.53 | 2.95 | 70.40 | 3.59 |
|
SFVmac | NC_010819 | 46.02 | 3.01 | 58.52 | 4.92 | 73.60 | 5.31 |
|
SFV | NC_001364 | 47.94 | 3.91 | 58.03 | 4.38 | 70.58 | 8.48 |
|
SFV-3 | NC_010820 | 44.33 | 3.23 | 59.50 | 5.31 | 75.20 | 4.35 |
Unclassified retroviruses | ||||||||
|
SSSV | NC_007654 | 54.69 | 2.28 | 51.70 | 3.54 | 57.90 | 6.65 |
|
EAV-HP | NC_005947 | 59.38 | 0.00 | 47.60 | 0.00 | 47.80 | 0.00 |
|
Xen-1 | NC_010955 | 59.60 | 1.80 | 53.35 | 2.76 | 58.00 | 4.11 |
Nucleotide preferences are usually an indication for the nature of mutational bias in genes or genomes. Here, in retroviruses, explicit differences are observed in nucleotide preferences. The AU content (overall A + U) of genes in single retrovirus ranged from 35% to about 60% (Table
When ENc versus AU3 content is plotted for the whole dataset, it is seen that only a small number of genes lie on the expected curve (the curve representing the variation of codon bias when determined by base composition only), while majority of the genes with low ENc values were lying well below it (Figure
(a) ENc versus GC3 plot of all the genes. The reference viruses are in different colors. (b) The values of the first axis and the second axis of each gene in CoA. Genes from reference retroviruses are shown in different colors; genes from other viruses are plotted in blue colour.
Codons occurring in high frequencies in the total codon usage data of an organism are called preferred codons. Here, in retroviruses, significant differences (using
The codon adaptation index (CAI) is one measure that is used to estimate the extent of bias towards codons that are preferred in highly expressed genes. The CAI value ranges from 0 and 1.0 for a gene, where a higher value is likely to indicate stronger codon usage bias and a potential higher expression level. Higher CAI for a large set of genes may also indicate that selection for translation is active over that set of genes. Codons whose frequencies of usage were significantly higher in the genes with higher CAI, than that of the genes with lower CAI, are considered as the optimal codons. In this study, codon usage of retroviruses was compared (with chi-squared contingency test) between two groups of genes. One group of genes was constituted from 5% of the total number of genes, which had the maximum CAI values. The other group of genes was similarly constructed from the genes having minimum CAI. In all, 26 codons, UUU (Phe), UUA, UUG, CUA (Leu), AUA (Ile), GUA (Val), UAU (Tyr), CAU (His), CAA (Gln), AAU (Asn), AAA (Lys), GAU (Asp), GAA (Glu), UCU, UCA, AGU (Ser), CCU, CCA (Pro), ACU, ACA (Thr), GCU, GCA (Ala), UGU (Cys), AGA, AGG (Arg), and GGA (Gly), were identified as the optimal codons (
Translational optimal codons.
Amino acid | Codon# | High | Low | ||
---|---|---|---|---|---|
RSCU | Number | RSCU | Number | ||
Phe | UUU* | 1.64 | 184 | 0.7 | 61 |
UUC | 0.36 | 40 | 1.3 | 113 | |
| |||||
Leu | UUA* | 2.67 | 323 | 0.34 | 35 |
UUG* | 1.21 | 146 | 0.5 | 51 | |
CUU | 0.47 | 57 | 0.63 | 65 | |
CUC | 0.29 | 35 | 1.83 | 189 | |
CUA* | 0.94 | 114 | 0.53 | 55 | |
CUG | 0.41 | 50 | 2.17 | 223 | |
| |||||
Ile | AUU | 0.85 | 193 | 0.77 | 55 |
AUC | 0.29 | 66 | 1.72 | 122 | |
AUA* | 1.86 | 425 | 0.51 | 36 | |
| |||||
Val | GUU | 0.65 | 80 | 0.63 | 44 |
GUC | 0.36 | 44 | 1.38 | 96 | |
GUA* | 2.17 | 267 | 0.32 | 22 | |
GUG | 0.82 | 101 | 1.68 | 117 | |
| |||||
Tyr | UAU* | 1.74 | 270 | 0.44 | 34 |
UAC | 0.26 | 41 | 1.56 | 121 | |
| |||||
His | CAU* | 1.5 | 144 | 0.64 | 63 |
CAC | 0.5 | 48 | 1.36 | 134 | |
| |||||
Gln | CAA* | 1.49 | 383 | 0.47 | 71 |
CAG | 0.51 | 131 | 1.53 | 228 | |
| |||||
Asn | AAU* | 1.67 | 317 | 0.54 | 38 |
AAC | 0.33 | 62 | 1.46 | 104 | |
| |||||
Lys | AAA* | 1.38 | 481 | 0.68 | 93 |
AAG | 0.62 | 218 | 1.32 | 182 | |
| |||||
Asp | GAU* | 1.51 | 259 | 0.53 | 54 |
GAC | 0.49 | 85 | 1.47 | 149 | |
| |||||
Glu | GAA* | 1.5 | 474 | 0.61 | 99 |
GAG | 0.5 | 159 | 1.39 | 228 | |
| |||||
Ser | UCU* | 1.18 | 77 | 0.76 | 49 |
UCC | 0.56 | 37 | 2.1 | 136 | |
UCA* | 1.88 | 123 | 0.51 | 33 | |
UCG | 0.2 | 13 | 0.71 | 46 | |
AGU* | 1.53 | 100 | 0.37 | 24 | |
AGC | 0.66 | 43 | 1.56 | 101 | |
| |||||
Pro | CCU* | 1.47 | 162 | 0.68 | 81 |
CCC | 0.52 | 57 | 2.04 | 243 | |
CCA* | 1.78 | 196 | 0.75 | 89 | |
CCG | 0.24 | 26 | 0.54 | 64 | |
| |||||
Thr | ACU* | 1.29 | 156 | 0.67 | 57 |
ACC | 0.42 | 51 | 2.1 | 178 | |
ACA* | 2.09 | 252 | 0.67 | 57 | |
ACG | 0.2 | 24 | 0.55 | 47 | |
| |||||
Ala | GCU* | 1.09 | 130 | 0.7 | 69 |
GCC | 0.58 | 69 | 2.11 | 207 | |
GCA* | 2.07 | 246 | 0.67 | 66 | |
GCG | 0.26 | 31 | 0.51 | 50 | |
| |||||
Cys | UGU* | 1.69 | 133 | 0.5 | 27 |
UGC | 0.31 | 24 | 1.5 | 81 | |
| |||||
Arg | CGU | 0.03 | 2 | 0.53 | 26 |
CGC | 0.09 | 6 | 1.64 | 81 | |
CGA | 0.42 | 28 | 0.65 | 32 | |
CGG | 0.12 | 8 | 2.03 | 100 | |
AGA* | 3.71 | 250 | 0.45 | 22 | |
AGG* | 1.63 | 110 | 0.71 | 35 | |
| |||||
Gly | GGU | 0.49 | 73 | 0.42 | 37 |
GGC | 0.32 | 48 | 1.52 | 134 | |
GGA* | 2.22 | 333 | 0.74 | 65 | |
GGG | 0.97 | 145 | 1.33 | 117 |
Correspondence analysis (CoA) on relative synonymous codon usage (RSCU) is a method for identifying major trends/factors (as orthogonal axes) responsible for the variation in codon usage among genes represented as 59- (number of sense codons) dimensional vectors. In the correspondence analysis on codon usage of retroviral genes, the two axes which accounted for the largest amount of variations, accounted for about 25% (major axis) and 10% of the variation of the whole data set. Each of the remaining axes accounted for less than 5% of the variation. The retroviral genes were widely distributed along the length of the first major axis. Genes belonging to differently biased viruses were distinctly separated on the first major axis. The AU rich retroviruses, for example, FIV, SFV-3, VISNA, OLV, and HIV-1, were on the extreme right, while the GC rich viruses were on the other end (Figure
(a) Correlation between AU content of each retroviral gene and their position on the first axis of CoA. (b) The distribution of synonymous codons is shown along the first and second axes of the CoA. Codons ending with G or C are shown in blue colors, and codons ending with A or U are shown in orange colour.
Cluster analysis based on codon usage reveals the grouping within and across the organisms based on the similarities and differences in their codon usage. The organisms are grouped based on a distance measure which is proportional to the similarities of the codon usage between pairs of organisms. Cluster analysis on retroviral codon usage revealed that the retroviruses are grouped into two major clusters (Figure
Dendogram representing the extent of divergence in relative synonymous codon usage of 56 retroviruses, using unweighed pair group average clustering, and distances are in Euclidean distance. Different clades are in different colors. To the extreme right mean ENc, mean AU% and AU3% are added from Table
Retroviruses are an extremely important system for study, especially so because of its potential to adversely affect the quality of life and life-span of a large fraction of the world population especially in developing countries. These viruses are a potential threat to mankind, because of their complex biological mechanisms and evolution. This study aims to reveal the nature of some important genetic, genomic, and evolutionary features of these viruses which may be further utilized in better understanding of the retroviral system and has been designed to elucidate the general complexity and preferences of codon usage of all the retroviruses based on certain well-established parameters. Analysis of codon usage and base composition of retroviral genes documented here have revealed some useful facts. Furthermore, the results obtained through the various analyses were found to be consistent with each other, thus strongly validating the results obtained.
The large majority of the completely sequenced 56 retroviruses belonged to the Orthoretrovirinae subfamily. Within the Orthoretrovirinae, different genera contained almost equal number of viruses. Several features of retroviruses have been revealed through computation and analysis of different well-established parameters to understand their compositional and codon usage characteristics. They are RSCU, codon bias (ENc), base content, preferred and optimal codons, major factors of CoA, and grouping by cluster analysis of these viruses based on their codon usages.
It is found that retroviral genes do not possess significantly high codon bias. The genes are almost equally distributed between weak bias and moderate bias. This observation is very similar to the findings of Jenkins and Holmes, in 2003, where they had also observed moderate bias in 50 human RNA viruses [
Some retroviruses were significantly similar in their overall codon usage, while majority was not. Four preferred codons were identified, all of which were subset of the set of 26 optimal codons separately identified. It was observed that phylogenetically closer retroviruses possess relatively similar codon usage and almost the same sets of preferred and optimal codons having A or U in their synonymous positions. But
In the light of the general fact that selective constraints are greater in the first two positions of codons, whereas mutational bias is greater in the third position, all the observations indicate that codon bias in retroviruses in general is strongly dependent on base composition and mutational bias. This observation is also supported by earlier studies where it has been shown that main factor explaining codon usage in viruses is mutation bias [
There is a good possibility that compositional bias detected in retroviruses in this study is the result of a directional mutational pressure imposed by one of the two enzymes that copies the retroviral genome, that is, retrovirus-specific reverse-transcriptase (RT) enzyme, which converts the viral RNA into DNA. It is a distinct possibility that the absence of any strong codon bias in retroviruses might be due to the combined effect of missincorporations by the error-prone RT polymerase enzyme (mentioned above) and another class of enzyme, cytidine deaminases such as enzymes of APOBEC3 superfamily [
Observations from comparative analysis of codon usage bias reveal lack of strong translational selection in considerable number of retroviruses and this could be a problem of using retroviruses as expression vectors for gene therapy and immunization. Instead, use of the retroviruses with AU rich nucleotide composition is recommended, utilizing optimal set of codons. Information on optimal codons obtained from this study is expected to be useful for codon optimization especially for designing retroviral vectors with higher translational efficiency and production of simple and safe retroviral vectors for gene therapy and immunization.
Overall, the results point towards the fact that mutational bias is a dominant factor, relative to translational selection, in shaping codon usage of retroviruses. In these viruses, where codon usage bias is not strong, it is primarily determined by base composition, that is, AU (or GC) content of the genes, while selection for efficient expression for genes is probably another important factor affecting their codon usage. The intricate character of codon usage of these viral systems is probably maintained by incorporations of errors during molecular processing of the retroviral genomes, to help avoid strong immune response from the infected host but yet strike a balance with adequate execution of basic life cycle mechanisms of these viruses. In spite of inter- and intra-genomic differences of base and codon usage, it is possible that the extant retroviruses, in general, have emerged through a complex but concerted process of evolution.
The authors declare that they have no conflict of interests and that they did not receive financial support for this study.
The authors are thankful to Arpita Mukherjee (Scientist, Department of Electronics, Central Mechanical Engineering Research Institute, Durgapur-713209, India) for her technical help on statistics and critical reading of the paper.