Analyzing the Evolution and Host Adaptation of the Rabies Virus from the Perspective of Codon Usage Bias

,


Introduction
Rabies is a fatal zoonotic disease of almost all warmblooded animals, including humans.It is caused by the rabies virus (RABV), which is responsible for the deaths of more than 59,000 people annually [1,2].The disease is particularly prevalent in developing countries, primarily in Asia and Africa, where they account for approximately 99% of human cases [3][4][5][6].This virus poses a persistent and significant health threat to the public.To this end, international organizations such as the World Health Organization (WHO), the World Organization for Animal Health (WOAH), and the Global Alliance for Rabies Control (GARC) are working together to support countries to eliminate dog-mediated rabies by 2030 (https://www.who.int/news-room/commentaries/detail/new-global-strategic-plan-to-eliminate-dog-me diated-rabies-by-2030).
Codon usage bias is the nonrandom usage among the synonymous codons [13][14][15][16].Under environmental pressure, viruses prefer synonymous codons in order to adapt to change, given the degeneracy of the codon [17,18].While viruses replicate inside a host organism, they rely on the cellular resources and energy of the host to carry out the processes of replication, translation, and expression [19,20].Thus, as viruses switch between different hosts, their codon usage pattern would change to increasing their translation efficiency and overall replication rate for new cellular environment.Meanwhile, evasion of the host immune response and spillover events could result from the change in viral codon usage bias [21,22].In the study of SARS-CoV-2 codons, it was observed that the expression levels of host genes with a similar pattern of synonymous codon usage to the virus significantly decreased.This could be attributed to competition for resources between the virus and the host, leading to the reduction in the expression of host genes that share similar codons, potentially facilitating immune evasion [23].Furthermore, through the analysis of codon usage patterns in virus-infected yeast cells and human cells, it was found that the similarity in codon usage between the virus and the host influences their coevolution [24].Therefore, it is essential to understand into the molecular evolution and host adaptability of viruses by studying the viral codon usage bias [24][25][26][27].
RABV is an ancient virus that has been extensively studied to elucidate its genomic features and transmission, evolutionary dynamics, and epidemiology [8][9][10]28].However, analysis on codon usage and adaptation to hosts of RABVs were scattered.To fill the gap, we used the latest and most comprehensive RABV genomes to analyze viral base composition, factors affecting codon usage, and adaptation to hosts.This study provides novel insights into the evolution, spread, and spillover of RABV from the perspective of the viral codon usage.

Relative Synonymous Codon Usage (RSCU) Analysis.
RSCU is a measure of the degree of codon usage bias in different genes or species [41].RSCU values were calculated using the Seqinr package (version 4.2-23) in R. The RSCU value is calculated as follows: If the RSCU values >1.6, the codon is considered to be overrepresented, while values <0.6 indicate underrepresentation.A value of 1.0 suggests no codon usage bias.

Effective Number of Codons (ENC)
Analysis.ENC refers to the number of valid codons used in a gene, indicating that the random selection of codon usage deviation [42].The ENC values range from 20 to 61, with values closer to 20 representing stronger codon preference.The following equation was applied to calculate the ENC: If the codons are only under the pressure from G + C mutational bias, the ENC-GC3s points will fall on or around the expected curve, whereas, the points away from the expected curve indicate that the codons are also under the pressure from natural selection.
PR2 analysis is a method for evaluating the effects of random mutational pressure and natural selection on codon usage patterns.The plot consists of the AU deviation (A3/(A3 + U3)) as the vertical coordinate and the GC deviation (G3/(G3 + C3)) as the horizontal coordinate, with a plotting origin of 0.5.The direction and extent of codon usage bias is inferred from the quadrant in which the points fall and the distance from the origin point (0.5, 0.5), with the origin point implying equal contributions from random mutation pressure and natural selection [45,46].
Neutrality plot refers to the assessment of the correlation between the GC3s as the horizontal coordinate and GC12s as the vertical coordinate [47].Mutations occurring at the third position of the codon are mostly synonymous, which are the most neutral bases, whereas mutations in the bases at the first and second positions of the codon cause amino acid changes.Therefore, the ratio of GC3s-GC12s, also known as the regression coefficient, can discern the dominant role of random mutation and natural selection on the codon.A regression coefficient closer to 1 indicates that there is little or no external selective pressure and that random mutation plays a dominant role.The closer the regression coefficient is to zero, or the absence of a significant correlation in the regression curve, the more natural selection prevails.

Correspondence Analysis (COA) and Correlation Analysis.
COA is a multivariate statistical analysis that determines variable and sample relationships by reducing the dimensionality of the data and filtering out the main factors.COA was performed from the RSCU values and visualized in the ggplot2 package of R.
The corrplot package of R and GraphPad Prism were applied to measure the correlation between variables.

Relative Dinucleotide Abundance Analysis.
Codon usage bias is restrained by the relative abundance of 16 dinucleotides, possibly due to the intrinsic properties of the viruses or to the mutational pressure of the innate immune system of the host.The relative dinucleotides abundances are defined as the ratio of observed to expected dinucleotide frequency, computed by SSE (version 1.4, http://www.virus-evolution.org).The contents of the 16 dinucleotides calculated as the following formula [48]: In the formula, f x and f y represent the frequency of nucleotide X and Y, respectively, and the f xy represents the frequency of the dinucleotide XY.When P xy < 0.78 or P xy > 1.23, the dinucleotides were inferred as underrepresented or overrepresented, respectively [49].
2.9.Adaption Index Analysis between RABV and Hosts.The codon adaption index (CAI) is supposed to investigate the codon usage bias of viral genome or gene across different species, with a range from 0 to 1. Theoretically, a high-CAI value means that the codon usage pattern of a virus is adapted to that of its host [50,51].The relative codon deoptimization index (RCDI) is used to compare the similarity of codon usage between virus and host.If the RCDI = 1, the virus is considered to be fully adapted to its host, while the RCDI much higher than 1 indicates poor adaptability [52,53].Moreover, the similarity index (SiD) value is used to assess the influence of host codon usage on pathogen codon usage, with the range from 0 to 1. Higher SiD values indicate greater impact of the hosts on the codon usage of the pathogen [54].The calculation of the SiD value was as follows: To balance the sample bias between the three phylogenetic clades, we used a python script to randomly select 50 sequences from each of the 2,018 RABV coding sequences from bat-related, dog-related, and RAC&SK-related, for a total of 150 sequences for the fitness analysis.All of the CAI, RCDI, and SiD were computed by vhcub package of R [55].

Distribution and Phylogenetic Tree of Full-Length Genomes.
According to the data from the GenBank database, the RABV full-length genomes have been reported in 79 countries, and the virus can infect virtually at all warmblooded animals, including bat, dog, fox, wolf, skunk, racoon, and even human.There were 1,187 sequences, with the largest number in Nouth America, most of which were sampled from skunks or raccoons.In contrast, South America reported the fewest sequences, with only 50, most of which came from bats.Two hundred twenty-one sequences were reported in Europe, and their hosts were mainly wild Canis.Of note, the sequences infecting with Canis-familiaris and human are concentrated in Asia and Africa (Figure 1(a) and Table S1 (Supplementary 1)).
The ML tree indicated that the RABV genomes might be divided into eight clades, corresponding to the previous study.Based on the outer ring (host species) of the ML tree, the sequences could roughly divide into three major group, including bat-related, dog-related, and RAC&SK-related (Figure 1(b), https://doi.org/10.6084/m9.figshare.24077211).2(a)).

d d R A A A A A A A A A A A
To further investigate the codon usage pattern of RABV coding sequences, we used PR2 plot analysis to confirm the effect of random mutation and natural selection on the codon usage bias.Our study found that random mutation and natural selection play unequal roles in the viral codon usage bias, due to the fact that the points are deviated from the origin and concentrated in Quadrants 1 and 3 (Figure 2(b)).
We then used neutral analysis to show that, compared with random mutations, natural selection pressures play a major role in shaping the codon usage bias of RABV coding sequences.Based on the ML tree analysis, these viral genomes were divided into three phylogenetic groups, and their correlation coefficients between GC3 and GC12 were calculated separately for the bat-related (0.1030), the dog-related (0.0135), and the RAC&SK-related (0.00514).Thus, their percentages of natural selective pressure on codon usage patterns were 89.70%, 98.65%, and 99.49%, respectively.Similarly, natural selection also played an important role in all the ungrouped genomes, contributing 93.38% of the constrain (Figure 3).
Our results suggest that the host species of RABV is highly correlated with phylogenetic clade and is a potentially critical factor in viral evolution and codon usage bias generation.We applied the first 20 axes to the relative and cumulative inertia analyses and found that the first axis explained 30.96% of the data inertia and was the main driver of variation, with each subsequent axis explaining a decreasing amount of variation.The first two axes accounted for almost half of the total variance, so we restricted our analysis to these two axes.In the COA analysis, these three groups clustered according to their viral host species, suggesting that viral host species can affect codon usage patterns (Figures 4(a) and 4(b)).
In addition, multifactorial correlations were calculated using corrplot package of R. Almost all indices, with the exception of the relationship between T3s and Axis2, were significantly correlated with the main axis, further confirming the above findings.Altogether, the codon usage bias of RABV coding sequences was under the pressure of both natural selection and random mutation (Figure 4(c)).

Comparison of Relative Dinucleotide Abundance of
RABV and Its Hosts.The relative abundance of 16 dinucleotides were calculated and showed that the frequencies of occurrence of each dinucleotide in the RABV and its host coding region were not evenly distributed.We found that CpU (1.238 AE 0.017), GpA (1.309 AE 0.015) and UpC dinucleotides (1.33 AE 0.014) were overrepresented, which were consistent with that GCU, AGA for Ala, CUG for Leu, CCU for Pro, UCU, UCA, UCC for Ser, ACU for Thr, GAG for Glu, GGA for Gly, and AUC for Ile were preferred in RSCU with the value > 1.While CpG (0.468 AE 0.014), GpC (0.665 AE 0.012), and UpA dinucleotides (0.663 AE 0.013) were underrepresented, which were consistent with that the GCG for Ala, CGC, CGG, CGU for Arg, CCG for Pro, UGG, AGC for Ser, ACG for Thr, UGC for Gly, and UAG for Ter were underrepresented in RSCU with the value < 0.6.Meantime, we noted that the relative dinucleotide abundances of viral hosts were similar.The GpA, TpC, GpC, and TpG showed different degrees between RABVA and its hosts (Table 1 and Figure 5).

Most Preferred Codons in RABV and
Isoacceptor tRNAs in Its Hosts.We compared the preferred codons (for each amino acid) in RABV with the most abundant isoacceptor tRNAs in its hosts and found that 9, 8, 9, and 10 of the 18 codons in the virus best matched the most abundant isoacceptor tRNAs in Myotis lucifugus, Mustela putorius furo, Canis familiaris, and Homo sapiens, respectively.Among the matched tRNAs, six tRNAs (GTT, CTG, CTT, GAA, AGG, and GTA) corresponding to Asn, Gln, Lys, Phe, Pro, and Tyr were the most abundant in all four hosts (Table 2).
The CAI values of viral coding genes, calculated from host usage patterns, are commonly used to predict the efficiency of gene expression in different hosts.The average CAI values for RABV with bat, RAC&SK, dog, and human were found to be 0.780 AE 0.002, 0.785 AE 0.002, 0.7793 AE 0.002, and 0.774 AE 0.002, respectively.Based on the CAI analysis, our findings imply that there is a relatively higher adaptation of the RABV, to RAC&SK cellular systems in contrast to bats, dogs, and humans (Figure 6).
The RCDI value is a measure of the similarity between viral coding genes and its hosts.We calculated the average RCDI values for RABV with bat (1.073 AE 0.004), RAC&SK (1.083 AE 0.004), dog (1.089 AE 0.005), and human (1.072 AE 0.004), respectively.Of note, the human counterpart had the lowest RCDI values compared to the other three hosts, Transboundary and Emerging Diseases  Transboundary and Emerging Diseases indicating that the genes encoding RABV are the least deoptimised and more adaptive to the human host, which is consistent with the analysis of isoacceptor tRNAs (Table 2 and Figure 7).The SiD value is applied to assess the effect of the overall codon usage pattern of the viral host on viral codon usage.The average CAI values for RABV with bat, RAC&SK, dog, and human were found to be 0.4916 AE 0.0001, 0.4917 AE 0.0001, 0.4916 AE 0.0001, and 0.4915 AE 0.0001, respectively.This indicates that during RABV evolution, RAC&SK had a greater impact on the virus than other three hosts (Figure 8).

Discussion
Rabies has a long history, almost accompanying the history of human civilization [1,2].It is widely distributed throughout the world and switches repeatedly between the different species [8,9,11].Of note, understanding the ability of RABV to spread across species, particularly across orders, is important for studying viral patterns of cross-species evolution, for predicting the spread of zoonotic infections and thus for their ultimate control.Therefore, our study provides new insights into the evolution, spread, spillover, and adaptation to hosts of RABV from the perspective of the viral codon usage.
The latest and the largest RABV data were used to apply phylogenetic analysis, suggesting the RABV genomes could be divided into three clades, based on the host species, which were similar to the previous reports [8][9][10].Subsequently, we searched the GenBank for reference genomes, Canis lupus familiaris, Mustela erminea, and Myotis lucifugus genome as representative hosts for each of the three clades, plus the Homo sapiens genome.The base composition analysis of RABV were applied to comprehensive analyze and showed a weak codon usage bias with a preference for the codons ending in A or U, consistent with the RSCU results.
The ENC-plot, PR2-plot, and neutrality-plot analysis were applied to assess the main effect on codon usage bias.For the all sequences, the mean ENC value is 53.83, with a narrow distribution range from 52.98 to 55.01, indicating a low-codon usage bias [56].Compared to the bat-related and dog-related, the points of RAC&SK-related are more clustered.Furthermore, the points in PR2 plot are concentrated to the left of the origin suggesting that all the sequences have a preference for the codons ending in C rather than G.It is consistent with the ENC plot that he points of RAC&SKrelated are more clustered in the Quadrant 3. In term of the neutrality-plot analysis, natural selection pressures play a major role in shaping the codon usage bias of RABV coding sequences, contributing about 90% of the constrain.Of the three clades, the RAC&SK-related has the smallest value of correlation coefficient and R 2 , and nonsignificant P value, indicate that the genome is highly conserved in GC% and under greater pressure from natural selection [24,27].
In the COA analysis, we find that the first two axes account for almost half of the total variance and the Axis 1 and Axis 2 can divide the RSCU of RABV into three groups, roughly corresponding to the three host species, suggesting viral host species may affect codon usage patterns.Subsequently, a total of 17 nucleotide-relevant indices are used for the correlation analysis.Almost all indices were significantly correlated with the main axis, to further confirming the above results [19,56].order to compare the differences of RABV adaptation to various hosts, we applied the values of CAI, RCDI, and SiD for assessment.Based on the values of CAI, the adaptation of RABV to each host in descending order is RAC&SK (M.P.F), bat (M.L), dog (C.F), and human (H.S), indicating that the RABV coding sequences have a higher gene expression potential in the RAC&SK and greater adaptation to the RAC&SK.There is a similar trend to the CAI analysis in SiD analysis, suggesting that the host of RAC&SK apply stronger pressure to shape viral codon usage pattern, consistent with the neutrality-plot analysis [59,60].Therefore, the RAC&SK might an ideal switching host from the order pteropods to carnivores, with high adaptation to RABV.However, in term of the RCDI analysis, the value in descending order is dog (C.F), RAC&SK (M.P.F), bat (M.L), and human (H.S), suggesting that RABV is predicted to have a higher translation rate to the host of human, due to the highest number (10/18) of human isoacceptor tRNAs that best match viral preferred codons [57].
RABV employs diverse different strategies to thrive within various host species [61].When carnivorous animals such as dogs, skunks, and raccoons become infected with the RABV, it often leads to acute death.However, in the case of 12 Transboundary and Emerging Diseases bats, it may result in long-term carriers of the virus [62,63].Additionally, when a host is infected with RABV transmitted by the same species, it typically manifests acute clinical symptoms, eventually leading to fatality.In contrast, infections with viruses from other species tend to yield milder symptoms and may not be lethal [64,65].This indicates that the RABV needs to evolve to overcome the host barrier.It also demonstrates that the virus has different adaptation patterns in the hosts of different species, which may be related to the usage of codons.

Transboundary and Emerging Diseases
Although our study provides an overall insight into the RABV codon usage, frankly, this study has several limitations.First, to avoid sampling bias, which is also limited by computational power, we used a Python script to randomly select 50 sequences in each clade, 150 in total, to assess the adaptation of RABV to the hosts.The random selective sequences might not adequately represent the adaptation of the entire clade.Second, our study focuses on only four viral  Transboundary and Emerging Diseases host, bats, dogs, human, and RAC&SK, because these hosts were distinctly clustered and representative on the phylogenetic tree and some hosts' genomes are not available in GenBank.In summary, our findings provide a novel and comprehensive insight into codon usage bias of RABV and its adaptation to hosts.We reveal natural selection pressure shaping the RABV codon usage pattern, account for more than 90%, and in brief analyzed why RABV has the wide range of hosts and the relation between viral transmission and host adaptation.Overall, the data in the study will help to RABV surveillance and further research to control and eliminate this deadly virus.
lated d d d d d d d d d d d d d d d d d d d d d d d d d

FIGURE 1 :
FIGURE 1: Distribution and phylogenetic tree of full-length genomes.(a) Geographical distribution of RABV genomes around the world.Different continents are denoted by different colors of circle; the number of sequences is denoted by sizes of circle for different country.(b) Phylogenetic tree of RABV genomes.Inner ring stripe represents the lineages in previous study; outer ring stripes represent the different host of RABV.

ðbÞFIGURE 2 :FIGURE 4 :
FIGURE 2: ENC-plot (ENC value plot against GC3s) analysis and parity rule-2 plot analysis of RABV.Different clades are denoted by different colors of circle.(a) ENC-plot (ENC value plot against GC3s) in relation to three major clades and (b) parity rule 2-plot in relation to three major clades.

FIGURE 7 :
FIGURE 7: The relative codon deoptimization index (RCDI) analysis.Different hosts are denoted by different colors.(a) All sequences of RABV genome; (b) the genomic sequences of the bat-related; (c) the genomic sequences of the dog-related; (d) the genomic sequences of the RAC&SK-related.Wilcoxon test was performed to compare the mean of the RCDI values pertaining to the different sets of hosts.* P <0:05; * * P <0:01; * * * P <0:001.NS, not significant.

FIGURE 8 :
FIGURE 8: The similarity index (SiD) analysis.Different hosts are denoted by different colors.(a) All sequences of RABV genome; (b) the genomic sequences of the bat-related; (c) the genomic sequences of the dog-related; (d) the genomic sequences of the RAC&SK-related.Wilcoxon test was performed to compare the mean of the SiD values pertaining to the different sets of hosts.* P <0:05; * * P <0:01; and * * * P <0:001.NS, not significant.

Table 1 ,
3.2.Base Composition Analysis of RABV Full-LengthGenomes.A clear trend of AU richness was observed in the RABV genomes by considerable analysis of the base composition of the viral coding sequences (Supplementary 2).The mean compositions (%) of the nucleotides A (28.10 AE 0.01) and U (26.43 AE 0.02) were significantly higher than C (21.91 AE Supplementary 2).3.3.Codon Usage Bias under Natural SelectionPressure.The ENC-plot revealed that both of random mutational and natural selective pressure on the RABV codon usage bias.We found that ENC value of coding sequences was 53.83 AE 0.28 higher than 35, with a low-codon usage bias, and all the points regardless of group fall under the expected curve and gathered on the left.Furthermore, there was a clear positive correlation between the ENC and GC3s (Figure

TABLE 1 :
Relative synonymous codon usage patterns of RABV in comparison with its host.

TABLE 1 :
Continued. in coding each amino acid are marked in bold and italic, respectively.Most preferred or dispreferred codons in both of virus and host are underline.
<0:01; and * * * P <0:001.NS, not significant.Aromo represents aromaticity of the viral gene product; GRAVY represents grand average hydropathicity score.L_sym represents the number of synonymous amino acids; L_aa represents the number of amino acids.FIGURE 5: The relative dinucleotide abundance values of 16 dinucleotides of the RABV and its hosts' coding region.Different coding sequences are denoted by different colors.

TABLE 2 :
The most preferred codon of RABV for each amino acid and isoacceptor tRNAs in its hosts.
Note: most abundant isoacceptor tRNAs in its hosts matching the most preferred codons of RABV are marked in bold.
The codon adaption index (CAI) analysis.Different hosts are denoted by different colors: (a) all sequences of RABV genome; (b) the genomic sequences of the bat-related; (c) the genomic sequences of the dog-related; (d) the genomic sequences of the RAC&SK-related.Wilcoxon test was performed to compare the mean of the CAI values pertaining to the different sets of hosts.