Randomness for Nucleotide Sequences of SARS-CoV-2 and Its Related Subfamilies

The origin and evolution of SARS-CoV-2 has been an important issue in tackling COVID-19. Research on these topics would enhance our knowledge of this virus and help us develop vaccines or predict its paths of mutations. There are many theoretical and clinical researches in this area. In this article, we devise a structural metric which directly measures the structural differences between any two nucleotide sequences. In order to explore the mechanisms of how the evolution works, we associate the nucleotide sequences of SARS-CoV-2 and its related families with the degrees of randomness. Since the distances between randomly generated nucleotide sequences are very concentrated around a mean with low variance, they are qualified as good candidates for the fundamental reference. Such reference could then be applied to measure the randomness of other Coronaviridae sequences. Our findings show that the relative randomness ratios are very consistent and concentrated. This result indicates their randomness is very stable and predictable. The findings also reveal the evolutional behaviours between the Coronaviridae and all its subfamilies.


Introduction
COVID-19 has a huge impact on all works of life. To develop stable and trustworthy vaccines [1,2], one needs to track and analyse the properties of SARS-CoV-2, which couples with MERS-CoV [3] and SARS-CoV which are the subfamilies of betacoronavirus. Besides, one also needs to compare the properties of its related families: alphacoronavirus, deltacoronavirus, and gammacoronavirus [4]. In the Coronaviridae, betacoronavirus is the most deadly subfamily. In the category, SARS-CoV, MERS-CoV, and SARS-CoV-2 emerged in 2003, 2012, and 2019, respectively. To evaluate and analyse their properties, there are many genomic, clinical, statistical, and analytical tools available. Among all the theoretical or clinical research, genetical analysis provides a straightforward way to delve into the structures of Coronaviridae [5,6]. There are some researchers focusing on geographic, demographic, and genomic analysis to extract some patterns of the viruses [7,8]. Though the origin and evolution of these viruses was studied previously-for example, MERS-CoV [9] and SARS [10,11]-there is still a long way to map out the interaction of these viruses. Currently, there are many theories or evidence about the mechanisms regulating the evolution and mutation of SARS-CoV-2 [12][13][14]. Nonetheless, a decisive solution to reveal such mechanisms still depends on further research and findings. In this article, we analyse their properties from the point of randomness, i.e., the degree of randomness of their nucleotide sequences. We devise a structural metric which would be applied in measuring the distances between all sorts of the Coronaviridae nucleotide sequences and the randomly generated nucleotide sequences. These distances could indicate how far the Coronaviridae is with respect to the random nucleotide sequences.
We utilise the data of coronavirus genomes from NCBI datasets [15]. Then, we measure the distances for each individual subfamily of the Coronaviridae. Our results show this structural metric is very suitable in revealing the properties of randomness. Hence, the relative distances between the random sequences are fairly stable and concentrated-this feature makes the concept of randomness feasible. From these settings, we could then calculate their relative randomness ratios (RRR) and extract our findings and results from RRR. The method to implement this notion is characterized in Section 3, and the results of the implementation are listed in Section 4, and the conclusions are reached in Section 5.

Theoretical Settings
In order to clearly measure the distances between structures, we devise a structural metric in this section-which would be applied in the latter sections.
For any vector v ! , we use v ! ðjÞ or v ! j to denote its jth element and | v ! | to denote its length. We also use k v ! k E to denote its Euclidean norm.

Common Finite Interval (CFI).
Let AFS denote the set of all the ascending finite sequences. Let v ! ,w ! ∈ AFS be arbitrary.
Define the greatest lower bound lb by lbð v Let finite K ⊆ R be arbitrary. Let SortðKÞ ∈ FINI denote the vector by sorting all the elements in K. Define a difference operator Diff over finite vectors by Diff This serves as the common structure between two structures.
Definition 3. (ascending finite sequences). Let ½a, b < ðwhere a < bÞ denote the set of all the ascending real vectors whose first element is a and last element is b. Let FINI be the union set of all ½a, b < for any a < b, i.e., FINI = ∪f½a, b< : a < b, a, b ∈ ℝg.

Definition 4. (structural metric).
Define a distance function δ over FINI by δð v Proof. It can be proved, according to Definition 4, by taking all the possible cases regarding their relations of intervals into consideration.
Claim 6. If d 1 , d 2 , ⋯, d n is a set of metrics over a set K, then dða, bÞ = ∑ n j=1 α j ⋅ d j ða, bÞ is also a metric on K.
Definition 7. It follows immediately from the definitions of a metric.
Example 1. Suppose nucleotide sequence N 1 , N 2 are given above.
Let p iQ denote the position of nitrogenous base Q in the sequence i. Let p 12Q denote the position of common sequence of p 1Q and p 2Q . Then, the results are presented in Table 1. Let BASES = f " A " , " C " , " G " , " T " g. Now we define δðN 1 , The weights are all predetermined 1/4 for each nitrogenous base. These values could also be adjusted according to professional judgement. For example, the weights could be decided by the relative frequencies of the bases. Example 1 lays a foundation of our latter arithmetical calculation.

Methods
There are several steps for calculating the relative randomness ratios (RRR). Computational and Mathematical Methods in Medicine (i) Generate a set of 1000 random nucleotide sequences whose lengths are all fixed at 30000. The generated random (nucleotide) sequences are presented in Table 2 (ii) Each sequence is regarded as a node. We then calculate the distance matrix for these nodes. This metric is a weighted metric consisting of 4 metrics which measure the structural distance with respect to each nitrogenous base. A concrete computation is shown in Example 1 (iii) Some patterned nucleotide sequences are created and their distances with random sequences are calculated. These sequences are nonessential. They are generated only for comparative purposes. The created (followed by rules) nucleotide sequences and their distances are presented in Table 3 Table 4 (v) The structural distances between MERS-CoV nucleotide sequences and random ones are calculated. The results are presented in Table 5 Table 6 (vii) The structural distances between alphacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 7 Table 8 (ix) The structural distance between gammacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 9 (x) RRR for each subfamily is calculated and the way to calculate it is explained in Section 4.2

Results
We use R program 4.0.2 (version) which in particular involves a package "Biostrings" to help us implement the theoretical setting. By the procedures mentioned in Section 3, we present the results in this section. We set the length of random nitrogenous base to be 30000, which is pretty much the length for SARS-CoV virus family. We also use R to sample 1000 samples (sequences) for our experiment (due to the capacity of our computers).
After removing the diagonal, we calculate some descriptive values for the 999 * 999 elements: the minimum, maximum, mean, and standard derivation of the whole distance matrix. The minimum is 127.1 and the maximum is 134.7. The mean is 130.88 and the standard derivation is 0.83. Since the standard derivation is very small, the structural distance between any pair of random nucleotide sequences is highly concentrated around the mean-this is a good referential property for our further analysis. Now, let us demonstrate the distances between some patterned sequences with random sequences.

Distance for Nucleotide Sequences.
We import SARS-CoV-2 genomic codes and save them in S4DSC2 [15]. Since the size of S4DSC2 is too huge (4617), or fs 1 , s 2 ,⋯,s 4617 g, and could not be handled by our computer, we sample only 20 of them. The results are presented in Table 4, where column "Sequence" is the order of the sampled sequence in the data set; "Min" and "Max" are the minimal and maximal distance for the given sequence with the random sequences, respectively; "Mean" is the average distance between the given sequence and the random sequences; "Sd" is the standard derivation of such set of distances; "Mean rand" is the average distance of the distance matrix of random sequences; "RRR" is the relative randomness ration, which is the "Mean" over "Mean rand." For the latter tables, meanings of the columns are the same; we will skip the wording. For MERS-CoV, the size of data downloaded is 530. We sample 20 of them randomly. The results are presented in Table 5. For SARS-CoV, the size of data downloaded is 10647. We sample 20 of them randomly. The results are presented in Table 6. For alphacoronavirus, the size of data downloaded and filtered is 1002. We sample 20 of them randomly. The results are presented in Table 7. For deltacoronavirus, the size of data downloaded and filtered is 149. We sample 20 of them randomly. The results are presented in Table 8. For gammacoronavirus, the size of data downloaded and filtered is 427. We sample 20 of them randomly. The results are presented in Table 9.

Conclusion
By observing all the results presented in the tables, we could reach the following statements: (i) The structural distances between random (nucleotide) sequences are highly concentrated with low standard derivation. This feature justifies the referential role under structural metric (ii) The patterned nucleotide sequences have lower means and lower standard derivations in distances with random sequences (iii) The relative randomness ratios (RRR) for Coronaviridae, which lie between 1.01 and 1.08, are much close to complete randomness ratio (or 1) in comparison with the ones for patterned nucleotide sequence, which lie around 0.84 in our examples (iv) Overall, the randomness of betacoronavirus is higher than alphacoronavirus or deltacoronavirus, which in turn are higher than the structural distances between SARS-CoV-2 and random sequences. This could probably explain why the mutations of betacoronavirus are higher than other subfamilies (v) In the betacoronavirus, the RRR of SARS-CoV-2 is almost fixed at 1.04. This indicates the mutations of SARS-CoV-2 are stabilized at this moment These findings provide some insightful knowledge about the degree of structural randomness of SARS-CoV-2 and its related family. Linking this knowledge to other research results and findings would help us map out the dynamical structures and evolutions of these viruses.

Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.