Characteristics and Prediction of RNA Structure

RNA secondary structures with pseudoknots are often predicted by minimizing free energy, which is NP-hard. Most RNAs fold during transcription from DNA into RNA through a hierarchical pathway wherein secondary structures form prior to tertiary structures. Real RNA secondary structures often have local instead of global optimization because of kinetic reasons. The performance of RNA structure prediction may be improved by considering dynamic and hierarchical folding mechanisms. This study is a novel report on RNA folding that accords with the golden mean characteristic based on the statistical analysis of the real RNA secondary structures of all 480 sequences from RNA STRAND, which are validated by NMR or X-ray. The length ratios of domains in these sequences are approximately 0.382L, 0.5L, 0.618L, and L, where L is the sequence length. These points are just the important golden sections of sequence. With this characteristic, an algorithm is designed to predict RNA hierarchical structures and simulate RNA folding by dynamically folding RNA structures according to the above golden section points. The sensitivity and number of predicted pseudoknots of our algorithm are better than those of the Mfold, HotKnots, McQfold, ProbKnot, and Lhw-Zhu algorithms. Experimental results reflect the folding rules of RNA from a new angle that is close to natural folding.


Introduction
RNAs are versatile molecules. Messenger RNAs carry genetic information and act as the intermediary agent between DNAs and proteins; ribosomal RNAs, transfer RNAs, and other noncoding RNAs also have important structural, regulatory, and catalytic functions in cells. To completely understand the various functions of RNAs, we need to first understand their structures. The primary structure of RNA is the sequence of nucleotides (i.e., four bases A, C, G, and U) in the singlestranded polymer of RNA. However, these sequences are not simply long strands of nucleotides. In RNA, complementary bases of guanine and cytosine pair can form three hydrogen bonds, those of adenine and uracil pair can form two hydrogen bonds, and those of guanine and uracil pair can form two hydrogen bonds. RNA folds into a 3D structure through hydrogen bonding and base stacking, which are nonconsecutive in the sequence. Noncanonical pairing and base-to-backbone hydrogen bonding also stabilize folding. The 3D arrangement of atoms in a folded RNA molecule is the tertiary structure; the collection of base pairs in the tertiary structure is the secondary structure. Experimental determination of RNA tertiary structures is too expensive and time consuming to meet practical needs; thus, predicting RNA structure by computer becomes a basic method and issue in computational biology [1].
The secondary structures of RNA include the scaffold of the tertiary structures. Predicting RNA secondary structures is the first step in predicting RNA tertiary structures from RNA sequences. Computational approaches for predicting RNA secondary structures can be classified into three families: thermodynamic, comparative, and hybrid. Thermodynamic approaches use dynamic programming to compute the optimal secondary structure for a single RNA sequence with globally minimal free energy [2], based on a set of experimentally determined energy parameters [3]. Such 2 BioMed Research International methods have been successful for relatively short RNAs. Manually comparative approaches are more reliable than thermodynamic approaches when many homologous sequences are available. Manually comparative approaches have been used to establish the structures of known RNA families. These approaches compute a consensus structure on a set of aligned RNA sequences by searching for covariance evidence between each of the base pairs. Quantitative measures of covariance have been implemented in 2 statistics and mutual information. Akmaev et al. [4] also extended these approaches to explicitly consider sequence phylogeny and showed positive results. Hybrid approaches, which have recently emerged, combine the advantages of thermodynamic and comparative approaches [5]. Hybrid approaches consider both thermodynamic stability and sequence covariance and produce positive results on as few as three homologous sequences. Other methods cannot be classified into any of these three families. A few of these methods attempt to simultaneously align and fold homologous sequences [6]. Eddy and Durbin [7] introduced stochastic context-free grammars to align homologous sequences iteratively and found a consensus structure for them.
A pseudoknot motif is a prevalent RNA structure. Pseudoknots serve various functions in biology [1]. Plausible pseudoknotted structures have been proposed and confirmed for the 3 -end of several plant viral RNAs, where pseudoknots are apparently used to mimic tRNA structures. Pseudoknots have been recently confirmed in some RNAs of humans and other species [6].
Current studies on RNA secondary structure prediction have not considered pseudoknots. Optimizing secondary structures, including arbitrary pseudoknots, is NP-hard [8].
Most RNA folding methods that can fold pseudoknots adopt heuristic search procedures and sacrifice optimality. Examples of these methods include quasi-Monte Carlo searches and genetic algorithms. These methods cannot guarantee the most optimal structure and cannot determine the accuracy of a given prediction toward optimality [9][10][11][12][13].
A different approach to pseudoknot prediction adopts dynamic programming to predict the tractable subclass of pseudoknots based on complex thermodynamic models in O( 4 )-O( 6 ) time [14][15][16], making them impractical even for sequences of a few hundred bases long.
Comparative approaches can also be applied to predict pseudoknots and are more reliable than thermodynamic approaches. For example, comparative analysis has revealed the existence of pseudoknots in several RNAs [17]. However, comparative analysis has typically been conducted in an ad hoc manner from an algorithmic point of view.
RNAs fold during transcription from DNA into RNA. Current RNA structure prediction by calculating the global optimal structure does not reflect the dynamic folding mechanism of RNA [18].
Although DP can accurately predict a minimum energy structure within a given thermodynamic model, the native fold is often in a suboptimal energy state that significantly varies from the predicted one [19]. A case may be made that the natural folding process of RNA and the simulated folding of RNA using an evolutionary algorithm, which includes intermediate folds, have much in common [20,21].
The current study provides a novel report that RNA folding accords with the golden mean characteristic based on the statistical analysis of real RNA secondary structures. The golden mean is also called the golden section or golden ratio. Adolf Zeising found the golden ratio expressed in the arrangement of branches along the stems of plants and of veins in leaves [22]. He extended his research to the skeletons of animals and the branches of their veins and nerves, to the proportions of chemical compounds and geometry of crystals, and even to the use of proportion in artistic endeavors. In these phenomena, he found the golden ratio operating as a universal law. In 2010, the journal Science reported that the golden ratio is present at the atomic scale in the magnetic resonance of spins in cobalt niobate crystals [23]. Several researchers have proposed connections between the golden ratio and human genome DNA [24,25].
Applying this characteristic, we design a golden mean (GM) algorithm by dynamically folding RNA secondary structures according to the golden section points and by forming pseudoknots subsequently folded between nucleotides that did not pair in previous steps.
We implement the method using thermodynamic data and test the performance on PKNOTS and TT2NE data set. For PKNOTS data set, the sensitivity and PPV of the Lhw-Zhu (LZ) and PKNOTS algorithms are increased by 2% to 3% via the preprocessing of the GM method. For TT2NE data set, the GM method indicates good performance in predicting secondary and pseudoknotted structures. The experimental results reflect the folding rules of RNA from a new angle that is close to natural folding.
For first approximation, the secondary structure is modeled as follows. If and are complementary bases (A&U, C&G, U&G), then and may constitute a base pair ( , ). Each base can occur in one base pair, or the set of base pairs can form a matching. The secondary structures are also noncrossing.
Concretely, a secondary structure on is a set of base pairs = {( , )}, where , ∈ {1, 2, . . . , }, that satisfies the following conditions. is a matching: no base appears in more than one pair.
The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic studies on hen lysozyme [26] and papain [27] and after limited proteolysis studies on immunoglobulins [28,29]. Wetlaufer defined domains as stable units of protein structure that can fold autonomously. Domains have been previously described as units of compact structure [30], function and evolution [31], and folding [32].
Each domain forms a compact 3D structure and is often independently stable and folded. Most domains have less than 200 residues with an average of approximately 100 residues [33,34].
A domain is closed by a helix or pseudoknot ( Figure 3). A subdomain is an independently stable part of a domain. If the closed helix or pseudoknot of a domain is deleted, its subdomain will become a domain.
By convention, single strands of DNA and RNA sequences are written in 5 -to-3 direction. RNAs fold during transcription from DNA into RNA. The subsequence , begins to transcribe from the 5 -end, that is, , and terminates transcription at the 3 -end, that is, (Figure 3). The helix ( , + : − , ) is completely folded after the transcription of . We can determine that the 3 -end of the helix ( , + : − , ) and the domain ( , ) is the 3 -end of subsequence , . Let be the sequence length. The length ratio of the 3 -end of the helix ( , + : − , ) and the domain ( , ) to the sequence 1, is the ratio of to . This study determines the characteristic of the 3 -end and the length of domain in the sequences.

Characteristic of Golden Section.
We compare the structures of the test set of all 480 sequences (nonfragment and nonredundant) from RNA STRAND with secondary and pseudoknotted structures, which are validated by NMR or Xray.
The results of statistical analysis on these real secondary structures are shown in Figures 1 and 2 and Tables 1, 2, 3, and 4. In Tables 1 and 2, Num represents the number of domains, group is the 3 -end of group, Ratio 1 is the ratio of Group 1 to Group 2, and Ratio 2 is the ratio of Group 2 to Group 3. In     For long 5S rRNA, 23S rRNA, GI Intron, rRNA, and tRNA in RNA STRAND, the statistical results are shown in Table 2. The number of domains varies from 1 to 6. The length ratio of the first subdomain to the first domain for 5S rRNA is 0.58 and that of GI Intron is close to 0.37. The length ratio of the second domain to the sequence for tRNA and other rRNA is close to 0.38. The length ratio of the third domain to the sequence for tRNA is close to 0. For short RNAs, sequences with lengths less than 50 have generally only one domain and one or two simple helices. For sequences with lengths between 50 and 90, except for synthetic RNA, the folding 3 -end of domains and subdomains is centered on three regions (Figure 1).
The sequences have one to three domains and that with only one domain has two or three subdomains. The sequences fold on three regions. The first folds in the region of 0.35L to 0.38L, the second in the region of 0.6L to 0.618L, and the third in the region of 0.9L to 1.0L. Among 33 sequences, 25, 28, and 49 have helix 3 -ends located in points 0.382L, 0.618L, and L, respectively.
For tRNAs with lengths more than 50, the folding 3end of domains and subdomains is centered on four regions ( Figure 2). The sequences have one to three domains and that with only one domain has two or three subdomains. Among 35 sequences, 15, 14, 14, and 33 have domain or subdomain 3ends located at points 0.382L, 0.5L, 0.618L, and L, respectively.
The data corresponding to Figure 1 are shown in Table 3.
The data corresponding to Figure 2 are shown in Table 4.
The sequences tend to fold twice or thrice. For the sequences that fold twice, the first folds in the region of 0.5L and the second in the region of . For the sequences that fold thrice, the first folds in the region of 0.35L to 0.38L, the second  in the region of 0.6L to 0.618L, and the third in the region of L.
In mathematics and the arts, two quantities belong to the golden ratio if their ratio is the same as the ratio of their sum to their maximum. Expressed algebraically, for quantities and with > , / = ( + )/ = phi = 1.618, and the quantities are 0.382 ( + ) and 0.618 ( + ). These quantities increase several unique ratios, including 0.618, 0.382, and 1.618, that is, the golden ratio. These ratios exist throughout nature, from population growth to the physical structure within the human brain, the DNA helix, many plants, and even the cosmos itself. The golden ratio is also called the golden section or golden mean.
The golden mean and the numbers of the Fibonacci series (0, 1, 1, 2, 3, 5, 8,. . .) have been used with significant success in analyzing and predicting stock market motion. Elliott presented wave theory, in which the frequent wave relationships are golden ratios (19.1%, 23.6%, 38.2%, 50%, 61.8%, and 80.9%). It has a striking similarity to RNA folding. We can determine that 0.382, 0.5, and 0.618 are the only important RNA folding points (Tables 1 to 4). The above results of statistical analysis also confirm this view. Almost all sequences are folded at L, and approximately half of the sequences are folded at points 0.382L, 0.5L, and 0.618L. For long sequences, the other two key fold points 0.236L and 0.809L may be found, which are 0.382L × 0.618 and 0.5L × 1.618.
Therefore, these points would be closer to natural folding and obtain higher accuracy than before if RNA is dynamically folded according to the above golden points.

Dynamic Algorithm.
RNAs fold through a hierarchical pathway, in which the helices and loops are rapidly formed as secondary structures and the subsequent slow folding of the 3D tertiary structures would consolidate the secondary structures [20,21]. RNAs also fold during transcription from DNA into RNA. Therefore, we first compute the secondary structure and then predict pseudoknotted structures. We fold RNA secondary structures as DNA is transcribed into RNA. The length of RNA sequences is gradually increased according to the above golden points, and only reliable helices are accepted.
The sketch of the GM algorithm is as in Algorithm 1. The method uses a dynamic programming strategy to compute the secondary structure values of ( , ) for all and in the start subsequence. The subsequence is then iteratively lengthened to the next golden point to compute the secondary structure values of ( , ) for all and in the new subsequence. Finally, the pseudoknots are computed by the crossing of two subsequences, and the values of ( , ) are provided for all and in the entire sequence. In this study, we use our previous LZ algorithm to compute pseudoknotted structures [16].
For the sequence of TMVup, the startLen is 0.618L, that is, 52. We fold the subsequence 1,52 and select the helix H (11, 15: 21, 25) with the maximum value; assign as another golden point 1.618 × 52, that is, L, and select the helix H (37, 42: 49, 54). The LZ algorithm computes the residual subsequence and forms the last structure. The ratio of the latter golden point to the former one is 1.618, and the length of the start subsequence should be between 40 and 70.
The time complexity of steps 1.1 to 1.5 is O( 3 ), and the number of iterative computations is less than 10 for sequences with lengths less than 1000. Therefore, the first step takes O( 3 ) time. In this study, the time complexity of step 2 is less than O( 5 ), which is similar to that of the LZ algorithm. Therefore, the time complexity of the GM algorithm is maintained as O( 5 ).

Results and Discussion
To illustrate the effect of our algorithm, tests are divided into two parts: one for pseudoknotted sequences and another for mixed data of pseudoknot-free and pseudoknotted sequences. We select two data sets. One is TT2NE data set to test pseudoknotted structures. This data set contains 47 pseudoknotted sequences from PseudoBase and PDB. Another is PKNOTS data set to test secondary and pseudoknotted structures. This data set includes 116 sequences, including 25 tRNA sequences randomly selected from the Sprinzl tRNA database, HIV-1-RT-ligand RNA pseudoknots, and some viral RNAs.
The accuracy of an algorithm is measured by both sensitivity and PPV. Let RP (real pair) be the number of base pairs in the real RNA structure, TP (true positive) the number of correctly predicted base pairs, and FP (false positive) the number of wrongly predicted base pairs. Burset and Guigo [35] defined SE (sensitivity) as TP/RP and PPV (positive predictive value) as TP/(TP + FP) in 1996.

Results of PKONTS Data Set.
In this section, we test the PKNOTS data set, which has become the benchmark dataset to predict RNA structures and present the prediction results of our method compared with the PKNOTS algorithm [14] and our previous LZ algorithm [16].
To explore the effect of the GM method in different models, we test two models and compare the difference before and after GM processing. First, we run PKONTS and LZ algorithm on the PKNOTS data set and obtain the output of the results. We then run the first step of the GM method to form the frame of secondary structures by dynamically folding sequences at the golden points and select one stable helix with the minimum energy at each fold. Subsequently, we obtain the partially folded sequences as the input of PKONTS and LZ algorithm and run them with the same energy model and parameters as above. We fold all sequences of the test set and obtain the results. The test results are shown in Table 5.
For each algorithm, the percentages of sensitivity and PPV are shown in Table 5 The improved PKNOTS algorithm is compared with the PKNOTS algorithm in both sensitivity and PPV ( Table 5). The improved PKNOTS algorithm increases the sensitivity from 82.8% to 85.5% and improves the PPV from 78.9% to 80.8%.
The improved LZ algorithm is compared with the LZ algorithm in both sensitivity and PPV ( Table 5). The improved LZ algorithm increases the sensitivity from 84.8% to 87.7% and improves the PPV from 80.7% to 82.8%.
The improved LZ and PKNOTS algorithms both have 2% to 3% higher sensitivity than the LZ and PKNOTS algorithms, respectively. The improved LZ algorithm outperforms the PKNOTS algorithm by 4.9%. The tests of improved PKNOTS and improved LZ indicate that the GM method may also be applied to other algorithms of RNA structure prediction to improve the prediction sensitivity and reduce the predicted redundant base pairs. Both of the improved LZ and PKNOTS algorithms increase the accuracy of many sequences (e.g., Bioton, DF0660, DG7740, DI1140, DP1780, DV3200, and DY4840), from which we can determine the influence of the golden mean characteristic.
In the PKNOTS algorithm, sequence TMVup is folded to form four helices, three of which are equal to those in GM, but one helix (29, 33: 56, 59) is redundant (Figure 3(f)). This sequence also missed all three pseudoknots. Only the form of the helix (29, 33: 56, 59) prevents the form of pseudoknots.
Further statistical analysis shows that the selected helices by the first step of GM basically belong to the real structures and control the folding pathway. This finding is precisely ascribed to the GM processing before PKONTS and LZ improve the accuracy.

Results of TT2NE Data Set.
In this section, we test the TT2NE data set, which includes 47 sequences from PseudoBase and PDB, which is also a subset of those used in the HotKnots.
We present the prediction results of our method compared with those of HotKnots [9], McQfold [10], ProbKnot [11], TT2NE [13], Mfold [36], and LZ [16]. MFold is restricted to secondary structures that are free of pseudoknots, whereas others can result in any topology of pseudoknot. 74 This set includes most of the sequences where HotKnots has been tested and shown to perform better than ILM and PKnots-rg. Thus, we will not compare GM to these latter algorithms.
The sensitivity of GM is better than that of the other algorithms, except for TT2NE, either on the average 1 or on the average 2 ( Table 6). The PPV of GM is better than that of the other algorithms, except for TT2NE and McQfold, either on the average 1 or on the average 2.
However, TT2NE has shown the most feasible value after several times folding with different parameters. Thus, TT2NE is not the autocomputing value. When given new sequences, we do not know which result should be selected.
The part of computing pseudoknots in GM is the same as that in the LZ algorithm, and the difference between them lies in the preprocessing of secondary structures. The performance of GM is better than that of the LZ algorithm, either on the average 1 or on the average 2. GM improves the sensitivity of sequences Bs glmS, EC S15, and HDV antigenomic from 36, 59, and 44 to 51, 100, and 84, respectively. GM also improves the PPV of these sequences from 57, 63, and 34 to 67, 74, and 64, respectively. However, for sequences AMV3 and BVDV, the performance of GM is only below LZ.
The number of predicted genera by GM is better than that by other algorithms, except for TT2NE (Table 6). All Mfolds predictions have genus 0 because Mfold generates only structures without pseudoknots.
However, TT2NE predicted more redundancy genera than other algorithms. For example, GLV IRES, R2 retro PK, and 1y0q sequences have only one native pseudoknot, but two pseudoknots are predicted by TT2NE. Bs glmS has two native pseudoknots, but three are predicted by TT2NE.
The percentages of sensitivity and PPV of each algorithm are shown. The column Gen indicates the predicted number of genera, and GR is the redundancy number in the predicted genus, which is more than the number of native genera. Average 1 is the total number of base pairs that are correctly predicted in the entire database divided by the corresponding total number of native base pairs (average sensitivity) and the total number of predicted base pairs (average PPV). Average 2 is the usual average of sensitivity and PPV values of all