Triplet Analysis That Identifies Unpaired Regions of Functional RNAs

We developed a novel method for analyzing RNA sequences, deemed triplet analysis, and applied the method in an in vitro RNA selection experiment in which HIV-1 Tat was the target. Aptamers are nucleic acids that bind a desired target (bait), and to date, many aptamers have been identified by in vitro selection from enough concentrated libraries in which many RNAs had an obvious consensus primary sequence after sufficient cycles of the selection. Therefore, the higher-order structural features of the aptamers that are indispensable for interaction with the bait must be determined by additional investigation of the aptamers. In contrast, our triplet analysis enabled us to extract important information on functional primary and secondary structure from minimally concentrated RNA libraries. As a result, by using our method, an important unpaired region that is similar to the bulge of TAR was readily predicted from a partially concentrated library in which no consensus sequence was revealed by a conventional sequence analysis. Moreover, our analysis method may be used to assess a variety of structural motifs with desired function.


Introduction
In vitro selection of nucleic acids is a powerful method used to isolate novel functional molecules [1][2][3][4][5][6]. Generally, in vitro selection involves many rounds of selection followed by sequencing and analysis of several recovered clones and determination of a consensus sequence. These consensus sequences are often highly functional molecules that have promise as pharmaceutical and chemical agents [7][8][9]. However, there are several inherent biases that often affect in vitro selection procedures, for example, the abundance of particular motifs in the library, differential efficiency of reversetranscription or amplification by PCR due to stable secondary structures. Therefore, confirmation that a particular consensus sequence is optimal is difficult and usually requires identification of a naturally occurring, functional counterpart. For the selection of aptamers, which interact with desired targets, a consensus motif may be a sequence in a local minimum rather than an absolute minimum in an energy diagram that represents interactions between all possible aptamers and the bait or target. Furthermore, it should be hard to identify general rules of intermolecular interactions from only such "highly evolved" motifs. To overcome this problem, the sequence changes in the library during the early stage of the selection should be analyzed. Evaluation of the sequences that are gradually concentrated from the starting library should reveal a general rule for interactions, that is, information that is indispensable for designing new artificial functional molecules.
Here, we developed a method, designated triplet analysis, to identify important sequence characteristics in the primary sequences that have been recovered from an RNA library after only several rounds of selection with a target molecule. The first step of triplet analysis is extraction of all nucleotide triplets from the primary sequence ( Figure 1). For example, a sequence AGCGUCCA would be separated to six triplets: AGC, GCG, CGU, GUC, UCC, and CCA. The next step is reconstruction of the parent sequence of these triplets. If these six triplets could be extracted, one would easily reconstruct the parent sequence, AGCGUCCA. Thus, a frequent successive sequence should be revealed from an analysis of frequent triplets. We chose the HIV-1 Tat protein as our experimental bait for our in vitro RNA selection because a naturally occurring binding RNA motif, TAR, and artificially selected binding motifs have been identified [10,11]. Therefore, we could compare our finding with these previously identified binding motifs to judge the validity of our analysis method.

In Vitro Selection of RNA Aptamers.
A library comprised of 78-nucleotide RNAs that each contained a region of 30 randomized nucleotides was prepared using chemically synthesized 97-bp DNAs that each had a T7 promoter sequence as templates and in vitro transcription. Approximately 1 nmol of RNA was generated from every 100 pmol of template DNA using the AmpliScribe T7 transcription kit (EPICENTRE). The transcripts were purified from 8% denaturing polyacrylamide gels that contained 7 M urea by means of crush and soak extraction and subsequent ethanol precipitation. The bait, Tat peptide (RKKRRGRRR), was synthesized using an Fmoc strategy on a solid support of Fmoc-NH-SAL resin (N-α-9-fluorenylmethoxy-carbonylsuper acid labile polystyrene resin, Watanabe Chem.), and this synthesis method produces an amide at the C-terminus of the bait peptide, as described previously.
The purified RNA was incubated in a binding buffer containing 2.5 mM Tris·HCl (pH 7.6), 100 mM NaCl, 2.0 mM MgCl 2 and then passed through a nitrocellulose filter (HAWP, MILLIPORE) three times to remove RNAs with filter binding properties. A sample of the RNA library (100 pmol) was incubated with 5 pmol of Tat peptide in 50 μL of the binding buffer for 60 min at 25 • C, and then the mixture was passed through another nitrocellulose filter. After washing with 1.5 mL of the binding buffer, RNAs bound to Tat on the filter were collected and amplified by subsequent reverse transcription, PCR, and transcription (RNA PCR kit, PE Applied Biosystems).

Dot Blot Analysis.
To analyze the binding ability of the RNA library, RNAs were labeled internally with [α-32 P] UTP (Amersham) during transcription in vitro. Labeled RNA (10 pmol) was mixed with 2.5 pmol of Tat peptide in 50 μL of the binding buffer. After a 60 min incubation, the mixture was passed through a nitrocellulose filter. The filter was washed with the binding buffer, and radioactive RNAs on the filter were visualized and quantified with the BIO-Imaging Analyzer, BAS 2000 (Fuji Film).

Cloning and DNA Sequencing.
After six rounds of selection and amplification, the ends of the double-stranded PCR products were blunted by T4 DNA polymerase (TAKARA). The blunted DNAs were purified and concentrated by ethanol precipitation and phosphorylated at their 5 end with T4 polynucleotide kinase (TOYOBO). The resulting DNAs were ligated into the Hinc II site of separate pUC118 plasmids using T4 DNA ligase (TAKARA). The ligated plasmid DNAs were transformed into E. coli MV1184, which were then cultured on LB plates containing ampicillin, IPTG (isopropyl-β-D-thiogalactoside), and X-gal (5-bromo-4-chloro-3-indolyl-β-D-galactoside). After blue-white selection, 20 white colonies were randomly selected and cultured in LB medium containing ampicillin. The recombinant plasmid DNA was recovered from E. coli clones using the alkaline lysis procedure. Nucleotide sequences were determined using the BigDye-termination method and an ABI PRISM 310 Genetic Analyzer (PE). Of the 20 clones analyzed, 17 contained a library sequence as the insert.

Simple Triplet Analysis to Identify Important Primary
Structure. From the library shown in Figure 2(a), RNAs that bound Tat were concentrated as described in the previous section. The binding ability of the library RNAs on each round was analyzed by dot blot analysis, and after six rounds of selection, accumulation of Tat-binding RNAs on the filter increased as shown in Figure 2  . recovered after six rounds of selection were cloned using the E. coli MV1184/pUC118 system and sequenced. The primary sequences of 17 clones are shown in Figure 2(c). Apparently, there were no regions of obvious primary sequence similarity shared among these sequences, and a conventional sequence alignment method [12] did not reveal any consensus sequence (data not shown). All triplets in the primary sequences of the 30-nucleotide randomized regions were extracted. Of all 64 possible triplets, 23 triplets had frequencies (the number of triplets divided by the number of clones) above 0.5 (Table 1). In other words, these 23 triplets occurred in a half of the clones stochastically. Notably, eight triplets, UUG, UGC, UGG, CGC, AGU, GCC, GGC, and GGG had, frequencies over 0.75 (bold values in Table 1). Successive sequences of over five nucleotides available from these triplets were UUGCC, UUGGC, and UUGGG. However, these pentanucleotide sequences were found in only one or two of the 17 clones though the eight triplets should potentially be an important component of a Tat-binding RNA. This finding indicated that the longer consensus sequence had not been concentrated within six rounds of selection and that the process was in a relatively early stage of selection.

Triplet Subtraction Analysis to Identify the Important Secondary
Structure. Further analysis, called triplet subtraction, was carried out as shown in Figure 3 to obtain detailed information, including RNA secondary structural features. In this analysis, surpluses of triplets are defined as the number of a triplet after taking away the number of all complementary triplets. For example, the numbers of AGC, GCG, and CGU subtracted by the numbers of GCU, CGC, and ACG, respectively, were calculated and listed. If a consensus sequence is in a fully matched stem region, there is no remaining triplet after the subtraction. On the other hand, if a consensus region is unpaired (i.e., in a bulge or loop), only the triplets concerned with the unpaired region are revealed after the subtraction. For example, six triplets-GCG, CGU, GUC, UCC, GGG, and GGC-should be revealed from a paired region with a  Figure 3: The scheme for the triplet subtraction from a library. Only the triplets from unpaired regions could be determined. Figure 4: The frequency of the triplets before (a) and after (b) subtraction. The changes in gray scale (from light to dark) indicate changes in frequency (from 0-25%, 26-50%, 51-75%, to 76-100%).
GU bulge, AGCGUCCA/UGGGCU, as shown in Figure 3. Thus, one can easily predict a frequent sequence around an unpaired region that might contribute to the function of an RNA. The results of the triplet subtraction analysis are summarized in Table 2 and Figure 4. As listed in Table 2, only two triplets UUG and UGG had s frequency over 0.5 after the subtraction. This value of 0.5 was surprisingly high because all the triplet frequency should be zero for a truly random library. Though UUU and GGG were the most frequent type of all NUU and GGN (where N indicates U, C, A, or G) triplets, respectively, the frequencies of these triplets were lower than those of UUG or UGG. Therefore, the most probable consensus sequence predicted from these triplets is UUGG. As listed in Table 2, many NNU and GNN might exist in unpaired region and the pattern of these triplets was widely spread, whereas UUG and UGG stood out among the NUN and NGN sequences, respectively. These finding indicated that the second U and/or the third G of the UUGG would be unpaired bases. The frequency of GGN triplets was moderate compared with that of UUG triplets, and, therefore, it was not clear whether the third G was unpaired. Nevertheless, this G might be an unpaired base because CCN triplets seemed not to be concerned with an unpaired region. Additionally, neither AAN nor ANN triplets seemed to occur in unpaired regions of secondary structure. Similar features were not found for NNC. These findings strongly indicated that an unpaired region closed by a UA base pair at their 5 end would not occur in motifs that bound to Tat, that is, the first UU of the UUGG should be unpaired bases. Ultimately, we could determine that a sequence, NUUGG/CN with unpaired UUG, was an important structural motif that should have significant role in an RNA-Tat interaction ( Figure 5(a), left). Possibly, a small portion of Tat-binding RNAs may contain an NUUGG/CCN sequence in which the UU doublet is unpaired as shown in Figure 5(a), right. This conclusion did not conflict with the results from a prediction of the secondary structure of RNA based on our thermodynamic parameters (data not shown).  3.3. Validation of the Triplet Analysis. TAR, the naturally occurring Tat-binding RNA motif, has a pyrimidine bulge, UCU, UUU, or UU, closed by a GC base pair at its 3 end as shown in Figure 5(b) [14][15][16][17]. The arginine side residues in Tat stack with the first U of the bulge and make hydrogen bonds to the GC base pair next to the 3 end of the bulge. These interactions along with electrostatic interactions at phosphate backbone are indispensable for the specific Tat-TAR interaction. Therefore, the essential structural motif of TAR that mediates binding with Tat is NNU(Y)UG(A)/(U)CNN with U(Y)U bulge, where Y indicates a pyrimidine nucleotide [11,18]. Similar structural features were seen in RNA aptamers selected with arginine as a target. A general motif of these arginine aptamers was GUNGA/UCC with a UN bulge as shown in Figure 5(c) [13,19]. Our UUGG motif closely resembled these indispensable structural features of TAR and arginine aptamers; therefore, we were able to confirm the validity of our triplet analysis method. The UUGG motif was revealed from the early stage of Tat-binding RNA selection, and therefore, the UUGG motif might be the ancestor of other Tat-binding RNAs and have fundamental property as a binding motif.
In summary, we developed triplet analysis method to analyze a library. From our analysis of a small number of clones which recovered from a library after only a few rounds of selection, we could determine an important sequence motif though no consensus sequence was revealed by conventional analysis. Furthermore, unpaired regions in the selected RNAs, that may play important roles in many functional nucleic acids, could be readily predicted using this method. The resulting UUGG motif, NUUGG/CN with an unpaired UUG, resembled the essential features of Tatbinding RNAs. This motif may be a fundamental sequence of Tat-and arginine-binding motifs, and researchers may be able to construct a Tat-binding RNA library based on this information for obtaining "higher evolved" motifs.