RNA-Mediated Gene Duplication and Retroposons: Retrogenes, LINEs, SINEs, and Sequence Specificity

A substantial number of “retrogenes” that are derived from the mRNA of various intron-containing genes have been reported. A class of mammalian retroposons, long interspersed element-1 (LINE1, L1), has been shown to be involved in the reverse transcription of retrogenes (or processed pseudogenes) and non-autonomous short interspersed elements (SINEs). The 3′-end sequences of various SINEs originated from a corresponding LINE. As the 3′-untranslated regions of several LINEs are essential for retroposition, these LINEs presumably require “stringent” recognition of the 3′-end sequence of the RNA template. However, the 3′-ends of mammalian L1s do not exhibit any similarity to SINEs, except for the presence of 3′-poly(A) repeats. Since the 3′-poly(A) repeats of L1 and Alu SINE are critical for their retroposition, L1 probably recognizes the poly(A) repeats, thereby mobilizing not only Alu SINE but also cytosolic mRNA. Many flowering plants only harbor L1-clade LINEs and a significant number of SINEs with poly(A) repeats, but no homology to the LINEs. Moreover, processed pseudogenes have also been found in flowering plants. I propose that the ancestral L1-clade LINE in the common ancestor of green plants may have recognized a specific RNA template, with stringent recognition then becoming relaxed during the course of plant evolution.


Retrogenes and Processed Pseudogenes.
Gene duplication is a fundamental process of gene evolution [1]. There are two types of gene duplication: direct duplication of genomic DNA and retropositional events [2][3][4]. Processed pseudogenes (PPs) are reverse-transcribed intronless cDNA copies of mRNA that have been reinserted into the genome ( Figure 1) [5,6]; they are especially abundant in mammalian genomes [7,8]. PPs are not usually transcribed because they lack an external promoter; therefore, they have long been viewed as evolutionary dead ends with little biological relevance. However, recent studies have unveiled a substantial number of "processed genes" or "retrogenes" with novel functions that are derived from the mRNA of various intron-containing genes [9][10][11][12]. Molecular biological studies showed that a class of mammalian retroposons, long interspersed element-1 (LINE1, L1), has been involved in the reverse transcription of nonautonomous retroposons, such as PPs (retrogenes) and short interspersed elements (SINEs) [13].

Retroposons.
Eukaryotic genomes generally contain an extraordinary number of retroposons such as long terminal repeat (LTR) retrotransposons, LINEs or non-LTR retrotransposons, and SINEs [6,14,15]. LINEs have been characterized as autonomous retroposons bearing either one or two open reading frames (ORFs); all LINEs encode a reverse transcriptase (RT), and some, but not all, encode an apurinic/apyrimidinic endonuclease, a ribonuclease H, and/or putative nucleic-acid-binding motifs ( Figure 2). Most members of a LINE family are truncated at various positions in their 5 regions, constituting defective members of the family, the lengths of which range from 100 to 1,000 bp [13]. The Bombyx R2 LINE protein, which has sequencespecific endonucleolytic and RT activity, makes a specific nick in one of the DNA strands at the insertion site and uses the 3 hydroxyl group that is exposed by this nick to prime the reverse transcription of its RNA transcript [16]. This mechanism is referred to as target DNA-primed reverse transcription (TPRT). The last 250 nucleotides that correspond to the 3 -untranslated region (UTR) of the R2 transcript are critical for this reaction [17]. Other LINEs, such Endonuclease and reverse transcriptase Figure 2: Schematic representation of a SINE and a LINE that have the same 3 -end sequence. Three-dimensional protein structures are taken from the L1-encoded ORF1 protein [94] and the reverse transcriptase of human immunodeficiency virus type 1 [95].
as L1, are also believed to retrotranspose by TPRT [18]. The human L1 TPRT machinery has been reconstructed in vitro [19]. SINEs are non-autonomous retroposons, the 5 -end sequences of which are derived from tRNA, 5S rRNA, or 7SL RNA with promoter activity for RNA polymerase III ( Figure 2) [20][21][22]. On the other hand, the 3 -end sequences of SINEs generally originated from a corresponding LINE [23]. A small nucleolar RNA-derived short retroposon, which lacks internal promoters for RNA polymerase III and has therefore not been subject to multiple rounds of retroposition, was recently discovered in the platypus [24].

Evolutionary Relationships of Various LINEs.
Eickbush's group conducted comprehensive phylogenetic analysis of LINEs using extended sequence alignment of their RT domains [25]. All identified LINEs were grouped into 11 distinct clades. Assuming vertical descent, the phylogeny suggests that LINEs are as old as eukaryotes, with each of the 11 clades dating back approximately 2 billion years [25]. Currently, almost 30 clades have been recognized [26]. Mammalian L1s belong to the L1 clade, which includes numerous LINEs from vertebrates, slime mold, plants, and algae [25,27,28]. Analyses of L1-encoded endonucleases from zebrafish and mammals revealed that they are divided into 3 groups: M, F, and Tx1 [29]. Kordiš et al. showed that the genomes of deuterostomes possess three highly divergent groups of L1-clade LINEs, which are distinct from Tx group [28]. The Tx group, with a target-specific insertion, consists of 2 branches, one of which includes frog Tx1 [30].

SINEs and LINEs.
The 3 -end sequences of various SINEs originated from a corresponding LINE ( Figure 2) [31]; for reviews, see also [23,32,33]. A systematic database and literature survey identified 58 SINEs, each possessing a common 3 -end sequence with its partner LINE (Table 1) [34]. For example, Figure 3 shows the alignment of tobacco TS SINE [35] with its partner LINE. This LINE, which was recently identified in the potato genome, a member of the same family as tobacco, belongs to the RTE clade. The 3 -end sequence of the SINE, approximately 100 bases, is nearly identical to that of the LINE, and they both end in TTG repeats [34]. SINE/LINE pairs have been observed in a wide variety of species, from eumetazoans to green plants, confirming the generality of this phenomenon (Table 1). Although various LINEs appear in the list, those from clades CR1 and RTE were particularly predominant.
Since the R2 LINE protein specifically recognizes the sequence near the 3 -end of the RNA transcript for the initiation of first-strand synthesis [16,17], the homology between the 3 -ends of SINEs and LINEs suggests that each SINE family recruits the enzymatic machinery for retroposition from the corresponding LINE through this common "tail" sequence [31]. This hypothesis was strongly supported by experiments with SINE sequences in the eel [36]. As the 3 -UTRs of several LINEs have been shown to be essential for retroposition [17,[36][37][38][39], these LINEs presumably require "stringent" recognition of the 3 -end sequence of the RNA template [32,36]. Figure 4 illustrates the relationship between the number of SINE/LINE pairs and the number of LINEs in each clade [34]. Although Spearman's rank correlation is not significant ( = 0.25), the number of SINEs with a LINE tail is positively correlated with the number of LINEs belonging to each clade ( 2 = 0.83); that is, more LINEs tend to lead to more SINE/LINE pairs. Therefore, although a few LINE clades are the predominant source of SINE/LINE pairs, it is plausible that this simply reflects the large number of LINEs in these clades. However, L1-clade LINEs are the only prominent exception to this. Although over 800 L1-clade LINEs appeared in the database, only 3 SINEs with L1 tails were found [34], suggesting that, in general, L1-clade LINEs are different from other LINEs with regard to 3 -end recognition.

Mechanism of RNA-Mediated Gene Duplication in Mammals.
Mammalian PPs and retrogenes were probably mobilized by L1s because they end in poly(A), and have L1-type target site duplications; they are inserted in L1-type endonuclease cleavage sites [40][41][42]. Molecular biological studies have shown that mammalian L1-encoded proteins have been involved in the reverse transcription of PPs [43,44]. In the same assay, another class of autonomous retroposons, LTR retrotransposons (retroviral-like elements), were unable to produce similar PP-like structures [43].
The 3 -end sequences of mammalian L1 LINEs do not exhibit any similarity to SINEs, except for the presence of 3 -poly(A) repeats, although these L1s are thought to have mediated the retroposition of mammalian SINEs such as primate Alu and rodent B1 families [45][46][47]. Since the 3 -poly(A)   repeats of L1 and Alu are critical for their retroposition in the HeLa cell line [46,48,49], L1 probably recognizes the 3 -poly(A) repeats. Therefore, while mammalian L1s do not require stringent recognition of the 3 -end sequence of the RNA templates, they are able to initiate reverse transcription in a more "relaxed" manner [32].
L1-encoded proteins are cis-acting; that is, L1 proteins preferentially mobilize or interact with the RNA molecule that encoded them [43,44]. However, L1 is also thought to mobilize SINE RNAs and cytosolic mRNAs by recognizing the 3 -poly(A) tail of the template RNAs in trans, resulting in enormous SINE amplification and PP formation [43,50]. Given that the L1 retropositional machinery acts in a cis-manner, Boeke [51] proposed the poly(A) connection hypothesis to explain why Alu RNA is mobilized by L1 at such a high frequency.
Schmitz et al. discovered a novel class of retroposons that lack poly(A) repeats in mammals. Termed tailless retropseudogenes, they are derived from truncated tRNAs and tRNArelated SINE RNAs [52]. To explain this phenomenon, they proposed a novel variant mechanism, probably guided by the L1 RT, in which neither the presence of a poly(A) tail on the RNA template nor its length is important for retroposition.

Retroposition Burst in Ancestral Primates
Abundant PPs are a feature of mammalian genomes [7,8]. Previously, my collaborators and I performed the first comprehensive analysis of human PPs using all known human genes as queries [50]. We found the possibility of a nearly simultaneous burst of PP and Alu formation in the genomes of ancestral primates. The human genome was queried and 3,664 candidate PPs were identified; the most abundant of which were copies of genes encoding keratin 18, glyceraldehyde-3-phosphate dehydrogenase, and ribosomal protein L21. A simple method was developed to estimate the level of nucleotide substitutions (and therefore the age) of the PPs. A Poisson-like age distribution was obtained with a mean age close to that of the Alu repeats. These data suggested a nearly simultaneous burst of PP and Alu formation in the genomes of ancestral primates. Similar results have been reported by other groups [53][54][55]. The peak period of amplification of these 2 distinct retroposons was estimated to be 40-50 million years ago (mya) [50]; moreover, concordant amplification of certain L1 subfamilies with PPs and Alus was observed. We proposed a possible mechanism to explain these observations in which the proteins encoded by members of particular L1 subfamilies acquired an enhanced ability to recognize cytosolic RNAs in trans.
Roy-Engel's group recently recreated and evaluated the retroposition capabilities of two ancestral L1 elements, L1PA4 and L1PA8, which were active ∼18 and ∼40 mya, respectively [56]. Relative to the modern L1PA1 subfamily, they found 4 International Journal of Evolutionary Biology International Journal of Evolutionary Biology 5  International Journal of Evolutionary Biology that both elements were similarly active in a cell culture retroposition assay in the HeLa cell line, and both were able to efficiently trans-mobilize Alu elements from several subfamilies. They found limited evidence of differential associations between Alu and L1 subfamilies, suggesting that other factors are likely the primary mediators of their changing interactions over evolutionary time. Population dynamics and stochastic variation in the number of active source elements likely played an important role in individual LINE or SINE subfamily amplification [56]. If coevolution also contributed to changing retroposition rates and the progression of subfamilies, cell factors were likely to play an important mediating role in changing LINE-SINE interactions over evolutionary time.

Gene Creation by the Coupling of Gene Duplication and Domain
Assembly. Most new genes arise by the duplication of existing gene structures, after which, relaxed selection on the new copy frequently leads to mutational inactivation of the duplicate; only rarely will a new gene with a modified function emerge. My collaborators and I described a unique mechanism of gene creation, whereby new combinations of functional domains are assembled at the RNA level from distinct genes, and the resulting chimera is then reversetranscribed and integrated into the genome by the L1 retrotransposon [59]. We characterized a novel gene, which we termed PIP5K1A and PSMD4-like (PIPSL), created by this mechanism from an intergenic transcript between the phosphatidylinositol-4-phosphate 5-kinase (PIP5K1A) and the 26S proteasome subunit (PSMD4) genes in a hominoid ancestor. PIPSL is transcribed specifically in the testis of humans and chimpanzees and is posttranscriptionally repressed by independent mechanisms in these primate lineages. The PIPSL gene encodes a chimeric protein combining the lipid kinase domain of PIP5K1A and the ubiquitinbinding motifs of PSMD4. Strong positive selection on PIPSL led to its rapid divergence from the parental genes, forming a chimeric protein with distinct cellular localization and minimal lipid kinase activity, but significant affinity for cellular ubiquitinated proteins [59]. PIPSL is a tightly regulated, testis-specific novel ubiquitin-binding protein formed by an unusual exon-shuffling mechanism in hominoid primates and represents a key example of the rapid evolution of a testisspecific gene.   [61]. We determined the evolutionary fate of PIPSL domains created by domain shuffling [61]. During hominoid diversification, the S5a/PSMD4-derived domain was retained in all lineages, whereas ubiquitin-interacting motif (UIM) 1 in the domain experienced critical amino acid replacements at an early stage, being conserved under subsequent high levels of nonsynonymous substitutions to UIM2 and other domains, suggesting that adaptive evolution diversified these 8 International Journal of Evolutionary Biology  functional compartments ( Figure 5) [61]. Conversely, the PIP5K1A-derived domain is degenerated in gibbons and gorillas. These observations provide a possible scheme of domain shuffling in which the combined parental domains are not tightly linked in the novel chimeric protein, allowing for changes in their functional roles, leading to their finetuning. Selective pressure toward a novel function initially acted on one domain, whereas the other experienced a nearly neutral state. Over time, the latter also gained a new function or was degenerated.

RNA-Mediated Gene Duplication in Land Plants
The SINE/LINE relationship in land plants is controversial. The first SINE/LINE pair of land plants was reported recently in maize [66]. However, the three tRNA-derived SINE families in Arabidopsis thaliana do not exhibit any similarity to the only LINE family (ATLN) in its genome [67][68][69]. Deragon's group proposed that the SINE-LINE relationship in Arabidopsis is not based on primary sequence identity but on the presence of a common poly(A) region [68]. I systematically analyzed the increasing wealth of genomic data to elucidate the SINE/LINE relationships in eukaryotic genomes, especially plants [34]. I proposed that the ancestral L1-clade LINE in the common ancestor of green plants may have used stringent RNA recognition to initiate reverse transcription. During the course of plant evolution, specific recognition of the RNA template may have been lost in a plant L1 lineage, as in mammals. Figure 6 represents the number of LINEs belonging to each LINE clade according to biological taxa [34]. The L1 clade is the largest of all the clades, with L1clade LINEs being predominant in mammals and land plants (mainly flowering plants). The genomes of flowering plants harbor almost exclusively L1-clade LINEs (RTE-clade LINEs are also found in several species).

L1-Clade LINEs Are Predominant in the Genomes of Flowering Plants.
While a significant number of SINEs, more than half of which end in poly(A) repeats, have been identified in the genomes of flowering plants (Table 2) [34], only three SINE/LINE pairs have been discovered in their genomes, that is, maize ZmSINE2 and ZmSINE3 [66] and tobacco TS SINE [34]. Interestingly, many PPs have been reported in flowering plants [11,[70][71][72][73]. Since mammalian L1s are thought to recognize the 3 -poly(A) tail of RNA when forming PPs [43], it is possible that the plant LINE machinery is similar to that of mammalian L1s [68]; that is, plant L1-clade LINEs presumably recognize the 3 -poly(A) tail of RNA, thereby mobilizing SINEs with a poly(A) tail and mRNA.
In accordance with this hypothesis, almost all L1-clade LINEs in flowering plants end in poly(A) repeats, while all RTE-clade LINEs end in (TTG)n or (TTGATG)n (Table 3) [34]. As for the exceptional cases of p-SINEs [74,75] and Au-like SINEs [76][77][78][79], which end in poly(T) tracts (or a short stretch of T), it is possible that they are mobilized by unidentified partner LINEs that recognize a poly(U) repeat of RNA at the 3 -terminus.

Plant L1-Clade LINEs Consist of 3 Deeply Branching Lineages That Have Descended from the Common Ancestor of Monocots and Eudicots.
Comprehensive phylogenetic analysis of L1-clade LINEs revealed three important points [34]. First, L1-clade LINEs from distinct taxa (i.e., land plants, green algae, and vertebrates) formed monophyletic groups. Statistical support for the monophyly of land plants and green algae was high, with bootstrap values of 100/82 and 97/83 (NJ/ML methods), respectively. The monophyly of vertebrate F and M lineages was not supported by the ML method. Second, the L1 lineages from these three taxa formed a monophyletic group (55/45; NJ/ML methods) among diverged LINE clades such as RTE and CR1. The Tx1 LINE, with a target-specific insertion, was also found in this clade, as observed in previous studies [26,29,30]. The Tx1 and vertebrate F lineage formed a monophyletic group with high confidence (94/85). Third, comparison with species phylogeny revealed that plant L1-clade LINEs consist of at least three deeply branching lineages that have descended from the common ancestor of monocots and eudicots (ME1-3). These 3 lineages must have arisen more than 130 mya, which is the approximate divergence of monocots and eudicots [80]. The history of plant L1 lineages is therefore reminiscent of that of vertebrate L1-clade LINEs, which are divided into International Journal of Evolutionary Biology   several ancestral lineages (M and F/Tx1), one of which leads to mammalian L1s [28,29].

A Conserved 3 -End
Sequence with a Solid RNA Structure, as in Maize and Sorghum SINEs, Observed in One Plant L1 Lineage. One monocot L1 lineage (monocot 1a in ME1) consisted of a large number of L1-clade LINEs that were identified mainly in the recently released maize and sorghum genomes. Moreover, one group of LINEs in this lineage retained a conserved 3 -end sequence [34]. The average pairwise divergence of this region (the last 45 nucleotides) among the LINEs was only 0.144 (standard error (SE), 0.043), whereas that for the entire sequence was 0.570 (SE, 0.012). Interestingly, maize SINEs (ZmSINE2 and ZmSINE3) with 3 -end sequences very similar to that of a LINE belonging to this group, LINE1-1 ZM, were reported recently [66]. I further revealed that several sorghum SINEs also possess similar 3 -end sequences [34]. Comparisons of the 3 -end sequences from these SINEs and LINEs revealed that part of the sequence (∼50 nucleotides) is apparently related; presumably they were derived from a common ancestral L1 sequence (Figure 7) [34].

Origin of Stringent and Relaxed 3 -End Recognition of
in the L1-clade was found in a green alga. The 3 -end sequence (∼80 nucleotides) of Chlamydomonas SINEX-3 CR [83] was very similar to that of L1-1 CR, both ending in poly(A) repeats [34]. Since land plants emerged from green algae [84], the following mechanism is proposed for the 3 -end recognition of plant L1-clade LINEs (Figure 9).
It is possible that the ancestral L1-clade LINE in the genome of the common ancestor of green plants possessed stringent, nonmammalian-type RNA recognition properties. During the course of plant evolution, an L1 lineage then lost the ability to recognize specifically the RNA template for reverse transcription, thereby introducing relaxed 3 -end recognition in land (flowering) plants as well as in mammals. This model assumes that rigid sequence specificity was an ancestral state, although the timing of its loss might be subject to debate. Since horizontal transfer of LINEs between eukaryotes is rare [25,[85][86][87], the discontinuous distribution of L1-clade LINEs with low specificity (i.e., mammalian L1s and plant ME2/ME3) suggests a type of parallel evolution.
The ancestral L1-clade LINE might have required the 3end sequence and the terminal poly(A) repeats. A few L1 lineages might then have lost their specific interaction with the 3 -UTR of the template RNA, retaining some role for the 3 -repeats. As shown in Table 3, most plant L1-clade LINEs, as well as mammalian L1s, have poly(A) repeats at their 3 -termini; however, 3 -poly(A) repeats are not necessarily a hallmark of relaxed 3 -end recognition. For example, although silkworm SART1, an R1-clade LINE, uses stringenttype recognition (its 3 -UTR is essential for retroposition) it ends in poly(A) repeats [37,38], which are necessary for efficient and accurate retroposition [38]. Other LINEs end in repeating units other than poly(A); for example, the I element (I clade) ends in TAA repeats [88], while UnaL2 (L2) ends in TGTAA repeats, which are likely involved in template slippage during reverse transcription [36].
Alternatively, the ancestral L1-clade LINE may have possessed relaxed, mammalian-type RNA recognition properties. During the course of plant evolution, the L1 lineages of land plants (ME1) and green algae might then have gained specific stringent-type recognition of the RNA template. However, it is difficult to imagine that the molecular machinery for rigid sequence specificity, such as the particular conformation of the RNA-binding domain, has arisen independently under reduced constraints.
In vivo retroposition assays have been developed for several LINEs [36,37,39,48]. Using such systems, it will be possible to verify these 2 models by evaluating the dispensability of the 3 -end sequence or poly(A) repeats in newly characterized L1 lineages such as plant ME1 and fish F.

Concluding Remarks
L1 LINEs have contributed significantly to the architecture and evolution of mammalian genomes, whereas LTR retrotransposons are overwhelmingly found in certain flowering plants. Understanding the independent origins of flexible 3end recognition may help us to determine what distinguishes the fate of a retroposon in the eukaryotic genome and why it has succeeded so well in certain genomes [89][90][91][92][93].