LINE-1 Endonuclease-Dependent Retrotranspositional Events Causing Human Genetic Disease: Mutation Detection Bias and Multiple Mechanisms of Target Gene Disruption

LINE-1 (L1) elements are the most abundant autonomous non-LTR retrotransposons in the human genome. Having recently performed a meta-analysis of L1 endonuclease-mediated retrotranspositional events causing human genetic disease, we have extended this study by focusing on two key issues, namely, mutation detection bias and the multiplicity of mechanisms of target gene disruption. Our analysis suggests that whereas an ascertainment bias may have generally militated against the detection of autosomal L1-mediated insertions, autosomal L1 direct insertions could have been disproportionately overlooked owing to their unusually large size. Our analysis has also indicated that the mechanisms underlying the functional disruption of target genes by L1-mediated retrotranspositional events are likely to be dependent on several different factors such as the type of insertion (L1 direct, L1 trans-driven Alu, or SVA), the precise locations of the inserted sequences within the target gene regions, the length of the inserted sequences, and possibly also their orientation.

Recently, we performed a comprehensive meta-analysis of 48 L1 endonuclease-mediated retrotranspositional events that cause human genetic disease. This analysis explored the sequence features associated with the different L1-mediated human retrotransposons (ie, L1 direct insertions, L1 transdriven Alu insertions, and L1 trans-driven SVA (short interspersed nucleotide elements-R, variable number of tandem repeats, and Alu insertions)), the frequency of genomic deletions created upon L1-mediated retrotransposition, and the process of L1-mediated insertion [6]. Here, we have extended this analysis by focusing on two key issues namely, mutation detection bias and the multiple mechanisms of target gene disruption. Note that during the preparation of this review, three further examples of simple Alu insertions causing human disease have been reported; these were also included in the analysis (Table 1).

MUTATION DETECTION DISPLAYS A SIGNIFICANT BIAS
Since the first report that de novo L1 insertions into the factor VIII gene (F8) had caused severe haemophilia A [7], numerous examples of simple L1-mediated retrotranspositional events (ie, those involving no loss of target gene material; n = 42) have been identified as a cause of human genetic disease (Table 1). Based upon results from in vitro studies [8,9], we have systematically annotated disease-causing 2 Journal of Biomedicine and Biotechnology   Table 2 from Chen et al [6] for easy comparison, except for the addition of three simple Alu insertions (BRCA1 [21]; BRCA2 [21]; and HESX1 [27]) that have been reported during the preparation of this review. Data on chromosomal location, inserted element and orientation, insertion size, and length of poly (A) tail were derived from Table 2 in Chen et al [6]. a With respect to the sense strand of the disrupted gene. S, sense; AS, antisense. The lengths of the genomic deletions associated with L1-mediated retrotransposons and simple poly (A) insertions are indicated in parentheses. b I, intron; E, exon. When an insertion occurred into an intron/exon and accompanying RNA analysis data were available, the position of the insertion's integration site was indicated in parentheses (+, relative to the first nucleotide of the intron/exon; −, relative to the last nucleotide of the intron/exon). c Only the effect on the target gene's pre-mRNA splicing and/or mRNA expression was evaluated. d The method that initially suggested/identified the mutation at the nucleotide level. PCR indicates all PCR-based techniques using genomic DNA as templates. e Data not available. f Poly (A) tail present but number of residues not specified. g 97 bp in the affected mother and 31 bp in the affected daughter, respectively. h Not applicable. L1-mediated retrotranspositional events that have been associated with genomic deletions (n = 9; Table 1). All these events probably resulted from L1 endonuclease-dependent retrotranspositional activity because not only have all the inserts integrated at typical L1-endonuclease cleavage sites, but they also possess poly (A) tails (see [6, Tables 1 and 2 and Figure 3]). By contrast, the three L1-derived extra-short inserts (termed "hyphen elements" by Audrézet et al [10]) identified at the junctions of large genomic deletions [10][11][12] did not share the above two hallmark characteristics of L1 endonuclease-dependent retrotranspositional events. These three mutations have therefore been proposed to have arisen via a "repair" process for existing DNA lesions, an L1 endonuclease-independent mechanism [13] that is likely to be qualitatively different from L1 endonuclease-based insertional mutagenesis (see [6, Table 1]).
The above 51 L1 endonuclease-mediated retrotranspositional events account for ∼ 0.1% of known mutations (∼ 52, 000 as of April 2005) causing human genetic disease, based upon the data collated in the Human Gene Mutation Database (http://www.hgmd.org/; [14]). The occurrence of L1-mediated simple retrotranspositional events has however long been thought to have been underestimated since large insertions may often be overlooked by routinely used PCR-based mutation detection techniques (eg, [15,16]). In this review, we have sought to explore how this mutation detection bias could have operated. To this end, we first manually evaluated the original publications that reported the 51 L1 endonuclease-mediated retrotranspositional events with respect to the mutation detection method(s) that initially suggested/identified the presence of an insertion or deletion at the nucleotide (ie, DNA or RNA) level. The locations of these lesions within the target genes (ie, in the 5untranslated regions (UTRs), exons, introns, or 3 -UTRs, resp) were also systematically annotated. Then, in order to assess the likelihood of having underestimated the occurrence of this type of mutational event, we attempted to relate the chromosomal location of the affected genes, as well as the types, sizes, and precise locations of the inserted sequence within the genes, to the mutation detection methods employed (Table 1).
In the context of the analysis of possible mutation detection bias, we excluded, for reasons of simplicity, the following entries from further consideration: (i) the three large genomic deletions that were associated with only simple poly (A) insertions, since the type of L1-mediated retrotransposon involved is unknown (Table 1) and (ii) the SVA simple insertions, owing to their limited number (only 4; Table 1). Our primary focus has therefore been the L1 and Alu insertions, both of which have been frequently found to cause human genetic disease. In addition, we did not consider the 42 simple insertions separately from the 6 genomic deletions associated with L1-mediated retrotransposons, on the basis that all were considered to have resulted from the same L1 endonuclease-mediated insertional mechanism. However, it is important to emphasize that, of the latter 6 cases, three  (Table 1). Since a typical Alu sequence is invariably < 290 bp in length and the poly (A) tails associated with L1-mediated retrotransposons are usually < 100 bp (see also [6, Table 4]), the length of an Alu insert plus its poly (A) tail should be < 400 bp (Table 1). At first sight, it would appear unlikely that those Alu inserts which cause sex-linked disease are going to be significantly underestimated both because X-linked diseases readily come to clinical attention in males and because inserts of < 400 bp into male X chromosomes are readily identifiable by routine PCR-based methods. Indeed, as is evident from Table 1, only in rare cases have the simple Alu inserts that have become integrated into the X chromosome been detected by Southern blotting or RT-PCR, and these would also have been amenable to detection by routine PCRbased methods. Whilst an electrophoretic band of larger than the expected size was demonstrated in the cases of Alu insertions when PCR products were examined, failure to PCR amplify several exons was encountered in the case of the 4726 bp deletion involving the X-linked ABCD1 gene [17].
To date, whereas 11 Alu insertions have been identified in X-linked genes as a cause of human genetic disease in male patients, the comparable figure for the autosomes is only 18 ( Table 1). Although the X chromosome has been claimed to be a preferred target for retrotransposition [18,19], it is difficult to accept that the observed chromosomal distribution of retrotranspositional mutations reflects the actual distribution since the X chromosome comprises only ∼ 5% of the human genome [20]. Consequently, it would appear likely that at least a proportion of Alu insertions causing human autosomal disease have been overlooked by routine PCR-based techniques. This could have been due to preferential PCR amplification of the wild-type allele which would have "masked" the Alu insertion mutant allele, an example being the failure to detect two Alu insertions by routinely used methods [21]. L1 direct inserts are usually much longer than Alu inserts (Table 1). Although, in principle, the presence of large inserts in X-linked genes in males might be initially suggested by the failure to PCR amplify the exon(s) under investigation (eg, as in the case of the 6017 bp L1 insertion in the CHM gene [22]), most of the L1 direct insertions listed in Table 1 were reported to have been initially identified by RT-PCR or Southern blotting. Given the extensive efforts devoted to screening for X-linked disease (note particularly the identification of two inserts that had become integrated into deep intronic regions (CYBB [23]; RP2 [24])), we surmise that the current figure (n = 12) of L1 direct inserts into the X chromosome may approach complete ascertainment. In this context, it is noteworthy that with respect to the insertions causing human X-linked disease, the number of reported L1 direct insertions (n = 12) is approximately the same as that of reported Alu insertions into X-linked genes (n = 11). However, by comparison with disease-causing Alu insertions that have become integrated within autosomal genes (n = 18), an apparent paucity of disease-causing autosomal L1 direct inserts (n = 3) is evident ( Table 1). The reason for this finding may be quite simple: the longer the inserts are, the more easily will they be missed by routine PCR-based techniques in the presence of a wild-type allele. It is therefore not unreasonable to conclude that the occurrence of L1 direct insertions causing autosomal disease has probably been significantly underestimated.
To obtain further insights into this issue, we examined the above finding in the context of a multiple pairwise comparison (Table 2). This revealed that, in general, mutations in X-linked genes are significantly over-represented in HGMD by comparison with both the proportion of X-linked to non-X-linked genes in HGMD (4-fold; p < 0.0001), and the proportion of X-linked to non-X-linked genes in the genome as a whole (8-fold; p < 0.0001). This could be due to a number of different factors including (i) the X chromosome bearing a slightly higher proportion of genes that are "disease genes" than other chromosomes, (ii) Xlinked disease may come to clinical attention more readily than autosomal disease since recessive mutations will become manifest in hemizygous males, (iii) hemizygous insertional mutations on the X-chromosome may, using currently used mutation detection techniques, be more readily detectable than heterozygous/compound heterozygous insertional mutations on the autosomes (due to the inherent limitations of PCR/"masking" of the mutant allele by the wild-type allele), (iv) greater effort may have been expended, historically, in identifying the genes and characterizing the mutational spectra underlying X-linked disease, and (v) the X-chromosome may represent a preferred retrotranspositional target as compared to other chromosomes. In reality, a combination of all these different factors has probably been operating. These considerations are also likely to apply to retrotranspositional insertions and may together account for the discrepancy in the observed prevalence of insertions into the X-chromosome as compared with the autosomes.

MULTIPLE MECHANISMS OF TARGET GENE DISRUPTION
We also systematically surveyed the original publications that reported the 51 L1 endonuclease-mediated retrotranspositional events with respect to the evidence presented for functional disruption of the target genes at the RNA level (ie, aberrant splicing and/or decreased mRNA expression). The information obtained was further evaluated in the context of the size, orientation, and integration sites of the inserts wherever possible and appropriate (Table 1).

Alu insertions
Of the 18 simple Alu insertions that integrated within coding regions, only five were informative with respect to the functional disruption of the target genes at the RNA level (BRCA2 [25]; BRCA2 [21]; CLCN5 [26], HESX1 [27]; HMBS [28]). This was in sharp contrast to the 7 simple Alu insertions that are known to have become integrated into intronic regions, 5 of which were informative. The probable reason for this phenomenon is that Alu insertions into coding regions will invariably lead to the loss of a functional protein product, irrespective of the precise point at which the gene expression pathway has been disrupted.
The Alu insertion into exon 22 of the BRCA2 gene resulted in the skipping of the exon involved through "some unknown mechanism" [25]. With hindsight, this insertion, which integrated fairly deeply into the exon involved (36 bp after the first nucleotide and 163 bp before the last nucleotide of exon 22; Table 1), could have disrupted cis-splicing elements such as an exon splicing enhancer or/and could have interacted with trans-acting cellular splicing factors, resulting in the "silencing" of the upstream constitutional splice acceptor site (for reviews, see [29][30][31]). Consistent with this postulate, the Alu insertion in the CLCN5 gene [32] was recently suggested to interfere with splicing regulatory elements, resulting in exon 11 skipping [26]. However, this is certainly not the case for the Alu insertion into exon 5 of the HMBS gene: both in vitro expression studies and in vivo RT-PCR analyses demonstrated that the mutant HMBS allele was not expressed at the RNA level [28]. Of the various possible mechanisms proposed by the original authors, we favour nonsense-mediated mRNA decay [33,34].
All 5 informative Alu intronic insertions are located nearer to the downstream exons than to the preceding exons. Consequently, most of them (n = 4) were found to cause skipping of the downstream exons: whilst two most likely affect the correct recognition of the splice acceptor sites (F8 [35]; FGFR2 [36]), the other two may affect the branch site that is usually located very close to the end of the intron (NF1 [37]; TNFRSF6 [38]). The remaining intronic insertion (GK [39]) was, however, reported not to "cause any deletions, duplications, premature stop codons, or frameshifts in the individual with benign glycerol kinase deficiency, as determined by RT-PCR (data not shown)." This notwithstanding, since no other mutations were present within the coding regions and intron-exon boundaries of the gene, and since the Alu insertion does not represent a polymorphism, this insertion was concluded to be indeed disease-causing [39]. Although we concur with this conclusion, we nevertheless feel that the functional consequence(s) of the Alu insertion may have been overlooked. In this regard, it is worth pointing out that the patient's radiochemically measured GK activity was 32% (ie, not a complete loss) that of the mean normal control [39]. It is therefore possible that the Alu insertion did not completely disrupt normal pre-mRNA splicing. However, in the RT-PCR analysis, the aberrantly spliced transcripts may have been unstable and could thus have been "masked" by correctly spliced stable transcripts.

L1 insertions
As with the Alu simple insertions, only one of the 8 L1 simple insertions in coding regions was informative with respect to target gene disruption; it caused the skipping of the exon involved [22], probably through a similar mechanism to the above-discussed Alu insertion into the BRCA2 gene [25]. By contrast, all 4 intronic insertions were informative: whilst two insertions were associated with either the skipping of a single exon (RPS6KA3 [40]) or an extremely complex splicing pattern (CYBB [23]), the other two insertions resulted in a significant, or even complete, loss of the mRNA transcript (HBB [41]; RP2 [24]). The latter two examples will now be discussed in detail in the light of a recent report [42].
Both L1 RNA and open-reading-frame-2 (ORF2) protein are very difficult to detect in mammalian cells, suggesting a mammalian-specific mechanism for negatively regulating L1 expression (see [42] and references therein). Indeed, the A-rich sense strand of an active human L1 element (ie, LINE-1.3; [43]), containing many canonical (n = 19) and noncanonical (n = 141) polyadenylation signals, has been noted to be prone to generate truncated transcripts by premature polyadenylation, at least under in vitro conditions [44]. However, using a different cell culture assay, Han et al [42] have shown that poor expression of the ORF2 protein is mainly due to the inability of RNA polymerase to elongate efficiently through L1 coding sequences (despite a minor contribution from premature polyadenylation). Moreover, these authors have demonstrated that an ORF2 sequence, when placed in the antisense orientation, inhibits transcription primarily by promoting premature polyadenylation. Based upon these observations, Han et al [42] predicted that L1 elements which have become inserted into introns could attenuate the expression of target genes either by premature truncation of RNA (in the antisense orientation) or by promoting transcriptional elongation (in the sense orientation), both mechanisms resulting in the decreased production of full-length pre-mRNA. Consistent with this postulate, highly expressed genes were found to contain relatively small amounts of L1 sequence, whereas poorly expressed genes contained large amounts [42].
In particular, the full-length de novo L1 insertion into intron 1 of the RP2 gene that is associated with the complete loss of RP2 mRNA synthesis [24] was cited by Han et al [42] as an example to support their thesis. As is evident from Table 1, the L1 insert in the HBB gene [41] shares remarkable similarities with that found in the RP2 gene [24]: both are full-length and both became integrated within introns. However, whereas the full-length RP2 L1 insertion was in the sense orientation and resulted in the complete loss of gene expression, the full-length HBB L1 insertion was in the antisense orientation and the amount of mRNA transcribed from the affected allele was reduced to 30% of normal (the mRNA transcripts from the affected and unaffected alleles were distinguishable by a codon 2 polymorphism and no splicing variants were detected [41]). This concurs with the in vitro finding that "inserting ORF2 in the antisense orientation produced a similar, but less potent, decrease in full-length RNA" [42]. Thus, the HBB insertion may serve as an additional example of an insertion that is consistent with the proposal that the insertion of L1 elements into a target gene's introns can significantly alter the expression of that gene [42].
The above notwithstanding, it would appear unlikely that the significantly 5 -truncated L1 insert (only 530 bp) in the DMD gene [45] caused the complete loss of the muscle (M) isoform of dystrophin through inhibition of transcriptional elongation and/or premature polyadenylation; this conclusion is based upon the in vitro observation that the level of reporter RNA expression was inversely correlated with the length of transfected L1 ORF2 (see [42,Figure 3]). Indeed, this short insert, which had integrated just 28 bp upstream of the ATG codon initiating translation of the M isoform encoded by the dystrophin (DMD) gene, must have affected transcriptional initiation and/or regulation. Although the expression of the M isoform was completely abolished, there were compensatory increases in the expression of the nonmuscle B (brain) and CP (cerebellar Purkinje) isoforms in the patient's skeletal muscle [45,46]. (The M, B, and CP isoforms are generally considered to be functionally homologous. However, the transcripts encoding these isoforms contain a unique first exon and are expressed from different, tissue-specific promoters, see [47] and references therein.)

SVA insertions
Of the four SVA insertions (Table 1), two were inserted into exons causing the skipping of the exons involved (BTK [48]; SPTA1 [49]), whereas the other two were reported to be associated with virtually undetectable mRNA expression (ARH [50]; FCMD [51]). In the case of the ARH mutation, "although no mRNA was detectable by Northern blotting, small amounts of cDNA could be amplified using RT-PCR" [50]. Similarly, "the transcript of this (FCMD) gene was nearly undetectable in FCMD patients who carried the insertion homozygously, and significantly lower than normal in patients heterozygous for the insertion and another mutation haplotype" [51]. As previously discussed [6], although SVA elements are relatively poorly characterized, they are composed of highly repetitive sequences (for a detailed sequence description, see [51]; refer also to [50, Figure 2]). Importantly, both SVA insertions are rather long (2600 and 3062 bp, resp). Moreover, the SVA insertion in ARH [50] is very similar to the L1 insertion in RP2 [24] in the following respects: both were in the sense orientation and both had been inserted into the first introns of their respective genes in comparable locations (Table 1). Thus, it is tempting to speculate that the 2600 bp SVA insert may also compromise transcriptional elongation resulting in an undetectable level of mRNA (even although it is C-rich, cf L1 which is A-rich).
That the 3062 bp SVA element had been inserted into the 3 -UTR of the FCMD gene [51] effectively serves to exclude a possible effect on transcriptional initiation. It is also pertinent to note that the normal FCMD transcript comprises a long 3 -UTR of 5952 bp; the SVA integration site is 4375 bp downstream of the TGA translational termination codon and 1454 bp upstream of the poly (A) addition signal sequence. Thus, it is very likely that the 3062 bp SVA insertion (in sense orientation) may either inhibit transcriptional elongation or cause abnormal polyadenylation resulting in the complete loss of gene expression.

Genomic deletions associated with L1-mediated retrotranspositional events
In the 6 cases associated with large target gene deletions (ie, the 3 events associated with L1-mediated retrotransposons (ABCD1 [17]; APC [52]; SERPINC1 [53]) plus the 3 events associated with only a simple poly (A) tail (Table 1)), the role played by L1-mediated short insertions in the functional disruption of the target genes cannot be independently assessed. Of the three events associated with extremely short genomic deletions, only two are informative: whilst the 608 bp L1 insertion in exon 44 of the DMD gene caused the skipping of the exon involved [54], the 1200 bp L1 insertion into intron 7 of the FCMD gene yielded a complex splicing pattern including the skipping of exons 7 and 8, the skipping of only exon 7, and the skipping of exons 7, 8, and 9, respectively [55].

CONCLUSIONS
Mutation detection bias is a complex issue. This notwithstanding, our analysis has suggested that at least two factors (namely, clinical selection and the choice of mutation detection techniques) may have contributed to a significant bias in 7 detecting L1-mediated retrotranspositional events that cause human genetic disease. Although there is a general tendency for autosomal L1-mediated insertions to be overlooked, autosomal L1 direct insertions appear likely to be the most seriously underestimated owing to their unusually large size. In particular, given the two examples of L1 direct inserts that have integrated within deep intronic regions (CYBB [23]; RP2 [24]), it would appear that methods other than PCR-based techniques (eg, RT-PCR and Southern blotting) should be employed whenever necessary and possible, with a view to maximizing the mutation detection rate.
Our analysis has also demonstrated that the mechanisms underlying the functional disruption of target genes by L1mediated retrotranspositional events are dependent on several factors such as the type of insertion, the precise locations of the inserted sequences within the target gene regions, the length of the inserted sequences, and perhaps also their orientation. Thus, an Alu insert might not be capable of efficiently inhibiting transcriptional elongation owing to its small size. Moreover, inserts that have integrated within 5 -or 3 -UTRs would be likely to affect the target genes differently from those that have integrated within coding or intronic regions. Further, the unique examples of full-length L1 inserts integrated into intronic regions (HBB [41]; RP2 [24]) suggest that both the length and orientation of L1 inserts may be important in the context of transcriptional inhibition. This notwithstanding, the precise mechanisms underlying certain insertions, for example, the large SVA insert in the deep intronic region in the ARH gene [50] still remains to be clarified.