Overlapping Antisense Transcription in the Human Genome

Accumulating evidence indicates an important role for non-coding RNA molecules in eukaryotic cell regulation. A small number of coding and non-coding overlapping antisense transcripts (OATs) in eukaryotes have been reported, some of which regulate expression of the corresponding sense transcript. The prevalence of this phenomenon is unknown, but there may be an enrichment of such transcripts at imprinted gene loci. Taking a bioinformatics approach, we systematically searched a human mRNA database (RefSeq) for complementary regions that might facilitate pairing with other transcripts. We report 56 pairs of overlapping transcripts, in which each member of the pair is transcribed from the same locus. This allows us to make an estimate of 1000 for the minimum number of such transcript pairs in the entire human genome. This is a surprisingly large number of overlapping gene pairs and, clearly, some of the overlaps may not be functionally significant. Nonetheless, this may indicate an important general role for overlapping antisense control in gene regulation. EST databases were also investigated in order to address the prevalence of cases of imprinted genes with associated non-coding overlapping, antisense transcripts. However, EST databases were found to be completely inappropriate for this purpose.


Introduction
There is accumulating evidence that a large number of structurally and functionally diverse non-coding RNA (ncRNA) molecules are produced in the eukaryotic cell (Eddy, 2001;Mattick, 2001). The hitherto unsuspected complexity of RNA-based gene regulatory mechanisms presents a considerable technical challenge to both bioinformaticists and molecular biologists. Most of the transcripts do not currently have well-defined structural features and may not be represented, or may be dismissed as cloning artifacts, in gene expression libraries. For example, functional ncRNA occurs in sizes ranging from 21 nucleotides for 'small temporal RNA' (stRNA) and double-stranded silencing RNA (siRNA) (Pasquinelli et al., 2000;Harborth et al., 2001), to greater than 40 kb for overlapping antisense or intergenic transcripts associated with chromatin remodeling at imprinted gene loci such as IGF2R and XIST (Wutz et al., 1997;Lee et al., 1999), and at developmentally regulated, nonimprinted loci such as a globin (Gribnau et al., 2000). Moreover, it is unclear whether a low level of transcription of certain ncRNAs is functionally significant, or whether it merely represents 'illegitimate' or 'leaky' transcription from cryptic promoters.
These difficulties notwithstanding, evidence that ncRNAs make a significant contribution to eukaryotic cell function comes from a variety of sources. Established work indicates that, in addition to functional intronic RNA, ribosomal RNA (rRNA), transfer RNA (tRNA), and the 5k and 3k untranslated regions (UTR) of messenger RNA (mRNA), short ncRNAs are also integral components of major nuclear catalytic complexes, for example, the small nuclear RNAs (snRNAs) of the spliceosome (Valadkhan and Manley, 2001), and telomerase RNA (Lukowiak et al., 2001). In addition, a wide variety of transcriptional and translational regulatory mechanisms have been either described or proposed that involve the base-pairing of complementary RNA molecules produced either in cis or in trans. These include the small nucleolar RNAs (snoRNAs), which modify rRNA and snRNAs (Kiss, 2001), the 'microRNAs' (miRNAs) including stRNA and siRNA (Eddy, 2001), and overlapping antisense transcripts (OATs) produced in cis at protein-coding gene loci in mammals (Kumar and Carmichael, 1998;Vanhée-Brossollet and Vaquero, 1998).
Imprinted genes, which are expressed from only one of the parental alleles during mammalian development, comprise a functionally diverse family of developmentally regulated genes with unusual genomic features, such as associated tandem repeats (Neumann et al., 1995) and reduced intronic content (McVean et al., 1996). In addition, there may be an enrichment of OATs at imprinted loci (Moore, 2001). Such imprinted, antisense transcripts may be functionally significant because many are expressed at high levels and are associated with genomic regions implicated in regulating the imprinting mechanism (Moore et al., 1997;Sleutels et al., 2000). However, there are also examples of OATs at non-imprinted loci (Vanhée-Brossollet and Vaquero, 1998). Moreover, the apparent enrichment of such transcripts at imprinted loci may reflect an ascertainment bias, because of the intensive study of the genomic organization and allele-specific expression patterns of these genes relative to non-imprinted genes.
In order to address the question of the functional significance of OATs in the human genome, we sought to estimate the frequency of their occurrence and to delineate their genomic structures through bioinformatics, by using BLASTN to search for sequence complementarities between transcribed gene sequences in the public databases. We were able to place a lower boundary on this estimate of approximately one thousand OATs in the human genome.

Materials and methods
The RefSeq database (Pruitt et al., 2001), which contains annotated mRNA sequences for 11 015 different human genes (at January 2001), was used. These are high quality gene predictions that use a combination of the scientific literature, expressed sequence tag (EST) sequences and automatic predictions of the locations of introns and exons. We downloaded the complete processed mRNA sequences for all genes (ftp://ftp.ncbi.nlm.nih.gov/refseq/). These sequences include the coding regions as well as the 5k and 3k untranslated regions (UTRs) of each gene. The BLASTN program (version 2.0.13, Altschul et al., 1990) was then used to compare each sequence in RefSeq to the complementary strand of all the remaining genes. This locates pairs of genes that have, in principle, the ability to form stretches of double-stranded RNA. The threshold E-value was set to 10 x8 to exclude weak matches. This yielded a collection of 1221 high scoring pairs (HSPs). These included matches due to the presence of repeated sequences (e.g. ALU repeats in the UTRs), which were filtered manually using Repeatmasker (A.F.A. Smith and P. Green, RepeatMasker at http://ftp.genome.washington.edu/ RM/RepeatMasker.html Smith and Green,27). The remaining pairs of sequences were then checked using the Locuslink records from RefSeq and the UCSC human genome browser (http://genome.ucsc. edu/) to locate pairs of overlapping genes that map to the same chromosomal location.
In a second series of experiments, the sequences of known imprinted genes from human and mouse were examined for complementary matches against corresponding databases of EST sequences using the Gene2est server at http://www.woody. embl-heidelberg.de/gene2est (Gemund et al., 2001). The list of mouse and human imprinted genes was taken from the Genomic Imprinting Website (http:// www.geneimprint.com/). The Gene2est server produces a BLAST output, which was imported into Artemis for visualization of results (Rutherford et al., 2000). In order to check the validity of EST 'hits' to the complementary strand, mouse RefSeq was blasted against mouse EST sequences from Genbank 124 (June 2001).

Results
The initial 1221 HSPs from the BLASTN searches were taken and reduced to 56 pairs of overlapping genes as described in Materials and Methods (Table 1). As expected under an assumption of random distribution, a large proportion of the transcripts map to the two largest chromosomes, 1 and 2. The majority of overlaps are between the 3k UTRs of the transcripts (Table 2), with a smaller number located in the 5k UTRs, or between the 5k UTR of one transcript and the 3k UTR of another. The overlaps typically extend over 50 -200 Overlapping antisense transcription in the human genome 245 Table 1. A list of overlapping transcripts reported in chromosomal order. The length of the overlap between each pair of transcripts is denoted in nucleotides. The type of overlap is denoted as: between the 3k UTRs (3k), between the 5k UTRs (5k), or the 5k UTR of one transcript and 3k UTR of the other (3k/5k). If a coding region is involved, it is marked cds. ** denotes a reviewed RefSeq sequence which has been manually processed; * indicates a provisional sequence on which some initial quality checking has been done; the remaining cases correspond to automatically generated predicted records which are validated by cDNA or EST data, and/or closely related homologous sequences Overlapping antisense transcription in the human genome 247 nucleotides, in some cases involving the coding region of one transcript (Table 1 and Figure 1). The transcripts identified in our search encode proteins with heterogeneous functions in DNA synthesis, cell cycle control and developmental regulation. This diversity might suggest that the occurrence of a DNA sequence overlap between pairs of protein-coding genes is incidental to their genomic location and structure, and of no mechanistic significance. However, some of the overlaps detected by our search have previously been reported in the literature, in either human or other species, and include functional studies that support their   Overlapping antisense transcription in the human genome 249 mechanistic significance. For example, a 1.5 kb OAT to basic fibroblast growth factor (bFGF, FGF2) has been reported in the oocytes of Xenopus laevis, and the human homologue has been cloned and mapped to the long arm of chromosome 4. In X. laevis, the region of complementarity extends through both the coding region and the 3k UTR of FGF2, whereas, in the human and rat homologues, complementarity extends only to the 3k UTR (Figure 1). Expression levels of both the sense and antisense transcripts have been studied to investigate the possibility of antisense regulation of the sense transcript (Li et al., 1996). The developmental pattern of expression of the OAT was found to be inversely correlated to the sense transcript in developing rat brain. Expression was also found to be agedependent with sense expression increasing postnatally and antisense expression decreasing (Li et al., 1996). Subsequently, it was shown that FGF2 protein levels are directly influenced by the level of the OAT in mammalian cells (Li et al., 2000), suggesting post-transcriptional regulation of FGF2 by the OAT. It has also been shown that this OAT encodes a functional protein with MutT-related enzymatic activity in the rat, and it was noted that the human homologue also contains an open reading frame (Li et al 1997). Intron 3 of the mouse thymidine kinase (tk) gene has been reported to contain an antisense promoter and the associated OAT is thought to regulate expression of the TK protein-encoding sense transcript in mouse fibroblasts (Sutterluety et al., 1998). This salvage pathway enzyme is expressed at low levels in resting mammalian cells but levels increase dramatically when cells enter S phase. A wellcharacterized transcriptional regulation is involved, and a post-transcriptional mechanism is also suspected. The correlation of TK protein repression with OAT expression supports a role for the OAT in regulating TK expression. The 5k UTR and part of the coding sequence of the human TK homologue found in RefSeq are complementary to a predicted gene of unknown function indicating the existence of a human homologue of the mouse OAT.
We also found an overlap of 177 nucleotides between the 3k UTRs of MSH6 and VIT1 at 2p16 (Figure 1), as previously reported. It was suggested that the overlap allows regulation of MSH6 by VIT1 (Le Poole et al., 2000).
Imprinted gene transcripts coding for functional proteins are found in the RefSeq database, but non-coding OATs associated with them are not. For example, the human COPG2 gene has a noncoding OAT at the 3k end, which was not found in our search. However, COPG2 also overlaps with the imprinted, protein-coding MEST gene over 52 nucleotides at their 3k ends, as previously reported (Blagitko et al., 1999), and as successfully identified by our search.
In an attempt to identify novel non-coding OATs at imprinted gene loci, a second set of experiments involving a BLASTN search of all known mouse imprinted genes against mouse EST databases found that the majority of imprinted genes had ESTs aligned to both strands. The reverse complement of all gene transcripts in the mouse RefSeq database was also used in a BLASTN search against the same database of mouse ESTs to determine whether the high number of 'hits' at imprinted loci occurred as an artefact of the EST database, due to submission of DNA sequence from both strands of cloned, double-stranded cDNA. Out of 7340 entries in mouse RefSeq, 6489 transcripts received hits to the complementary strand. The correct transcriptional orientation of the ESTs aligned to their respective genomic regions could not be assigned unambiguously and therefore the use of EST databases to search for non-coding OATs is unreliable. However, it may be possible in the future to use EST databases consisting exclusively of directionally cloned and sequenced cDNAs to produce an accurate estimate of the frequency of non-coding OATs.
In a further experiment, we assessed the representation of previously confirmed OATs at imprinted mouse and human gene loci in the public EST databases. We found, unsurprisingly, that the databases are biased towards highly expressed transcripts, which is problematic because some OATs are expressed at low levels and may be tissuespecific (Moore et al., 1997). The mouse insulin-like growth factor 2 (Igf2) gene is an extensively studied imprinted gene with an OAT (Igf2as) at the 5k end (Moore et al., 1997;Okutsu et al., 2000). This OAT was not detected in our searches, probably due to its low expression level ( Figure 2). Moreover, the 5k ends of genes are underrepresented in EST databases because reverse transcription of mRNA is frequently initiated from the 3k end using a poly(T) primer. Therefore, in BLASTN searches against ESTs, more 'hits' are expected at the 3k end of the gene (Figure 2). It is also evident that there are many 'hits' on the opposite DNA strand to that 250 M. E. Fahey et al.
predicted from the structure of the Igf2as gene, further undermining the reliability of such ESTbased searches.

Discussion
We found 56 pairs of overlapping transcripts among the 11 015 protein coding transcripts in RefSeq. On the conservative assumption that RefSeq contains one quarter of all protein coding transcripts, we can estimate that there are 4r4r56=896 OAT pairs in the human genome. However, this is likely to be an underestimate because RefSeq does not contain non-coding transcripts, which occur frequently at imprinted loci, and also at non-imprinted loci, but at an unknown frequency. In this study, we show that EST data are unsuitable for investigating noncoding OATs due to the biased nature of the current databases. An accurate estimation of OAT pairs consisting of one or two non-coding transcripts will require either laboratory-based approaches or customized gene expression databases that circumvent the problems associated with the current EST databases. During the preparation of this manuscript a list Figure 2. Schematic of mouse Igf2 transcripts. Exons are shown as boxes and ESTs aligning to both strands are marked by arrows. EST datasets are biased towards 3k ends of genes and their orientation with respect to the genomic locus is uncertain. Such bias inhibited the unambiguous validation of non-coding antisense transcripts. Igf2 is one of the most extensively studied imprinted genes with a well-characterised antisense transcript at the 5k end which was not detected by a search of EST databases. More than 300 ESTs matched both strands at the 3k end of the gene, as indicated by the thickness of the arrows, with slightly more aligning to the top strand (coding for Igf2) than the lower strand. Relatively few aligned to the 5k end of the gene Overlapping antisense transcription in the human genome 251 of potential antisense transcripts in the human genome was reported by Lehner et al. (2002). They used RefSeq, as we did, but also used a compilation of vertebrate mRNAs extracted from the EMBL nucleotide sequence database. They reported a total of 87 pairs of genes, 45 of which are in common with our list. Of the 42 gene pairs, reported by Lehner et al. (2002) that we did not find, 18 include a sequence from the EMBL compilation that is not represented in RefSeq, and which we did not include in our analysis. These overlapping pairs are of variable and unknown validity but do include some biologically interesting genes. The remaining gene pairs were from a more recent version of RefSeq than that used in our analysis. We report the following 11 unpublished OATs, excluded by Lehner et al. (2002) due to the presence of repeat sequences: TPR/ PRG4, LOC51611/FLJ20139, KIAA0764/FLJ10624, CRIPT/LOC51088, VRK2/FLJ10335, HT009/IDI1, APAF1/LOC56899, MDDX28/FLJ20399, LOC51031/ FLJ10581, COL9A3/TCFL5, FLJ10508/MCM3AP. However, in all of these pairs, the repeats are not the basis of the complementary pairing between the transcripts. Therefore, as the pairs are transcribed from the same locus their inclusion is valid. The functional significance of the OATs described herein is largely unknown. However, some of the pairs that we found have been described previously and have been studied functionally (Le Poole et al., 2000;Li et al., 1996Li et al., , 1997Sutterluety et al., 1998). Twenty three of the 56 OAT pairs that we describe involve transcripts containing an open reading frame encoding a protein of unknown function. Further characterization of the transcriptome and proteome is required to test the functionality of such pairs. Expression levels of OATs might be expected to be inversely proportional to one another, as is the case for the FGF2 locus. Such further studies may clarify the involvement of such overlapping transcripts in gene regulation. Although we cannot exclude the possibility that some of the overlaps may be incidental and of no functional significance, the existence of double-stranded RNA specific proteins supports the possibility that OATs constitute part of a significant gene regulatory mechanism. For example, DRADA, a member of the dsRNA-specific adenosine deaminase family of modifying enzymes, is a ubiquitously expressed nuclear enzyme capable of converting adenosine residues in dsRNA molecules to inosines, thereby destabilising the molecule (Kim and Nishikura, 1993). OATs forming dsRNA molecules could also be targets for dsRNA-specific RNases leading to mRNA degradation. Functionally significant overlapping antisense transcripts have been reported in prokaryotic cells and are implicated in post-transcriptional regulatory mechanisms (Wagner et al., 2002). Regulatory OATs are also present in eukaryotes, indicating a widespread role for antisense mediated gene regulation (Vanhee-Brossollet and Vaquero, 1998). With the emergence of complete genome sequence databases, a comparative analysis to test for interspecies conservation of OAT pairs could offer further insights into the prevalence and functional significance of antisense transcription. For example, the structure of the FGF2 gene coding transcript and its corresponding OAT are conserved between human, rat, chicken and frog . This example provides a starting point upon which to build a comprehensive database of OATs. Moreover, as the annotation of genomes becomes more complete, and methods to detect and characterize non-coding transcripts improve, a more complete database of OATs comprising both coding and validated non-coding OATs may be compiled.