Maternally and Paternally Silenced Imprinted Genes Differ in Their Intron Content

Imprinted genes exhibit silencing of one of the parental alleles during embryonic development. In a previous study imprinted genes were found to have reduced intron content relative to a non-imprinted control set (Hurst et al., 1996). However, due to the small sample size, it was not possible to analyse the source of this effect. Here, we re-investigate this observation using larger datasets of imprinted and control (non-imprinted) genes that allow us to consider mouse and human, and maternally and paternally silenced, imprinted genes separately. We find that, in the human and mouse, there is reduced intron content in the maternally silenced imprinted genes relative to a non-imprinted control set. Among imprinted genes, a strong bias is also observed in the distribution of intronless genes, which are found exclusively in the maternally silenced dataset. The paternally silenced dataset in the human is not different to the control set; however, the mouse paternally silenced dataset has more introns than the control group. A direct comparison of mouse maternally and paternally silenced imprinted gene datasets shows that they differ significantly with respect to a variety of intron-related parameters. We discuss a variety of possible explanations for our observations.


Introduction
The intron content of a gene or genome is composed of the combined effects of intron length and intron number. Intron length varies greatly between species and between regions of a specific genome, e.g. the average length of an intron in the rat is greater than 1000 bp, whereas in the worm (C. elegans) it is less than 500 bp (Deutsch and Long, 1999). Within a specific genome, intron length is correlated with various genomic parameters, e.g. there is, on average, three times more intron DNA in regions of low, compared to high, GC content. This statistic varies depending on the species, ranging from 2.1 in the mouse to 4.0 in the ox (Duret et al., 1995). Recombination rate (which is related to GC content) also correlates with intron length, with longer introns found in regions of low recombination (Carvalho and Clarke, 1999). In addition to genomic location, the functional characteristics of genes may influence intron content, e.g. highly expressed genes in the human have introns that are up to 14 times shorter than those found in genes expressed at a lower level, a finding that may be related to selection for increased transcriptional efficiency of highly expressed genes (Castillo-Davis et al., 2002).
However, there is relatively little understanding of the factors that influence intron number. This is due largely to ignorance of the mechanisms by which new introns are generated in individual genomes, and the selective forces that govern their spread within, or removal from, the population (Rogozin et al., 2003). Introns that have been fixed may evolve a variety of secondary functions associated with aspects of gene regulation, e.g. alternative splicing expands the repertoire Intron content of maternally and paternally silenced genes 573 of protein isoforms that can be produced from a single gene locus. However, new introns may initially impose a cost due to the occurrence of mutations that disrupt protein structure and function. Recently, Lynch took a population genetics approach and proposed a model of intron fixation in which the only relevant evolutionary forces considered were mutation and random genetic drift (Lynch, 2002). In this scheme, intron-containing alleles initially are subject to weak negative selection because they are more likely than intronless alleles to mutate to harmful variants, e.g. elimination of a splice recognition site may disrupt an open reading frame. If a species has a very large population size, e.g. prokaryotes, such a weak mutation pressure may be sufficient to prevent fixation of an intron. However, species with a low population size, such as large multicellular organisms, may allow fixation of an intron at the neutral rate, i.e. due to genetic drift. Such arguments may explain the increased 'intron load' observed in large organisms with small effective population sizes, relative to prokaryotes or simple eukaryotes (Lynch, 2002).
Genomic imprinting is a mechanism that causes monoallelic expression of a small number of genes with important functions during mammalian development (John and Surani, 1996). It has been proposed that the evolution of imprinting can be explained by the widespread occurrence of polyandry, which results in reduced relatedness of paternally, relative to maternally, derived alleles at embryonic loci that influence maternal investment, and can be understood as a form of intragenomic conflict (Moore and Haig, 1991). Previously, it was shown that, compared to a non-imprinted control set, imprinted genes have reduced intron content, having both fewer and shorter introns (Hurst et al., 1996). This observation was interpreted in the context of the parental conflict theory as evidence for either selection to increase the transcription rate at imprinted loci (see subsequent work by Castillo-Davis et al., 2002) or, more speculatively, to prevent degradation of mRNA due to putative mechanisms of 'splicing interference' (which would provide an additional negative selection pressure against intron fixation at imprinted loci; see Lynch, 2002). However, the small number of imprinted genes available to Hurst et al. (1996) did not allow firm conclusions regarding the source of the observed effect. We have assembled larger datasets of imprinted genes in both the mouse and human that allow us to examine their intron composition in detail.

Control and imprinted gene datasets
Human and mouse Refseq datasets were used as non-imprinted controls (Pruitt and Maglott, 2001). To avoid subjective bias in imprinted gene selection, we used the imprinted gene list provided by the on-line Catalogue of Imprinted Genes and Parent-of-Origin Effects Database (hereafter called 'Human Imprinted Gene List') (Morison et al., 2001; http://cancer.otago.ac.nz/IGC/Web/home. html) as an independent dataset of human imprinted genes. For the mouse, we used the online resource at the MRC Mammalian Genetics Unit (hereafter called 'Mouse Imprinted Gene List') (http://www.mgu.har.mrc.ac.uk). Genes with well-characterized intron structure were extracted from the lists to produce our human and mouse Modified Imprinted Gene Lists. We removed nonprotein coding genes from the modified dataset and considered protein-coding imprinted genes separately because, apart from human LIT1 and H19, and mouse Igf2as, Peg13, Mirg, Air and H19, many non-protein coding imprinted genes have poorly characterized gene structures. Also, the RefSeq control genes consist entirely of protein coding genes, and we cannot exclude a systematic difference in intron content between coding and noncoding genes. Human IGF2as was retained because a short open reading frame has been reported (Okutsu et al., 2000). In addition, the following variations of the Modified Imprinted Gene Lists were also analysed: 1. Removal of genes with controversial or polymorphic imprinting status, i.e. human GRB10 (Blagitko et al., 2000), IGF2R (Kalscheuer et al., 1993), COPG2 (Yamasaki et al., 2000. 2. Removal of intronless genes from both the RefSeq control and the Modified Imprinted Gene Lists. 3. Removal of SNRPN from the modified imprinted gene list as multiple small nucleolar RNA (snoRNA) molecules are encoded within some of its introns (Runte et al., 2001). Clearly, the evolutionary forces that influence the intronic structure of this gene are likely to be very different to those affecting other genes, particularly under the 'nearly neutral' model of intron evolution proposed by Lynch (2002).
Maternally silenced imprinted genes have less intron content than non-imprinted control genes We used the assembled datasets of human and mouse maternally and paternally silenced imprinted genes (Table 1) and compared them to human and mouse Refseq control datasets for a variety of parameters related to intron content, i.e. intron size and intron number. Using our Modified Imprinted Gene Lists, we find that both maternally and paternally silenced imprinted genes differ significantly from the control set with respect to intron content, but in different ways. Analysis of the human datasets, when non-protein coding genes are included, shows the total intron size of maternally silenced genes is significantly less than controls when SNRPN is removed (Table 2). Controlling for exon content, maternally silenced genes also contain significantly less intron sequence than the control dataset, when intronless genes are included. This finding is retained when non-coding genes and SNRPN are removed ( Table 3). Removal of SNRPN also has a strong effect on average intron number. Similar results were obtained for maternally silenced genes in the mouse; however, in general, they are less sensitive to Snrpn inclusion and significance is lost upon removal of intronless genes. Specifically, when controlled for exon content, maternally silenced genes have less intron content than controls. Also, average intron size is half that of the control set, but only when intronless genes are included. A similar trend of reduced intron size was observed in the human maternally silenced dataset, but was not statistically significant. Intron size co-varies with gene expression level (Castillo-Davis et al., 2002). However, we found no difference between mouse or human maternally and paternally silenced imprinted datasets with respect to EST frequency in the public databases (mean EST count: human, paternally silenced, 162 ± 92; maternally silenced, 161 ± 103; Mann-Whitney, p = 0.845; mouse, paternally silenced, 92 ± 87; maternally silenced 135 ± 99; Mann-Whitney, p = 0.07).
Mouse paternally silenced imprinted genes have higher intron content than non-imprinted control genes Solely in the mouse, there are a significantly increased number of introns in the paternally silenced dataset, which has almost three times as many introns per kilobase of exon than the control set. This is a highly robust result, unaffected by removal of non-coding genes, intronless genes, Snrpn (Tables 2 and 3) or imprinted genes with controversial imprinting status. Also, total intron size and average intron size of mouse paternally silenced genes is higher than control genes, but this effect is lost when intronless genes are removed from the control dataset. A similar trend is observed in the human dataset, but fails to reach significance (Tables 3 and 4).

Imprinted intronless genes are found exclusively in the maternally silenced dataset in the human and mouse
In the human and mouse, the maternally silenced gene datasets have a significantly higher proportion of intronless genes than the control dataset, whereas there are no intronless genes in the mouse and human paternally silenced datasets. Analyses were repeated following the removal of intronless genes from all datasets. Following this data manipulation, statistical difference with the control set was retained for the number of introns per kilobase of exon in the mouse paternally silenced dataset, and total intron size per kilobase of exon in the human maternally silenced dataset (Table 3). The majority of imprinted intronless genes are clustered at human chromosome 15q11-q13, and mouse chromosome 7, 28-29 cM (Table 1). There is a high degree of conservation of the mouse and human intronless imprinted gene orthologues with respect to chromosome map position, gene structure and imprinting status (Table 5). The exceptions are Frat3, which is a recent addition to the mouse Prader-Willi syndrome region adjacent to Mkrn3 (Chai et al., 2001), and U2af1-rs1, which is inserted in an intron of the biallelically expressed Murr1 (Nabetani et al., 1997). Inspection of Table 1 suggests that there is no correlation between the distribution of intronless genes and local recombination rate, neither is there an apparent correlation between maternal or paternal silencing Intron content of maternally and paternally silenced genes 575  of intron-containing imprinted genes and chromosome map position or local recombination rate.

Maternally and paternally silenced imprinted genes have different intron contents
To determine whether maternally and paternally silenced imprinted genes differ significantly from one another, the two datasets were compared directly for all parameters relating to intron content. None of the comparisons reached statistical significance in the human; however, in the mouse, all parameters, except those incorporating exon size, were significantly different (Table 6).

Discussion
Using larger datasets, we have confirmed the finding of Hurst et al. (1996) that imprinted genes are unusual with respect to their intron content. Moreover, because of the relatively large number of new imprinted genes that were available for the current study, we were able to analyse mouse and human and maternally and paternally silenced datasets separately. Our major finding is that oppositely imprinted genes differ significantly in their intron content, and that this difference is directional: maternally silenced genes tend to have reduced intron content compared to controls, whereas paternally silenced genes tend to have increased intron content.
Intron content of maternally and paternally silenced genes 577 The average values for each parameter are shown for both human and mouse imprinted and non-imprinted genes. Non-protein coding imprinted genes are included in this analysis. a Indicates analyses from which intronless genes were excluded. SNRPN was removed from the analysis, alone and in combination with intronless genes. Average intron sizes and average exon sizes were calculated based on the average of averages for each gene within a dataset. Imprinted genes are divided into those silenced on either the maternally or paternally inherited chromosome, and a Mann-Whitney U Test was employed to test the null hypothesis that imprinted and non-imprinted genes are similar with respect to gene structure. A Fisher's exact test was used to compare the number of intronless genes observed in the groups.
Intron content of a gene is influenced both by intron size and intron number. Can we specify whether our observations of reduced and increased intron content, respectively, of maternally and paternally silenced genes is due to effects on intron size, intron number, or both? Comparison of the maternally or paternally silenced genes with the control dataset detected differences in both of these parameters in the mouse and human, but the differences were not always statistically significant in both species. Indeed, there is evidence for species-specific effects because (following removal of SNRPN ), intron number in maternally silenced genes is reduced in both human and mouse, whereas in the paternally silenced datasets, intron number is increased in mouse but not in human. In both species, average intron length is reduced in the maternally silenced datasets; however, the difference is significant only in the mouse, and depends on the inclusion of intronless genes.  No. introns/kb exon 3.6 2.7 3.7 3.4 3.6 9.5 0.000 No. introns/kb exon a 3.9 3.2 3.7 3.9 5.3 9.5 0.001 Remove SNRPN 3.6 2.4 0.023 3.7 3.4 3.5 9.5 0.000 Remove SNRPN a 3.9 2.9 3.7 3.9 5.2 9.5 0.001 Data reported as in Table 2. Non-protein coding genes excluded from the analysis, human LIT1 and H19, and mouse Igf2as, Peg13, Mirg and H19.
Therefore, we cannot determine unambiguously, from the current datasets, the source of altered intron content of imprinted genes relative to controls. However, the data in Table 4 which, unlike those in Tables 2 and 3, are not compiled on a 'per gene' basis, show more clearly that (following exclusion of intronless genes and SNRPN ) there are systematic differences across species between maternally and paternally silenced genes for both intron number and intron size. Direct comparison of mouse maternally and paternally silenced datasets detected significant differences between them for a range of intron-related parameters. A similar analysis of human imprinted genes, however, did not. This may be due to the lower number of imprinted genes in our Modified Human Imprinted Gene List. Trends in the data are observed more clearly in Table 4. A marked reduction is observed in both intron number and intron length, in both species and across all parameters. These differences may achieve statistical significance in future studies, subject to the discovery or full structural characterization of further imprinted genes. For example, total intron size of the human paternally silenced dataset is larger than that of the maternally silenced and control datasets, but the differences may not have reached significance because of the relatively small number (14) of Intron content of maternally and paternally silenced genes 579 Average value for each parameter is reported. Data excludes SNRPN, intronless and non-coding genes. Average intron length and average exon length are not averages of averages but one average taken from all individual introns and exons in each group. Total intron length/total exon length is calculated from the two average values reported for both parameters previously. Number of introns per kilobase of exon is a ratio of number of introns and total exon length. Consistent trends of reduced intron content (both size and number) of maternally silenced genes are observed in both species.

Table 5. Conservation of imprinted intronless genes between mouse and human
Human Mouse
The choice of control and imprinted genes for inclusion in our datasets is not trivial. Known imprinted genes were removed from the RefSeq databases, but these datasets may nevertheless contain unidentified imprinted genes. However, the number of imprinted genes in the genome is probably quite small (Moore, 2001), and unlikely to significantly bias the large control datasets used in this study. Moreover, any such bias would tend to produce a false negative, but not a false positive, result, i.e. it would increase the similarity between the imprinted and control datasets. The selection of imprinted gene datasets is also problematic because, given the relatively small number of imprinted genes, the inclusion or exclusion of a single gene might have a significant effect on the mean parameter values of the dataset. For example, the maternally silenced SNRPN gene has a significant effect on several parameters because it contains a large number of introns that encode SnoRNAs. This gene is also problematical because the mouse orthologue is not fully characterised. We therefore re-analysed all parameters following the exclusion of SNRPN, and find that while it has a significant effect on some parameters in the human, the observation of reduced intron content of maternally silenced genes compared to the control set is retained irrespective of SNRPN inclusion or exclusion. The inclusion of a single non-coding gene in the human maternally silenced dataset, with exclusion of SNRPN, results in a further two parameters becoming significantly different from controls: total intron size and the number of introns per kilobase of exon. As the human imprinted data set becomes more complete, these findings may become more robust. Mouse data in general are less sensitive to the exclusion or inclusion of one or more genes.
In addition, there are a variety of methods of determining whether a gene is subject to imprinting, not all of which lead to similar conclusions. Therefore, the imprinted status of some genes is controversial. Our datasets contain three such genes (human GRB10, IGF2R, COPG2 ); however, their removal does not affect our conclusions.
We set out to confirm the previously reported finding that imprinted genes are unusual with respect to intron content compared to non-imprinted genes (Hurst et al., 1996). Initially, in the present study, we found that neither human nor mouse datasets comprising both maternally and paternally silenced imprinted genes were different to our control gene datasets. Because the imprinted gene dataset used in the original study of Hurst et al. (1996) was small and somewhat biased towards maternally silenced genes (9 out of 14; not including mouse Mas and human CG, which are now thought not to be imprinted), we hypothesised that the observation of reduced intron content of imprinted genes was due primarily to the relatively high maternally silenced gene content. We therefore analysed maternally and paternally silenced genes separately. We applied a Bonferroni correction at an α-level of 0.012 instead of 0.05 to control for these two separate tests and also to account for testing for multiple parameters (intron size and intron number) (Pernager, 1998). We note that in the direct comparison of mouse maternally and paternally silenced gene datasets, five of eight parameters would withstand a severe Bonferroni correction.
We considered and excluded a number of factors that might, in principle, explain our observations. Castillo-Davis et al. (2002) showed that, in the human and in C. elegans, introns of highly expressed genes are 14-fold and two-fold shorter, respectively, than in genes with low expression. However, they detected no difference in intron density. However, our observation of reduced intron size in maternally silenced genes of the mouse, with a similar trend in the human, is not explained by differences in imprinted gene expression levels, because EST counts were similar for both maternally and paternally silenced gene datasets.
There are complex, and probably overlapping, effects of a variety of parameters, such as GC and transposon content, and genetic recombination rate, on intron content (Hurst et al., 1999;Duret, 2001). Human maternally and paternally imprinted gene regions differ in GC and transposon content (Greally, 2002), and some imprinted regions exhibit higher levels of recombination in the male, relative to the female, germline (Paldi et al., 1995;Robinson and Lalande, 1995). However, none of these parameters appears to provide a convincing explanation for our observations. For example, the paternally silenced subgroup has a relatively high GC content (Greally, 2002), which is an expected correlate of short introns, contrary to our finding of a trend towards reduced intron length in both mouse and human maternally silenced datasets. Moreover, we found no correlation between recombination rate and intron content in the human imprinted gene dataset. Indeed, inspection of Table 1 indicates that, contrary to expectation, among imprinted genes, intronless genes map to regions of relatively low recombination. An additional, recent analysis of recombination in imprinted regions found that imprinted regions have high recombination rates compared to non-imprinted regions of the genome; however, there was no evidence that maternally and paternally silenced imprinted genes are different with respect to local recombination rate (Lercher and Hurst, 2003).
We could not explain our finding of reduced intron content of maternally silenced genes in terms of gene function or genomic location, therefore we considered population genetic arguments that might, in principle, explain our observations. Lynch (2002) has proposed that the phylogenetic distribution of intron density may be explained by considering introns as weakly deleterious, and therefore subject to purifying selection or random genetic drift, depending on species population size. However, weak purifying selection against introns may be countered by genetic hitch-hiking. The rate at which a linked, beneficial mutation approaches fixation during a selective sweep would influence the probability of recombination between an intron-containing allele and an intronless variant. We expect rapidly evolving loci to fix introns more frequently than relatively slowly evolving loci, because there is less chance of recombination with an intronless variant during a selective sweep and less time for purifying selection to select between intronless and intron-containing variants. From these arguments, we tentatively propose that the possibility that there are systematically different rates of evolution of maternally and paternally silenced imprinted genes (Mills and Moore, 2004) may provide a explanation for our observations.

Gene structure analysis
Mouse and human control gene datasets were obtained from the UCSC genome site (http://genome.ucsc.edu) (October 2003) and contained 20 248 and 16 883 full-length human and mouse transcript sequences, respectively, from the Refseq database (Pruitt and Maglott, 2001). Tables outlining the gene structure of the transcripts are available from an alignment of the mRNA sequence to the human draft sequence of June 2003 and mouse of February 2003. The RefSeq database is a curated, nonredundant database at the NCBI consisting of full-length sequences as currently described. The database aims to have one reference sequence for each transcript in the genome. Our control sets therefore represent a global, unbiased sample of mouse and human genes.
A list of mouse and human imprinted genes was obtained from the MRC Mammalian Genetics Unit (http://www.mgu.har.mrc.ac.uk) and the Catalogue of Imprinted Genes and Parent-of-Origin Effects databases (Morison et al., 2001), respectively, and was supplemented by searching for new imprinted genes in PubMed. Imprinted genes were removed from the Refseq control datasets. Maternally and paternally silenced gene datasets were compared to non-imprinted genes using a nonparametric statistical test. The mean value of each parameter was used and, in cases where different transcript variants exist for an imprinted gene, the average values were calculated for each transcript and subsequently an overall average was taken. Analyses were carried out including and excluding intronless genes, genes whose imprinted status is controversial, and the SNURF-SNRPN snoRNA-containing transcript (Runte et al., 2001). A chi-squared test with Yates' correction was used to ascertain the significance of the number of imprinted genes without introns compared to the control gene dataset. Conservation of imprinting between mouse and human orthologues was also investigated. In most cases orthologue pairs are annotated. However, in cases where one is missing, sequence identity in the corresponding genome was searched using the BLAT program at UCSC, which is designed to find regions of high sequence identity.
GNAS is a highly complex locus with respect to both transcript-specific transcription and imprinted expression patterns. The locus maps to chromosome 20 in the human and chromosome 2 in the mouse. It is composed of both sense and antisense transcripts, which are associated with alternative first exons. Nesp is a paternally silenced gene; Gnasxl, Nespas and Gnas exon 1a are maternally silenced; Gnas exon 1 is expressed from both alleles; however, there is evidence for tissue specific imprinting (Yu et al., 1998;Liu et al., 2003). Holmes et al. (2003) further characterized this locus using the FANTOM mouse transcriptome (Okazaki et al., 2002) and identified new alternative transcripts, which they labelled F1-F12. Both spliced and unspliced variants were published with alternative 3 untranslated regions. Using the FAN-TOM database (http://fantom.gsc.riken.go.jp/db) we took the genomic structure for each variant and averaged transcript length for Gnas exon 1a, Nesp and Gnasxl (clones D930047C10 and A930027G11; A230089C09 and D930020N02; C130027O20 and 533 041BM12).

Gene expression analysis
Expression levels of imprinted genes were estimated from expressed sequence tag (EST) abundance in the public databases. BLASTN (version 2.2.4) was used to compare imprinted gene transcript sequences to a database of 4 533 427 human EST sequences downloaded from NCBI [31] (August 2002). Threshold values were set to allow EST hits of > 400 nucleotides with > 95% identity to be accepted as matches. If identity exceeded 98%, sequence alignment of 100-400 nucleotides was also accepted. Non-coding genes and SNRPN were excluded from the analysis.