The regulation of metazoan gene expression occurs in part by pre-mRNA splicing into mature RNAs. Signals affecting the efficiency and specificity with which introns are removed have not been completely elucidated. Splicing likely occurs cotranscriptionally, with chromatin structure playing a key regulatory role. We calculated DNA encoded nucleosome occupancy likelihood (NOL) scores at the boundaries between introns and exons across five metazoan species. We found that (i) NOL scores reveal a sequence-based feature at the introns on both sides of the intron-exon boundary; (ii) this feature is not part of any recognizable consensus sequence; (iii) this feature is conserved throughout metazoa; (iv) this feature is enriched in genes sharing similar functions: ATPase activity, ATP binding, helicase activity, and motor activity; (v) genes with these functions exhibit different genomic characteristics;
(vi)
Eukaryotic gene expression is controlled at multiple levels, and splicing of mRNA is an important regulatory step in the production of functional proteins. During mRNA splicing, portions of the RNA, introns, are removed by the spliceosome complex, and the remaining protein-coding RNAs, exons, are joined together [
The packaging of eukaryotic DNA into chromatin is expected to affect all DNA templated processes. The fundamental subunit of chromatin is the nucleosome, 150 base pairs (bp) of DNA wrapped around a histone octamer. The position and density of nucleosomes play key regulatory roles and are controlled both by chromatin regulatory complexes and by features intrinsic to the DNA sequence [
Nucleosome forming and nucleosome inhibitory properties were derived from first principles more than three decades ago [
Nucleosome occupancy has recently been shown to play a regulatory role at exon boundaries. These exonic nucleosomes have been proposed to act as “speed bumps” that allow for cotranscriptional splicing [
We have previously described a computational model of nucleosomal occupancy trained on DNA sequence content [
The DNA sequences for the current builds of all organisms (human, hg19; rat, rn4; zebrafish, danRer7; fly, dm3; worm, ce10; yeast, sacCer3) in this analysis were downloaded from the UCSC Genome Bioinformatics website (
Nucleosome occupancy likelihood (NOL) scores were generated using the support vector machine (SVM) model derived from the Ozsolak A375 dataset (described in [
Gene ontology analysis was completed with the GOrilla software [
For calculations of significance across genomic statistics in enriched ontological categories, outliers were first excluded. Outliers were defined as those values less than the first quartile minus the interquartile range (IQR) or greater than the third quartile plus the IQR.
We were first interested in discovering if there was any intrinsic nucleosome occupancy information in the regions flanking intron-exon boundaries. We reasoned that if chromatin structure plays a role in cotranscriptional splicing, then a robust location to store that chromatin structural information would be within the DNA sequence itself. To this end, we retrieved all intron-exon boundaries from the RefSeq annotation of the human genome [
Mean nucleosome occupancy likelihood (NOL) scores for aligned sequences at the boundaries between introns and exons. Mean nucleosome occupancy likelihood (NOL) scores for aligned sequences at the boundaries between introns and exons. (a) Mean NOL scores for the regions +/−500 bp from the annotated upstream end and downstream end of the exons. For the central region +/−50 bp, nucleotide representation at each position is indicated by the size of the letters. (b) Mean NOL scores and associated nucleotide representations for the regions centered on the minimum value found within +/−50 bp of the annotated boundary between intron and exon.
We hypothesized that the dip in the NOL scores may represent a functional DNA-encoded chromatin-regulatory structural element, as this is what the NOL plots measure. To investigate this possibility, we aligned the entire dataset to this putative regulatory feature by centering each region on the minimum found in the 100 base pairs centered on the boundary (Figure
Different biophysical properties emerge with different DNA sequence combinations (e.g., DNA wedge angles [
Counts of exons and genes grouped by location of minimum NOL score in the region +/−50 bp from the annotated boundary between intron and exon. Counts of exons and genes grouped by location of minimum NOL score in the region +/−50 bp from the annotated boundary between intron and exon. ((a), (c), (e), (g), and (i)) Histograms for human, rat, zebrafish, fly, and worm showing the counts of minimum values at each position in the region +/−50 bp from the annotated intron/exon (exon starts) and exon/intron (exon ends) boundaries. ((b), (d), (f), (h), and (j)) Venn diagrams indicating the numbers of genes represented in the −26 peak for exon starts and +26 peak for exon ends and the overlap between the two sets.
In order to understand the numbers and types of genes associated with the U − 26 and D + 26 feature, we selected the exons represented in each of these groups, U − 26 and D + 26, for further analysis. There are 9578 genes represented in the U − 26 group and 7360 genes represented in the D + 26 group, representing 24.1% and 18.5% of all open reading frames tested, respectively. We next wanted to know if the U − 26 and D + 26 features were found in the same sets or different sets of genes. 3369 genes, or 19%, overlap between the U − 26 and D + 26 groups (Figure
As positioned nucleosomes flanking exons are phylogenetically conserved, we were next interested in determining whether the prominent U − 26 and the D + 26 signature are conserved in other metazoan species. Conservation of these features would suggest an important role for the U − 26 and the D + 26 causing it to be maintained by evolution. We compared wide ranging species including rat, zebrafish, fly, and worm. We identified the nucleotide position of the minimum at the boundary between intron and exon for rat, zebrafish, fly, and worm. A conspicuous U − 26 and D + 26 signature exists for all of these species (Figures
The conservation of these features suggested a role in genomic regulation. We next wanted to identify the feature that is present in groups of genes with related function. In order to test whether the U − 26 or the D + 26 signatures identified groups of genes that share a common function, we searched for ontological enrichment [
Organism-region | ATPase activity | ATP binding | Helicase activity | Motor activity |
---|---|---|---|---|
Human U − 26 | 3.84 |
<1 |
2.98 |
1.38 |
Human D + 26 | 2.64 |
2.32 |
3.0 |
1.53 |
|
||||
Rat U − 26 | 4.67 |
7.92 |
4.43 |
1.22 |
Rat D + 26 | 7.84 |
1.2 |
2.8 |
1.17 |
|
||||
Zebrafish U − 26 | 8.83 |
2.85 |
9.61 |
2.82 |
Zebrafish D + 26 | 4.77 |
2.79 |
1.14 |
1.04 |
|
||||
Fly U − 26 | 3.72 |
8.3 |
N/A | 3.99 |
Fly D + 26 | N/A | 3.83 |
N/A | 4.27 |
|
||||
Worm U − 26 | 2.07 |
4.83 |
3.7 |
1.89 |
Worm D + 26 | 7.81 |
9.28 |
3.36 |
3.25 |
Measures of significance of enrichment as indicated by GOrilla [
In order to test whether genes found in the enriched functional categories containing the U − 26 and the D + 26 signature had genomic characteristics that varied significantly from the rest of the genome, we compared exon size, intron size, and number of exons for each function category to the same values calculated for the genome as a whole (Figure
Boxplots of genomic characteristics for several ontological categories in comparison to the entire genome. Boxplots of intron sizes and numbers of exons are shown for the whole genome (WG), ATP binding (AB), ATPase activity (AA), helicase activity (HA), and motor activity (MA) across the 5 species of interest. Values that differ significantly from the whole genome are indicated with an asterisk.
As our experiments, to this point, had been purely based on in silico experiments and genome sequence data, we wanted to see how our results compared to published
GO category | K U − 26 | G U − 26 | K D + 26 | G D + 26 |
---|---|---|---|---|
ATP binding | 8.86 |
N/A | 4.04 |
8.55 |
ATPase activity | 9.32 |
N/A | N/A | N/A |
Helicase activity | N/A | N/A | N/A | N/A |
Motor activity | N/A | 6.76 |
N/A | 2.47 |
Number of genes | 587 | 655 | 647 | 636 |
The upstream intron contains a region between the branchpoint sequence and the 3′ splice site that is generally depleted of AG dinucleotides. This region is generally within 40 nucleotides of the AG splice site and encompasses the U − 26 feature. We were interested in determining if the loci containing the U − 26 feature were enriched or depleted for any dinucleotide occurrences relative to the rest of the genome. We calculated dinucleotide frequency for the ten dinucleotides for the loci containing the U − 26 feature and compared that to the dinucleotide frequency for an equal number of other intron-exon boundaries in the genome (Figure
Dinucleotide counts at intron-exon boundary. The dinucleotide counts for the region −40 to +10 nucleotides from the upstream boundary between intron and exon as indicated at the bottom of the figure. Solid black bar indicates the counts for exons containing the U − 26 signal. Grey bars represent 5 random samples of similar size for comparison.
Recent work has shown that differential G/C content plays a role in the intron exon definition and splice site selection [
We have identified a set of conserved genes sharing a common function using a nucleosome positioning signature. We have further characterized the set of genes as having increased numbers of exons while having average number and length of introns. This feature has been validated using
Our results indicate that cryptic sequence features may drive DNA-templated regulatory events. Our observations and the classification of a particular subset of genes could not have been accomplished through alignment of nucleotide content. The NOL scores point toward a physical property of DNA related to the ability of a particular DNA sequence to form a nucleosome. The organization and architecture of DNA around the nucleosome may likely play a role in the mechanism of pre-mRNA splicing. We anticipate that many more functional DNA elements may be discovered using similar methodologies.
The authors declare that there is no conflict of interests regarding the publication of this paper.
Justin A. Fincher and Jonathan H. Dennis conceived the study and participated in the study design and in data analysis. Justin A. Fincher performed the computational analysis and data preparation. Jonathan H. Dennis drafted the paper. All authors read, revised, and approved the final paper.
The authors would like to thank the Florida State University CompuStat group for helpful discussions. This work was partially supported by Florida State University (Jonathan H. Dennis, and Justin A. Fincher), National Institutes of Health R01 DA033773 (Jonathan H. Dennis), National Science Foundation CNS-0964413 and CNS-0915926 (Gary S. Tyson), and American Heart Association 12POST12070101 (Justin A. Fincher).