With the development of new sequencing technology, the entire N6-methyl-adenosine (m^{6}A) RNA methylome can now be unbiased profiled with methylated RNA immune-precipitation sequencing technique (MeRIP-Seq), making it possible to detect differential methylation states of RNA between two conditions, for example, between normal and cancerous tissue. However, as an affinity-based method, MeRIP-Seq has yet provided base-pair resolution; that is, a single methylation site determined from MeRIP-Seq data can in practice contain multiple RNA methylation residuals, some of which can be regulated by different enzymes and thus differentially methylated between two conditions. Since existing peak-based methods could not effectively differentiate multiple methylation residuals located within a single methylation site, we propose a hidden Markov model (HMM) based approach to address this issue. Specifically, the detected RNA methylation site is further divided into multiple adjacent small bins and then scanned with higher resolution using a hidden Markov model to model the dependency between spatially adjacent bins for improved accuracy. We tested the proposed algorithm on both simulated data and real data. Result suggests that the proposed algorithm clearly outperforms existing peak-based approach on simulated systems and detects differential methylation regions with higher statistical significance on real dataset.
1. Introduction
Although the presence of posttranscriptional biochemical modifications to RNA has been established in 1960s [1], due to historical limitations, RNA epigenetics is largely uncharted territory until recently [2–4]. In 2012, a powerful sequencing protocol methylated RNA immune-precipitation sequencing (MeRIP-Seq or m^{6}A-Seq) was developed [5, 6], in which the fragmented mRNA fragments with N6-methyl-adnosine (m^{6}A) are pulled down with anti-m^{6}A antibody and then purified and passed to subsequent sequencing to generate the so-called “IP sample” for profiling the transcriptome-wide RNA m^{6}A methylome. Very often, a paired “input sample” is generated as well using all the RNA for measuring the entire transcriptome background (please refer to [7] for a more comprehensive protocol of this approach). This technique facilitates a number of research findings recently which includes the following: the role of RNA methylation in controlling the circadian clock [8], addiction [9], and stem cell [10], and [2, 3, 5, 6, 8–16]. It also enabled the construction of mammalian RNA methylation database [17] and systems biology approaches for decomposing the RNA methylome to unveil the latent enzymatic regulators of epitranscriptome [18]. Software tools for RNA methylation site detection [19, 20] and for differential RNA methylation analysis [21] from MeRIP-Seq data are now available in a rather user friendly manner. Nevertheless, as a newly arising technique, MeRIP-Seq still poses computational challenges that call for novel and sophisticated approaches.
Differential methylation analysis is of crucial importance for epigenetics research. Differentially methylated regions (DMRs), that is, regions that exhibit different methylation levels between two experimental conditions, for example, normal and cancerous, can be as small as a single base or as large as an entire gene locus, depending on the biological question of interest and the bioinformatics methods used for their identification [22]. Differential methylation analysis from MeRIP-Seq seeks to identify the differences in RNA methylome in a case-control study (e.g., cancerous and normal), which usually involves at least four high-throughput sequencing (HTS) samples, including the IP and input samples under both the case and control conditions. For affinity-based methods developed for DNA epigenetics (such as MeDIP-Seq and ChIP-Seq), since the absolute amount of DNA is most likely to stay unchanged between two conditions, the percentage of modified DNA molecule is linearly correlated with the absolute amount; thus the difference in methylation is consistent when measured in relative (percentage) and absolute amount. However, in MeRIP-Seq, due to the change in transcriptional expression level between two conditions, it is possible that while the absolute amount of methylated RNA increases, the relative amount (percentage of methylated RNA) decreases as shown in Figure 1. From computational perspective, the differential methylation analysis of RNA is quite different from that of DNA, and DNA differential methylation approaches [23], such as MOABS [24] and DMAP [25], may not be directly applicable to RNA. Until now, methods aiming at the differential analysis of MeRIP-Seq data do not extensively appear in literature. exomePeak [19, 21] is dedicatedly developed for differential RNA methylation analysis from MeRIP-Seq data. The detection of DMRs is based on rhtest [26], which is an extended version of hypergeometric test, computing the statistical significance of the difference in the percentages of methylated fragments between the two conditions, which directly indicates the difference in enzymatic regulation. Before the detection of DMRs, peaks (methylated regions) are called firstly from the transcriptome by comparing the IP with input sample by relative enrichment [7, 19, 27]. Only with the detected methylation sites can we effectively estimate the methylation level.
Comparison of the differential methylation analysis in DNA and RNA. The first column shows the DNA related differential analysis in ChIP-Seq or MeDIP-Seq, where the total DNA is often considered the same under two experimental conditions, so the differential analysis can be performed by directly comparing the absolute amount of methylated RNAs in the two IP samples. In contrast, for RNA (the second column), the background is total RNA, which can vary significantly under different conditions, and therefore, the absolute amount of methylated RNA for a specific site does not necessarily correlate with the degree of methylation. For the example shown in the above figure, while amount of methylated RNA increases under the cancer condition, the relative amount (percentage of methylated RNA) decreases, indicating a hypomethylation at RNA level. As a result, the differential analysis of RNA methylome in MeRIP-Seq should be performed by comparing the percentages of methylated RNA to reflect the influence of methylation enzymatic regulation.
Affinity-based approaches cannot provide single-base resolution. Since multiple RNA methylation residuals may locate in proximity and cannot be effectively differentiated with peak calling procedure, they can appear as a single broad methylation site in the peak calling result from MACS [27] or exomePeak [19]. In many cases, this discrepancy can be trivial and does not significantly affect relevant study; however, it can be disastrous in differential methylation analysis, because multiple RNA methylation residuals can be regulated by different enzyme complexes and thus may be differentially methylated. Failing to identify the precise location of each methylation residual can lead to large bias in the estimation of its methylation level and in the comparison to a different condition. Currently, all existing methods for RNA differential methylation from MeRIP-Seq data are peak-based. In this paper, based on the rhtest method developed in exomePeak package [21], we proposed FET-HMM, a novel strategy for spatially enhanced differential RNA methylation analysis using hidden Markov model (HMM). When applying to the RNA methylation site detected from a peak calling algorithm, FET-HMM breaks a single site into multiple adjacent small bins and evaluates whether a specific bin is differentially methylated or not between two experimental conditions with spatial dependency incorporated by HMM. Figure 2 shows the comparison between existing and our methods.
Comparison of differential methylation analysis methods. This figure shows the difference between existing peak-based differential analysis method and the proposed method. Started from aligned reads, the left part of this figure shows how exomePeak conducts differential analysis. It firstly identifies a single methylation site and then decides whether the methylation site as a whole is differentially methylated or not. However, the newly proposed method will split the testing region into multiple adjacent small bins and then will integrate their dependency with HMM for more accurate identification of differential methylation site. In the above example, the RNA methylation site detected using exomePeak method may consist of two methylation residuals, and only the one on the right side is differentially methylated in this case-control study. The proposed FET-HMM method is likely to work better than peak-based exomePeak method under this scenario.
HMM is a statistical model that integrates multiple random processes and has been widely used in DNA-templated epigenetic analysis and in RNA methylation sites detection (or peak calling) [28–30], but so far it has not been applied for RNA differential methylation analysis. We applied the newly developed approach FET-HMM on both simulated and real datasets. The results on simulated data showed that FET-HMM can effectively improve the performance of rhtest in terms of the area under the curve (AUC) when detecting differential methylation sites. When applied to human MeRIP-Seq datasets, FET-HMM method returns more biological meaningful results than exomePeak method. The FET-HMM algorithm has been implemented in an open source R package for differential methylation analysis from MeRIP-Seq data and is freely available from GitHub. The method is detailed in the following section.
2. Methods
In this section, we firstly review the usage of rhtest, a modified version of Fisher’s exact test (FET), for differential RNA methylation analysis and then introduce spatially enhanced approach FET-HMM.
2.1. Peak-Based Differential RNA Methylation Analysis with Rhtest
To conduct differential RNA methylation analysis in a case-control study, we should get four samples, that is, the IP and input samples from both groups. Consider that there are a number of RNA methylation sites detected with peak calling approaches [19, 20, 27] from MeRIP-Seq. Then we can assume that the number of reads within the gth RNA methylation sites follows the Poisson distribution, with(1)X0,g~PoissonN0λ0,g,X1,g~PoissonN1λ1,g,Y0,g~PoissonM0λ-0,g,Y1,g~PoissonM1λ-1,g,where X0,g and X1,g are the reads counts of the input samples for untreated and treated condition and consistently, Y0,g and Y1,g are the reads counts of the IP samples for untreated and treated samples. Here, g=1,2,…,G indicates the gth RNA methylation site. N0,N1,M0,M1 are the size (or the sequencing depth) of library, respectively; and the parameters λ0,g,λ1,g,λ-0,g,λ-1,g are the normalized Poisson means in a standard library, indicating the expectation of the reads counts within a bin. Following the formulation from previous study [26], we assume that λ-0,g and λ-1,g satisfy the following relationship with λ-0,g=λ0,gη0,g/f0 and λ-1,g=λ1,gη1,g/f1, where f0 and f1 indicate the percentage of the expressed RNA fragments that are modified in the untreated and treated samples, respectively. η0,g and η1,g indicate the percentage of RNA fragments mapped inside the RNA methylation site that carry the methylation mark. We would like to test whether η1,g=η0,g. According to the properties of the Poisson distributions [31, 32], given X0,g+Y0,g=t0,g, X1,g+Y1,g=t1,g, we should have X0,g~Binomialp0,g,t0,g and X1,g~Binomialp1,g,t1,g, where p1,g=N1f1/N1f1+M1η1,g and p0,g=N0f0/N0f0+M0η0,g. For different experimental conditions, if we assume that the total amount of modifications remains the same, only its distribution may change, then we can have f0=f1=f. We also notice that if N1M0=N0M1, then η1,g=η0,g⇔p1,g=p0,g, and testing whether the two Binomial distributions have the same successful rate is equivalent to the classical problem of testing the independence in a 2 × 2 contingency table. In order to establish N1M0=N0M1, only one of the 4 samples needs to be rescaled. When N1M0=N0M1 is achieved after rescaling, under the null hypothesis p1,g=p0,g, X0,n follows a hypergeometric distribution as in (2), and we may use Fisher’s exact test [33–36] with two tails to evaluate its significance. Consider(2)pX0,g=k~HyperX0,g∣K,n,N=KkN-Kn-kNn,where N=t0,g+t1,g=x0,g+x1,g+y0,g+y1,g, n=t0,g=x0,g+y0,g, and K=x0,g+x1,g. The smaller the p value is, the more likely the gth RNA methylation site is differentially methylated between two conditions.
2.2. Spatially Enhanced Differential RNA Methylation Analysis with FET-HMM
The method developed in the previous section could not effectively discriminate multiple RNA methylation residuals located within a single RNA methylation site (as shown in Figure 1). We seek to enhance the spatial resolution with hidden Markov model. Similar to various formulation, for a particular RNA methylation site, we firstly divided it into N mutually connected bins of length L. Then we can still assume that the number of reads within the nth bin follows the Poisson distribution, with(3)X0,n~PoissonN0λ0,n,X1,n~PoissonN1λ1,n,Y0,n~PoissonM0λ-0,n,Y1,n~PoissonM1λ-1,n,where X0,n and X1,n are the reads counts of the input samples for untreated and treated condition and consistently, Y0,n and Y1,n are the reads counts of the IP samples for untreated and treated samples. Here, n=1,2,…,N indicates the nth bin. The parameters λ0,n,λ1,n,λ-0,n,λ-1,n are the normalized Poisson means in a standard library, indicating the expectation of the reads counts within a bin. Following the formulation from previous study [26], we assume that λ-0,n and λ-1,n satisfy the following relationship with λ-0,n=λ0,nη0,n/f0 and λ-1,n=λ1,nη1,n/f1, where f0 and f1 indicate the percentage of the expressed RNA fragments that are modified in the untreated and treated samples, respectively. η0,n and η1,n indicate the percentage of RNA fragments mapped inside the bin that carry the methylation mark. We can easily test whether η1,n=η0,n (whether differential methylation is observed) for a specific bin; however, we should not neglect the dependencies between the reads counts of adjacent bins within an RNA methylation site; that is, if differential methylation is observed on a specific bin, it is likely that differential methylation can also be observed on bins adjacent to it and vice versa. The dependency can be effectively incorporated with an HMM formulation, and we thus developed a new strategy for the identification of differential methylation regions (DMRs) with improved spatial resolution.
To begin with, with respect to nth bin, the hidden true states of differential methylation are denoted as S={s1,s2,…,sN}, where sn∈{0,1} with 1 representing differential methylation state (DMS) and 0 otherwise. Considering that a differential methylation region may span multiple adjacent bins, we assume that the true hidden DMS S follows a first order Markov chain, whose transition matrix A contains entries defined as (4)Aij=Psn+1=j∣sn=i,i,j∈0,1,where Aij denotes the probability for the hidden variable switching from DMS i at the nth bin to the DMS j at the (n+1)th bin. In addition, the initial probability p(S1=0)=u and p(S1=1)=1-u, which can be denoted as π=(u,1-u). Next, the result of rhtest [21, 26] was used as the observed variable of the HMM. However, the information acquired from rhtest is a statistical significance of differential methylation in terms of p values and FDRs (False Discovery Rates). We seek to enhance the differential methylation results by incorporating spatial dependency. Specifically, 3 different strategies are developed for this purpose with their own advantages and disadvantages, which are detailed in the following.
2.3. FHB Strategy: Combine Fisher’s Exact Test and HMM with Binary Observation
In FHB strategy, we use the binary decisions received from FET as the observation of hidden Markov model. The model essentially evaluates how likely a true differential methylation state can be detected by FET, or if FET reports a DMS with a significance level, how likely it is true after incorporating spatial dependency. We assume that a state can be correctly observed with probability p; and a mistake happens with probability 1-p. Since the observation from FET is considered as binary, a cut-off threshold should be used to switch the FDR (False Discovery Rate) value to generate the “observed” set of observed variable O=(o1,o2,…,on) with on∈{0,1}. Then according to the standard HMM definition, these probabilities consist of an emission matrix B, whose entries are defined as(5)Bij=Pon=j∣sn=i=p,i,j∈0,1,i=j,1-p,i,j∈0,1,i≠j.The detailed structure of HMM is shown in Figure 3.
Hidden Markov model. In FHB strategy, the “observation” is a binary status reported from FET, and the emission probability is Bernoulli distribution.
Finally, we applied the widely used Baum-Welch algorithm [37–39] to estimate the unknown parameters of the HMM. Baum-Welch algorithm applies the well-known Expectation and MSaximization (EM) strategy to conduct the process of estimation. The implementation steps of Baum-Welch algorithm are as follows.
The Proposed Algorithm
(1) Initialization. Given the initial value of Aij, πi, and Bij randomly according to the conditions of probability, we hence get the initial model parameters λ(0)=(π(0),A(0),B(0)).
(2) EM Steps
E Step. Let γn(i) denote the probability of the hidden DMS being at i at the nth bin, and let ξn(i,j) denote the probability of the hidden DMS being at i at the nth bin and the DMS being at j at the (n+1)th bin. Also, we denote tsik, k∈{0,1}, to represent the times of the transition from DMS i to any DMS k and tsij to represent the times of the transition from DMS i to the DMSj. γn(i) and ξn(i,j) can be computed through (6) and (7), and the expectation of tsik and tsij can be calculated by (8) and (9). λ(m)=(π(m),A(m),B(m)) represents the parameters of HMM after the mth iteration. Consider(6)γni=Psn=i∣O,λm=Psn=i,O∣λmPO∣λm,(7)ξni,j=Psn=i,sn+1=j∣O,λm=Psn=i,sn+1=j,O∣λmPO∣λm,(8)Etsik=∑n=1N-1γn(i),(9)Etsij=∑n=1N-1ξn(i,j).M Step. After using (10), (11), and (12) to estimate πi, Aij, and Bij, we get λ(m+1). One has(10)πi(m+1)=γ1i,(11)aij(m+1)=EtsijEtsik=∑n=1N-1ξn(i,j)∑n=1N-1γn(i),(12)bim+1k=∑n=1NγniIon=k∑n=1Nγni.In (12),(13)I{on=k}=1on=k0on≠kis the indicative function.
(3) Loop. Repeat the EM steps until the convergence of Aij, πi, and Bij. After the procedures above, optimal model parameter λ(op) could be obtained. Let unk=1 if we are absolutely sure sn=k and unk=0 otherwise. What we focused on is the final expectation of unk, k∈{0,1}, which can be calculated as(14)Eunk∣O,λop=Psn=k∣O,λop.
Then we could obtain the posterior probability of a bin being at a specific state, and the performance of FET-HMM can be compared with that of exomePeak on simulated dataset when the true state is available.
2.4. FHC Strategy: Combine Fisher’s Exact Test and HMM with Continuous Observation
In FHB strategy, we adopt a switching cut-off threshold to convert the statistical significance (p value from differential analysis with rhtest) into binary states as the observation of HMM. This strategy has two limitations. Firstly, we could hardly find the most reasonable threshold for a dataset, and different threshold can lead to different results. Secondly, some information gets lost in the conversion from p value to binary states; for example, both p values 0.01 and 0.001 are converted as DMS state 1 after a binary conversion with significance level 0.05; however, the former is less confident. In addition, Bernoulli distribution may not be the most suitable distribution for the emission probability of observed variable. Therefore, a strategy seeking to directly smooth the continuous statistical significance without binary conversion may be superior. For this purpose, we use the p values from FET to approximate the likelihood of a bin with DMS state 0 and (1-pvalue) for its likelihood with DMS state 1. The p values generated from FET can be used to estimate the emission probability of HMM directly and then passed to HMM for smoothing purposes. It should be denoted as(15)BII=pvalue11-pvalue1pvalue21-pvalue2⋮⋮pvalueN1-pvalueN.After getting the matrix BII of size N by 2 constructed from FET p values, the Baum-Welch algorithm introduced in FHB can be applied to spatially enhance the local result, with formula (12) omitted because matrix BII does not need to be reestimated every iteration. Please note that using p values to approximate directly the probability matrix BII helps to avoid the binary conversion and information loss, and we will show in the Result section that this trick indeed improves the performance of algorithm.
2.5. FastFH Strategy: A High-Efficiency Strategy for Applying FET-HMM on Big Omics Data
When the proposed method is used in real MeRIP-Seq dataset, two problems would emerge. What comes first was some reads would be mapped into very short genes; thus the number of the bins would be quite small. In other words, the length of some Markov chains would be too short for accurate estimation of parameters and finally affects the results of DMRs detection. In addition, computational time was another important factor that we should take into consideration. Take the human hg19 data we were going to test as an example. If there were more than 30000 detected RNA methylation sites in total, the Baum-Welch algorithm would be performed more than 30000 times and the execution time might be too long. In order to solve these two limitations, we could combine the two strategies together. Firstly, the threshold used in FHB was used here again to switch the FDR into binary DMS. Then we could estimate transition matrix AIII directly from this DMS information as shown in(16)πIII=1-∑i=1NDMSiN,∑i=1NDMSiN,AIII=PSn+1=0∣Sn=0PSn+1=1∣Sn=0PSn+1=0∣Sn=1PSn+1=1∣Sn=1,where PSn+1∣Sn denotes the conditional probability for the transition from Sn to Sn+1, which can be conveniently estimated by scanning all the states of differential methylation S={s1,s2,…,sN} on all RNA methylation sites. For every single gene, the emission probability BIII has the same form as BII in FHC strategy. By doing this, the AIII matrix can be estimated in a single step instead of an iterative manner so as to save computation load. This result should be also more robust on short RNA methylation sites with less number of bins than previous strategy. Secondly, we chose the Estep in FHB strategy to compute the final expectation defined in formula (14) for every single bin on every RNA methylation sites of real RNA epigenetics data. FastFHC strategy applied Estep after estimating transition matrix and initial probability for all genes. πIII and AIII are considered the same on different RNA methylation sites and are estimated like FHB with binary converted observation. Although some information can be lost in the conversion step, since tens of thousands of RNA methylation sites are pooled together for estimation of πIII and AIII, it should be still relatively accurate. The 3 strategies are summarized in Figure 4.
Comparison of different strategies. FHB strategy is the most naïve and straightforward; FHC is the most time consuming and performs better than FHB but is less robust. With FastFHC, the algorithm can now be applied to genome scale dataset in a timely and robust manner.
3. Result3.1. Test on Simulated Data
For MeRIP-Seq, as the ground truth is not available for the differential RNA methylation status in real data, the performance of our proposed method (FHB and FHC strategy) was first validated on simulated datasets. Specifically, the reads counts for the IP and input samples under two experimental conditions were generated from model assumptions, respectively. In every set of data, 100 RNA methylation sites are generated, each with 1000 adjacent bins. The sequencing depths were all set 10^{8}, and the normalized Poisson mean λ0 of untreated input was set to 10^{−6}, unless otherwise clarified. To simulate differential expression, reads counts of each gene in both the IP and the input control sample also vary in a certain range compared with the untreated condition, respectively; and we assume its log2 fold change follows a uniform distribution between [-3,3]. To mimic differential methylation, the methylation reads counts log2 odds ratio follows a uniform distribution between [-3,3] for differential methylation bins and 0 for nondifferential bins. In order to impose dependency of adjacent bins on the simulated data, we applied a definite HMM to generate the labels used as the hidden DMS of the 1000 adjacent bins to indicate whether a bin is differential methylated or not. Then the label was used to generate the data and also used as the ground truth for evaluating the performance of the proposed FET-HMM approach. The transition matrix Asim was set as(17)Asim=0.90.10.10.9unless otherwise stated, and the initial probability π=(0.5,0.5) due to the lack of prior information. We considered three factors that may affect the performance of the algorithm, that is, the cut-off threshold applied to FET result for switching FDR (or p values) to the binary observed state (only for FHB), the transition matrix (degree of spatial dependency) used to generate the ground truth, and the sequencing depth (library size) of the data. The area under receiver operating characteristics curve (AUC) is calculated to evaluate the performance of the proposed algorithms under different settings of the 3 key factors to be tested.
In the first experiment, we tested the impact of cut-off threshold on the FHB strategy. As shown in Figure 5, although the choice of threshold does affect the performance of the algorithm, by incorporating spatial dependency, the proposed FHB strategy effectively improves the DMRs detection performance under all cut-off thresholds tested.
Boxplot of AUCs for different thresholds applied to switch FDR to the binary state. This figure shows that with the variation of thresholds, the performance of FHB outperforms exomePeak in AUC on 100 datasets. exomePeak does not use the cut-off threshold so its performance remains the same. The performance is evaluated at bin level rather than peak level in all experiments.
In the second experiment, we tested the impact of transition matrix, which indicates the degree of dependency between adjacent observations (bins). As shown in Figure 6, the performance of FHB and FHC strategies heavily relies on the transition matrix setting, which reflects the degree of dependence between adjacent bins; and FHC strategy outperforms FHB and exomePeak under different settings tested.
Boxplot of AUCs for different transition matrices used to generate the ground truth. The performance of FHB and FHC strategies heavily relies on the transition matrix setting, which reflects the degree of dependence between adjacent bins; and FHC strategy outperforms FHB and exomePeak under different settings tested.
The last factor that may affect the simulation results is the sequencing depth (the total number of reads). In our simulation, the sequencing depths (SD) of the four samples varied from 10^{9} to 10^{6}. From Figure 7, we can see that the performances of FHB, FHC, and exomePeak are all satisfactory when sequencing depth is high enough (SD = 10^{9}); their performance all decreases together with the sequencing depth. Among the 3 methods tested, FHC gives the best performance and the advantage of FET-HMM over exomePeak is the most prominent when the sequencing depth is low. When the sequencing depth is very low, none of the 3 approaches can identify DMRs effectively.
Boxplot of AUCs for different sequencing depths. The performance of all 3 approaches decreases together with the sequencing depth. FHC strategy gives the best performance and the advantage of FET-HMM over exomePeak is the most prominent when the data is of mediocre sequencing depth.
We also consider here another scenario of unbalanced sequencing depth; that is, only one of the 4 samples has very large or small sequencing depth, and the results are highly consistent with previous result. As shown in Figure 8, the performance of all 3 approaches decreases as the sequencing depth decreases and FHC strategy outperforms FHB and exomePeak on most settings.
Boxplot of AUCs for different unbalanced sequencing depths. The performance of all 3 approaches decreases as the sequencing depth decreases and FHC strategy outperforms FHB and exomePeak on most settings. In this test, the sequencing depth of IP sample under treated condition varies with that of the other 3 samples unchanged.
In general, the computational complexity of the proposed approaches increases together with the number of the genes, the length of the genes, and the resolution of the analysis (the size of the bin); and since FHB and FHC require iterative refinement, their computational complexity is also proportional to the number of iterations required to research convergence. To further evaluate the computational complexity of the 3 strategies, we conducted one additional experiment. In this experiment, we simulated a dataset of 7 genes, each with a different length (50, 100, 150, 200, 250, 300, and 350) and the methylation state transition probability is set to be 0.95. A total of 10 datasets are generated for evaluation purposes and the average performance and time consumption are calculated. As it can be seen from Table 1, on the simulated setting, FastFHC is comparable to FHB and FHC in performance, but much faster, making it a reasonable choice for genome-scale data with more than a few thousands of genes.
Comparison of different approaches.
Method
AUC
Time
FHB
0.960
4.39 s
FHC
0.987
0.85 s
FastFHC
0.962
0.12 s
exomePeak (rhtest)
0.924
0.02 s
3.2. Test on MeRIP-Seq Data
In order to test our proposed method in real applications, we chose the human MeRIP-Seq data from Hela cells and from METTL3/METTL14 knockout conditions [40] as shown in Table 2. Previous study shows that METTL3 and METTL14 are components of RNA methyltransferase complex [40, 41], and we would like to identify their respective targeted RNA methylation sites from the following analysis. The original raw data in SRA format was downloaded directly from Gene Expression Omnibus (GEO) GSE46705, which consists of 8 IP and 8 Input MeRIP-Seq replicates obtained under wild type condition and after METTL3 or METTL14 knockout, respectively (a total of 16 libraries). The short sequencing reads are firstly aligned to human genome assembly hg19 with Tophat2 [42], and then the same types of samples obtained under the same condition are merged together for differential RNA methylation analysis.
MeRIP-Seq data used.
Dataset
Cell
Treatment
Replicates (IP/input)
Reference
1
Hela
Control
4 & 4
[40]
2
Hela
METTL3 K/O
2 & 2
[40]
3
Hela
METTL14 K/O
2 & 2
[40]
Differential RNA methylation is predicted using exomePeak R/Bioconductor package [21] with UCSC gene annotation database [43] and with FastFHC strategy for comparison. Since METTL3 and METTL14 are methyltransferase, their target sites should exhibit hypomethylation under knockout condition. The hypomethylation sites under knockout condition (targeted RNA methylation sites) are then extracted and their sequences are submitted to MEME-ChIP for motif discovery. The identified motifs are summarized in Table 3. The enriched motifs are quite different in both datasets, indicating that there are multiple regulatory avenues to regulate the RNA methylome through sequence specificity.
Motifs for target sites of METTL3 and METTL14.
Rank
exomePeak
FET-HMM
Motif
E-value
Motif
E-value
1
2.3 × 10^{−27}
2.4 × 10^{−33}
METTL3 K/O
2
4.7 × 10^{−13}
7.1 × 10^{−24}
3
1.5 × 10^{−11}
1.5 × 10^{−15}
4
2.6 × 10^{−6}
4.4 × 10^{−5}
1
3.3 × 10^{−13}
1.4 × 10^{−19}
METTL14 K/O
2
2.5 × 10^{−12}
2.0 × 10^{−21}
3
3.3 × 10^{−10}
3.4 × 10^{−12}
4
4.8 × 10^{−7}
6.0 × 10^{−11}
Despite the difference in sequences, as shown in Figure 9, the motifs identified by FastFHC results are more statistically significant than that from exomePeak, indicating higher sequence specificity, which is achieved by spatial enhancement with HMM in FET-HMM approach. The increased sequence specificity will be invaluable for decoding the structure of RNA methylation/demethylation enzymes.
E values of motifs identified from differential methylation regions. The figure shows the motif E values from exomePeak and FastFHC strategy. With spatially enhanced differential methylation analysis, FastFHC identifies RNA methylation sites that are more biologically meaningful, indicating higher specificity compared with the exomePeak result.
We then checked the distribution of METTL3 and METTL14 targeted RNA methylation sites on mRNA and lncRNA. As shown in Figure 10, the targeted RNA methylation sites of METTL3 and METTL14 are relatively enriched near stop codon of mRNA. Interestingly, compared with METTL14 targets, METTL3 targets are relatively enriched on untranslated regions (5′ and 3′UTR), which is never reported before. Although existing studies suggest METTL3 and METTL14 function as an RNA methylation complex together with WTAP, our observation suggests that they may have their own respective functions as well. On lncRNA, their targets are almost uniformly distributed on the entire RNA with slight enrichment on 5′ end, whose reason is not yet clear.
Distribution of METTL3 and METTL14 targeted RNA methylation sites. For both METTL3 and METTL14, their targeted RNA methylation sites are relatively enriched near stop codon of mRNA; however, compared with METTL14 targets, METTL3 targets are relatively enriched on untranslated regions (5′ and 3′UTR). On lncRNA, their targets are unfirmly distributed with slightly enriched on 5′ end.
4. Conclusion
In this paper, we developed an HMM-based method, FET-HMM, for spatially enhanced detection of differentially methylated region from MeRIP-Seq data. Compared with existing peak-based approaches which perform differential analysis on the entire methylation site, FET-HMM seeks to increase the resolution of detection to some extent by dividing the single RNA methylation site into multiple adjacent bins (as shown in Figure 1), resulting in the improved detection performance. We developed 3 different strategies for this purpose, each with different advantage and disadvantages, and the FastFHC strategy can be directly applied to genome scale dataset. We show on the simulated and real datasets that the proposed approaches outperform original approach in detection performance and report more statistically significant DMRs on real MeRIP-Seq data.
It is important to note that exomePeak, which adopts a hypothesis testing scheme, relies on a cut-off threshold to report differential methylation sites, while FET-HMM, which assumes a hidden Markov model, needs a cut-off threshold for posterior probability. Although their performances can be compared under AUC, the two approaches are fundamentally different. It is suggested that both exomePeak and FET-HMM are used when analyzing specific datasets rather than using one approach only.
The proposed approach still has a number of limitations, many of which are shared by other existing MeRIP-Seq data analysis software. Firstly, the proposed approach could not model the within-group variation and thus cannot effectively take advantage of biological replicates. Currently, replicates are merged together which loses the biological variability. Secondly, the proposed approach cannot discriminate different isoforms of the same genes. MeRIP-Seq intrinsically poses very limited information regarding the methylation states of different isoform transcripts. Thirdly, even with the proposed approach, the spatial resolution is still not base-pair resolution. To obtain true base-pair solution, a more advanced computational approach needs to be developed to further combine the nucleotide sequence information (motif).
Disclosure
The open source R package implementing the proposed algorithm on MeRIP-Seq data is freely available from GitHub: https://github.com/lzcyzm/RHHMM.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors thank the support from National Natural Science Foundation of China (61473232, 61401370, 91430111, 61170134, and 61201408) to Shao-Wu Zhang, Jia Meng, and Hui Liu; Jiangsu Science and Technology Program (BK20140403) to Jia Meng; Fundamental Research Funds for the Central Universities (2014QNB47, 2014QNA84) to Lin Zhang and Hui Liu. The authors also thank computational support from the UTSA Computational System Biology Core, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health.
BrownleeG. G.SangerF.BarrellB. G.Nucleotide sequence of 5S-ribosomal RNA from Escherichia coliFuY.DominissiniD.RechaviG.HeC.Gene expression regulation mediated through reversible m^{6}A RNA methylationMeyerK. D.JaffreyS. R.The dynamic epitranscriptome: N6-methyladenosine and gene expression controlKönigJ.ZarnackK.LuscombeN. M.UleJ.Protein-RNA interactions: new genomic technologies and perspectivesDominissiniD.Moshitch-MoshkovitzS.SchwartzS.Salmon-DivonM.UngarL.OsenbergS.CesarkasK.Jacob-HirschJ.AmariglioN.KupiecM.SorekR.RechaviG.Topology of the human and mouse m^{6}A RNA methylomes revealed by m^{6}A-seqMeyerK. D.SaletoreY.ZumboP.ElementoO.MasonC. E.JaffreyS. R.Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codonsDominissiniD.Moshitch-MoshkovitzS.Salmon-DivonM.AmariglioN.RechaviG.Transcriptome-wide mapping of N^{6}-methyladenosine by m^{6}A-seq based on immunocapturing and massively parallel sequencingFustinJ. M.DoiM.YamaguchiY.HidaH.NishimuraS.YoshidaM.IsagawaT.MoriokaM. S.KakeyaH.ManabeI.OkamuraH.XRNA-methylation-dependent RNA processing controls the speed of the circadian clockHessM. E.HessS.MeyerK. D.VerhagenL. A. W.KochL.BrönnekeH. S.DietrichM. O.JordanS. D.SaletoreY.ElementoO.BelgardtB. F.FranzT.HorvathT. L.RütherU.JaffreyS. R.KloppenburgP.BrüningJ. C.The fat mass and obesity associated gene (Fto) regulates activity of the dopaminergic midbrain circuitryWangY.LiY.TothJ. I.PetroskiM. D.ZhangZ.ZhaoJ. C.N6 -methyladenosine modification destabilizes developmental regulators in embryonic stem cellsLeeM.KimB.KimV. N.Emerging roles of RNA modification: m(6)A and U-tailSchwartzS.MumbachM. R.JovanovicM.WangT.MaciagK.BushkinG.MertinsP.Ter-OvanesyanD.HabibN.CacchiarelliD.SanjanaN.FreinkmanE.PacoldM.SatijaR.MikkelsenT.HacohenN.ZhangF.CarrS.LanderE.RegevA.Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5′ sitesLiuJ.YueY.HanD.WangX.FuY.ZhangL.JiaG.YuM.LuZ.DengX.DaiQ.ChenW.HeC.A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylationWangX.LuZ.GomezA.HonG. C.YueY.HanD.FuY.ParisienM.DaiQ.JiaG.RenB.PanT.HeC.N 6-methyladenosine-dependent regulation of messenger RNA stabilityHeC.Grand Challenge Commentary: RNA epigenetics?DominissiniD.Roadmap to the epitranscriptomeLiuH.FloresM. A.MengJ.ZhangL.ZhaoX.RaoM. K.ChenY.HuangY.MeT-DB: a database of transcriptome methylation in mammalian cellsLiuL.ZhangS.ZhangY.LiuH.ZhangL.ChenR.HuangY.MengJ.Decomposition of RNA methylome reveals co-methylation patterns induced by latent enzymatic regulators of the epitranscriptomeMengJ.CuiX.RaoM. K.ChenY.HuangY.Exome-based analysis for RNA epigenome sequencing dataLiY.SongS.LiC.YuJ.MeRIP-PF: an Easy-to-use Pipeline for High-resolution Peak-finding in MeRIP-Seq DataMengJ.LuZ.LiuH.A protocol for RNA methylation differential analysis with MeRIP-Seq data and exomePeak R/Bioconductor packageBockC.Analysing and interpreting DNA methylation dataRobinsonM. D.KahramanA.LawC. W.LindsayH.NowickaM.WeberL. M.ZhouX.Statistical methods for detecting differentially methylated loci and regionsSunD.XiY.RodriguezB.ParkH. J.TongP.MeongM.GoodellM. A.LiW.MOABS: model based analysis of bisulfite sequencing dataStockwellP. A.ChatterjeeA.RodgerE. J.MorisonI. M.DMAP: differential methylation analysis package for RRBS and WGBS dataMengJ.CuiX.LiuH.ZhangL.ZhangS.RaoM. K.ChenY.HuangY.Unveiling the dynamics in RNA epigenetic regulationsProceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM '13)December 2013Shanghai, China13914410.1109/BIBM.2013.6732477ZhangY.LiuT.MeyerC. A.EeckhouteJ.JohnsonD. S.BernsteinB. E.NussbaumC.MyersR. M.BrownM.LiW.ShirleyX. S.Model-based analysis of ChIP-Seq (MACS)SeifertM.CortijoS.Colomé-TatchéM.JohannesF.RoudierF.ColotV.MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arraysSeifertM.KeilwagenJ.StrickertM.GrosseI.Utilizing gene pair orientations for HMM-based analysis of promoter array ChIP-chip dataXuH.WeiC.-L.LinF.SungW.-K.An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq dataKrishnamoorthyK.ThomsonJ.A more powerful test for comparing two Poisson meansPrzyborowskiJ.WilenskiH.Homogeneity of results in testing samples from Poisson series with an application to testing clover seed for dodderBeckerC.HagmannJ.MüllerJ.KoenigD.StegleO.BorgwardtK.WeigelD.Spontaneous epigenetic variation in the Arabidopsis thaliana methylomeLiY.ZhuJ.TianG.The DNA methylome of human peripheral blood mononuclear cellsListerR.PelizzolaM.DowenR. H.HawkinsR. D.HonG.Tonti-FilippiniJ.NeryJ. R.LeeL.YeZ.NgoQ.-M.EdsallL.Antosiewicz-BourgetJ.StewartR.RuottiV.MillarA. H.ThomsonJ. A.RenB.EckerJ. R.Human DNA methylomes at base resolution show widespread epigenomic differencesListerR.PelizzolaM.KidaY. S.HawkinsR. D.NeryJ. R.HonG.Antosiewicz-BourgetJ.OgmalleyR.CastanonR.KlugmanS.DownesM.YuR.StewartR.RenB.ThomsonJ. A.EvansR. M.EckerJ. R.Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cellsBaldiP.ChauvinY.Smooth on-line learning algorithms for hidden Markov modelsDempsterA. P.LairdN. M.RubinD. B.Maximum likelihood from incomplete data via the EM algorithmPoritzA.Hidden Markov models: a guided tour1Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '88)April 1988New York, NY, USA71310.1109/ICASSP.1988.196495WangX.LuZ.GomezA.HonG. C.YueY.HanD.FuY.ParisienM.DaiQ.JiaG.RenB.PanT.HeC.N^{6}-methyladenosine-dependent regulation of messenger RNA stabilityPingX.-L.SunB.-F.WangL.XiaoW.YangX.WangW.-J.AdhikariS.ShiY.LvY.ChenY.-S.ZhaoX.LiA.YangY.DahalU.LouX.-M.LiuX.HuangJ.YuanW.-P.ZhuX.-F.ChengT.ZhaoY.-L.WangX.DanielsenJ. M. R.LiuF.YangY.-G.Mammalian WTAP is a regulatory subunit of the RNA N6-methyladenosine methyltransferaseKimD.PerteaG.TrapnellC.PimentelH.KelleyR.SalzbergS. L.TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusionsKarolchikD.BarberG. P.CasperJ.ClawsonH.ClineM. S.DiekhansM.DreszerT. R.FujitaP. A.GuruvadooL.HaeusslerM.HarteR. A.HeitnerS.HinrichsA. S.LearnedK.LeeB. T.LiC. H.RaneyB. J.RheadB.RosenbloomK. R.SloanC. A.SpeirM. L.ZweigA. S.HausslerD.KuhnR. M.KentW. J.The UCSC genome browser database: 2014 update