1. Introduction

BMRI

BioMed Research International

2314-6141 2314-6133

Hindawi Publishing Corporation

10.1155/2015/852070

852070

Research Article

Spatially Enhanced Differential RNA Methylation Analysis from Affinity-Based Sequencing Data with Hidden Markov Model

Zhang

Yu-Chen

¹ Zhang

Shao-Wu

¹ Liu

Lian

¹ Liu

Hui

² Zhang

Lin

² Cui

Xiaodong

³ Huang

Yufei

³ Meng

Jia

⁴ Wu

Fang-Xiang

Key Laboratory of Information Fusion Technology of Ministry of Education

School of Automation

Northwestern Polytechnical University

Xi’an 710072

China

nwpu.edu.cn

School of Information and Electrical Engineering

China University of Mining and Technology

Xuzhou 221116

China

cumt.edu.cn

Department of Electrical and Computer Engineering

University of Texas at San Antonio

San Antonio

TX 78249

USA

utsa.edu

⁴

XJTLU-WTNC Research Institute

Department of Biological Sciences

Xi’an Jiaotong-Liverpool University

Suzhou 215123

China

xjtlu.edu.cn

2015

282015

2015 12 02 2015 25 03 2015 282015

2015

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the development of new sequencing technology, the entire N6-methyl-adenosine (m⁶A) RNA methylome can now be unbiased profiled with methylated RNA immune-precipitation sequencing technique (MeRIP-Seq), making it possible to detect differential methylation states of RNA between two conditions, for example, between normal and cancerous tissue. However, as an affinity-based method, MeRIP-Seq has yet provided base-pair resolution; that is, a single methylation site determined from MeRIP-Seq data can in practice contain multiple RNA methylation residuals, some of which can be regulated by different enzymes and thus differentially methylated between two conditions. Since existing peak-based methods could not effectively differentiate multiple methylation residuals located within a single methylation site, we propose a hidden Markov model (HMM) based approach to address this issue. Specifically, the detected RNA methylation site is further divided into multiple adjacent small bins and then scanned with higher resolution using a hidden Markov model to model the dependency between spatially adjacent bins for improved accuracy. We tested the proposed algorithm on both simulated data and real data. Result suggests that the proposed algorithm clearly outperforms existing peak-based approach on simulated systems and detects differential methylation regions with higher statistical significance on real dataset.

1. Introduction

Although the presence of posttranscriptional biochemical modifications to RNA has been established in 1960s [1], due to historical limitations, RNA epigenetics is largely uncharted territory until recently [2–4]. In 2012, a powerful sequencing protocol methylated RNA immune-precipitation sequencing (MeRIP-Seq or m⁶A-Seq) was developed [5, 6], in which the fragmented mRNA fragments with N6-methyl-adnosine (m⁶A) are pulled down with anti-m⁶A antibody and then purified and passed to subsequent sequencing to generate the so-called “IP sample” for profiling the transcriptome-wide RNA m⁶A methylome. Very often, a paired “input sample” is generated as well using all the RNA for measuring the entire transcriptome background (please refer to [7] for a more comprehensive protocol of this approach). This technique facilitates a number of research findings recently which includes the following: the role of RNA methylation in controlling the circadian clock [8], addiction [9], and stem cell [10], and [2, 3, 5, 6, 8–16]. It also enabled the construction of mammalian RNA methylation database [17] and systems biology approaches for decomposing the RNA methylome to unveil the latent enzymatic regulators of epitranscriptome [18]. Software tools for RNA methylation site detection [19, 20] and for differential RNA methylation analysis [21] from MeRIP-Seq data are now available in a rather user friendly manner. Nevertheless, as a newly arising technique, MeRIP-Seq still poses computational challenges that call for novel and sophisticated approaches.

Differential methylation analysis is of crucial importance for epigenetics research. Differentially methylated regions (DMRs), that is, regions that exhibit different methylation levels between two experimental conditions, for example, normal and cancerous, can be as small as a single base or as large as an entire gene locus, depending on the biological question of interest and the bioinformatics methods used for their identification [22]. Differential methylation analysis from MeRIP-Seq seeks to identify the differences in RNA methylome in a case-control study (e.g., cancerous and normal), which usually involves at least four high-throughput sequencing (HTS) samples, including the IP and input samples under both the case and control conditions. For affinity-based methods developed for DNA epigenetics (such as MeDIP-Seq and ChIP-Seq), since the absolute amount of DNA is most likely to stay unchanged between two conditions, the percentage of modified DNA molecule is linearly correlated with the absolute amount; thus the difference in methylation is consistent when measured in relative (percentage) and absolute amount. However, in MeRIP-Seq, due to the change in transcriptional expression level between two conditions, it is possible that while the absolute amount of methylated RNA increases, the relative amount (percentage of methylated RNA) decreases as shown in Figure 1. From computational perspective, the differential methylation analysis of RNA is quite different from that of DNA, and DNA differential methylation approaches [23], such as MOABS [24] and DMAP [25], may not be directly applicable to RNA. Until now, methods aiming at the differential analysis of MeRIP-Seq data do not extensively appear in literature. exomePeak [19, 21] is dedicatedly developed for differential RNA methylation analysis from MeRIP-Seq data. The detection of DMRs is based on rhtest [26], which is an extended version of hypergeometric test, computing the statistical significance of the difference in the percentages of methylated fragments between the two conditions, which directly indicates the difference in enzymatic regulation. Before the detection of DMRs, peaks (methylated regions) are called firstly from the transcriptome by comparing the IP with input sample by relative enrichment [7, 19, 27]. Only with the detected methylation sites can we effectively estimate the methylation level.

Figure 1

Comparison of the differential methylation analysis in DNA and RNA. The first column shows the DNA related differential analysis in ChIP-Seq or MeDIP-Seq, where the total DNA is often considered the same under two experimental conditions, so the differential analysis can be performed by directly comparing the absolute amount of methylated RNAs in the two IP samples. In contrast, for RNA (the second column), the background is total RNA, which can vary significantly under different conditions, and therefore, the absolute amount of methylated RNA for a specific site does not necessarily correlate with the degree of methylation. For the example shown in the above figure, while amount of methylated RNA increases under the cancer condition, the relative amount (percentage of methylated RNA) decreases, indicating a hypomethylation at RNA level. As a result, the differential analysis of RNA methylome in MeRIP-Seq should be performed by comparing the percentages of methylated RNA to reflect the influence of methylation enzymatic regulation.

Affinity-based approaches cannot provide single-base resolution. Since multiple RNA methylation residuals may locate in proximity and cannot be effectively differentiated with peak calling procedure, they can appear as a single broad methylation site in the peak calling result from MACS [27] or exomePeak [19]. In many cases, this discrepancy can be trivial and does not significantly affect relevant study; however, it can be disastrous in differential methylation analysis, because multiple RNA methylation residuals can be regulated by different enzyme complexes and thus may be differentially methylated. Failing to identify the precise location of each methylation residual can lead to large bias in the estimation of its methylation level and in the comparison to a different condition. Currently, all existing methods for RNA differential methylation from MeRIP-Seq data are peak-based. In this paper, based on the rhtest method developed in exomePeak package [21], we proposed FET-HMM, a novel strategy for spatially enhanced differential RNA methylation analysis using hidden Markov model (HMM). When applying to the RNA methylation site detected from a peak calling algorithm, FET-HMM breaks a single site into multiple adjacent small bins and evaluates whether a specific bin is differentially methylated or not between two experimental conditions with spatial dependency incorporated by HMM. Figure 2 shows the comparison between existing and our methods.

Figure 2

Comparison of differential methylation analysis methods. This figure shows the difference between existing peak-based differential analysis method and the proposed method. Started from aligned reads, the left part of this figure shows how exomePeak conducts differential analysis. It firstly identifies a single methylation site and then decides whether the methylation site as a whole is differentially methylated or not. However, the newly proposed method will split the testing region into multiple adjacent small bins and then will integrate their dependency with HMM for more accurate identification of differential methylation site. In the above example, the RNA methylation site detected using exomePeak method may consist of two methylation residuals, and only the one on the right side is differentially methylated in this case-control study. The proposed FET-HMM method is likely to work better than peak-based exomePeak method under this scenario.

HMM is a statistical model that integrates multiple random processes and has been widely used in DNA-templated epigenetic analysis and in RNA methylation sites detection (or peak calling) [28–30], but so far it has not been applied for RNA differential methylation analysis. We applied the newly developed approach FET-HMM on both simulated and real datasets. The results on simulated data showed that FET-HMM can effectively improve the performance of rhtest in terms of the area under the curve (AUC) when detecting differential methylation sites. When applied to human MeRIP-Seq datasets, FET-HMM method returns more biological meaningful results than exomePeak method. The FET-HMM algorithm has been implemented in an open source R package for differential methylation analysis from MeRIP-Seq data and is freely available from GitHub. The method is detailed in the following section.

2. Methods

In this section, we firstly review the usage of rhtest, a modified version of Fisher’s exact test (FET), for differential RNA methylation analysis and then introduce spatially enhanced approach FET-HMM.

2.1. Peak-Based Differential RNA Methylation Analysis with Rhtest

To conduct differential RNA methylation analysis in a case-control study, we should get four samples, that is, the IP and input samples from both groups. Consider that there are a number of RNA methylation sites detected with peak calling approaches [19, 20, 27] from MeRIP-Seq. Then we can assume that the number of reads within the gth RNA methylation sites follows the Poisson distribution, with(1)X0,g~PoissonN0λ0,g,X1,g~PoissonN1λ1,g,Y0,g~PoissonM0λ-0,g,Y1,g~PoissonM1λ-1,g,where X0,g and X1,g are the reads counts of the input samples for untreated and treated condition and consistently, Y0,g and Y1,g are the reads counts of the IP samples for untreated and treated samples. Here, g=1,2,…,G indicates the gth RNA methylation site. N0,N1,M0,M1 are the size (or the sequencing depth) of library, respectively; and the parameters λ0,g,λ1,g,λ-0,g,λ-1,g are the normalized Poisson means in a standard library, indicating the expectation of the reads counts within a bin. Following the formulation from previous study [26], we assume that λ-0,g and λ-1,g satisfy the following relationship with λ-0,g=λ0,gη0,g/f0 and λ-1,g=λ1,gη1,g/f1, where f0 and f1 indicate the percentage of the expressed RNA fragments that are modified in the untreated and treated samples, respectively. η0,g and η1,g indicate the percentage of RNA fragments mapped inside the RNA methylation site that carry the methylation mark. We would like to test whether η1,g=η0,g. According to the properties of the Poisson distributions [31, 32], given X0,g+Y0,g=t0,g, X1,g+Y1,g=t1,g, we should have X0,g~Binomialp0,g,t0,g and X1,g~Binomialp1,g,t1,g, where p1,g=N1f1/N1f1+M1η1,g and p0,g=N0f0/N0f0+M0η0,g. For different experimental conditions, if we assume that the total amount of modifications remains the same, only its distribution may change, then we can have f0=f1=f. We also notice that if N1M0=N0M1, then η1,g=η0,g⇔p1,g=p0,g, and testing whether the two Binomial distributions have the same successful rate is equivalent to the classical problem of testing the independence in a 2 × 2 contingency table. In order to establish N1M0=N0M1, only one of the 4 samples needs to be rescaled. When N1M0=N0M1 is achieved after rescaling, under the null hypothesis p1,g=p0,g, X0,n follows a hypergeometric distribution as in (2), and we may use Fisher’s exact test [33–36] with two tails to evaluate its significance. Consider(2)pX0,g=k~HyperX0,g∣K,n,N=KkN-Kn-kNn,where N=t0,g+t1,g=x0,g+x1,g+y0,g+y1,g, n=t0,g=x0,g+y0,g, and K=x0,g+x1,g. The smaller the p value is, the more likely the gth RNA methylation site is differentially methylated between two conditions.

2.2. Spatially Enhanced Differential RNA Methylation Analysis with FET-HMM

The method developed in the previous section could not effectively discriminate multiple RNA methylation residuals located within a single RNA methylation site (as shown in Figure 1). We seek to enhance the spatial resolution with hidden Markov model. Similar to various formulation, for a particular RNA methylation site, we firstly divided it into N mutually connected bins of length L. Then we can still assume that the number of reads within the nth bin follows the Poisson distribution, with(3)X0,n~PoissonN0λ0,n,X1,n~PoissonN1λ1,n,Y0,n~PoissonM0λ-0,n,Y1,n~PoissonM1λ-1,n,where X0,n and X1,n are the reads counts of the input samples for untreated and treated condition and consistently, Y0,n and Y1,n are the reads counts of the IP samples for untreated and treated samples. Here, n=1,2,…,N indicates the nth bin. The parameters λ0,n,λ1,n,λ-0,n,λ-1,n are the normalized Poisson means in a standard library, indicating the expectation of the reads counts within a bin. Following the formulation from previous study [26], we assume that λ-0,n and λ-1,n satisfy the following relationship with λ-0,n=λ0,nη0,n/f0 and λ-1,n=λ1,nη1,n/f1, where f0 and f1 indicate the percentage of the expressed RNA fragments that are modified in the untreated and treated samples, respectively. η0,n and η1,n indicate the percentage of RNA fragments mapped inside the bin that carry the methylation mark. We can easily test whether η1,n=η0,n (whether differential methylation is observed) for a specific bin; however, we should not neglect the dependencies between the reads counts of adjacent bins within an RNA methylation site; that is, if differential methylation is observed on a specific bin, it is likely that differential methylation can also be observed on bins adjacent to it and vice versa. The dependency can be effectively incorporated with an HMM formulation, and we thus developed a new strategy for the identification of differential methylation regions (DMRs) with improved spatial resolution.

To begin with, with respect to nth bin, the hidden true states of differential methylation are denoted as S={s1,s2,…,sN}, where sn∈{0,1} with 1 representing differential methylation state (DMS) and 0 otherwise. Considering that a differential methylation region may span multiple adjacent bins, we assume that the true hidden DMS S follows a first order Markov chain, whose transition matrix A contains entries defined as (4)Aij=Psn+1=j∣sn=i, i,j∈0,1,where Aij denotes the probability for the hidden variable switching from DMS i at the nth bin to the DMS j at the (n+1)th bin. In addition, the initial probability p(S1=0)=u and p(S1=1)=1-u, which can be denoted as π=(u,1-u). Next, the result of rhtest [21, 26] was used as the observed variable of the HMM. However, the information acquired from rhtest is a statistical significance of differential methylation in terms of p values and FDRs (False Discovery Rates). We seek to enhance the differential methylation results by incorporating spatial dependency. Specifically, 3 different strategies are developed for this purpose with their own advantages and disadvantages, which are detailed in the following.

2.3. FHB Strategy: Combine Fisher’s Exact Test and HMM with Binary Observation

In FHB strategy, we use the binary decisions received from FET as the observation of hidden Markov model. The model essentially evaluates how likely a true differential methylation state can be detected by FET, or if FET reports a DMS with a significance level, how likely it is true after incorporating spatial dependency. We assume that a state can be correctly observed with probability p; and a mistake happens with probability 1-p. Since the observation from FET is considered as binary, a cut-off threshold should be used to switch the FDR (False Discovery Rate) value to generate the “observed” set of observed variable O=(o1,o2,…,on) with on∈{0,1}. Then according to the standard HMM definition, these probabilities consist of an emission matrix B, whose entries are defined as(5)Bij=Pon=j∣sn=i=p,i,j∈0,1, i=j,1-p,i,j∈0,1, i≠j.The detailed structure of HMM is shown in Figure 3.

Figure 3

Hidden Markov model. In FHB strategy, the “observation” is a binary status reported from FET, and the emission probability is Bernoulli distribution.

Finally, we applied the widely used Baum-Welch algorithm [37–39] to estimate the unknown parameters of the HMM. Baum-Welch algorithm applies the well-known Expectation and MSaximization (EM) strategy to conduct the process of estimation. The implementation steps of Baum-Welch algorithm are as follows.

The Proposed Algorithm

(1) Initialization. Given the initial value of Aij, πi, and Bij randomly according to the conditions of probability, we hence get the initial model parameters λ(0)=(π(0),A(0),B(0)).

(2) EM Steps

E Step. Let γn(i) denote the probability of the hidden DMS being at i at the nth bin, and let ξn(i,j) denote the probability of the hidden DMS being at i at the nth bin and the DMS being at j at the (n+1)th bin. Also, we denote tsik, k∈{0,1}, to represent the times of the transition from DMS i to any DMS k and tsij to represent the times of the transition from DMS i to the DMSj. γn(i) and ξn(i,j) can be computed through (6) and (7), and the expectation of tsik and tsij can be calculated by (8) and (9). λ(m)=(π(m),A(m),B(m)) represents the parameters of HMM after the mth iteration. Consider(6)γni=Psn=i∣O,λm=Psn=i, O∣λmPO∣λm,(7)ξni,j=Psn=i, sn+1=j∣O,λm=Psn=i, sn+1=j, O∣λmPO∣λm,(8)Etsik=∑n=1N-1γn(i),(9)Etsij=∑n=1N-1ξn(i,j).M Step. After using (10), (11), and (12) to estimate πi, Aij, and Bij, we get λ(m+1). One has(10)πi(m+1)=γ1i,(11)aij(m+1)=EtsijEtsik=∑n=1N-1ξn(i,j)∑n=1N-1γn(i),(12)bim+1k=∑n=1NγniIon=k∑n=1Nγni.In (12),(13)I{on=k}=1on=k0on≠kis the indicative function.

(3) Loop. Repeat the EM steps until the convergence of Aij, πi, and Bij. After the procedures above, optimal model parameter λ(op) could be obtained. Let unk=1 if we are absolutely sure sn=k and unk=0 otherwise. What we focused on is the final expectation of unk, k∈{0,1}, which can be calculated as(14)Eunk∣O,λop=Psn=k∣O,λop.

Then we could obtain the posterior probability of a bin being at a specific state, and the performance of FET-HMM can be compared with that of exomePeak on simulated dataset when the true state is available.

2.4. FHC Strategy: Combine Fisher’s Exact Test and HMM with Continuous Observation

In FHB strategy, we adopt a switching cut-off threshold to convert the statistical significance (p value from differential analysis with rhtest) into binary states as the observation of HMM. This strategy has two limitations. Firstly, we could hardly find the most reasonable threshold for a dataset, and different threshold can lead to different results. Secondly, some information gets lost in the conversion from p value to binary states; for example, both p values 0.01 and 0.001 are converted as DMS state 1 after a binary conversion with significance level 0.05; however, the former is less confident. In addition, Bernoulli distribution may not be the most suitable distribution for the emission probability of observed variable. Therefore, a strategy seeking to directly smooth the continuous statistical significance without binary conversion may be superior. For this purpose, we use the p values from FET to approximate the likelihood of a bin with DMS state 0 and (1-p value) for its likelihood with DMS state 1. The p values generated from FET can be used to estimate the emission probability of HMM directly and then passed to HMM for smoothing purposes. It should be denoted as(15)BII=p value1 1-p value1p value2 1-p value2⋮⋮p valueN1-p valueN.After getting the matrix BII of size N by 2 constructed from FET p values, the Baum-Welch algorithm introduced in FHB can be applied to spatially enhance the local result, with formula (12) omitted because matrix BII does not need to be reestimated every iteration. Please note that using p values to approximate directly the probability matrix BII helps to avoid the binary conversion and information loss, and we will show in the Result section that this trick indeed improves the performance of algorithm.

2.5. FastFH Strategy: A High-Efficiency Strategy for Applying FET-HMM on Big Omics Data

When the proposed method is used in real MeRIP-Seq dataset, two problems would emerge. What comes first was some reads would be mapped into very short genes; thus the number of the bins would be quite small. In other words, the length of some Markov chains would be too short for accurate estimation of parameters and finally affects the results of DMRs detection. In addition, computational time was another important factor that we should take into consideration. Take the human hg19 data we were going to test as an example. If there were more than 30000 detected RNA methylation sites in total, the Baum-Welch algorithm would be performed more than 30000 times and the execution time might be too long. In order to solve these two limitations, we could combine the two strategies together. Firstly, the threshold used in FHB was used here again to switch the FDR into binary DMS. Then we could estimate transition matrix AIII directly from this DMS information as shown in(16)πIII=1-∑i=1NDMSiN,∑i=1NDMSiN,AIII=PSn+1=0∣Sn=0PSn+1=1∣Sn=0PSn+1=0∣Sn=1PSn+1=1∣Sn=1,where PSn+1∣Sn denotes the conditional probability for the transition from Sn to Sn+1, which can be conveniently estimated by scanning all the states of differential methylation S={s1,s2,…,sN} on all RNA methylation sites. For every single gene, the emission probability BIII has the same form as BII in FHC strategy. By doing this, the AIII matrix can be estimated in a single step instead of an iterative manner so as to save computation load. This result should be also more robust on short RNA methylation sites with less number of bins than previous strategy. Secondly, we chose the Estep in FHB strategy to compute the final expectation defined in formula (14) for every single bin on every RNA methylation sites of real RNA epigenetics data. FastFHC strategy applied Estep after estimating transition matrix and initial probability for all genes. πIII and AIII are considered the same on different RNA methylation sites and are estimated like FHB with binary converted observation. Although some information can be lost in the conversion step, since tens of thousands of RNA methylation sites are pooled together for estimation of πIII and AIII, it should be still relatively accurate. The 3 strategies are summarized in Figure 4.

Figure 4

Comparison of different strategies. FHB strategy is the most naïve and straightforward; FHC is the most time consuming and performs better than FHB but is less robust. With FastFHC, the algorithm can now be applied to genome scale dataset in a timely and robust manner.

3. Result 3.1. Test on Simulated Data

For MeRIP-Seq, as the ground truth is not available for the differential RNA methylation status in real data, the performance of our proposed method (FHB and FHC strategy) was first validated on simulated datasets. Specifically, the reads counts for the IP and input samples under two experimental conditions were generated from model assumptions, respectively. In every set of data, 100 RNA methylation sites are generated, each with 1000 adjacent bins. The sequencing depths were all set 10⁸, and the normalized Poisson mean λ0 of untreated input was set to 10⁻⁶, unless otherwise clarified. To simulate differential expression, reads counts of each gene in both the IP and the input control sample also vary in a certain range compared with the untreated condition, respectively; and we assume its log2 fold change follows a uniform distribution between [-3,3]. To mimic differential methylation, the methylation reads counts log2 odds ratio follows a uniform distribution between [-3,3] for differential methylation bins and 0 for nondifferential bins. In order to impose dependency of adjacent bins on the simulated data, we applied a definite HMM to generate the labels used as the hidden DMS of the 1000 adjacent bins to indicate whether a bin is differential methylated or not. Then the label was used to generate the data and also used as the ground truth for evaluating the performance of the proposed FET-HMM approach. The transition matrix Asim was set as(17)Asim=0.90.10.10.9unless otherwise stated, and the initial probability π=(0.5,0.5) due to the lack of prior information. We considered three factors that may affect the performance of the algorithm, that is, the cut-off threshold applied to FET result for switching FDR (or p values) to the binary observed state (only for FHB), the transition matrix (degree of spatial dependency) used to generate the ground truth, and the sequencing depth (library size) of the data. The area under receiver operating characteristics curve (AUC) is calculated to evaluate the performance of the proposed algorithms under different settings of the 3 key factors to be tested.

In the first experiment, we tested the impact of cut-off threshold on the FHB strategy. As shown in Figure 5, although the choice of threshold does affect the performance of the algorithm, by incorporating spatial dependency, the proposed FHB strategy effectively improves the DMRs detection performance under all cut-off thresholds tested.

Figure 5

Boxplot of AUCs for different thresholds applied to switch FDR to the binary state. This figure shows that with the variation of thresholds, the performance of FHB outperforms exomePeak in AUC on 100 datasets. exomePeak does not use the cut-off threshold so its performance remains the same. The performance is evaluated at bin level rather than peak level in all experiments.

In the second experiment, we tested the impact of transition matrix, which indicates the degree of dependency between adjacent observations (bins). As shown in Figure 6, the performance of FHB and FHC strategies heavily relies on the transition matrix setting, which reflects the degree of dependence between adjacent bins; and FHC strategy outperforms FHB and exomePeak under different settings tested.

Figure 6

Boxplot of AUCs for different transition matrices used to generate the ground truth. The performance of FHB and FHC strategies heavily relies on the transition matrix setting, which reflects the degree of dependence between adjacent bins; and FHC strategy outperforms FHB and exomePeak under different settings tested.

The last factor that may affect the simulation results is the sequencing depth (the total number of reads). In our simulation, the sequencing depths (SD) of the four samples varied from 10⁹ to 10⁶. From Figure 7, we can see that the performances of FHB, FHC, and exomePeak are all satisfactory when sequencing depth is high enough (SD = 10⁹); their performance all decreases together with the sequencing depth. Among the 3 methods tested, FHC gives the best performance and the advantage of FET-HMM over exomePeak is the most prominent when the sequencing depth is low. When the sequencing depth is very low, none of the 3 approaches can identify DMRs effectively.

Figure 7

Boxplot of AUCs for different sequencing depths. The performance of all 3 approaches decreases together with the sequencing depth. FHC strategy gives the best performance and the advantage of FET-HMM over exomePeak is the most prominent when the data is of mediocre sequencing depth.

We also consider here another scenario of unbalanced sequencing depth; that is, only one of the 4 samples has very large or small sequencing depth, and the results are highly consistent with previous result. As shown in Figure 8, the performance of all 3 approaches decreases as the sequencing depth decreases and FHC strategy outperforms FHB and exomePeak on most settings.

Figure 8

Boxplot of AUCs for different unbalanced sequencing depths. The performance of all 3 approaches decreases as the sequencing depth decreases and FHC strategy outperforms FHB and exomePeak on most settings. In this test, the sequencing depth of IP sample under treated condition varies with that of the other 3 samples unchanged.

In general, the computational complexity of the proposed approaches increases together with the number of the genes, the length of the genes, and the resolution of the analysis (the size of the bin); and since FHB and FHC require iterative refinement, their computational complexity is also proportional to the number of iterations required to research convergence. To further evaluate the computational complexity of the 3 strategies, we conducted one additional experiment. In this experiment, we simulated a dataset of 7 genes, each with a different length (50, 100, 150, 200, 250, 300, and 350) and the methylation state transition probability is set to be 0.95. A total of 10 datasets are generated for evaluation purposes and the average performance and time consumption are calculated. As it can be seen from Table 1, on the simulated setting, FastFHC is comparable to FHB and FHC in performance, but much faster, making it a reasonable choice for genome-scale data with more than a few thousands of genes.

Table 1

Comparison of different approaches.

Method	AUC	Time
FHB	0.960	4.39 s
FHC	0.987	0.85 s
FastFHC	0.962	0.12 s
exomePeak (rhtest)	0.924	0.02 s

3.2. Test on MeRIP-Seq Data

In order to test our proposed method in real applications, we chose the human MeRIP-Seq data from Hela cells and from METTL3/METTL14 knockout conditions [40] as shown in Table 2. Previous study shows that METTL3 and METTL14 are components of RNA methyltransferase complex [40, 41], and we would like to identify their respective targeted RNA methylation sites from the following analysis. The original raw data in SRA format was downloaded directly from Gene Expression Omnibus (GEO) GSE46705, which consists of 8 IP and 8 Input MeRIP-Seq replicates obtained under wild type condition and after METTL3 or METTL14 knockout, respectively (a total of 16 libraries). The short sequencing reads are firstly aligned to human genome assembly hg19 with Tophat2 [42], and then the same types of samples obtained under the same condition are merged together for differential RNA methylation analysis.

Table 2

MeRIP-Seq data used.

Dataset	Cell	Treatment	Replicates (IP/input)	Reference
1	Hela	Control	4 & 4	[40]
2	Hela	METTL3 K/O	2 & 2	[40]
3	Hela	METTL14 K/O	2 & 2	[40]

Differential RNA methylation is predicted using exomePeak R/Bioconductor package [21] with UCSC gene annotation database [43] and with FastFHC strategy for comparison. Since METTL3 and METTL14 are methyltransferase, their target sites should exhibit hypomethylation under knockout condition. The hypomethylation sites under knockout condition (targeted RNA methylation sites) are then extracted and their sequences are submitted to MEME-ChIP for motif discovery. The identified motifs are summarized in Table 3. The enriched motifs are quite different in both datasets, indicating that there are multiple regulatory avenues to regulate the RNA methylome through sequence specificity.

Table 3

Motifs for target sites of METTL3 and METTL14.

	Rank	exomePeak		FET-HMM
	Rank	Motif	E -value	Motif	E -value
	1		2.3 × 10⁻²⁷		2.4 × 10⁻³³
METTL3 K/O	2		4.7 × 10⁻¹³		7.1 × 10⁻²⁴
METTL3 K/O	3		1.5 × 10⁻¹¹		1.5 × 10⁻¹⁵
	4		2.6 × 10⁻⁶		4.4 × 10⁻⁵

	1		3.3 × 10⁻¹³		1.4 × 10⁻¹⁹
METTL14 K/O	2		2.5 × 10⁻¹²		2.0 × 10⁻²¹
METTL14 K/O	3		3.3 × 10⁻¹⁰		3.4 × 10⁻¹²
	4		4.8 × 10⁻⁷		6.0 × 10⁻¹¹

Despite the difference in sequences, as shown in Figure 9, the motifs identified by FastFHC results are more statistically significant than that from exomePeak, indicating higher sequence specificity, which is achieved by spatial enhancement with HMM in FET-HMM approach. The increased sequence specificity will be invaluable for decoding the structure of RNA methylation/demethylation enzymes.

Figure 9

E values of motifs identified from differential methylation regions. The figure shows the motif E values from exomePeak and FastFHC strategy. With spatially enhanced differential methylation analysis, FastFHC identifies RNA methylation sites that are more biologically meaningful, indicating higher specificity compared with the exomePeak result.

We then checked the distribution of METTL3 and METTL14 targeted RNA methylation sites on mRNA and lncRNA. As shown in Figure 10, the targeted RNA methylation sites of METTL3 and METTL14 are relatively enriched near stop codon of mRNA. Interestingly, compared with METTL14 targets, METTL3 targets are relatively enriched on untranslated regions (5′ and 3′UTR), which is never reported before. Although existing studies suggest METTL3 and METTL14 function as an RNA methylation complex together with WTAP, our observation suggests that they may have their own respective functions as well. On lncRNA, their targets are almost uniformly distributed on the entire RNA with slight enrichment on 5′ end, whose reason is not yet clear.

Figure 10

Distribution of METTL3 and METTL14 targeted RNA methylation sites. For both METTL3 and METTL14, their targeted RNA methylation sites are relatively enriched near stop codon of mRNA; however, compared with METTL14 targets, METTL3 targets are relatively enriched on untranslated regions (5′ and 3′UTR). On lncRNA, their targets are unfirmly distributed with slightly enriched on 5′ end.

4. Conclusion

In this paper, we developed an HMM-based method, FET-HMM, for spatially enhanced detection of differentially methylated region from MeRIP-Seq data. Compared with existing peak-based approaches which perform differential analysis on the entire methylation site, FET-HMM seeks to increase the resolution of detection to some extent by dividing the single RNA methylation site into multiple adjacent bins (as shown in Figure 1), resulting in the improved detection performance. We developed 3 different strategies for this purpose, each with different advantage and disadvantages, and the FastFHC strategy can be directly applied to genome scale dataset. We show on the simulated and real datasets that the proposed approaches outperform original approach in detection performance and report more statistically significant DMRs on real MeRIP-Seq data.

It is important to note that exomePeak, which adopts a hypothesis testing scheme, relies on a cut-off threshold to report differential methylation sites, while FET-HMM, which assumes a hidden Markov model, needs a cut-off threshold for posterior probability. Although their performances can be compared under AUC, the two approaches are fundamentally different. It is suggested that both exomePeak and FET-HMM are used when analyzing specific datasets rather than using one approach only.

The proposed approach still has a number of limitations, many of which are shared by other existing MeRIP-Seq data analysis software. Firstly, the proposed approach could not model the within-group variation and thus cannot effectively take advantage of biological replicates. Currently, replicates are merged together which loses the biological variability. Secondly, the proposed approach cannot discriminate different isoforms of the same genes. MeRIP-Seq intrinsically poses very limited information regarding the methylation states of different isoform transcripts. Thirdly, even with the proposed approach, the spatial resolution is still not base-pair resolution. To obtain true base-pair solution, a more advanced computational approach needs to be developed to further combine the nucleotide sequence information (motif).

Disclosure

The open source R package implementing the proposed algorithm on MeRIP-Seq data is freely available from GitHub: https://github.com/lzcyzm/RHHMM.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank the support from National Natural Science Foundation of China (61473232, 61401370, 91430111, 61170134, and 61201408) to Shao-Wu Zhang, Jia Meng, and Hui Liu; Jiangsu Science and Technology Program (BK20140403) to Jia Meng; Fundamental Research Funds for the Central Universities (2014QNB47, 2014QNA84) to Lin Zhang and Hui Liu. The authors also thank computational support from the UTSA Computational System Biology Core, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health.

Brownlee

G. G.

Sanger

Barrell

B. G.

Nucleotide sequence of 5S-ribosomal RNA from Escherichia coli

Nature 1967 215 5102 735 736

10.1038/215735a0

2-s2.0-0014200967

Dominissini

Rechavi

Gene expression regulation mediated through reversible m⁶A RNA methylation

Nature Reviews Genetics 2014 15 5 293 306

2-s2.0-84898814417

10.1038/nrg3724

Meyer

K. D.

Jaffrey

S. R.

The dynamic epitranscriptome: N6-methyladenosine and gene expression control

Nature Reviews Molecular Cell Biology 2014 15 5 313 326

10.1038/nrm3785

2-s2.0-84899586607

König

Zarnack

Luscombe

N. M.

Ule

Protein-RNA interactions: new genomic technologies and perspectives

Nature Reviews Genetics 2012 13 2 77 83

10.1038/nrg3141

2-s2.0-84855916545

Dominissini

Moshitch-Moshkovitz

Schwartz

Salmon-Divon

Ungar

Osenberg

Cesarkas

Jacob-Hirsch

Amariglio

Kupiec

Sorek

Rechavi

Topology of the human and mouse m⁶A RNA methylomes revealed by m⁶A-seq

Nature 2012 484 7397 201 206

10.1038/nature11112

2-s2.0-84860779086

Meyer

K. D.

Saletore

Zumbo

Elemento

Mason

C. E.

Jaffrey

S. R.

Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons

Cell 2012 149 7 1635 1646

10.1016/j.cell.2012.05.003

2-s2.0-84862649489

Dominissini

Moshitch-Moshkovitz

Salmon-Divon

Amariglio

Rechavi

Transcriptome-wide mapping of N⁶-methyladenosine by m⁶A-seq based on immunocapturing and massively parallel sequencing

Nature Protocols 2013 8 1 176 189

10.1038/nprot.2012.148

2-s2.0-84872032385

Fustin

J. M.

Doi

Yamaguchi

Hida

Nishimura

Yoshida

Isagawa

Morioka

M. S.

Kakeya

Manabe

Okamura

XRNA-methylation-dependent RNA processing controls the speed of the circadian clock

Cell 2013 155 4 793 806

10.1016/j.cell.2013.10.026

2-s2.0-84887875528

Hess

M. E.

Hess

Meyer

K. D.

Verhagen

L. A. W.

Koch

Brönneke

H. S.

Dietrich

M. O.

Jordan

S. D.

Saletore

Elemento

Belgardt

B. F.

Franz

Horvath

T. L.

Rüther

Jaffrey

S. R.

Kloppenburg

Brüning

J. C.

The fat mass and obesity associated gene (Fto) regulates activity of the dopaminergic midbrain circuitry

Nature Neuroscience 2013 16 8 1042 1048

10.1038/nn.3449

2-s2.0-84880916638

Wang

Toth

J. I.

Petroski

M. D.

Zhang

Zhao

J. C.

N6 -methyladenosine modification destabilizes developmental regulators in embryonic stem cells

Nature Cell Biology 2014 16 2 191 198

10.1038/ncb2902

2-s2.0-84893310526

Lee

Kim

V. N.

Emerging roles of RNA modification: m(6)A and U-tail

Cell 2014 158 5 980 987

10.1016/j.cell.2014.08.005

Schwartz

Mumbach

M. R.

Jovanovic

Wang

Maciag

Bushkin

Mertins

Ter-Ovanesyan

Habib

Cacchiarelli

Sanjana

Freinkman

Pacold

Satija

Mikkelsen

Hacohen

Zhang

Carr

Lander

Regev

Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5′ sites

Cell Reports 2014 8 1 284 296

10.1016/j.celrep.2014.05.048

Liu

Yue

Han

Wang

Zhang

Jia

Deng

Dai

Chen

A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation

Nature chemical biology 2014 10 2 93 95

10.1038/nchembio.1432

2-s2.0-84897110592

Wang

Gomez

Hon

G. C.

Yue

Han

Parisien

Dai

Jia

Ren

Pan

N 6-methyladenosine-dependent regulation of messenger RNA stability

Nature 2014 505 7481 117 120

10.1038/nature12730

2-s2.0-84892372347

Grand Challenge Commentary: RNA epigenetics?

Nature Chemical Biology 2010 6 12 863 865

2-s2.0-78649267733

10.1038/nchembio.482

Dominissini

Roadmap to the epitranscriptome

Science 2014 346 6214 1192 1192

10.1126/science.aaa1807

Liu

Flores

M. A.

Meng

Zhang

Zhao

Rao

M. K.

Chen

Huang

MeT-DB: a database of transcriptome methylation in mammalian cells

Nucleic Acids Research 2015 43 1 D197 D203

10.1093/nar/gku1024

Liu

Zhang

Liu

Zhang

Chen

Huang

Meng

Decomposition of RNA methylome reveals co-methylation patterns induced by latent enzymatic regulators of the epitranscriptome

Molecular BioSystems 2015 11 1 262 274

10.1039/c4mb00604f

Meng

Cui

Rao

M. K.

Chen

Huang

Exome-based analysis for RNA epigenome sequencing data

Bioinformatics 2013 29 12 1565 1567

10.1093/bioinformatics/btt171

2-s2.0-84878874512

Song

MeRIP-PF: an Easy-to-use Pipeline for High-resolution Peak-finding in MeRIP-Seq Data

Genomics, Proteomics & Bioinformatics 2013 11 1 72 75

2-s2.0-84875375724

10.1016/j.gpb.2013.01.002

Meng

Liu

A protocol for RNA methylation differential analysis with MeRIP-Seq data and exomePeak R/Bioconductor package

Methods 2014 69 274 281

Bock

Analysing and interpreting DNA methylation data

Nature Reviews Genetics 2012 13 10 705 719

10.1038/nrg3273

2-s2.0-84871688621

Robinson

M. D.

Kahraman

Law

C. W.

Lindsay

Nowicka

Weber

L. M.

Zhou

Statistical methods for detecting differentially methylated loci and regions

Frontiers in Genetics 2014 5, article 324

10.3389/fgene.2014.00324

Sun

Rodriguez

Park

H. J.

Tong

Meong

Goodell

M. A.

MOABS: model based analysis of bisulfite sequencing data

Genome Biology 2014 15, article R38

10.1186/gb-2014-15-2-r38

2-s2.0-84899049695

Stockwell

P. A.

Chatterjee

Rodger

E. J.

Morison

I. M.

DMAP: differential methylation analysis package for RRBS and WGBS data

Bioinformatics 2014 30 13 1814 1822

10.1093/bioinformatics/btu126

Meng

Cui

Liu

Zhang

Rao

M. K.

Chen

Huang

Unveiling the dynamics in RNA epigenetic regulations

Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM '13)

December 2013

Shanghai, China

139 144

10.1109/BIBM.2013.6732477

Zhang

Liu

Meyer

C. A.

Eeckhoute

Johnson

D. S.

Bernstein

B. E.

Nussbaum

Myers

R. M.

Brown

Shirley

X. S.

Model-based analysis of ChIP-Seq (MACS)

Genome Biology 2008 9, article R137

10.1186/gb-2008-9-9-r137

2-s2.0-53849146020

Seifert

Cortijo

Colomé-Tatché

Johannes

Roudier

Colot

MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arrays

Bioinformatics 2012 28 22 2930 2939

10.1093/bioinformatics/bts562

2-s2.0-84869397620

Seifert

Keilwagen

Strickert

Grosse

Utilizing gene pair orientations for HMM-based analysis of promoter array ChIP-chip data

Bioinformatics 2009 25 16 2118 2125

2-s2.0-68649092780

10.1093/bioinformatics/btp276

Wei

C.-L.

Lin

Sung

W.-K.

An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data

Bioinformatics 2008 24 20 2344 2349

10.1093/bioinformatics/btn402

2-s2.0-53749092745

Krishnamoorthy

Thomson

A more powerful test for comparing two Poisson means

Journal of Statistical Planning and Inference 2004 119 1 23 35

MR2018448

2-s2.0-0242635914

Przyborowski

Wilenski

Homogeneity of results in testing samples from Poisson series with an application to testing clover seed for dodder

Biometrika 1940 31 313 323

MR0002070

Becker

Hagmann

Müller

Koenig

Stegle

Borgwardt

Weigel

Spontaneous epigenetic variation in the Arabidopsis thaliana methylome

Nature 2011 480 7376 245 249

10.1038/nature10555

2-s2.0-80054718602

Zhu

Tian

The DNA methylome of human peripheral blood mononuclear cells

PLoS Biology 2010 8 11

e1000533

10.1371/journal.pbio.1000533

Lister

Pelizzola

Dowen

R. H.

Hawkins

R. D.

Hon

Tonti-Filippini

Nery

J. R.

Lee

Ngo

Q.-M.

Edsall

Antosiewicz-Bourget

Stewart

Ruotti

Millar

A. H.

Thomson

J. A.

Ren

Ecker

J. R.

Human DNA methylomes at base resolution show widespread epigenomic differences

Nature 2009 462 7271 315 322

10.1038/nature08514

2-s2.0-70450217879

Lister

Pelizzola

Kida

Y. S.

Hawkins

R. D.

Nery

J. R.

Hon

Antosiewicz-Bourget

Ogmalley

Castanon

Klugman

Downes

Stewart

Ren

Thomson

J. A.

Evans

R. M.

Ecker

J. R.

Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells

Nature 2011 471 7336 68 73

10.1038/nature09798

2-s2.0-79952264847

Baldi

Chauvin

Smooth on-line learning algorithms for hidden Markov models

Neural Computation 1994 6 2 307 318

10.1162/neco.1994.6.2.307

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

Maximum likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society Series B: Methodological 1977 39 1 1 38

MR0501537

Poritz

Hidden Markov models: a guided tour

Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '88)

April 1988

New York, NY, USA

7 13

10.1109/ICASSP.1988.196495

Wang

Gomez

Hon

G. C.

Yue

Han

Parisien

Dai

Jia

Ren

Pan

N ⁶-methyladenosine-dependent regulation of messenger RNA stability

Nature 2014 505 7481 117 120

10.1038/nature12730

2-s2.0-84892372347

Ping

X.-L.

Sun

B.-F.

Wang

Xiao

Yang

Wang

W.-J.

Adhikari

Shi

Chen

Y.-S.

Zhao

Yang

Dahal

Lou

X.-M.

Liu

Huang

Yuan

W.-P.

Zhu

X.-F.

Cheng

Zhao

Y.-L.

Wang

Danielsen

J. M. R.

Liu

Yang

Y.-G.

Mammalian WTAP is a regulatory subunit of the RNA N6-methyladenosine methyltransferase

Cell Research 2014 24 2 177 189

10.1038/cr.2014.3

2-s2.0-84893746230

Kim

Pertea

Trapnell

Pimentel

Kelley

Salzberg

S. L.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Genome Biology 2013 14, article R36

10.1186/gb-2013-14-4-r36

2-s2.0-84876996918

Karolchik

Barber

G. P.

Casper

Clawson

Cline

M. S.

Diekhans

Dreszer

T. R.

Fujita

P. A.

Guruvadoo

Haeussler

Harte

R. A.

Heitner

Hinrichs

A. S.

Learned

Lee

B. T.

C. H.

Raney

B. J.

Rhead

Rosenbloom

K. R.

Sloan

C. A.

Speir

M. L.

Zweig

A. S.

Haussler

Kuhn

R. M.

Kent

W. J.

The UCSC genome browser database: 2014 update

Nucleic Acids Research 2014 42 1 D764 D770

10.1093/nar/gkt1168

2-s2.0-84891771466