A number of empirical Bayes models (each with different statistical distribution assumptions) have now been developed to analyze differential DNA methylation using high-density oligonucleotide tiling arrays. However, it remains unclear which model performs best. For example, for analysis of differentially methylated regions for conservative and functional sequence characteristics (e.g., enrichment of transcription factor-binding sites (TFBSs)), the sensitivity of such analyses, using various empirical Bayes models, remains unclear. In this paper, five empirical Bayes models were constructed, based on either a gamma distribution or a log-normal distribution, for the identification of differential methylated loci and their cell division—(1, 3, and 5) and drug-treatment-(cisplatin) dependent methylation patterns. While differential methylation patterns generated by log-normal models were enriched with numerous TFBSs, we observed almost no TFBS-enriched sequences using gamma assumption models. Statistical and biological results suggest log-normal, rather than gamma, empirical Bayes model distribution to be a highly accurate and precise method for differential methylation microarray analysis. In addition, we presented one of the log-normal models for differential methylation analysis and tested its reproducibility by simulation study. We believe this research to be the first extensive comparison of statistical modeling for the analysis of differential DNA methylation, an important biological phenomenon that precisely regulates gene transcription.
High-density oligonucleotide tiling arrays have been widely utilized to globally analyze chromatin modifications across entire genomes, including assessments of DNA methylation, in addition to the identification of transcription factor binding sites [
To date, there have been numerous statistical inference frameworks developed for microarray differential analysis, including empirical [
The fundamental key of empirical Bayes model for characterizing microarray data is the statistical distribution assumption, which currently includes two common types: log-normal and gamma distribution. Our group was one of the first to use the empirical Bayes model for the analysis of differential methylation microarray data, by developing a log-normal empirical Bayes model for microarray analysis of not only differential DNA methylation but also histone acetylation and differential gene expression, in a “triple array” system for the simultaneous assessment of these phenomena in ovarian cancer cells [
It was recently shown that specific sequence characteristics of methylated regions exist in cancerous [
In this paper, we constructed and compared the performance of a number of empirical Bayes models, based on log-normal and gamma distributions, and then compared their performance in differential methylation analysis on real data. Finally, we assessed the impact of these models for a common biological application, TFBS enrichment within DNA sequences differentially methylated by cell division and treatment with a DNA-damaging agent.
Genomic DNA from ovarian cancer A2780 cells (ATCC, Manassas, VA, Calbiochem, Billerica, MA, USA) and total genomic DNA purified (DNeasy purification kits, Qiagen, Valencia, CA) following 1, 3, and 5 cell divisions were exposed or unexposed to the DNA adduct-forming agent cisplatin. Differential methylation hybridization (DMH) was then performed as previously described [
A numerical methylation signal for each probe,
For our use of the BGG model, first proposed by Newton et al. [
Consequently, the M-step was
For our microarray differential methylation analysis, we slightly revised the BGG model, in which the between-replicate variation is modeled by truncated normal distributions, as follows:
Our BNNGG model was a further revision from the BNGG model, with the background variation (at the pixel level) added as an additional source of variation
The BLNN model was first proposed by Kendziorski et al. [
Our BLNNN model was revised from the BLNN model (described above), in which the background variation at the pixel level was added as an additional source of variation:
Our previous study of the fidelity of DNA methylation inheritance [
Five empirical Bayes models parameter list.
Empirical Bayes model | Parameters | Observed data | Missing data |
---|---|---|---|
BGG |
|
|
|
BNGG |
|
|
|
BNNGG |
|
|
|
BLNN |
|
|
|
BLNNN |
|
|
|
Note:
Five empirical Bayes model frameworks.
Empirical Bayes model |
|
|
Likelihood |
---|---|---|---|
|
|
||
BGG |
|
|
|
|
|
||
| |||
|
|
||
|
|
||
BNGG |
|
|
|
|
|
||
|
|
||
| |||
|
|
||
|
|
||
BNNGG |
|
|
|
|
|
||
|
|
||
| |||
|
|
||
BLNN |
|
|
|
|
|
||
| |||
BLNNN |
|
|
|
|
|
||
|
|
|
|
|
|
Time-dependent methylation pattern definitions. Between the parent A2780 cell and its cisplatin-treated 1st, 3rd, and 5th generation daughter cells, a probe with increased methylation (probability ≥ 0.8) is defined as hypermethylation (i.e., up), a probe with decreased methylation (probability ≥ 0.8) is defined as hypomethylation (i.e., down), and otherwise, the methylation change is even. Probes showing decreased methylation from generations 1 to 3 to 5 were defined as having “stochastic hypomethylation.” Analogously, probes showing increased methylation from generations 1 to 3 to 5 were considered to exhibit “stochastic hypermethylation.” Finally, probes showing mixed increased and decreased methylation from generations 1 to 3 to 5 were defined as having “random differential methylation.”
Categories | Differential methylation | ||
---|---|---|---|
Parental versus Generation 1 | Parental versus generation 3 | Parental versus generation 5 | |
Stochastic hypomethylation | Down | Down | Down |
Even | Down | Down | |
Even | Even | Down | |
| |||
Stochastic hypermethylation | Up | Up | Up |
Even | Up | Up | |
Even | Even | Up | |
| |||
Random differential methylation | Down | Up | Down |
Down | Up | Even | |
Down | Even | Down | |
Even | Up | Down | |
Even | Up | Even | |
Even | Down | Even | |
Even | Down | Up | |
Up | Down | Up | |
Up | Down | Even | |
Up | Even | Up |
As we mentioned in Section
For identifying DNA sequences differentially methylated over 1, 3, or 5 cell divisions and/or treatment with the DNA-damaging agent cisplatin, we used a customized 60-mer oligo-two-color microarray, containing over 40,000 CpG-rich fragments from 12,000 promoters. Methylated versus unmethylated DNA fragments were separated by digesting DNA isolated from drug-treated daughter (Cy5 labeled for cell generations 1, 3, and 5) cells and untreated parental (Cy3 labeled) cells to methylation-sensitive restriction enzyme cleavage, where the raw values of each scanned fluorescent probe was preprocessed for foreground/background signal normalization, pixel number, and signal standard deviations. The raw data was first statistically normalized using the common Lowess method (see Section
Each of the five empirical Bayes models was then compared for its performance, as determined by the minimized negative after-convergence log-likelihoods for the EM iterations (Figure
Model performance comparisons in differential methylation data analysis. Five empirical Bayes models were compared: (1) binary-gamma-gamma (BGG); (2) binary-normal-gamma-gamma (BNGG); (3) binary-normal-normal-gamma-gamma (BNNGG); (4) binary-log-normal-normal (BLNN); (5) binary-log-normal-normal-normal (BLNNN). Negative log-likelihoods (a) and the number of identified differentially methylated CpG islands (b) of the five Bayesian models as applied for comparing methylation differences between A2780 parental cells and their cisplatin-treated 1st, 3rd, and 5th generation daughter cells.
Differentially methylated CpG islands before and after cisplatin treatment identified by empirical Bayes models. Scatter plots of the logarithmically transformed DNA methylation intensities before and after 1, 3, and 5 cell divisions of cisplatin-treated A2780 cells, in which the
Prescribed differential methylation analysis is applicable to compare DNA methylation signals before and after A2780 cells divided and were treated with cisplatin at a given time point. Our previous study of the heritable fidelity of DNA methylation during DNA replication [
Numbers of CpG islands, as identified by empirical Bayes models, segregating into our three previously defined methylation heritability categories [
Overlaps of stochastically hypo- and hypermethylated CpG islands identified by empirical Bayes models.
To assess a possible systems biological application for this work, we compared the degree of TFBS enrichment among stochastically hypomethylated, stochastically hypermethylated, and randomly differentially methylated loci, as compared to the predicted TFBS frequencies calculated from the GC content-matched background sequences (see Section
Number of significantly enriched TFBSs in time-dependent methylation patterns.
Empirical Bayes model | Stochastic hypo-methylation | Stochastic hyper-methylation | Random differential methylation |
---|---|---|---|
BGG | 0 | 0 | 4 |
BNGG | 0 | 0 | 0 |
BNNGG | 0 | 0 | 0 |
BLNN | 71 | 51 | 19 |
BLNNN | 36 | 58 | 0 |
The log-normal models presented minimum negative log-likelihoods, showing consistently increasing numbers of differential methylated and reasonable numbers of time-dependent differentially methylated loci. All these features suggest rigorous statistical performance on differential methylation analysis. Moreover, we recently found that hypermethylated gene promoters had enriched transcription factor-binding sites (TFBSs) in ovarian cancer drug-resistant cells [
As we discussed previously, BLNN performed worse on low signal probes than BLNNN, which resulted in more differential methylated loci. Subsequently, BLNN generated more stochastically hypomethylated loci, stochastically hypermethylated loci, and random differential methylated loci. TFBS enrichment showed similar patterns on stochastic hypo- or hypermethylation between both models, while dramatically different on random methylation, which gives us a chance to compare these two models biologically. By enrichment of 0 versus 19, BLNNN selected purely the nonmonotone methylation loci into the random methylation pattern, suggesting a better performance than BLNN.
To illustrate the applicability of log-normal distribution assumption and BLNNN model in differential methylation analysis, which is not just limited in the real microarray experiments presented in this paper, we further performed simulation studies on BLNNN model. The parameter estimates
We believe this is the first comparison of empirical Bayes models for analyzing differential methylation microarray data, demonstrating that log-normal distribution is statistically superior to gamma distributions. We also showed that probe level background noise can markedly confound the identification of differentially methylated loci and particularly, affect BLNN detection of loci having small methylation signals, as compared to BLNNN. In a similar study, Kendziorski et al
In this paper, we compared all five empirical Bayes models for revealing enrichment of TFBS motifs into three distinct methylation heritability categories. While both log-normal models provided similar numbers of enriched TFBSs in stochastically hypermethylated and hypomethylated loci, all gamma models yielded only limited or no TFBSs. In the field of epigenetics, it has been hypothesized that there exist methylation-prone and methylation-resistant sequences in cancerous [
Transcription factor-binding site
Differential methylation hybridization
Binary-gamma-gamma model
Binary-normal-gamma-gamma model
Binary-normal-normal-gamma-gamma model
Binary-log-normal-normal model
Binary-log-normal-normal-normal model
Expectation-maximization.
The authors declare that they have no competing interests.
This work was supported by National Natural Science Foundation of China 60973078 (Y. Wang), 61173085 (Y. Wang), and 60901075 (G. Wang), Natural Science Foundation of Heilongjiang Province of China LC2009C35 (G. Wang), United States National Cancer Institute grants CA113001 (T. H. M. Huang and K. P. Nephew) and CA85289 (K. P. Nephew and C. Balch).