The statistical application considered here arose in epigenomics, linking the DNA methylation proportions measured at specific genomic sites to characteristics such as phenotype or birth order. It was found that the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at CpG sites, may be successfully modelled by a Laplace distribution which is perturbed by a Hermite polynomial. We use a linear model with such a response function. Hence, the response function is known, or assumed well estimated, but fails to be differentiable in the classical sense due to the modulus function. Our problem was to estimate coefficients for the linear model and the corresponding covariance matrix and to compare models with varying numbers of coefficients. The linear model coefficients may be found using the (derivative-free) simplex method, as in quantile regression. However, this theory does not yield a simple expression for the covariance matrix of the coefficients of the linear model. Assuming response functions which are
This work arose in a biological context, in epigenomics, namely, the modelling of the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at specific genomic sites (CpG sites). It was observed that this error distribution may be suitably modelled by a truncated Laplace distribution perturbed by a Hermite polynomial.
This error distribution was first noticed in Sequenom measurements but has wider application. A survey of data generated by measurements on the Infinium, Illumina, Affymetrix, and MeDIP2 machines showed similar characteristics to that of the Sequenom, where such an amended Laplace distribution was required to properly describe the probability density function. It is thought that variation in the scattering angle of light in the measurement processes common to all of these platforms is responsible for the frequencies in the tails of the measurement distributions not conforming to a simple Laplace density and requiring our proposed amendment. Without the amendment the Laplace density gives tail probabilities for the deviations that are too high, potentially leading to an incorrect failure to reject a null hypothesis. Because the observed frequency distribution of epigenomic and gene expression measurements appears to be a common feature of molecular biology, it is important that the process of estimation and inference under the amended Laplace probability density be studied. The paper reports results from a study of estimation and inference under the amended Laplace density.
We extend the theory of linear models as given in [
The theory of generalized linear models as described in [
The modified Laplace probability density functions considered here have a sharp peak at the maximum. Maximum likelihood estimation (MLE) of coefficients may be done by non-gradient methods, such as the simplex method. However, the usual classical expressions for the standard errors of the coefficients, the information matrix, and the log-likelihood ratio statistic do not apply due to lack of differentiability. We derive expressions for generalized versions of these quantities using generalized functions. Consequently, we show that our MLE is asymptotically normal.
The method we present to estimate these statistics could in principle be applied to other probability density functions exhibiting abrupt changes in gradient. Response function parameters are assumed known or previously estimated. The theory is applied to find the standard errors for coefficients of a linear model, assuming the response function has a truncated Laplace distribution with added kurtosis, due to perturbation by a Hermite polynomial. To illustrate the application, we show how birth order can be linked to methylation status at two CpG sites in the promotor of the H19 gene.
Let
Let
The method of MLE for the response function (
Now consider the case of bounded support. For finite
More generally, consider perturbations of the truncated Laplace probability density function of the following form. Let
We could allow unbounded support if
Now consider our motivating example, a truncated Laplace distribution with bounded support
Consider
The functions
We restrict to symmetric distributions satisfying
Let
Our aim is to find a maximum likelihood estimator (MLE) denoted
If
The theory of LAE regression (corresponding to MLE using Laplace distributions without modification as response functions) may be found in various texts, for example, [
If
In Section
If
Although
Consider the probability density function
If
Let
Let
Note that Corollary
To begin, assume that
Let
Next, consider
Now, Assuming that A MLE is not necessarily unique. If Since at the MLE the absolute values of the deviations
Lemma
For
Now
We need to be aware of the case where
The question is, given a nontrivial perturbing function
Assume that
It is possible that there exist orthants
Assume bounded support and let
Choose
In the limiting case, where
Now,
We may apply Lemma
The non-linear part of the log-likelihood function,
The first derivative of
The second derivative of
In certain situations we might find the criteria for both Lemmas
For
The inclusion of the modulus (absolute value) function in the Laplace probability density function (
The fact that
The following generalized functions are required to determine the first and second partial derivatives of the log-likelihood function
In Section
Let
Let
In Sections
The mean and the variance of
If
We calculate the information matrix
If
In order to calculate the expected value of the generalized Hessian we require
We use a generalized Taylor series expansion (in the coefficients
Next we consider the score function
We have shown that the expected value of the generalized Hessian
If
Now consider our motivating example, a Laplace distribution with bounded support
The log-likelihood ratio statistic enables us to assess the adequacy of a model. It enables us to compare a model with
Let
The derivation of the log-likelihood ratio statistic (for smooth functions) may be found in the textbooks. For example, for generalized linear models (see [
So, by (
The generalized expressions derived in Section Our model is linear (see ( The response function There exists a unique true vector of coefficients The matrix The Assuming fixed
We make the following further assumptions.
The ML estimates
Since a continuous function on a compact set attains it maximum, the existence of a maximum of the log-likelihood function
The random variable
The random variable
Consider the random variable
Although for finite
Using a generalized Taylor series expansion about
In the Proof of Lemma
Consider the first degree approximation
An alternative Proof of Theorem
Quantitative analysis of DNA methylation at specific genomic sites (known as CpG sites) was carried out with the Sequenom MassARRAY Compact System (
Deviations in methylation proportion.
In order to illustrate the application of the theory developed, a sample of 40 deviations was chosen at random from the total pool of 1440 available CpG methylation proportion deviations. A constant value of 0.48 was added to 20 of these samples and designated treatment H, while a constant value of 0.45 was added to the other 20 samples and designated treatment L. A uniform random variable sampled between −0.01 and 0.01 was added to each value to simulate the additional differences expected to occur between individuals.
We analysed the data using our amended Laplace distribution (
We coded a low value treatment (L) by setting
Two simulation results, adding high (H) or low (L) treatments (T) to DNA methylation proportions.
T | Amended Laplace distribution | Laplace distribution | Normal distribution | ||||||
---|---|---|---|---|---|---|---|---|---|
|
|
MATLAB glmfit | |||||||
Mean | Std. err. |
|
Mean | Std. err. |
|
Mean | Std. err. |
|
|
H | 0.4817 | 0.0042 |
|
0.4812 | 0.0062 |
|
0.4658 | 0.0094 |
|
L | 0.4532 | 0.0042 | 0.4519 | 0.0062 | 0.4540 | 0.0094 | |||
| |||||||||
H | 0.4803 | 0.0042 |
|
0.4809 | 0.0062 |
|
0.4887 | 0.0119 |
|
L | 0.4592 | 0.0042 | 0.4586 | 0.0062 | 0.4673 | 0.0119 |
For both simulations
For comparison, the results of a standard analysis of variance, assuming the deviations have a normal distribution, are also given in Table
For comparison we also included LAE regression, estimating the model coefficients assuming the response function is the Laplace probability density function truncated to
The CpG methylation at two CpG sites in the promoter of the H19 gene was measured by Sequenom in umbilical cord tissues collected as part of an ongoing prospective birth cohort study. Phenotype variables in this population include birth order or parity, defined as first born child (primiparous) or later born (multiparous). We analysed the relationship between H19 gene methylation status and birth order in this study, using our amended Laplace distribution (
The problem was coded by substituting
Primiparous (p) versus multiparous (m) effects on DNA methylation proportion at the promoter of the H19 gene.
Site | Parity |
Amended Laplace distribution |
Normal distribution | Mann-Whitney | ||||
---|---|---|---|---|---|---|---|---|
Mean | Std. err. |
|
Mean | Std. err. |
|
|
||
CpG9 | p | 0.180 | 0.0029 | <1 |
0.300 | 0.059 |
|
|
CpG9 | m | 0.450 | 0.0029 | 0.441 | 0.059 | |||
| ||||||||
CpG13 | p | 0.230 | 0.0029 | <1 |
0.326 | 0.061 |
|
|
CpG13 | m | 0.560 | 0.0029 | 0.523 | 0.061 |
First simulation data, treatments either H (
T |
|
T |
|
T |
|
T |
|
---|---|---|---|---|---|---|---|
L | 0.4579 | L | 0.4467 | H | 0.4841 | H | 0.4873 |
L | 0.4243 | L | 0.4610 | H | 0.4735 | H | 0.4761 |
L | 0.4993 | L | 0.4609 | H | 0.4878 | H | 0.2391 |
L | 0.4131 | L | 0.4851 | H | 0.4823 | H | 0.4805 |
L | 0.4463 | L | 0.4340 | H | 0.4462 | H | 0.4877 |
L | 0.4317 | L | 0.4573 | H | 0.4664 | H | 0.4779 |
L | 0.4473 | L | 0.4360 | H | 0.4845 | H | 0.4929 |
L | 0.4347 | L | 0.4584 | H | 0.4817 | H | 0.4863 |
L | 0.4760 | L | 0.4420 | H | 0.4861 | H | 0.4751 |
L | 0.4776 | L | 0.4914 | H | 0.4668 | H | 0.4543 |
Second simulation data, treatments either H (
T |
|
T |
|
T |
|
T |
|
---|---|---|---|---|---|---|---|
L | 0.5287 | L | 0.4416 | H | 0.4881 | H | 0.4829 |
L | 0.5224 | L | 0.4547 | H | 0.4803 | H | 0.4246 |
L | 0.5162 | L | 0.4496 | H | 0.4789 | H | 0.4609 |
L | 0.4564 | L | 0.4568 | H | 0.4790 | H | 0.4233 |
L | 0.4628 | L | 0.4574 | H | 0.4739 | H | 0.5412 |
L | 0.5230 | L | 0.4599 | H | 0.4974 | H | 0.5193 |
L | 0.3731 | L | 0.5124 | H | 0.7010 | H | 0.4921 |
L | 0.4389 | L | 0.4592 | H | 0.4725 | H | 0.4702 |
L | 0.4519 | L | 0.4458 | H | 0.4871 | H | 0.5520 |
L | 0.4675 | L | 0.4685 | H | 0.4885 | H | 0.3600 |
CpG methylation measurements at sites 9 and 13 on the promoter of the H19 gene versus primiparous (p) or multiparous (m).
CpG9 | p/m | CpG9 | p/m | CpG13 | p/m | CpG13 | p/m |
---|---|---|---|---|---|---|---|
1.00 | p | 0.16 | p | 0.30 | p | 0.16 | p |
0.08 | p | 0.19 | p | 0.00 | p | 0.36 | p |
0.04 | p | 0.15 | p | 0.03 | p | 0.02 | p |
0.17 | p | 0.35 | p | 0.25 | p | 0.60 | p |
0.46 | p | 0.04 | p | 0.80 | p | 0.01 | p |
1.00 | p | 0.27 | p | 0.71 | p | 0.70 | p |
0.18 | p | 0.32 | p | 0.17 | p | 0.56 | p |
0.33 | m | 0.37 | p | 0.56 | m | 0.70 | p |
0.28 | m | 0.05 | p | 0.40 | m | 0.00 | p |
0.82 | m | 0.39 | p | 0.57 | m | 0.61 | p |
0.20 | p | 0.07 | p | 0.18 | p | 0.02 | p |
0.08 | p | 0.17 | p | 0.03 | p | 0.23 | p |
0.15 | p | 0.14 | m | 0.09 | p | 0.99 | m |
1.00 | p | 0.61 | m | 0.96 | p | 0.60 | m |
0.10 | m | 0.53 | m | 0.79 | m | 0.35 | m |
0.89 | m | 0.45 | m | 0.83 | m | 0.63 | m |
0.07 | m | 0.09 | m | 0.02 | m | 0.07 | m |
0.62 | m | 0.57 | m | 0.38 | m | 0.53 | m |
0.48 | m | 0.73 | m | 0.68 | m | 0.72 | m |
0.31 | m | 0.30 | m | 0.27 | m | 0.22 | m |
0.62 | m | 0.80 | m |
In this example, setting
The original MLE theory and methods in this paper were developed assuming the response function is a modified version of the Laplace probability density function, that is, assuming nontrivial perturbation and/or truncation to compact support
In the absence of perturbation or truncation of the response function, the results in this paper correspond to the theory of LAE (or median) regression as found in [
We present an original and practical method of obtaining the covariance matrix for the model coefficients. This involves evaluating
For LAE regression, other methods of determining approximations to this covariance matrix may be found in the literature. In particular, in the method of quantile regression [
We prove that, even for truncated and perturbed Laplace response functions, subject to certain restrictions, the maximum of the log-likelihood function occurs at a data point. This result is well-known in the case of LAE regression. A proof that the LAE estimator passes through at least
Three asymptotically equivalent test statistics for LAE regression may be found in [
When working with a model for which the response function is assumed to be a truncated Laplace probability density function, we could ignore the truncation to
The original formulae derived in Section
Preliminary results indicate the use of an amended Laplace distribution enables a clearer separation of means than that given by other more standard procedures, for example, beta regression, in cases where independent evidence suggests that the means are different [
The Laplace distribution is the basis of many mathematical models (see [
Molecular biology deals with complex interactions both in terms of the physiology of the processes of interest and in the instrumentation required to measure these effects. The non-linearity of these processes can result in frequency distributions that are far from normal, so that application of “standard” methods of statistical inference based on least squares may be inadequate. Methods which deal with the form of the frequency distribution directly such as maximum likelihood are necessary for adequate inference to be made.
The Laplace or double exponential distribution considered here has been observed in molecular biology studies, where a significant proportion of high deviations appears to occur regularly [
We prove Lemma
Let
Theorem
Rockafellar [
Let
For our purposes, the effective domain of
Let
Let
Let
The simulated high (H) and low (L) treatment data analysed in Section
The authors wish to acknowledge funding support provided by the National Research Centre for Growth and Development, New Zealand (G. Wake, A. Pleasants, A. Sheppard), and the Foundation of Research Science and Technology, New Zealand (UOAX0808, A. Sheppard). Further, the authors acknowledge their collaborative link with the GUSTO birth cohort, led by Professors P. D. Gluckman, University of Auckland, and Yap-Seng Chong, National University of Singapore.