Recovering Decay Rates from Noisy Measurements with Maximum Entropy in the Mean

We present a new method, based on the method of maximum entropy in the mean, which builds upon the standard method of maximum entropy, to improve the parametric estimation of a decay rate when the measurements are corrupted by large level of noise and, more importantly, when the number of measurements is small. The method is developed in the context on a concrete example: that of estimation of the parameter in an exponential distribution. We show how to obtain an estimator with the noise filtered out, and using simulated data, we compare the performance of our method with the Bayesian and maximum likelihood approaches.


Introduction
Suppose that you want to measure the half-life of a decaying nucleus or the life-time of some elementary particle, or some other random variable modeled by an exponential distribution describing, say a decay time or the life time of a process.Assume as well that the noise in the measurement process can be modeled by a centered Gaussian random variable whose variance may be of the same order of magnitude as that of the decay rate to be measured.We assume that the variance δ of noise is known.To make things worse, assume that you can only collect very few measurements.
We want to emphasize, that the method developed is tailored to this one important model in applications, and on the other hand, to the fact that the samples have to be small and contaminated by observational noise on top of their inherent randomness .And what the method provides in general, is a technique for filtering noise out.
Thus, if x i denotes a realized value of the variable, one can only measure y i x i e i , for i 1, 2, . . ., n, where n is a small number, say 2 or 3, and e 1 denotes the additive measurement noise.When one knows that the distribution of X is exponential, the parameter should be 2 Journal of Probability and Statistics estimated by 1/n y i 1/n x i 1/n e i .In other words, assume that you know that the sample comes from a specific parametric distribution but is contaminated by additive noise, then your estimator of the relevant parameters is contaminated by the measurement noise.What to do?One possible approach is to apply to the small sample the standard statistical estimation procedures like maximum likelihood.But these work well when the sample size is larger than what concerns us here.In our particular example, the MLE is 1/n y i in which the noise may be important unless n is large.Thus apart from the issues arising from the smallness of the sample, we have the issue of the presence of the observational noise.We should direct the reader to the work of Rousseeuw and Verboven 1 , in which issues relating to estimation in small samples are discussed.
Still another possibility, the one we want to explore here, is to apply a maxentropic filtering method, to estimate both the unknown variable and the noise level.For this we recast the problem as a typical inverse problem consisting of solving for x in y Ax e, x ∈ K, 1.
where K is a convex set in R d , y ∈ R k and for some d and k, and A is an k × d-matrix which depends on how we rephrase the our problem.We could, for example, consider the following problem.Find x ∈ 0, ∞ such that y x e. 1.2 In our case K 0, ∞ , and we set y 1/n Σ j y j .Or we could consider a collection of n such problems, one for every measurement, and then proceed to carry on the estimation.Once we have solved the generic problem 1.1 , the variations on the theme are easy to write down.What is important to keep in mind here, is that the output of the method is a filtered estimator x * of x, which itself is an estimator of the unknown parameter.The novelty then is to filter out the noise in 1.2 .
The method of maximum entropy in the mean MEM is rather well suited for solving problems like 1.1 .See Navaza's 2 for an early development and Dacunha-Castelle and Gamboa 3 for full mathematical treatment.We shall briefly review what the method is about and then apply it to obtain an estimator x from 1.2 .In Section 3 we obtain the maxentropic estimator, and in Section 4 we examine some of its properties, in particular we examine what the results would be if either the noise level were small or the number of measurements were large.We devote Section 4 to some simulations in which the method is compared with a Bayesian and a maximum likelihood approaches.

The Basics of MEM
MEM is a technique for transforming a possibly ill-posed, linear problem with convex constraints into a simpler usually unconstrained but non-linear minimization problem.The number of variables in the auxiliary problem is equal to the number of equations in the original problem, k in the case of example 1.To carry out the transformation one thinks of the x there as the expected value of a random variable X with respect to some measure P to be determined.The basic datum is a sample space Ω s , F s on which X is to be defined.In our setup the natural choice is to take Ω s K, F s B K , the Borel subsets of K, and X id K the identity map.Similarly, we think of e as the expected value of a random variable V taking values in R k .The natural choice of sample space here is Ω n R k and F n B R k the Borel subsets.
To continue we need to select a prior measures dQ s ξ and dQ n v on Ω s , F s and Ω n , F n .The only restriction that we impose on them is that the closure of the convex hull of both supp Q s resp., of supp Q n is K resp., R k .These prior measures embody knowledge that we may have about x and e but are not priors in the Bayesian sense.Actually, the model for the noise component describes the characteristics of the measurement device or process, and it is a datum.The two pieces are put together setting Ω Ω s × Ω n ; F F s ⊗ F n , and dQ ξ, v dQ s ξ dQ n v .And to get going we define the class Note that for any P ∈ P having a strictly positive density ρ dP/dQ, then E P X ∈ int K .For this standard result in analysis check in Rudin's book 4 .The procedure to explicitly produce such P 's is known as the maximum entropy method.The first step of which is to assume that P / ∅, which amounts to say that our inverse problem 1.1 has a solution, and we define by the rule whenever the function ln dP/dQ is P -integrable and S Q P −∞ otherwise.This entropy functional is concave on the convex set P. To guess the form of the density of the measure P * that maximizes S Q is to consider the class of exponential measures on Ω defined by where the normalization factor is Here λ ∈ R k , if we define the dual entropy function by the rule It is easy to prove that, Σ λ ≥ S Q P for any λ ∈ D Q , and any P ∈ P. Thus if we were able to find a λ * ∈ D Q such that P λ * ∈ P, we are done.To find such a λ * it suffices to minimize the convex function Σ λ over the convex set D Q .We leave for the reader to verify that if the minimum is reached in the interior of D Q , then P λ * ∈ P. We direct the reader to 4, 5 for all about this, and much more.For a review of the use of maximum entropy on the mean for solving linear inverse problems, the reader might want to look at 6 .

Entropic Estimators
Let us now turn our attention to 1.2 .Since our estimator is a sample mean of an exponential of unknown parameter , it is natural to assume, for the method described in Section 2, that the prior Q s for X is a Γ n, α/n , where α > 0 is our best or prior guess of the unknown parameter.Here in after we propose a criterion for the best choice of α.Similarly, we shall chose Q n to be the distribution of a N 0, δ 2 /n random variable as prior for the noise component.Things are rather easy under these assumptions.To begin with, note that and the typical member dP λ ξ, v of the exponential family is now Γ n e − λ nα ξ e − v δ 2 λ/n 2 n/2δ 2 2πδ 2 /n 1/2 dξ dv.

3.2
It is also easy to verify that the dual entropy function Σ λ is given by and, discarding one of the solutions because it leads to a negative estimator of a positive quantity , we are left with from which we obtain that as well as

3.7
Comment 1.Clearly, from 3.4 it follows that y x * e * .Thus it makes sense to think of x * as the estimator with the noise filtered out, and to think of e * as the residual noise.

Properties of x *
Let us now spell out some of the notation underlying the probabilistic model behind 1.1 .We shall assume that the x i and the e i in the first section are values of random variables X i and ε i defined on a sample space W, W .For each θ > 0, we assume to be given a probability law P θ on W, W , with respect to which the sequences {X k | k 1, 2, . ..} and {ε k | k 1, 2, . ..} are both i.i.d. and independent of each other, and with respect to P θ , X k ∼ exp θ and ε k ∼ N 0, δ 2 .That is, we consider the underlying model for the noise as our prior model for it.Minimal consistency is all right.From 3.6 and 3.7 , the following basic results are easy to obtain.Lemma 4.1.If we take α 1/ y, then λ * 0, x * y, and e * 0.
Comment 2. Actually it is easy to verify that the solution to x * α 1/α is α 1/ y.
To examine the case in which large data sets were available, let us add a superscript n and write y n to emphasize the size of the sample.If x n denotes the arithmetic mean of an i.i.d.sequence of random variables having exp θ as common law, it will follow from the LLN the following.

Lemma 4.2. As n → ∞ then
4.1 Proof.Start from 3.7 , invoke the LLN to conclude that y n tends to θ and obtain 4.1 .
Proof.Just look at the right hand-side of 4.1 to conclude that x 1/θ θ.
Comment 3. What this asserts is that when the number of measurements is large, to find the right value of the parameter it suffices to solve x α − 1/α 0. And when the noise level goes to zero, we have the following.
Proof.When δ → 0, the dQ n v → 0 dv , the Dirac point mass at 0. In this case, we just set δ 0 in 3.4 and the conclusion follows.
When we choose α 1/ y, the estimator x * happens to be unbiased.
Lemma 4.5.Let θ denote the true but unknown parameter of the exponential, and P θ dy have density for y > 0 and 0 otherwise.With the notations introduced above, one has E P θ x n * 1/θ whenever the prior α for the maxent is the sample mean y.
Proof.It drops out easily from Lemma 4.1, from 1.2 , and the fact that the joint density f θ of y is a convolution.
But the right choice of the parameter α is still a pending issue.To settle it we consider once more the identity | y − x * | | e * |.In our particular case we shall see that α 0 minimizes the right-hand side of the previous identity.Thus, we propose to choose α to minimize the residual or reconstruction error.Lemma 4.6.With the same notations as above, e * happens to be a monotone function of α and e * α 0 1/2 y− y 2 4δ 2 and e * α → ∞ y.In the first case x * α 0 1/2 y y 2 4δ 2 , whereas in the second x * α → ∞ 0.
Proof.Recall from the first lemma that when α y 1, then e * 0. A simple algebraic manipulation shows that when α y > 1 then e * > 0, and that when α y < 1 then e * < 0. To compute the limit of e * as α → ∞, note that for large α we can neglect the term 4/δ 2 under the square root sign, and then the result drops out.It is also easy to check the positivity of the derivative of e * with respect to α.Also clearly | e * 0 | < | e * ∞ |.
To sum up, with the choice α 0, the entropic estimator and residual error are

Simulation and Comparison with the Bayesian and Maximum Likelihood Approaches
In this section we compare the proposed maximum entropy in the mean procedure with the Bayesian and maximum likelihood estimation procedures.We do that by simulating data and carrying out the three procedures and plotting the histograms of the corresponding estimators.First, we generate histograms that describe the statistical nature of x * as a function of the parameter α.For that we generate a data set of 5000 samples of 1, 3, 5, and 10 measurements, and for each of them we obtain x * from 4.3 .Also, for each data point we apply both a Bayesian estimation method, a simple-minded maximum likelihood estimation and a maximum likelihood method and plot the resulting histograms.

The Simple-Minded MLE
This consists of an application of the MLE method as if there was no measurement noise.We carried out this for the sake of comparison, to verify that when the sample size becomes larger, the effect of the measurement noise is washed away on the average.The plot of the results for n 1 is too scattered, and we do not display it.The result of the simulations is displayed in Figure 1.

The Maxentropic Estimator
The simulated data process goes as follows.For n 3 the data points y 1 , y 2 , y 3 are obtained in the following way.
i Simulate a value for x i from an exponential distribution with parameter θ 1 .
ii Simulate a value for e i from a normal distribution N 0, δ 2 0.25 .
iii Sum x i with e i to get y i , if y i < 0, repeat first two steps until y i > 0.
v Compute the maximum entropy estimator given by 4.3 .
We then display the resulting histogram in Figure 2.

The Bayesian Estimator
In this section we derive the algorithm for a Bayesian inference of the model given by y i x e i , for i 1, 2, . . ., n.The classical likelihood estimator of x is given by y 1/n n i 1 y i .As we know that the unknown mean x has an exponential probability distribution with parameter θ x ∼ E θ , therefore the joint density of the y i and x is proportional to where θ exp −θx is the density of the unknown mean x and where π θ ∝ θ −1 is the Jeffrey's noninformative prior distribution for the parameter θ 5 .
In order to derive the Bayesian estimator, we need to get the posterior probability distribution for θ, which we do with the following Gibbs sampling scheme 7 .
ii Draw θ ∼ E x .
Repeat this algorithm many times in order to obtain a large sample from the posterior distribution of θ in order to obtain the posterior distribution of E x 1/θ.For our application, we simulate data with θ 1, which gives an expected value for x equal to E x 1.
We get the histogram displayed in Figure 3 for the estimations of E x after 5000 iterations when simulating data for θ 1.The results are summarized in Tables 1, 2, and 3. When simulating data for θ 1, the MEM, Maximum likelihood and Bayesian histograms are all skewed to the right and yield a mean under the three simulated histograms

Figure 1 :
Figure 1: The simple MLE for different n.

Figure 2 :
Figure 2: Histogram for E x with MEM for different n.

Figure 3 :
Figure 3: Histogram for E x with Bayes Method for different n.

Figure 4 :
Figure 4: Histogram for E x with the MLE for different n.

Table 1 :
Means and standard deviations for different methods when n 3.

Table 2 :
Means and standard deviations for different methods when n 5.