Sample dependence in the maximum entropy solution to the generalized moment problem

The method of maximum entropy is quite a powerful tool to solve the generalized moment problem, which consists of determining the probability density of a random variable X from the knowledge of the expected values of a few functions of the variable. In actual practice, such expected values are determined from empirical samples, leaving open the question of the dependence of the solution upon the sample. It is the purpose of this note to take a few steps towards the analysis of such dependence.


Introduction and preliminaries
To state what the generalized moment problem is about, let (Ω, F , P) be a probability space and let (S, B, m) be a measure space, with m a finite or sigma−finite measure.
Let X be an S−valued random variable, such that its distribution has a density with respect to the measure m. The generalized moment problem consists in determining a density f (x) such that where h k is a collection of measurable functions h k : S → R, the d k are given real numbers, and we set h 0 ≡ 1 and d 0 = 1 to take care of the natural requirement on f. A typical example is the following. X stands for a positive random variable (a stopping time, o perhaps a total risk severity) and we can compute E[exp(−α k X)] = d k by some Montecarlo procedure at a finite number of points α k . The problem that we need to solve amounts to invert the Laplace transform from such finite collection of values of the transform parameter α.
Actually this last problem is of much interest in the banking and insurance industries, where the density is necessary to compute risk premia and regulatory capital of various types, samples may be small and the estimation of the d k reflects that.
We mention the unpublished work by Leitner and Temnov (2009) we want to understand the variability in the densityf * N obtained applying the maximum entropy method, due to the variability of the sample X 1 , ...., X N . For that, in the next section we recall in a (short) historical survey the notion of entropy of a density, and in the following section we present the basics of the maximum entropy method.
In Section 4 we take up the main theme of this work: the variability off * N that comes in through thed k . There we prove thatf * N converges pointwise and in L 1 to the maxentropic density f * obtained from the exact data, and we shall see examine hoŵ f * N deviates from f * in terms of the difference between the true and the estimated (sample) moments.

The entropy of a density
As there seem to be several notions of entropy, it is the aim of this section to point out that they are all variational on the theme of a single definition. Let us begin by spelling out what is it that we call the entropy of a density. Let P be a measure on (S, B). Suppose that P << m and let f denote its density. The entropy S m (P ) is defined by When P is not a probability measure, (2.1) is to be modified as follows: When m is a probability measure, call it Q, and both P and Q are equivalent to a measure n, with densities, respectively f = dP/dm and g = dQ/dm, then (2.1) can be written as Comment For the applications that we shall be dealing with, S will stand for a closed, convex subset in some R n , and m will be the usual Lebesgue measure. We also mention that if m is a discrete measure, then the integrals would become sums.
The expression (2.1) seems to have made its first appearance in the work of Boltzmann in the last quarter of the XIX-th century. There it was defined in where f (x, v)dxdv was to be interpreted as the number of particles with position within dx and velocity within v. The function happened to be a Lyapunov functional for the dynamics that Boltzmann proposed for the evolution of a gas, which grew as the gas evolved towards equilibrium. Not much later Gibbs used the same function, but now defined on R 6N × R 6N , whose points (q, p) denote the joint position and momenta of a system of N particles. This time dm = dqdp, and f (q, p)dqdp is the probability of finding the system within the specified "volume" element. Motivated by earlier work in thermodynamics, it was postulated that in equilibrium the density of the system yielded a maximum value to the entropy S m (f ). These remarks explain the name of the method.
The expression (2.1) (with a reversed sign) made its appearance in field of infor- (ii) For any two densities f and g, S n (f, g) ≤ 0, and S n (f, g) = 0 if and only if f = g a.e. n.
(iii) For any two densities f, g such that S n (f, g) is finite, we have (Kullback's in- The reader is directed to either Cover and Thomas (2003) or to Kullback (1968) for proofs.

The standard maximum entropy method
Here we recall some well known results about the standard maximum entropy method (SME) along with some historical remarks. Even though the core idea seems to have been first made in the work of Esscher (1932), where he introduced what nowadays is called the Esscher transform, it was not until the mid 1950's that it became part of the methods used in statistics, through the work of Kullback (1968). It seems to have been first been formulated as a variational procedure by Jaynes (1957) to solve the (inverse) problem consisting of finding a probability density f y) (on the phase space of a mechanical system), satisfying the following integral constraints: where the d k are observed (measured) expected values of some functions ("observables", or random variables in the probabilistic terminology) of the system. That problems appears in many fields, see Kapur (1989) and Jaynes (2003) for example.
Usually, we set h 0 ≡ 1 and d 0 = 1 to take care of the natural requirement on f (x). It actually takes a standard computation to see that when the problem has a solution, it is of the type in which the number of moments M appears explicitly. It is usually customary to write e −λ * 0 = Z(λ * ) −1 , where λ * = (λ * 1 , ..., λ * M ) is an M−dimensional vector. Clearly, the generic form of the normalization factor is given by With this notation the generic form of the solution can be rewritten as Here a, b denotes the standard Euclidean scalar product in R M , and h(x) is the vector with components h k (x). At this point we mention that the simple minded proof appearing in many physics textbooks is not really correct. That is because the set of densities is not open in L 1 (dm). There are many alternative proofs. Consider for example the work by Csiszar (1975) and Cherny and Maslov (2003).
The heuristics behind (3.4)and what comes next is the following. If in statement (ii) of Theorem (2.1) we take g(x) to be any member of the exponential family λ → which suggests that if we find a minimizer λ * such that the inequality becomes an equality, by Theorem (1) we conclude that (3.4) is the desired solution. This dualization argument seems to have been first proposed in Mead and Papanicolau (1974), and is expounded in full rigor in Borwein and Lewis (2000). The vector λ * can be found minimizing the dual entropy: where d is the M−vector with components d k , and obviously, the dependence on α is through d. We add that technically speaking, the minimization of Σ(λ, d) is over the domain of Z(λ) which is a convex set. In most applications it just is R M . And for the record, we state the result of the duality result as

Mathematical complement
In this section we gather some results about Z(λ) that we need below. The following is well known. See Kullback (1968) for example.  The function Z(λ) defined above is log-convex, that is, ln Z(λ) is convex.
2 Z(λ) is continuously differentiable as many times as we need.
The first two assertions are proved in Kullback's book. The third drops out form the inverse function theorem in calculus. See Fleming (1987), and the last one follows from the fact that the Jacobian of φ −1 equals the negative of the inverse of the Hessian matrix of ln Z(λ), which is (minus) the covariance matrix C. As a simple consequence of item (4) in Proposition(3.1) we have the simple, but relevant for the next section  The next result concerns the convergence off * N to f * in L 1 (dm).
Theorem 4.2 With the notations introduced above, we havê Proof The proof is an easy consequence of the continuous dependence of Σ(λ, d) on its arguments, of the identity (3.1) and item (iii) in Theorem (2.1) withf * N playing the role of f and f * playing the role of g. In this case −S m (f * M , f * ) happens to be which, as mentioned, tends to 0 as N → ∞.
To continue, consider The proof follows easily from (3.6). In that inequality, the Cauchy-Schwartz inequality in R K is used. Also, taking limits as N → ∞ in (3.6), we obtain another proof of the convergence off * N to f * . What is interesting about (2) in Theorem (4.3), is the possibility of combining it with Chebyshev's inequality to obtain rates of convergence. It is not hard to verify that where · is the Euclidean norm in R K , and Corollary 4.1 With the notations introduced in Theorem (4.3) and two lines above,  in law as N → ∞. Also, for any bounded, Borel measurable g(x) , D √ N(δd) ∼ N(0, σ 2 (g)).
The proof of the assertions is standard. It involves applying the central limit theorem to the vector variable √ N ĥ N − d .