Estimation of Poisson-Dirichlet Parameters with Monotone Missing Data

This article considers the estimation of the unknown numerical parameters and the density of the base measure in a PoissonDirichlet process prior with groupedmonotone missing data.The numerical parameters are estimated by the method of maximum likelihood estimates and the density function is estimated by kernel method. A set of simulations was conducted, which shows that the estimates perform well.


Introduction
As a young but fast-growing field of statistics, Bayesian nonparametrics (abbreviated as BNP below) focuses on Bayesian solutions of nonparametric and other infinite-dimensional statistical models.Compared with frequentist statistics and classical Bayesian statistics, BNP provides highly flexible and robust models for infinite-dimensional parameter spaces.The most extensively investigated priors in BNP include the famous Dirichlet process prior [1] and Polya tree process prior [2][3][4][5] which have played fundamental roles in the development of Bayesian nonparametrics.
Dirichlet processes are also referred to as one-parameter Poisson-Dirichlet processes by Kingman [6].As the first and foremost generalization of Dirichlet processes, two-parameter Poisson-Dirichlet processes (abbreviated as Poisson-Dirichlet process below) were first discussed by Pitman and Yor [7] and from then have made huge success in Bayesian nonparametric modeling in language, images, ecology, biology, genomics, and so on.Remarkable examples include Goldwater et al. [8] who used Poisson-Dirichlet process as an adaptor to justify the appearance of type frequencies in formal analyses of natural language and improved the performance of an earlier model for unsupervised learning of morphology, Sudderth and Jordan [9] who modeled the object frequencies and segment sizes by Poisson-Dirichlet processes and developed a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases, Favaro et al. [10] who used a Poisson-Dirichlet model to deal with the issue of prediction within species sampling problems, and Hoshino [11] who studied the microdata disclosure risk assessment with Pitman's sampling formula, clarified some of its theoretical implications, and compared various models based on the Akaike Information Criterion by applying them to real data sets.For more references related to the application of Poisson-Dirichlet process in the area of language learning, one can be referred to Johnson et al. [12], Wood and Teh [13], and Wallach et al. [14].
While the exact Bayesian methods take the assumption that prior distributions are completely specified, empirical Bayesian methods deal with the situations where prior distributions are at most partially specified and thus need to be estimated.Empirical Bayesian methods for parametric and semiparametric models have been investigated in a huge volume of literature.However, the study of empirical Bayesian methods in the framework of Bayesian nonparametrics is quite limited.A recently published paper is that by Yang and Wu [15] who studied the problem of estimation of the priors with monotonically missing data when the prior is a Dirichlet process.
In this paper, we aim at estimating the unknown numerical parameters and the density of the base measure with independent and identically distributed (i.i.d.) groups of observations with a Poisson-Dirichlet process prior.Because the Dirichlet process prior is a special case of the Poisson-Dirichlet process prior, we in fact have extended the methodologies of Yang and Wu [15] to a bigger model.The estimation of the unknown parameters is carried out in two different methods, the maximum likelihood method and a naive method proposed by Carlton [16], of which performances are compared by a simulation study.
Because there are two numerical parameters in Poisson-Dirichlet process priors, the maximum likelihood estimates (MLE) for the unknown parameters are discussed under three different settings (see the next section for the definition of the parameters  and  and the density function): (i) the discount parameter  is unknown but the concentration parameter  is known; (ii) the concentration parameter  is unknown but the discount parameter  is known; and (iii) both  and  are unknown.Favaro et al. [10] gave the empirical Bayes estimates when both  and  were unknown on complete data without missing.The comparisons between the estimates of Favaro et al. [10] and ours are presented with the same sample size in terms of bias, standard deviations (SD), and mean squared errors (MSEs).
The remainder of this paper is structured as follows.In Section 2, we review the basic model and the definition of the Poisson-Dirichlet processes priors.Data structure and model assumptions are also described in this section.In Section 3, the MLEs of the prior parameters are discussed in detail under three different aforementioned situations.A naive estimate for the discount parameter  is also discussed.Section 4 discusses the estimates of base distribution density by kernel method.Section 5 presents a small simulation study to show the performance of the estimates discussed in Section 3.

The Data and Model
The data are observed in  time periods and accordingly organized in groups  = 1, 2, . . ., , with  representing the calendar time point the individuals in this group begin to be observed.Each group  contains   individuals.Denote by (, ) the th individual in group  which is represented by a random vector   with   being the th coordinate of individual (, ).The observations are presented by dimensional ( ≤ ) vectors for which the coordinates are observed sequentially in time, so that the observations are subject to monotone missing.A clearer picture of the data structure is exhibited in Table 1.
Hence, the observed data up to time  are monotone missing; that is, the components of an individual (, ) are ordered in such a way that if an observation of a variable   for individual (, ) is missing, then so are the observations of all subsequent variables   ,  ≥  (if any) for the same individual.Clearly, Table 1 indicates that variables for individuals in the first −+1 groups are completely observed and only the first  −  + 1 components are observable for individuals in group  =  −  + 2, . . ., .
Data structured as in Table 1 is frequently encountered in real world.A typical example is loss data for claims used for the purpose of loss reserving in the industry of non-life insurance (see, e.g., [15,17,18]).Assume that the evaluation time of loss reserving is accident year ; each claim made at accident year  ( ≤ ) will be paid at the end of the subsequent  years after the claim (the first is the one paid at the end of the year the claim is made) so that the payments of an individual (, ) correspond to a -vector ( 1 , . . .,   , . . .,   ).Then observing at the end of the evaluation year , the payments for a claim in accident year  have been paid only with s such that  ≤  −  + 1; that is, the observable payments of claim (, ) are just the subvector X o  = ( 1 ,  2 , . . .,  ,−+1 ), where the superscript "o" means "observable."Other example data of this structure can also be found in Marini et al. [19] who discussed the maximum likelihood estimation in panel studies for the first time, Hao and Krishnamoorthy [20] who considered testing and estimation problems, and Raats et al. [21] who discussed the estimation and testing for multivariate regression model, among a large number of others, with monotone missing data.
Before presenting the probabilistic characteristics of the data under a BNP framework in Assumption 2, we recall the definition of Poisson-Dirichlet processes in terms of stickbreaking.Let (, ) be a pair of real numbers satisfying  ∈ [0, 1) and  > −, which is always respected and hence will not be mentioned everywhere.
(2) Let (  ) ≥1 be a sequence of random variables independent and identically distributed as a probability measure  (called base distribution below), taking values in a measurable space (X, B), and p = (  ) ≥1 ∼ PD(, ).Then the random probability measure on (X, B) is referred to as a Poisson-Dirichlet process (indexed by B) with parameters ,  and base distribution (⋅) or, in symbol, writing  ∼ PD(; , ).
Clearly, for given p and (  ) ≥1 , one has a realization of the Poisson-Dirichlet process  that is a discrete distribution with mass   at   ,  = 1, 2, . . . in the domain X.Pitman and Yor [7] discussed in detail Poisson-Dirichlet processes in which a number of well-known properties of Dirichlet processes were generalized.Carlton [16] gave methods to estimate the parameters (, ) with completely observed data.
(c) The probability measure  has a continuous density function, denoted also by (x), with respect to the dimensional Lebesgue measure on R  .
The intuition behind the assumption is as what follows.In many situations, the data are recorded chronologically so that the first subscript of   reflects the calendar time when the data vector begins to be recorded and the third subscript represents the calendar time that component is recorded; hence the dependence between the data beginning at a same calendar time and independent over calendar times is reasonable.This structure is the typical case of loss reserving in general insurance with individual data (see, e.g., Huang et al. [17,18]).
Given the above data structure and model assumption, the objective of this paper is to estimate the unknown parameters  and  and the density (x) of the base measure.

Estimation of Parameters (𝛼, 𝜃)
The unknown parameters (, ) of the distribution of a Poisson-Dirichlet process can generally be estimated by means of maximum likelihood as shown in Section 3.2.To do it under the data structure of Table  The next lemma shows that, with a Poisson-Dirichlet process structure, observations of any two individuals are identical with probability of 1 if and only if so are their corresponding components.This provides an easy and convenient way to judge whether any two observations are the same only by their any (e.g., the first) component.Proof.We verify the case  = 2 and the proof applies for any  > 2. By the definition of Poisson-Dirichlet process, we take Corollary 3 of Pitman and Yor [7] implies that where  ∼ Beta(1 − ,  + ).On the other hand, because Lemma 3 implies that  11 and  21 are two observations from a PD structure with the parameters  and  and  1 ( 1 ) = ∫( 1 ,  2 , . . .,   )d 2 ⋅ ⋅ ⋅ d  , similar discussion as above implies that Pr( 11 =  21 ) = (1−)/(1+).Therefore, The proof is thus completed.

Maximum
Write Similar to Carlton [16] who used A  to estimate the unknown parameters by maximum likelihood method when individuals are completely observed, we here discuss the MLE of (, ) in terms of the random variables A  ,  = 1, 2, . . ., .Under Assumption 2, the log-likelihood function for (, ) given the observations A  = a  in all the  groups is where The is analyzed in three different situations as what has been done in Carlton [16]: (i)  is unknown and  is known.
(ii)  is known and  is unknown.
(iii) Both  and  are unknown.

3.2.1.
Estimate  with Known .In this case, note that is a decreasing function of  and ℓ  (, ) → −∞ as  → 1.
Theorem 6.The MLE α() of  is unique and where  is Euler's constant.
Proof.Note that The proof is thus completed.

Estimate
with Known .If  = 0, then the Poisson-Dirichlet process reduces to a Dirichlet process, for which the estimation problem of  can be referred to Yang and Wu [15].We thus consider only the case of  ∈ (0, 1).By the loglikelihood function ℓ(, ) given in (8), we have A condition under which the MLE is unique is presented in the following theorem.Proof.For given  ∈ (0, 1), notice that the variable for ℓ  (, ) is only ; we write Further recall that   > 1 for  = 1, 2, . . ., ; it follows that and consequently () → ∞ as  → − for the second term is limited in the upper equation.With some algebraic transformation to (13), we obtain which implies that, for sufficiently large , Then, by the continuity of (), the Intermediate Value Theorem implies that  ∈ (−, ∞) such that () = 0.
For the uniqueness, it suffices to show that () < 0 whenever   () ≥ 0. By (13), we have which implies that By the definition of  max and  min , inequality is obtained when With ( 15) and ( 20), we have When  max − 2 min ≤ 0, () < 0. Particularly, there is a unique solution when  = 1 for  max =  min =  1 .
The solution to ℓ  (, ) = 0 is the MLE of  and we denote it by θmle .When  = 1, by Pitman and Yor [7], the PD(, ) distributions are mutually absolutely continuous when  ∈ (0, 1) is fixed and  is varying; hence θmle is not a consistent estimator.Some simulation results of θmle are reported in Section 5 when  > 1; theoretically  can be ∞.

Estimate
Both  and .As discussed above, the MLEs for  and  when both are unknown can be obtained by solving the equations ℓ  (, ) = ℓ  (, ) = 0. To guarantee the existence of the MLEs for  and , we need (α, θ) of ℓ  (, ) = ℓ  (, ) = 0 to be such that where (, ) fl ( ℓ  ℓ  ℓ  ℓ  ) is the Hessian matrix; that is, 3.2.4.Remarks.While we have so proved the existence of MLEs under the three different cases, analytical expressions of the estimates under none of the cases are given and, thus, the MLEs can only be obtained with numerical methods, for example, the Newton-Raphson iteration method or dichotomy method.
In addition, by Carlton [16], a naive estimate of  in terms of merely the observations in group  is α = log   /log   , which is consistent as   tends to ∞.Now with the  groups of data, we can estimate  by where   = log   / ∑  =1 log   .It is then straightforward from Lemma 5.2 in Carlton [16] that both α and α are strongly consistent with the true value of  as max 1≤≤   → ∞.

The Estimation of Base Distribution
In this section, we estimate the base distribution  of PD(; , ) under the monotone missing data structure.In Section 4.1, we pick up a set of i.i.d.observations from  and then estimate the density of  by kernel estimation method.With the i.i.d.observations, we can estimate the base density  by multivariate kernel density method that is similar to the one used by Yang and Wu [15] who studied the multivariate density estimation in a Dirichlet process prior under the same missing mechanism and showed that their estimate is superior to Titterington and Mill [22] under asymptotic mean integrated squared errors (AMISE) criterion (note that the classical versions of [22] for complete data can be found in, e.g., [23,24]).
Proof.The proof can be found in Theorem 2.5 in Korwar and Hollander [25].
By Theorem 8, there are   i.i.d.observations in group  and hence the number of i.i.d.observations in the whole  groups is  = ∑  =1   .Note that the observed part of the  i.i.d.observations, namely, is still with a monotone missing structure.For every  = 1, 2, . . ., , write Z   = ( 1 ,  2 , . . .,   ) for the subvector of the first  components of Z  from individual (, ),  = 1, 2, . . .,   , and for the maximal set of the subvectors Z   ,  = 1, 2 . . .,   ,  = 1, 2, . . ., , which are completely observed.Obviously,   is made of all the observations with no missing components and  1 is the set of the first components of all Z  in the  groups.Similarly, define Obviously, S ⊂   , for all .
Mathematical Problems Engineering 7

Kernel Density Estimation.
There are two to estimate the joint density  the random vector Z  , proposed by Yang and Wu [15] and Titterington and Mill [22], respectively.They are described below.

Simulation
In this section, we report a small numerical simulation regarding the performance of the estimates of  and  given in (i)  = 10 and because the estimates depend only on the observations of   s which can be determined by the first component, we set  = 1 for convenience.
(ii) The standard normal distribution (0, 1) was taken as the base measure .
(iii) The sizes of individuals ( 1 ,  2 , . . .,  10 ) took three levels of values: The procedure of the simulation under those settings is simply described as follows.
Step 2. Compute the values of   and    from the data generated in Step 1.
Step 3. Compute the estimates of ,  with Newton-Raphson method.
The simulation compared the estimates αmle and α of  for known  in which we set  = 10 and  took three values: 0.1, 0.3, and 0.5; the case where  > 0.5 can simply be obtained by replacing  with 1 − .For each combination of the parameters, the simulation was proceeded 1000 replications.The performances of the two estimates are shown in Table 2, from which it is clearly shown that the simulated biases of αmle are all smaller than those of α .Although the simulated standard deviations of α are stable and smaller in three levels under three different values, the MSEs of αmle are smaller than those of α .That is, αmle performs better than α under the MSE criterion.Next, the simulation studied the performance of the estimates of  with  = 0.5 at four true values:  = 0.1, 1, 5, and 10.The simulation was also proceeded 1000 replications for each combination of the parameters.The simulation results are summarized in Table 3, from which we can see that the estimates θmle generally performed better with the increasing of the sample size.
Finally, we considered the MLEs for  and  when neither of the parameters is known.The true values of the parameters and the performance of the estimates of the 1000 replications are shown in Table 4.The simulation used the Newton-Raphson iteration method.
Favaro et al. [10] adopted an empirical Bayes procedure for estimating (, ), writing (α, θ)  , when the data is not divided into groups.They investigated the issue of prediction within species sampling problems and applications to genomics based on the estimates (α, θ)  when both parameters are unknown.Favaro et al. [10] also studied the asymptotic behavior of the number of new species conditionally on the observed sample and derived asymptotic highest posterior density intervals for the estimates of their interest.The comparisons of simulation results between (α, θ)  and the MLEs which are under the multigroups data structure are presented in Table 4 with the total size of individuals in the perspective of bias, SD, and MSEs.From Table 4, we can see that the MSE of α is very small and the MSE of θ is a bit bigger.Apparently, the simulation shows that the MLEs generally are better than (α, θ)  of single group data under the critera of bias, SD, and MSEs.

Conclusion
In this paper, we studied the estimates of the unknown parameters as well as the density of the base measure in Poisson-Dirichlet process priors under monotonically missing data.The parameters are estimated by the method of MLEs and a set of simulations shows that the estimates perform well.
used ∫ (⋅ | (x o , x m ), h) f(x m | x o )dx m instead of the usual kernel (⋅ | x o ) when estimating the density of X o , where X o and X m are, respectively, the observed part and missing part of X and f(x m | x o ) is a probability density function to estimate true (x m | x o ).

Table 2 :
Performance of estimates  with  =

Table 3 :
Performance of the estimate of  with  = 0.5.