MCMC Sampling Estimation of Poisson-Dirichlet Process Mixture Models

In this article, we aim to estimate the parameters of Poisson-Dirichlet mixture model with multigroup data structure by empirical Bayes. ,e number of mixture components with Bayesian nonparametric process priors is not fixed in advance and it can grow with the increase of data. Empirical Bayes is the useful method to estimate the mixture components without information on them in advance. We give the procedure to construct smooth estimates of base distribution G0 and estimates of the two parameters (α, θ). ,e performances of estimations for parameters under multigroup data are better than those of the single-group data with the same total size of individuals in the perspectives of bias, standard deviations, and mean squared errors by numerical simulation. Also, we applied Poisson-Dirichlet mixture models to well-known real datasets.


Introduction
Mixture models are typically used to model data in which each observation belongs to one of a set of distributions, being finite or infinite, with certain probabilities. In many situations, some elements or all of the mixturing distributions are unknown. However, accuracy of prior information plays an important role in statistical inferences, and the deviation of prior information inevitably results in biased inference which may cause serious problems in subsequent decisions. A possible solution to the inference on parameters of the prior is the empirical Bayes (EB) methods with the price of using more data sources than pure Bayesian inference. EB is a kind of modern statistical analysis methodologies that blend frequentist inference into Bayesian ideas with the parameters in prior distributions estimated from the observed data (or Bayes decisions estimated from observed data). EB has been widely applied in many fields in its development process of more than half a century. Relevant contributions are, among others, Seidel [1] on estimating partial prior information by EB, George and Foster [2] on EB variable selection in regression analysis, Clyde and George [3] on EB estimation for wavelets, Yang and Wu [4] on EB estimates of the precision parameter and baseline distribution with monotone missing data subject to Dirichlet process priors, Bi and Davuluri [5] developing a novel nonparametric empirical Bayesian-based approach to model the RNA-seq data, and Zhou et al. [6] presenting careful comparisons among several empirical Bayes estimates to the precision parameter of Dirichlet process prior. As of EB inference for nonparametric problems when priors are modeled by mixture models, one can refer to, for example, the works of Liu [7] and McAuliffe et al. [8].
A Dirichlet process (DP) [9] is a distribution of distributions and, thanks to its characteristics, can act as a prior distribution in nonparametric problems with possibly tied observations. erefore, a DP mixture model allows for common distributions being shared by subsets of observations and thus forms generalizations of the finite mixture models (FMM). In FMM, the number of unknown distributions is supposed to be known and fixed and every individual is sampled from one of those distributions in different probability, whereas, in DP mixture models, the number of clusters is not needed to be given in advance and inferred based on the data. e number of clusters in a DP mixture model is random and grows in log(sample size). e research efforts on DP mixture model in Bayesian nonparametric are quite large, which can be found in, for example, the works of Ferguson [9], MacEachern and Müller [10], Neal [11], Ishwaran and James [12], Quintana et al. [13], and so on.
Being proposed by Pitman and Yor [14], Poisson-Dirichlet process (PDP) is also known as Pitman-Yor process, which can be seen as the extension of DPs. In PDP model, a new observation is either from the previous ones in some cluster with a probability which is an increasing function of the cluster size or a new draw from the base distribution.
As an effective and widely applied density estimation method, mixture modeling covers a great deal of phenomena in real-world practice. But it is usually hard to determine the number of mixture components. ere are many methods treating the number of components as an unknown constant which is based on the data beforehand, for example, model selection methods. However, the assumption could be too rigid or wrong for the new individual may be from the previous cluster or new cluster as mentioned above. ere are a lot of researches on the DP mixture model, such as the works of Berry and Christensen [15], McAuliffe et al. [8], Mostofi and Kharrati-Kopaei [16], and Hu et al. [17]. But there are few literatures about the PDP mixture model and its parameters' estimation.
In this article, we are interested in the estimation of base distribution under the multigroup data, where the density of base distribution is continuous. ere are two parameters and base distribution in PDP mixture model, different from DP mixture model. One difficulty is the initial value of (α, θ) and its impact on the estimation of base distribution, where we suppose that α ∈ [0, 1), θ > − α in the context. Another one is that the base distribution could not be directly observed, which is the parameter of infinite dimensional distribution. e individuals of every group follow the random probability distribution in view of the base. We handle this by the nonparametric empirical Bayes; and the estimations of α, θ are referred to in the works, among others, of Carlton [18] and Favaro et al. [19]. e performance of estimations for α, θ under multigroup data is better than that of the single-group data with the same total size of individuals in the perspectives of bias, standard deviations, and mean squared errors. e remainder of this paper is structured as follows. In Section 2, we review PDP mixture models. In Section 3, under the framework of multigroup data, the procedure of estimation on the base density is described and the algorithm with the nonconjugate Gibbs sampler is given. In Section 4, some simulations with different base distributions and pairs of (α, θ) are shown and we compare the results between single-group and multigroup data. In Section 5, the PDP mixture model is applied to the well-known real-world datasets.

Poisson-Dirichlet Process Mixture Model
is section formulates the Poisson-Dirichlet processes mixture models we are going to discuss. e fundamental setting is a set of independent random couples (X 1 , η 1 ), (X 2 , η 2 ) . . . , (X n , η n ) such that, for every i � 1, 2, . . . , n, X i |η i ∼ f(x|η i ) for certain known density function f and η 1 , η 2 , . . . , η n are identically distributed as an unknown probability measure G, as discussed in, for example, in the works of Neal [11] and McAuliffe et al. [8].
ere are quite a few real-world applications that can be set in this framework. Examples include such general insurance practice where the observations X i are the claims of an insured at n periods distributed identically as f(x|η i ) with η i reflecting the overall profile of the economic, political, and other situations at period i, such as presence of extreme weather, earthquake, war, and epidemics. In contrast to the most extensively discussed model under which X 1 , X 2 , . . . , X n |η ∼ iid f(x|η), η ∼ G, the parameters η i are of evolutionary feature from one period to another [20]. In many real scenarios, the values of evolutionary η i , i � 1, 2, . . . , n exhibit clustering feature and the number and the profiles of the clusters are unknown, so that it is appropriate to use stick-breaking distributions to model η i , i � 1, 2, . . . , n.
Let (ξ n ) n ≥ 1 be a sequence of i.i.d. random variables following a common probability measure G 0 , taking values in a measurable space, (X, B), for example, and independent also of (V n ) n ≥ 1 . Let δ(ξ, ·) stand for the Dirac measure degenerated at point ξ.
en the stickbreaking distribution G(·) � ∞ n�1 V n δ(ξ n , ·) on (X, B) is referred to as a Poisson-Dirichlet process (PDP) with parameters α, θ and base distribution G 0 , writing G ∼ PD(G 0 ; α, θ). Note that parameters α and θ determine the characteristic of power law of a PDP and the case α � 0 corresponds to a Dirichlet process.
Let η � η 1 , η 2 , . . . , η n be a sequence of i.i.d. observations drawn from G so as to reflect the clustering feature of distributions. Write K for the number of clusters of η, η * 1 , η * 2 , . . . , η * K for the distinct observations among η 1 , η 2 , . . . , η n and m k is the number of elements in cluster k. en, by Pitman [21], given η 1 , η 2 , . . . , η n , the probability distribution of a new future η n+1 is that is, taking values η * k with probability m k − α/n + θ and a new one from G 0 with probability Kα + θ/n + θ. is is the generalized Pólya urn scheme to produce η i , i � 1, 2, . . .. A PDP mixture model is defined by where f(·|η) is a known density function characterized by parameter η, η � (η 1 , η 2 , . . . , η n ) consists of the latent parameters, and G 0 is absolutely continuous with respect to certain σ-finite measure with an entirely unknown density function g 0 , as depicted in Figure 1. anks to the clustering feature of elements of η from PD processes, under this model, the observations X 1 , X 2 , . . . , X n are randomly divided into a random number of clusters so that all the elements of every cluster share a common parameter picked from η * 1 , η * 2 , . . . , η * K and a new observation X n+1 either is generated from a distribution with a new parameter, meaning that X n+1 comes from a new cluster, or falls into one of the previously existing clusters with parameter from η * 1 , η * 2 , . . . , η * K . Note that α � 0 corresponds to the special case of DP mixture models, on which some recent efforts can be found in, for example, the work of Neal [11] who discussed the methods of sampling from the posterior distribution of a DP mixture model and the work of McAuliffe et al. [8] who studied the estimation of the base distribution G 0 in DP mixture model with a single group of data X 1 , X 2 , . . . , X n .
In this paper, however, we consider the estimation of the parameters in the distribution G when data is organized in independent and identically distributed groups. To be specific, we consider the following PDP mixture model: the data are divided into I groups and the parameters η ij , j � 1, 2, . . . , n i , of each group form an i.i.d. sample from G. A graphical representation of a PDP mixture model is shown in Figure 2, and its hierarchical structure in terms of conditional distributions is clearly expressed by Under this setting, the individuals from different groups are mutually independent and the random probabilities G i , i � 1, 2, . . . , I, are independent and identically distributed as a PD(G 0 ; α, θ) with unknown α, θ, and G 0 to be estimated.
On the one hand, the structure defined above arises quite frequently in real-world applications. In the context of general insurance, it is a typical structure of data extensively investigated by the so-called credibility theory [20]. On the other hand, it has been commonly recognized that, for the majority of EB problems, it is impossible to make efficient estimate of the unknown elements in the prior distributions if one has only a single group of conditionally independent and identically distributed observations given the random parameters following the priors. e only exception arises when the prior can be characterized by a so-called stickbreaking distribution such as Dirichlet processes (see [4,22]). Even for stick-breaking priors, existing research efforts have exhibited the possibly low efficiency in estimating the parameters in prior distributions with a single group of data; see, for example, the works of Zehnwirth [22], Yang and Wu [4], and Zhang et al. [23].
. . , η * iK i those distinct values, and m ij , j � 1, 2 . . . , K i , the number of elements in cluster j of group i. en, similar to eorem 2.5 of Korwar and Hollander [24], given K i , the distinct values G 0 and are also independent of m i1 , m i2 , . . . , m iK i , so that we can retrieve information as of the base distribution G 0 from the information delivered by . , I and of the parameters θ and α from Figure 1: PDP mixture models with a single-group data.

Mathematical Problems in Engineering
observations in X i jk , j � 1, 2, . . . , m i k share the common random parameter η * ik of cluster k in group i. en, by means of marginalization, the likelihood under framework (3) with data X � X ij , j � 1, 2, . . . , n i ; i � 1, 2, . . . , I can be expressed as where the expectation E η i is taken with respect to marginal distribution of η i with X ij taken fixed, E K i ,m i is understood in the same way, and g 0 is the unknown density of G 0 . Imaginably, the likelihood function (4) of (g 0 , α, θ) is generally expressed as a product of I summations, of which every summand is, in turn, a product of the form expressed inside the parentheses in (4). So, it is too intricate to be maximized; even numerically and certain indirect alternatives to the maximum likelihood estimates of (g 0 , α, θ) need to be developed, of which the following is constructed with the help of the ideas from empirical Bayes.

Estimation of Poisson-Dirichlet Mixture Models
is section concentrates on developing reasonable estimates of the parameters g 0 , α, and θ and elucidates in detail algorithms deriving those estimates. We proceed in two subsections with one for the idea and the other for algorithms to realize the ideas.

Constructing Estimates
. , I, a natural choice is to estimate g 0 by certain maturely developed density estimators. We use kernel method to result in where is defined by a certain kernel function ω(·) together with a bandwidth h > 0. For α and θ, similar to Carlton [18] who studied the estimation of PDP priors and Favaro et al. [19] who established the PDP model so as to predict species for the problem of species searching, we can use the maximum likelihood estimate derived in terms of the distribution of (K i , m i1 , . . . , m iK i ), i � 1, 2, . . . , I ; that is, where and a i j is the cardinality of m ik : k � 1, 2, . . . , K i such that m ik � j , that is, each of the clusters of which has j items. Now that the latent variables η ij , j � 1, 2, . . . , n i , i � 1, 2, . . . , I} are unobservable, we construct unbiased pseudoestimates: where the arguments (g 0 , α, θ) indicate that the "estimates" nonetheless depend on the unknown parameters g 0 , α, and θ. Note that these pseudoestimates can be computed by means of the posterior distribution for η given the observations X. By these pseudoestimates, an algorithm can be motivated, which, starting from certain appropriately selected initial values of g 0 , α, and θ, iteratively computes for r � 0, 1, . . ..
indicating that the posterior expectations of (g 0 (·; η), α(η), θ(η)) given X are computed with the unknown parameters (g 0 , α, θ) replaced by ). Consequently, estimate g 0 , α, and θ by the limit In every iteration to compute (g (r+1) ), numerical solutions are necessary because it is not an easy task to work out closed forms of the posterior expectations in the equations in (8). A natural way is to employ an MCMC technique that will be described in detail later on.
To sum up, the estimates are approximated with the following workable procedure.

Algorithms.
is subsection details Algorithm 1 in the selection of initial values, the steps to perform the MCMC sampling, and the computation of the estimation proposed.

Initial Values.
To fix an initial estimate g (0) 0 , compute first the maximum likelihood estimate η ij by f(X ij |η ij ) � max u f(X ij |u) and then compute a kernel estimate of g 0 by Similar to the formula in (5), in many important cases, f is a location translation of some standard density with mode zero (e.g., a standard normal distribution) such that η ij � X ij . In these cases, the initial values of g 0 are given by For parameters α, θ, the initial value of α could be 0.1 or 0.5 and θ could be 1 or 5 in case there is no further information on them. e simulation, which will be reported later on, showed that the initial values of (α, θ) had only quite minor effects on the convergence of the algorithm.

e Gibbs Sampling.
We next state the full conditional distribution used in the MCMC procedure, that is, the conditional distribution of a single latent variable η ij given the data X i � X i1 , X i2 , . . . , X in i , i � 1, 2, . . . , I, and the latent variables (η i ∖η ij ) ∪ k≠i η k . Denote by η i and m il � # r: r ∈ 1, 2, . . . , n i ∖ j and η ir � η * il the number of such elements in η −j i which take the value η * il . anks to the mutual independence among the random vectors (G i , (X i1 , η i1 ) . . . , (X in i , η in i )), i � 1, 2, . . . , I, it suffices to discuss the single-group situation, as stated in eorem 3.1.

Theorem 1.
For each j � 1, 2, . . . , n i , given η −j i and X i , the conditional distribution of η ij is where G η (·|X ij ) is the posterior distribution of η ij derived from a single observation X ij with density g η (u|X ij ) � f(X ij

and the normalization quantity b is determined by the equality
Proof. By the generalized Pólya urn scheme (1), the posterior distribution under PD(g 0 ; α, θ) is erefore, the conditional distribution of η ij given (η −j i , X i ) can be computed by Mathematical Problems in Engineering 5 Set and rewrite the equation above as at is, the conditional distribution of η ij is Obviously, q 0 and q l s satisfy ), we now describe the iteration to compute (g (r+1) ) from (g (r) 0 , α (r) , θ (r) ) for all r � 0, 1, 2 . . ., which consists of two alternatively running phases: one for sampling from the posterior distribution of the latent variables η with the unknown parameters replaced by the estimates (g (r) 0 , α (r) , θ ) and the other for estimation of the parameters using the sampled latent variables η, as depicted in what follows: (1) Sampling. e sampling phase is indeed a two-step procedure applied to every group i � 1, 2, . . . , I separately so as to produce the posterior samples η i , starting with an arbitrarily selected η (0) in i . e first step is the standard Gibbs sampling derived from the full conditionals presented in eorem 1 and the second performs an acceleration.
ALGORITHM 1: Estimation procedure. 6 Mathematical Problems in Engineering two-step procedure, the first step updates the values of c j and in passing the values of η ij respecting the full conditionals in eorem 1. For the case K −j i l�1 q l is much larger than q 0 , the Markovian chain directly derived from eorem 1 may converge slowly because the chance to introduce new values for η ij is quite small; see equation (12). Hence, an acceleration method (e.g., [11]) may be helpful, which samples another round of η i for acceleration.
is is realized by sampling η i conditional on X i and the cluster indicators c just obtained in Step 1 and then sampling c given (X i , η i ). In other words, Step 2 is driven by the grouped conditional distribution of η i given (X i , c) and c given (X i , η i ). Because of the particular feature that c is a deterministic function of η i , Step 2 only needs to draw values from the conditional distribution of η i given (X i , c).
Moreover, for every j � 1, . . . , n i , the integration f(X ij |η ij )g (r) 0 (η ij )dη ij involved in computing G η (·|X ij ) and q l , l � 0, 1, . . . , K −j i is approximated by an average where d is a positive integer and u 1 , 0 depend only on the iteration indicator r, this approximation can be prepared at the beginning of each iteration.
As a final point, we would like to note that, due to the presence of the acceleration Step 2, the update of the values of η ij in Step 1 essentially takes effect only on the computation of the probabilities q l in equation (12) . . , B, for a sufficiently large B.
By these newly obtained estimates, repeat the above two phases for the next iteration until the sequence of estimates is stable. e above computation process on the estimation of α, θ, g 0 under the framework in (3) can be briefly arranged as in Algorithm 3.
We now conclude this section by the following remark on one of Neal's [11] algorithms. Remark 1. Algorithm 2 is similar to, but with two minor differences, to Algorithm 8 of Neal [11] at Step 1. First, we only draw a single set of d i.i.d. values from g (r) 0 in each iteration, while Neal [11] suggested drawing d i.i.d. values for each j � 1, . . . , n i , hence needing n i − 1 times more than the sampling computation here. e algorithm here hence may cause a significant reduction in the total computation load. Second, we draw a value for η ic j from G η (·|X ij ) if a new cluster for c j is produced, while Neal [11] simply drew a value from g (r) 0 . According to eorem 1, we think Algorithm 2 draws a correct value.

Simulation
In this section, some simulations are reported to show the performance of estimates provided above. e data structure of the PDP mixture model is according to equation (3) and the simulations were conducted under the two settings.
In the three base probabilities we chose, the first two are symmetric, t 3 has heavy tails, and the last has a positive skew. We tried a few distinct initial values of α, θ in all cases; the simulations showed that the estimations were insensible to the initial values. For every combination of the values, α, θ, I, and g 0 , we produced 1000 replicates of the datasets under the setting described above. e simulation results are concisely discussed below. .
(1) e estimates of the base measures g 0 . In the simulation, g 0 was estimated more accurately under the setting I � 100 than under the setting I � 1. e results from some replicates out of the 1000 are presented in Figures 3-5 under some parameters α, θ, and g 0 , in terms of the iterations r � 0, 5, and 200 under I � 1 and I � 200, in which the dashed lines represent the curves of estimates and the solids are the true density curves. e figures show the two following clear features: the dashed lines are more close to the solids curves with the increase of iteration times in both data structures and apparently the performance of estimated curves under multigroup data is better than that of single-group data. e center of dashed curves with single data is a little deviation at round 200 in Figure 3, the estimated curve is more concentrated than real density with heavy tails in Figure 4, and the curves of g 0 are more fluctuating and skewed in Figure 5 under the singlegroup data. Overall, in both the single-group and multigroup environments, our estimates are still near the true density curves. e histograms of L 1 distance between g 0 and true g 0 with 1000 replicates are shown in Figure 6 under single group and multiple groups. Figure 7 delineates the total Mathematical Problems in Engineering variation distance (TVD) between G 0 and G 0 . e TVD of two probability distributions P and Q is Sup A |P(A) − Q(A)|, with the supremum taken over measurable sets A. Each panel plots the TVD between G 0 and G 0 as the rounds progress. As can be seen, the value of TVD gradually reaches a steady value after about 15 iterations with the bases G 0 of N(0, 1) and t 3 , and the other needs to go through about 25 iterations.
(2) e estimates of (α, θ) with the 1000 replicates are summarized in Table 1 in terms of the mean and standard deviation and their histograms are (1) Drawd iid values u 1 , u 2 , . . . , u d from g (r) 0 and compute h rj , j � 1, . . . , n i by (13). (2) Initiate the algorithm with an arbitrarily selected η (0) i . Correspondingly, set c � c(η (0) i ). Let η * � (η * i1 , η * i2 , . . . , η * iK i , 0, . . . , 0) be the vector of dimension n i in which the first K i elements store the distinct value of η 0 i and m � (m i1 , m i2 , . . . , m iK i , 0, . . . , 0) be an n i -vector to represent the numbers of η * ij occurring in η (0) i . (3) Now assume that we have got η (b−1) i and the corresponding auxiliary vectors c, m and η * . e sampled values η (b) i are generated by the following two-step procedure: Step 1. For j � 1, 2, . . . , n i , (2) draw a new value for c j from 1, 2, . . . , n i according to the distribution law and then, in case m ic j � 0, draw a value for η ic j from G η (·|X ij ) with g 0 replaced by g (r) 0 , where B is the normalization constant subject to n i c�0 q c � 1; and then (3) set m ic j � m ic j + 1.
Step 2. For all c ∈ c 1 , c 2 , . . . , c n i which are obtained by above Step 1, draw η (b) from the conditional density ALGORITHM 2: Gibbs sampling under single group.
Step 1. Initialize α (0) , θ Step 3. Obtain α (r+1) , θ (r+1) and g (r+1) 0 according to (10). presented in Figure 8. e comparisons of estimates are listed with means, SD, and 95% credible intervals and the estimates of multigroup structure are better than those of the single-group structure no matter what kind of bases in three parametric combinations. It is more stable on the estimations of (α, θ) in the standard deviations criterion when I � 100.

Applications to Real Data
In this section, the PDP mixture model is employed to analyze the three following well-known real-world datasets. e first is the galaxies data [8,25] consisting of velocities of 82 galaxies from six well-separated conic sections in Corona Borealis region. e model with I � 1 was applied and the algorithm provided in Section 3 was run 200 rounds, giving rise to the estimate (α, θ) � (0.429, 3.988) and g 0 that is depicted in Figure 9, simply showing that the base distribution has multimodality corresponding to identifying different clumped galaxies. e second dataset consists of records of stamp thickness [26] in millimeters of 485 unwatermarked Hidalgo stamps dating from 1872 to 1874. Figure 10 presents estimates g 0 of stamp thickness at round 200 of the procedure mentioned above on the stamp data, which exhibits the close bimodality of the density and one is more steep than another. e curve of g 0 reflects the high probability around the bimodality of η * i s corresponding to the number of types of paper used. e stamp data depicts the concentration of shared variables on the modes. After 200 rounds' iterations, we obtain the estimates (α, θ) � (0.425, 6.810). e last is on Danish fire losses [27] which are heterogeneous as reflected in multimodality, skewness, and heavy tail distributions, collected by Copenhagen Reinsurance and consist of 2492 records over the period of 1980-1990. e losses are in millions of Danish Krone and not adjusted for inflation over time. e data is divided into multiple groups based on time and the observations between groups are independent as the conventional assumption in the industry of nonlife insurance. e estimated density g 0 is   Bases N(0, 1) t 3 −log X 2 1 clearly described with two intervals of abscissa in Figure 11 at round 200 under the Poisson-Dirichlet process mixture models, having one concentrated modality and heavy tail. We get (α, θ) � (0.481, 0.793).

Conclusions
In this article, we have shown the posterior distribution in the PDP mixture models and the estimates cannot be directly obtained because of unobserved latent variables η ij s. e unbiased "pseudoestimates" are constructed and the corresponding algorithm is depicted in detail based on nonparametric empirical Bayes. Some simulations are presented with both data structures, single-group and multigroup, on the pseudoestimates of base distribution G 0 , discount parameter α, and concentration parameter θ. By the numerical simulation, the performances of estimations for parameters under multigroup data are better than those of the single-group data with the same total size of individuals. Also, we applied the model to the well-known realworld datasets.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.