doi:10.1155/2010/201489 Research Article Spiked Dirichlet Process Priors for Gaussian Process Models

We expand a framework for Bayesian variable selection for Gaussian process (GP) models by employing spiked Dirichlet process (DP) prior constructions over set partitions containing covariates. Our approach results in a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that in turn induces a clustering of the covariates. We evaluate two prior constructions: the first one employs a mixture of a point-mass and a continuous distribution as the centering distribution for the DP prior, therefore, clustering all covariates. The second one employs a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, which induces clustering of the selected covariates only. DP models borrow information across covariates through model-based clustering. Our simulation results, in particular, show a reduction in posterior sampling variability and, in turn, enhanced prediction performances. In our model formulations, we accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP prior models and compare performances under the two prior constructions.


Introduction
In this paper, we expand a framework for Bayesian variable selection for Gaussian process GP models by employing spiked Dirichlet process DP prior constructions over set partitions containing covariates.Savitsky et al. 1 incorporate Gaussian processes in the generalized linear model framework of McCullagh and Nelder 2 by expanding the flexibility for the response surface to lie in the space of continuous functions.Their modeling approach results in a class of nonparametric regression models where the covariance matrix depends on the predictors.GP models, in particular, accommodate high-dimensional heterogenous covariate spaces where covariates possess different degrees of linear and nonlinear association to the response, Rasmussen and Williams 3 .
In this paper, we investigate mixture prior models that induce a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that, in turn, induces a clustering of the covariates.Mixture priors that employ a spike at zero are now routinely used for variable selection-see for example, George and McCulloch 4 and Brown et al. 5 for univariate and multivariate regression settings, respectively, and Sha et al. 6 for probit models-and have been particularly successful in applications to high-dimensional settings.These approaches employ mixture prior formulations for the regression coefficients that impose an a priori multiplicity penalty, as argued by Scott and Berger 7 .More recently, MacLehose et al. 8 have proposed a Bayesian nonparametric approach to the univariate linear model that incorporates mixture priors containing a Dirac measure component into the DP construction of Ferguson 9 and Antoniak 10 .Dunson et al. 11 use a similar spiked centering distribution in a logistic regression.As noted by these authors, DP models borrow information across covariates through model-based clustering, possibly induced through a spatial or temporal correlation, and can achieve improved variable selection and prediction performances with respect to models that use mixture priors alone.Within the modeling settings of MacLehose et al. 8 and Dunson et al. 11 , the clustering induced by the Dirichlet process is on the univariate regression coefficients and strength is borrowed across covariates.Kim et al. 12 have adapted the DP modeling framework to provide meaningful posterior probabilities of sharp hypotheses on the coefficients of a random effects model.Their goal is not necessarily variable selection, but rather the more general problem of testing hypotheses on the model parameters.Their model formulation does not share information across covariates but rather clusters vectors of regression coefficients across observations.While the prior constructions described above all use a mixture of a point mass and a continuous distribution as the centering distribution of the DP prior, in this paper we also investigate constructions that employ a mixture of a spike and a DP prior with a continuous distribution as the centering distribution.The former approach clusters all covariates, the latter induces clustering of the selected covariates only.The prior formulations we adopt show a reduction in posterior sampling variability with enhanced prediction performances in cases of covariates that express nearly the same association to the response.
In our model formulations, we accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP prior models and variable selection.Unlike prior constructions for linear models, which are able to marginalize over the model space indicators and directly sample the model coefficients a posteriori, our non-linear modeling frameworks employ nonconjugate priors.We achieve robust selection results by using set partitions on which we impose a DP prior to enclose both the model and the associated parameter spaces.We optimize performances of posterior sampling with a modification of the auxiliary Gibbs algorithm of Neal 13 that accounts for a trivial cluster containing nuisance covariates.We investigate performances on simulated data under the two prior constructions.Our DP prior model constructions represent generalized nonconjugate formulations with associated posterior sampling algorithms that, while specific to GP models, may be applied to other nonconjugate settings.
The remainder of the paper is structured as follows: GP models and covariance matrix formulations are reviewed in Section 2. Section 3 introduces our spiked DP prior formulations, including separate models to cluster all covariates and only selected covariates.Sampling schemes for posterior inference are described in Section 4. Simulations are conducted in Section 5, where we compare the clustering construction to the mixture priors model and also compare the two DP prior model formulations.A benchmark dataset is analysed in Section 6. Concluding remarks are offered in Section 7.

Generalized Gaussian Process Models
where A j l∈R t j exp z x l l , with R t j being the set of individuals at risk right before t j and A i the A j evaluated at the ith failure time.The use of the partial likelihood conveniently avoids prior specification of the baseline hazard.

Covariance Formulation
A common choice for C in 2.2 is a covariance function that includes a constant term and an exponential term, that is, with J n being an n×n matrix of 1's and G a matrix with elements g ij x i −x j P x i −x j and P diag − log ρ 1 , . . ., ρ p , with ρ k ∈ 0, 1 associated to variable x k , k 1, . . ., p.This single-term exponential covariance provides a parsimonious representation that enables a broad class of linear and non-linear response surfaces.In particular, Rasmussen and Williams 3 show how the exponential form 2.5 can be derived from a linear construction by expanding the inputs, x j 's, into an infinite basis.The chosen parametrization allows simple prior specifications and it is also used by Linkletter et al. 15 as a transformation of the exponential term used by Neal 14 and Sacks et al. 17 in their covariance matrix formulations.This construction is sensitive to scaling and we find best results by normalizing the predictors to lie in the unit cube 0, 1 .Other choices of C, such as exponential constructions that include a second exponential term and Matern constructions, are reviewed in Savitsky et al. 1 .

Spiked Dirichlet Process Prior Models for Variable Selection
Variable selection can be achieved in the GP modeling framework with covariance matrix of type 2.5 by imposing mixture priors on the covariance parameters, that is, for ρ k , k 1, . . ., p, which employs a U 0, 1 prior for ρ k | γ k 1 and a δ 1 • , that is, a point mass distribution at one, for γ k 0. This formulation is similar in spirit to the use of mixture priors employed for variable selection in linear regression models, as, for example, in George and McCulloch 4 and Brown et al. 5 for univariate and multivariate regression settings, respectively.
In this paper, we embed mixture priors for variable selection into Dirichlet process prior models that cluster covariates to strengthen selection.The Dirichlet process DP construction of Ferguson 9 and Antoniak 10 is a typical choice for a prior on an unknown distribution, G.In particular, given a set of a priori i.i.d.parameters, φ φ 1 , . . ., φ p , with φ i ∼ G, we define the DP prior on G ∼ DP α, G 0 , where G 0 is the parametric base measure defining the prior mean, E G G 0 .The concentration parameter, α, expresses the prior confidence in the base measure.Draws from G are discrete a.s., implying a positive probability of ties to instantiate randomly generated partitions.Indeed, many contributions in nonparametric Bayesian inference are formulated in terms of random partition models, that is, probability models that cluster the set of experimental units.See Quintana 18 for a nice review of nonparametric Bayesian models.
Here we introduce probability distributions on set partitions with a particular focus on clustering the p covariates through the a priori i.i.d.covariance parameters, φ , rather than the usual choice of n i.i.d.observations.Let φ * φ * 1 , . . ., φ * M , for M ≤ p, define the unique values of φ, and let us define the clusters as S {k : φ k φ * }.Let F indicate the space of all possible partitions of the p covariates.The partition ν p {S 1 , . . ., S M } ∈ F captures a particular disjoint clustering of the covariates, with S k ∩S m ∅ for k / m, such that we recover the full set of covariates in the disjoint union, p k 1 S k S 0 {1, . . ., p}.The DP provides the P ólya urn scheme of Blackwell and MacQueen 19 by marginalizing over G to define a joint prior construction for a particular partition, where Γ x is the gamma function and p S the number of covariates in cluster S. Higher values of α tend to produce a larger number of clusters.This is evident if we factorize the joint prior as where we introduce cluster indicators, s k ⇒ k ∈ S , k 1, . . ., p and employ exchangeability for φ k to treat covariate k as the last one added.Here p −k,s / s t indicates the number of covariates, excluding covariate k, allocated to the nontrivial cluster S. Similarly, M − , captures the total number of clusters when excluding covariate k.In particular, this construction of the conditional prior reveals that the probability for covariate k to be clustered with m is uniform for all k or π s k s m | s −k ∝ 1 for m 1, . . ., k − 1, k 1, . . ., p.We complete the prior specification with α ∼ G a α , b α to allow the data to update the concentration parameter for a fully Bayesian approach.It is important to note that our prior construction is over set partitions that contain covariates and that all the observations are in every cluster.We next develop two specific and alternative prior formulations, the first permits clustering on all-trivial and selected-covariates and the second one focuses on clustering only the selected covariates.

Clustering All Covariates
The first prior construction we consider employs the mixture prior as the centering distribution for the DP prior model, therefore, clustering all covariates.Let us consider the covariance parameters, φ φ 1 , . . ., φ p with φ k γ k , ρ k , for k 1, . . ., p.We proceed with the usual DP prior construction

3.5
Notice how prior 3.5 strengthens selection by aggregating all trivial covariates into a single cluster.Later in the paper we will employ a data augmentation approach to conduct posterior samples from this nonconjugate model formulation.

Clustering Selected Covariates
Alternatively, we can use prior models that employ a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, therefore, inducing clustering of the selected covariates only.We construct this model as which may be written more intuitively as ρ k | γ k 1, G ∼ G.This formulation confines the set partitions to cluster only the selected covariates.While the dimension of the selected covariates, p γ , will change at every iteration of the MCMC algorithm for posterior inference, we may still marginalize over G, given p γ , to produce the P ólya urn prior formulation 3.3 where we set

3.7
We note that the normalizing expression in the denominator now uses p γ , rather than p, to account for our reduced clustering set.Trivial covariates are not clustered together so that we a priori expect reduced efficacy to remove trivial covariates from the model space.
In other words, this prior construction produces a relatively flatter prior for assignment to nontrivial clusters under model sparsity as compared to 3.5 .Yet, we expect improvement in computational speed as we are only clustering p γ ≤ p covariates.

Markov Chain Monte Carlo Methods
We accomplish posterior inference by employing novel combinations and extensions of existing algorithms for inference with DP models and variable selection.In particular, we adapt the auxiliary Gibbs algorithm of Neal 13 to data augmentation schemes that draw posterior samples for our nonconjugate model formulations.
First we extend our notation and use ρ γ to indicate the ρ vector where ρ k 1 when γ k 0, for k 1, . . ., p.For model 3.

Clustering All Covariates
We first define our sampling algorithm using the covariate clustering construction 3.5 which includes all covariates-both trivial and selected.Our MCMC algorithm sequentially samples s, φ * , α, λ a , λ z , h in a Gibbs-type fashion.We improve efficiency of the auxiliary Gibbs algorithm of Neal 13 used to sample s by making a modification that avoids duplicate draws of the trivial cluster.The sampling scheme we propose is as follows.
1 Update s: The auxiliary Gibbs algorithm of Neal 13 achieves sampling of the cluster indicators by introducing temporary auxiliary parameters typically generated from the base distribution 3.4 .While multiple draws of nontrivial cluster are almost surely unique, repeated draws of the trivial cluster are entirely duplicative.We make a modification to the auxiliary Gibbs algorithm by ensuring that our state space always contains the trivial cluster, therefore, avoiding duplicate generations.The algorithm employs a tuning parameter, ω, as the number of temporary auxiliary parameters to be generated from the prior to facilitate sampling each s k at every MCMC iteration.We begin by drawing the ω auxiliary parameters from the conditional prior given the current state space values.If ∈ {1, . . ., p} : s l s t , then one of possibly multiple auxiliary parameters has a connection to the trivial state.We thus sample φ * M − 1 ∼ δ 0 γ * k δ 1 ρ * k , which draws this value as the trivial cluster.If, however, ∃ ∈ {1, . . ., p} : s l s t , then the auxiliary parameters are independent of the trivial state and are sampled as nontrivial clusters from δ 1 γ k I 0 ≤ ρ k ≤ 1 , as in the original auxiliary Gibbs algorithm.Next, we draw the cluster indicator, s k , in a Gibbs Step from the conditional posterior over the set partitions with a state that includes our auxiliary parameters, where we abbreviate 4.1 with π D | φ * s with φ * s ∈ φ * , the unique parameter associated to cluster index, s.In the examples below, we use ω 3, and therefore a probability to assign a covariate to each of the new clusters as proportional to α/w .Neal 13 notes that larger values of ω produce posterior draws of lower autocorrelation.
2 Update φ * : We update the cluster parameters, φ * φ * 1 , . . ., φ * M , M ≤ p, using a Metropolis-within-Gibbs.This scheme consists of 2 moves: a between-models move to jointly update γ * k , ρ * k for k 1, . . ., M in a componentwise fashion, and a within model move to update ρ * k for covariates in the current model after the between-models move.We use uniform proposals for the ρ * k s.Under our clustering formulation, we update the M clusters, and not the p covariates, therefore, borrowing strength among coclustered covariates.
3 Update α: We employ the two-step Gibbs sampling algorithm of Escobar and West 20 constructed as a posterior mixture of two Gamma distributions with the mixture component, η, drawn from a beta distribution.The algorithm is facilitated by the conditional independence of α from D, given s.
4 Update {λ a , λ z , h}: These are updated using Metropolis-Hastings moves.Proposals are generated from the Gamma distributions centered at the previously sampled values.

Clustering Selected Covariates
We describe the steps to perform updates for s, γ, ρ * γ , α, λ a , λ z , h from 3.7 . 1 Update s: Obtain draws in a Gibbs Step for s s 1 , . . ., s p γ , employing the auxiliary Gibbs algorithm in the usual way according to the conditional posterior

4.3
2 Update γ : We update γ componentwise by employing a Metropolis-within-Gibbs algorithm.Notice that an update that adds a covariate also adds a cluster and, similarly, the removal of a covariate also discards a cluster in the case where the cluster contains only the deleted covariate.
3 Update ρ * γ : We update the unique ρ k 's values for the previously selected p γ covariates in a componentwise fashion using a Metropolis-within-Gibbs algorithm and uniform proposals.For both MCMC schemes, final selection of the variables is accomplished by employing a cutoff value for the marginal posterior probabilities of inclusion of single variables based on a target expected False Discovery Rate EFDR in a fashion similar to Newton et al.21 .For example, let ξ k be the posterior probability of the event γ k 1, that is, a significant association of the kth predictor to the response.We fix α, a prespecified false discovery rate, and select those covariates with posterior probabilities of exclusion under the null hypothesis, 1 − ξ k , that are below threshold, κ, that is, with I • the indicator function.As noted by Kim et al. 12 , the optimal posterior threshold, κ, may be determined as the maximum value in the set {κ : EFDR κ ≤ α}.

Simulation Study
We explore performances of the proposed models on simulated data and on a benchmark dataset.Results will show that the application of DP priors may supply a reduction in posterior sampling variability that, in turn, enhances prediction performances, for cases where there is an expected clustering among covariates.We investigate performances under the two prior constructions described above.

Hyperparameters Setting
In all examples below, we generally follow the guidelines for hyperparameter settings given in Savitsky et al. 1 for prior settings related to the mixture prior construction of Section 3 and to specific data models.In particular, we employ G 1, 1 priors on λ a , λ z .In addition, we center and normalize the response and transform the design matrix to lie in 0, 1 p to produce a small intercept term, which in turn supplies a better conditioned GP covariance matrix.Savitsky et al. 1 note little sensitivity of the results to the choice of w, the prior expectation of covariate inclusion.Here we set w 0.025 in all examples below.In the univariate regression model 2.3 , the parameters of the prior on the precision error term, r ∼ G a r , b r , should be set to estimate the a priori expected residual variance.We choose a r , b r 2, 0.1 .As for the DP priors, we choose α ∼ G 1, 1 , a setting that produces a prior expected number of clusters of about 7.5 for p 1000 covariates.We briefly discuss sensitivity to the choice of these hyperparameter settings in the simulation results.

Clustering versus No Clustering
We first consider the univariate regression model 2.3 and compare performance of covariate clustering under 3.5 with the original GP construction 3.1 of Savitsky et al. 1 .With the latter approach, we employ their MCMC-scheme 2 algorithm to accomplish posterior inference.Results we report were obtained by using a response kernel that includes both linear and non-linear associations and where subgroups of covariates share the same functional form, to induce clustering, y x 1 x 2 x 3 sin 9x 4 sin 9x 5 1.3x 6 1.3x 7 , 5.1 with ∼ N 0, σ 2 , σ .05,and with covariates simulated from a U 0, 1 .We use p 1000 covariates, with the response kernel constructed from the first 7.We do 10,000 MCMC iterations, discarding half as burn-in.Results are presented in Figure 1; plots a , b present box plots for posterior samples of the ρ k 's without clustering and under the clustering model 3.5 , respectively.Only the first 20 covariates are displayed, to help visualization.One readily notes both the reduced spread between covariates sharing the same functional form and within covariate sampled values of ρ k for all covariates under application of our clustering model.Such reduction in within and between spreads of the posterior sampling affects, in turn, prediction performances.In particular, we assessed prediction accuracy by looking at the mean-squared prediction error MSPE normalized by the variance of the randomly selected test set, that we term "normalized MSPE".The normalized MSPE declines from 0.12 to 0.02 under application of our clustering model.We further applied the least-squares posterior clustering algorithm of Dahl 22 that chooses among the sampled partitions, post-burn-in, those that are closest to the empirical pairwise clustering probabilities obtained from averaging over posterior samples.Our model returned the correct 3 clusters.

Clustering All versus Selected Covariates
Next we compare performances for the two prior models 3.5 and 3.7 , clustering all covariates and selected covariates only, respectively.We conduct this comparison under the Cox survival model 2.4 .The latent response kernel is constructed as y 3.5x 1 3.5x 2 3.5x 3 − 1.5 sin 5x 4 − 1.5 sin 9x 5 − 2.5x 6 − 2.5x 7 , 5.2 with ∼ N 0, σ 2 , σ .05,and with covariates simulated from a U 0, 1 .We generate observed t ∼ Exp 1 / λ exp y , where we employ λ 1.We subsequently randomly censor 5% of our generated survival times.Figure 2 presents the results for clustering all covariates plots a , c and only selected covariates plots b , d .Again, we see the expected clustering behavior among selected covariates in both models, with a slightly less sharp cluster separation in the latter case, indicating a reduction in borrowing of strength among coclustered covariates.We further experimented with the prior expected number of clusters by employing α ∼ G a, 1 , with a 3 − −5, and found a further slight reduction of within covariate sampling spread for selected covariates with increasing a, likely resulting from the greater tendency to produce more clusters.
It is worth noting that, under model sparsity, the MCMC algorithm of the model clustering selected covariates only is about 13-14 times faster than the one under the model that clusters all covariates.More precisely, results presented here for the former model were obtained with a computation of 4600 CPU-seconds as compared to 63000 CPU-seconds for the latter model clustering all covariates.This is not surprising under model sparsity, since the model formulation clustering all covariates assigns p covariates to clusters on every MCMC iteration, while the construction clustering selected covariates assigns only p γ .Computation times do of course increase for both clustering methods proportionally to the number of true clusters.Reported CPU run times were based on utilization of Matlab with a 2.4 GHz Quad Core Q6600 PC with 4 GB of RAM running 64-bit Windows XP.

Benchmark Data Application
We analyze the Boston Housing data of Breiman and Friedman 23 using the covariate clustering model 3.5 that includes all covariates.This data set relates p 13 predictors to the median value of owner-occupied homes in n 506 census tracks in the Boston metropolitan area; see Breiman and Friedman 23 for a detailed description of the predictors.We held out a randomly chosen validation set of 250 observations.Figure 3 Dahl 22 , which provides an analytical approach for selecting clusters from among the posterior partitions, contained φ 1 and a separate cluster capturing {x 7 , x 11 }.The set partition with the second lowest least squares deviation score defines this second cluster as {x 7 , x 10 }.These results then generally support our subjective visual interpretation.

Discussion
In this paper, we have expanded the framework for Bayesian variable selection for generalized Gaussian process GP models by employing spiked Dirichlet process DP prior constructions over set partitions containing covariates.Our approach results in a nonparametric treatment of the distribution of the covariance parameters of the GP covariance matrix that in turn induces a clustering of the covariates.We have proposed MCMC schemes for posterior inference that use modifications of the auxiliary Gibbs algorithm of Neal 13 to facilitate posterior sampling under model sparsity avoiding the generation of duplicate trivial clusters.Our simulation results have shown a reduction in posterior sampling variability and enhanced prediction performances.In addition, we have evaluated two prior constructions: the first one employs a mixture of a point-mass and a continuous distribution as the centering distribution for the DP prior, therefore, clustering all covariates.The second one employs a mixture of a spike and a DP prior with a continuous distribution as the centering distribution, which induces clustering of the selected covariates only.While the former prior construction achieves a better clusters separation, clustering only selected covariates is computationally more efficient.
In the future, it may be interesting to extend our nonparametric covariate clustering models to hierarchical structures that impose some prior dependence among covariates.Another possible extension of our modeling framework includes augmentation with the simultaneous employment of a complementary clustering of observations in a Dirichlet mixture construction incorporating the regression error term of the model.There is no inherent conflict between these two schemes since all observations are in every covariate cluster.

4
Update α: As in Step 3. of the previous algorithm.5 Update {λ a , λ z , h}: As in Step 4. of the previous algorithm.

Figure 1 :
Figure 1: Effect of covariate clustering employing prior model 3.5 : Univariate regression model n 130, p 1000 .Box plots of posterior samples for the ρ k 's; a shows results without covariate clustering; b shows results with covariate clustering.

dFigure 2 : 1 {x 6 2 {x 7
Figure 2: Effect of covariate clustering: Survival model n 150, p 1000 .Box plots of posterior samples for the ρ k 's and marginal posterior probabilities for the γ k 's; a and c show results with clustering of all covariates; b and d show results with clustering of only selected covariates.

Figure 3 :
Figure 3: Boston Housing Data.Posterior samples of ρ.Plots a , b present box plots of ρ k .a without clustering models; b with clustering on all covariates.

failure times and let t 1 < t 2 < • • • < t D be the D ≤ n distinct noncensored failure times. In this paper, we use the partial likelihood formulation, defined as
Savitsky et al. 1 incorporate GP models within the generalized linear model framework of McCullagh and Nelder 2 by employing the relation 4which encloses the mixture prior on ρ k | γ k and the Bernoulli prior on γ k in the base distribution, and where w is the prior probability of covariate inclusion.A further Beta prior can be placed on w to introduce additional variation.Under model sparsity, we a priori expect most covariates to be excluded from the model space, which we accomplish by allocating the associated ρ k for a nuisance covariate to the Dirac measure component of the conditional mixture prior under the setting γ k 0, ρ k 1 , effectively reducing the dimensionality of the parameter space.Our clustering model 3.4 therefore strengthens exclusion of nuisance covariates with a prior construction that co-clusters nuisance covariates.Let us define the trivial cluster as S t 5 , we collect the covariance parameters in Θ s, φ * , λ a , λ z where φ * γ * , ρ * γ * indicates the M unique cluster values of φ γ, ρ γ .For model 3.7 , we have Θ s, γ, ρ * γ , λ a , λ z to reflect the fact that covariate selection is done separately from the clustering.We recall the augmented data likelihood of Savitsky et al. 1 employed as a generalized formulation for GP models to construct the joint posterior distribution over model parameters, i ∈ {y i , {t i , d i , z x i }} and D : {D 1 , . . ., D n } to capture the observed data augmented by the unobserved GP variate, z X , in the case of latent responses, such as for the survival model 2.4 .Note that D depends on the concentration parameter for clustering covariates, α, through the prior on s.We collect all other model-specific parameters in h; for example, h r for the univariate regression model 2.3 .