Posterior Propriety of an Objective Prior in a 4-Level Normal Hierarchical Model

'e use of hierarchical Bayesian models in statistical practice is extensive, yet it is dangerous to implement the Gibbs sampler without checking that the posterior is proper. Formal approaches to objective Bayesian analysis, such as the Jeffreys-rule approach or reference prior approach, are only implementable in simple hierarchical settings. In this paper, we consider a 4-level multivariate normal hierarchical model. We demonstrate the posterior using our recommended prior which is proper in the 4level normal hierarchical models. A primary advantage of the recommended prior over other proposed objective priors is that it can be used at any level of a hierarchical model.


Introduction
Bayesian hierarchical models have a wide range of modern applications including engineering [1], astrophysics [2], economics [3], environmental sciences [4], climatology [5], survival analysis [6], and genetics [7]. It is wonderful here to stay, but hyperparameter priors are often chosen in a casual fashion. Statisticians often use improper priors to express ignorance or to provide good frequency properties. However, it is dangerous to implement the Gibbs sampler without checking that the posterior is proper. As Hobert and Casella [8] pointed, without proper precaution, simple noninformative priors can be misused, sometimes unknowingly, and lead to other difficulties, such as the nonconvergence of the Gibbs sampler. erefore, it is hazardous to skip the demonstration at the risk of making the inference from an improper posterior distribution. ere are many examples of this in the statistical and other literatures. See especially, Wang et al. [9]; Hobert and Casella [8,10]; Berger and Strawderman [11]; Berger et al. [12]; Speckman et al. [13]; Roy and Dey [14]; Michalak and Morris [15]; Ramos et al. [16]; Tomazella et al. [6]; etc. e importance of the posterior propriety motivates us to explore it. e normal hierarchical distribution has received enormous attention and is also of substantial importance in contemporary statistical theory and application. Ning [17] proposed a 2-level multivariate normal hierarchical model for the degradation data of multiple units with change point. Wang and Coit [18] handled the reliability prediction problem using a multivariate normal distribution model, considering multiple degradation measures. Heuristically, one could also build the multivariate normal hierarchical model with adding a normal distribution for unknown mean vectors to deal with this reliability prediction problem. e unknown parameters because of either convenience or a lack of prior information are often modeled with improper objective priors. Some references of objective priors can be found in Berger and Strawderman [11]; Berger [19]; Pollo et al. [20]; and Ferreira et al. [21]. However, formal approaches to objective Bayesian analysis, such as the Jeffreysrule approach or reference prior approach, are only implementable in simple hierarchical settings. It is thus common to use less formal approaches, such as utilizing formal priors from nonhierarchical models in hierarchical settings. However, this can be fraught with danger. For instance, nonhierarchical Jeffreys-rule priors for variances or covariance matrices result in improper posterior distributions if they are used at higher levels of a hierarchical model (see [12]).
Berger et al. [12] addressed the question of choice of hyperpriors in the following 2-level normal hierarchical model: where for i � 1, 2, . . . , m, in which the y i are the k × 1 observation vectors and the θ i are the k × 1 unknown mean vectors (thus p � mk), β is a k × 1 unknown "hypermean" vector, and V is an unknown k × k "hypercovariance matrix." Some commonly priors for the hyperparameter β can be considered. For example, the constant prior and conjugate normal prior. For the prior of the covariance matrix V, a large literature can be found. For example, the articles by Dey and Srinivasan [22]; Lin and Perlman [23]; Haff [24]; Yang and Berger [25]; Daniels and Kass [26,27]; Ledoit and Wolf [28]; Sun and Berger [29]; Rajaratnam et al. [30]; Hoff [31]; And Berger et al. [12] explored posterior propriety and admissibility of various priors. But no overall conclusion was reached as to a specific prior in their paper. Following the seminal work of Berger et al. [12]; Berger et al. [32] rec-ommended a particular objective prior for choice of π(β) and π(V): where v 1 > v 2 · · · > v k are the ordered eigenvalues of V. eir recommendation is based on considerations of the posterior propriety, admissibility, ease of implementation (including computational considerations), and performance. A primary advantage of the recommended prior over other proposed objective priors is that it can be used at any level of a hierarchical model. But the property of the resulting posterior is not clear, and it represents a very difficult question to answer. Michalak and Morris [15] pointed out that except in the simplest models when improper priors are used, it can be daunting and time-consuming to verify that the resulting posterior distribution is proper. While Berger et al. [32] suspected that the propriety holds in any level normal hierarchical models, they were not strictly able to prove it. ey only showed that the posterior is proper in a 3-level hierarchical model. In this paper, we follow the story and consider the posterior propriety of the recommended prior in a 4-level normal hierarchical model. We demonstrate that the posterior using the recommended prior is still proper in the 4-level normal hierarchical model. e structure of this paper is as follows. Section 2 presents a 4-level normal hierarchical model. We propose the recommended hyperpriors for "hypermean" vectors and "hypercovariance matrix." In Section 3, we demonstrate that the posterior using the recommended prior is proper in the 4-level normal hierarchical model. In Section 4, we consider MCMC computation from the posterior. Section 5 gives the performance of this prior, in comparison with other objective priors that were studied in Berger et al. [12], presenting strong numerical evidence of the superiority of (3). Some concluding remarks are provided in Section 6.
As discussed by Berger et al. [12], the prior π(ξ) can be represented by the following hierarchical structure: For both intuitive and technical reasons, it is convenient , and O i being the matrix of eigenvectors corresponding to V * , W * , and Σ * , i � 1, 2, 3, respectively. Consider the one-to-one transformation from V to (V * , O 1 ) and rewrite the prior as where dV � i≤j dv ij ,dV * � k i�1 dv i , and dO 1 denotes the invariant Haar measure over the space of orthonormal Anderson [33] for definition). From Farrell [34], the functional relationship be- erefore, the prior for V becomes the prior of (V * , O 1 ): Use of the invariant prior on O 1 (essentially a uniform prior over rotations) is natural and noncontroversial. is transformation reveals a significant difficulty of any prior that can be written as a function of |V|; in the (V * , O 1 ) space, such priors contain the term i<j (v i − v j ), which gives low mass to close eigenvalues and hence effectively force the eigenvalues apart. It is a criticism of inverse Wishart and Jeffreys priors. is is contrary to the common intuition, in that one often chooses a prior that pushes the eigenvalues closer together.
Similarly, one can obtain the prior of (W * , O 2 ) and (Σ * , O 3 ), given by

Posterior Propriety
In this section, we consider the posterior propriety of the recommended prior in the 4-level normal hierarchical model (4). (4) with q ≥ 2. Assume that Z has rank ps and ZT has rank nq. en, the posterior distribution is always proper.

Theorem 1. Consider the hierarchical Bayes model
Proof. For the technical reason, we define en, (4) is equivalent to Mathematical Problems in Engineering 3 Level 1: (y | θ) ∼ N mk θ, I mk , We marginalize out over θ, β, and η, one by one. It yields the marginal distribution of y given (ξ, V, W, and Σ) is where G mk×q � ZT(1 n ⊗I q ) and Note that ξ | λ ∼ N p (0, λI q ), and marginalizing out ξ yields the marginal distribution of y given (V, W, Σ, and λ): erefore, the likelihood of y given (V, W, Σ, and λ) is where Note that π(V) From relationship (8) between V and (V * , O 1 ),

π(V)
Next, consider the integration over W. From the matrix identity |A + XBX ′ | � |A||B||B − 1 + X ′ A − 1 X|, we obtain Let z 0 > 0 be the smallest eigenvalue of Z ′ Z, then us, Using the above inequality, it yields π(W) Let u 1 ≥ · · · u q > 0 be the eigenvalues of G ′ G. Using the matrix identity |A + XBX ′ | � |A||B||B − 1 + X ′ A − 1 X| again, it follows that erefore, Last, to consider the integration over Σ. Let t 0 be the smallest eigenvalue of T ′ Z ′ ZT. en, us, π(Σ) which is finite if Since ϵ can be chosen arbitrarily small, the integration over Σ is finite if Since q, n, m, and s ≥ 2, (29) is true. e theorem is proved.
Next, consider the integration over Σ. It follows that Since ϵ can be chosen arbitrarily small, p � n and k � sn; this is true if is completes the proof.

Computation
In this section, we consider the MCMC computation from the posterior arising from the model (4). e normal hierarchical models are typically handled today by Gibbs sampling. One difficulty of computation is to sample the covariance matrix efficiently. e main new development discussed in Berger et al. [35] is a new and efficient computational algorithm for dealing with priors on covariance matrices as in (3). It overcomes the computational bottleneck mentioned in the introduction.   (θ, β) can be written as Here and in the following etr(A) represents exp(trace(A)) for a squared matrix A, and e conditional posterior of η given (β, ξ, W, Σ) is N nq (K − 1 ν, K − 1 ). (e) Defining r 2 � (s/2) + 1 − (1/2p), the conditional posterior density of W for given (β, η) can be written as To compute with the recommended prior for ξ, use the equivalent representation as follows: (g) Defining r 3 � (n/2) + 1 − (1/2q), the conditional posterior density of Σ for given (β, η) can be written as where H 3 � n l�1 (η l − ξ)(η l − ξ) ′ . (i) For sample λ from its full conditional, the Inverse Gamma ((q − 1)/2, (1 + ‖ξ‖ 2 )/2) density If q � 1, this step is not needed as the hyperprior, for ξ is constant.
(ii) Given (λ, Σ, η i ), Gibbs sampling of ξ can be done from its full conditional, which is (when q � 1, set λ � ∞) Sampling of θ, β, η, and ξ can simply be carried out with a Gibbs step, as its full conditional will be a normal distribution. To sample the covariance matrix V, W, and Σ from Fact 1, the Metropolis-Hastings Algorithm [36] and Hit-and-Run Method Chen and Schmeiser [37] could be used, based on proposal distributions that generate full-candidate matrices. But the two well-known methods are both inefficient, especially for highdimensional data. Fortunately, Berger et al. [35] proposed a new and efficient computational algorithm for sampling the covariance matrix V, W, and Σ from Fact 1, respectively.
Take sampling V from (45) as an example. For the conditional density of V given in (45), we use the eigenvalue-vector decomposition V � O 1 V * O 1 ′ , where O 1 is the orthogonal and V * is the diagonal matrix of ordered eigenvalues. For H 1 defined in (46), note that tr( , so the conditional density of V given in (45) can be transformed to Gibbs sampling of V * : following Lemma 1 in Berger et al. [12], i.e., we can first sample V * given (O 1 , H 1 ), from where c i is the (i, i) element of O 1 ′ H 1 O 1 /2. us, we can directly sample v i independently from inverse gamma (r 1 − 1, c i ) distributions, and then simply rearrange the v i , so that v 1 ≥ · · · ≥ v k . Gibbs sampling of O 1 : from (19), the conditional density of O 1 given V * and H 1 is Write H 1 � LUL ′ , where L is the matrix of normalized eigenvectors (LL ′ � I k ) and U � diag(u 1 , . . . , u k ) is the diagonal matrix of corresponding eigenvalues with en, the conditional density of Γ given V * and H 1 is Hoff [31] introduced an MCMC algorithm to sample from the posterior of O 1 by updating two randomly selected columns of O 1 . Alternatively, Berger et al. [35] suggested updating two randomly selected rows of Γ (essentially equivalent to Hoff's method in full rank cases, but considerably faster in situations of less than full rank). For example, to update the first two rows of Γ, we write Γ � diag(Ω, I k− 2 )(Γ 12 ′ , Γ − 12 ′ ) ′ , where Γ 12 is the first 2 rows of Γ, Γ − 12 is the rest k − 2 rows of Γ, and Here, ω ∈ (− π/2, π/2] and ϵ i � ± 1 for i � 1, 2. Write en, the conditional posterior of ω, π(ω | Γ 12 , Write where ϕ ∈ (− π/2, π/2] and s 1 > s 2 . From (59), the conditional posterior of θ is where c 0 � − (1/2)(s 1 − s 2 )(u 1 − u 2 ) ≤ 0. Let α � cos 2 (ω + ϕ). en, the full conditional density of α is Sampling α ∈ [0, 1] can be done with a rejection sampler with proposal Beta(1/2, 1/2) distribution.
From the studies in Berger et al. [35], the new method substantially outperforms the Metropolis and Hit-and-Run Mathematical Problems in Engineering algorithms in moderate dimensions and succeeds for k up to 100, whereas the other methods break down in much lower dimensions.

Simulation
From the mean square error (MSE) perspective, we compare the performance of 6 objective hyperpriors (considered in Berger et al. [12]), created from the product of three objective hyperpriors for ξ: (1) Constant prior: π 1 (ξ) � 1 (2) Conjugate prior: and two objective hyperpriors for V (1) Constant prior: π(V) � 1,π(W) � 1 and π(Σ) � 1 (2) Recommended reference prior: Except for the constant prior and recommended reference prior, Berger et al. [12] also studied the nonhierarchical independence Jeffreys prior (π(V) � |V| − (k+1)/2 ) and the hierarchical independence Jeffreys prior (π(V) � |I + V| − (k+1)/2 ) in the model (1). From Berger et al. [12], the nonhierarchical independence Jeffreys prior for the covariance matrix cannot be used for the hierarchical models since it results in improper posterior. For the hierarchical independence Jeffreys prior, the common sampling methods are very inefficient to sample V from the posterior distribution. erefore, we do not consider either of the abovementioned two Jeffreys priors.
Set q � n � 2, p � s � 4, m � k � 16. We generate Z � (z 1 , . . . , z mk ) ′ , where z i ∼ iid N sp (0, I sp ) for i � 1, . . . , mk. Similarly, T � (t 1 , . . . , t sp ) ′ , where t j ∼ iid N nq (0, I nq ) for j � 1, . . . , sp. We simulate the Bayes risks of the posterior means resulting from the 6 priors in every combination of the following cases: Each choice of (ξ, V, W, and Σ) specifies a "true" hierarchical model for the simulation and we wish to compute the Bayes risk corresponding to the posterior means for θ that arise from each of the 6 objective priors. To simulate these risks, we generate 2000 random θ l ∼ N mk (1 m ⊗ β, I m ⊗ V) and generate observations y l |θ l ∼ N mk (θ l , I mk ), l � 1, . . . , 2000. To obtain the posterior means, δ π (y l ), of θ l , under the 6 priors, we run 100, 000 MCMC cycles after 3, 000 burn-in cycles, using the algorithms described above. Finally, we approximate the Bayes risk as the average observed mean squared error (MSE) as follows: e results are given in Table 1. e recommended prior (in the last line) dominates all the others in terms of risk. Within each three-row segment, comparison of the first row with the third row is a comparison between using the constant prior for ξ and the recommended prior. e gains with the recommended prior are large. Comparison of each of the first three rows with each of the last three rows is a comparison between using the constant prior for V versus the recommended prior. e gains with the recommended prior are also large. e risk results present the strong numerical evidence of the superiority of the recommended priors (5).

Conclusion
In this paper, we follow Berger et al.'s [12] work and study a 4-level normal hierarchical model. We demonstrate that the posterior using the recommended prior is still proper in the 4-level normal hierarchical model. Herein, we do not demonstrate Berger et al.'s [32] conjecture, i.e., the posterior using the recommended prior is always proper in any level normal hierarchical models. But our method provides one useful guideline for the completion of the story.
Besides, the normal hierarchical models are typically handled by Gibbs sampling. One difficulty of computation is to sample the covariance matrix efficiently. e common sampling methods for covariance matrices, for example, Metropolis-Hastings algorithm [36] and Hit-and-Run method [25], are inefficient for higher dimensions. To overcome the computational bottleneck of sampling covariance matrices, we can use a powerful and efficient method proposed by Berger et al. [35] for sampling the conditional density of covariance matrices. erefore, there is no difficulty in computation for the 4-level normal hierarchical model with the recommended priors (5). In addition, the simulation result presents the strong numerical evidence of the superiority of the recommended priors (5).

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.