© Hindawi Publishing Corp. AN IMPROVED BAYES EMPIRICAL BAYES ESTIMATOR

Consider an experiment yielding an observable random quantity X whose distribution Fθ depends on a parameter θ with θ being distributed according to some distribution G0. We study the Bayesian estimation problem of θ under squared error loss function based on X, as well as some additional data available from other similar experiments according to an empirical Bayes structure. In a recent paper, Samaniego and Neath (1996) investigated the questions of whether, and when, this information can be exploited so as to provide a better estimate of θ in the current experiment. They constructed a Bayes empirical Bayes estimator that is superior to the original Bayes estimator, based only on the current observation X for sampling situations involving exponential families-conjugate prior pair. In this paper, we present an improved Bayes empirical Bayes estimator having a smaller Bayes risk than that of Samaniego and Neath’s estimator. We further observe that our estimator is superior to the original Bayes estimator in more general situations than those of the exponential families-conjugate prior combination.


Introduction.
Suppose that an experiment yields an observable random variable X whose distribution is indexed by a parameter θ. Assume that θ is distributed according to some known (prior) distribution G 0 . Consider the Bayesian estimation problem of θ based on X under squared error loss function. The researchers have investigated how to be a better Bayesian whenever some additional data are available from other similar experiments. This similarity of other experiments is described using an empirical Bayes structure (Robbins [8,9]). More specifically, it is assumed that θ i 's are independent of θ i ∼ G, i = 1,...,k (1.1) and that, given θ i , X i is distributed according to F θ i , that is, (here and in what follows "∼" means independent identically distributed (i.i.d.)).
It is further assumed that the pairs {(X i ,θ i )} are mutually independent and are independent of (θ, X) as well. The above model is also known as the compound sampling model. The Bayesian approach to inferences about θ depends on the assumed prior distribution G 0 . This prior can depend on unknown parameters which in turn may follow some second-stage prior. This sequence of parameters and priors constitutes a hierarchical model. The hierarchy must stop at some point, with all remaining prior parameters assumed to be known. The Bayes estimators that are obtained with respect to such hyperpriors are known as hierarchical Bayes estimators. Alternatively, the basic empirical Bayes approach uses the observed data (X 1 ,...,X k ) from other similar experiments to estimate those final-stage parameters (or to estimate the Bayes rule itself) and proceeds as in a standard Bayesian analysis. The resulting estimators are generally known as parametric (nonparametric) empirical Bayes estimators. There is a huge literature on empirical Bayes and hierarchical Bayes methods. An interested reader is referred to the monographs of Maritz and Lwin [7] and Carlin and Louis [2] for further details. On the other hand, the so-called Bayes empirical Bayes approach (Deely and Lindley [4]) operates under the assumption that the prior G 0 (or the hyperprior) is completely known and seeks to combine subjective information about the unknown parameter (via G 0 ) with the available data (X 1 ,...,X k ,X) in the process of making inferences about the parameter of interest θ. For various developments on Bayes empirical Bayes methods, see the works of Rolph [10], Berry and Christensen [1], Deely and Lindley [4], Gilliland et al. [6], Walter and Hamedani [13,14], Samaniego and Neath [12], and the references therein.
Like the Bayes empirical Bayes approach, hierarchical Bayes modeling is also a powerful tool for combining information from separate, but possibly related, experiments. The basic idea is to treat the unknown parameters from the individual experiments as realizations from a second-level distribution. The combined data can then be thought of as arising from a two-stage process: first, the parameters θ 1 ,...,θ k are drawn from the second-level distribution, say G η (θ), and then the data X 1 ,...,X k are drawn from the resulting first-level distributions F θ 1 ,...,F θ k . Under this formulation, the first-level model explains the variation within experiments, and the second-level model explains the variation across experiments. For example, hierarchical Bayes methods form an ideal setting for combining information from several published studies of the same research area, a scientific discipline commonly referred to as meta-analysis (Cooper and Hedges [3]), though, in this context, primary interest is in the hyperparameters η rather than the parameters from individual studies, θ.
The appeal of the Bayes empirical Bayes or of the hierarchical Bayes modeling is that information from all the experiments can be used for inferences about the intermediate parameters θ 1 ,...,θ k as well as θ. The probabilistic formulation is tailor-made for a fully Bayesian analysis. By putting prior probability distributions on the third-level hyperparameters η and any nuisance parameters, all inferences can be carried out with posterior probability statements. But, as noted by some authors, the danger lies in these approaches, however the misspecification of the prior and hyperprior distributions are. Most seriously, there is the possibility of false association. Perhaps the association of θ should not be combined with θ 1 ,...,θ k . See Efron [5] for more information on these kinds of criticisms and for an empirical Bayes-likelihood approach of data combining.
In a recent paper, Samaniego and Neath (hereafter S&N, [12]) presented an efficient method for exploiting the past data (X 1 ,...,X k ) along with the current observation X in the Bayesian estimation problem of θ, resulting in a Bayes empirical Bayes (BEB) estimator of θ. From a Bayesian perspective, they investigated under what circumstances their BEB estimator would offer improvement over the original Bayes estimator. They demonstrated that in the traditional empirical Bayes framework and in situations involving exponential families, conjugate priors, and squared error loss, their BEB estimator of θ (which is based on data X 1 ,...,X k and X) is superior to the original Bayes estimator of θ, which is based only on X, showing a better Bayesian performance by combining some relevant empirical findings with the subjective information in the prior distribution G 0 . The performance of BEB estimator of S&N depends on the choice of the prior G 0 , and it is not known how good is their estimator from a frequentist perspective. Generally speaking, however, besides having good Bayesian properties, estimators derived using Bayesian methods such as BEB estimators can have excellent frequentist properties (such as a smaller frequentist risk) and produce improvements over estimators generated by frequentist-or likelihood-based approaches. See Carlin and Louis [2] for more elaborate discussion on this point.
In this paper, we present another potentially useful BEB estimator developed under the same setup of S&N. The proposed estimator of θ is shown to have better performance than that of S&N in the sense of having a smaller Bayes risk. We further observe that our estimator is superior to the original Bayes estimator for more general situations than those of the exponential familiesconjugate prior combination, showing a wider applicability of the proposed estimator. It is not known whether our strategy is the optimal way of combining data from past similar experiments. Indeed, it is reasonable to expect that there may be other better methods of exploiting the data (X 1 ,...,X k ) from past experiments with the current experiment. The next section contains the main results of this article. Section 3 contains a numerical example comparing the proposed estimator with that of S&N estimator.

Bayes empirical Bayes estimator.
For convenience of notation, we assume throughout this section that the current experiment is the (k + 1)stexperiment, and thus the problem is Bayesian estimation of θ k+1 based on data (X 1 ,...,X k ) from k past experiments as well as the current data vector and that It is further assumed that the pairs {(X i ,θ i )} k+1 i=1 are mutually independent. Let G be a prior distribution on θ k+1 such that the Bayes estimator d G (X k+1 ) under squared error loss is given by where α ∈ [0, 1) and θ k+1 denotes the uniformly minimum variance unbiased estimator (UMVUE) of θ k+1 . Based on data (X 1 ,...,X k ) from past experiments, let G k denote the prior distribution with prior mean c θ * +(1−c)E G (θ); that is, let G k be the prior distribution on θ k+1 such that the Bayes estimator d G k (X k+1 ) of θ k+1 under squared error loss is and for i = 1,...,k, θ i denotes the UMVUE of θ i based on the observation vector X i . Then S&N showed that d G k has smaller Bayes risk than that of d G given by (2.3) with respect to G 0 when estimating θ k+1 . That is, for any value of the constant c satisfying where A = |E G 0 (θ) − E G (θ)| and r (G 0 ,d) = E(d − θ k+1 ) 2 , with E denoting expectation with respect to all the random variables governed by (2.1) and (2.2). This notation of "E" is used in what follows without further mention.

11)
With w = w * in (2.9), it is now clear that the following inequality holds:

12)
where r (G 0 ,d G 2 ) and r (G 0 ,d G ) denote Bayes risks of d G 2 and d G , given by (2.8) and (2.3), respectively, with respect to G 0 ; that is, Proof. The derivation of w * = (w * 1 ,w * 2 ) is rather lengthy but straightforward. Therefore, we give only the main steps of the computation here. Let w 0 = 1 − (w 1 + w 2 ). Then, 0 ≤ w 0 < 1, w 0 + w 1 + w 2 = 1, and the BEB estimator δ G 2 ,w (X 3 ), given by (2.9), takes the form where µ = E G (θ). Subject to the condition w 0 + w 1 + w 2 = 1, we now minimize (2.15) (CPT stands for cross product terms) where (2.16) Note that the second term on the right-hand side of (2.15) is equal to the product of (1 − α) 2 , and The second term on the right-hand side of (2.17) is equal to (2.19) We now minimize (2.19) subject to the restriction w 0 + w 1 + w 2 = 1. Let where λ denotes the Lagrangian multiplier and r (G 0 ,δ G 2 ,w ) is given by (2.19). By differentiating r (w) with respect to w 0 , w 1 , and w 2 separately and setting equal to zero, the following three equations are obtained: where λ 1 = λ/2(1 − α) 2 . Now, from (2.21), (2.22), and (2.23) we obtain that Substituting w 1 of (2.24) and w 2 of (2.25) into (2.21) and solving for w 0 gives where a 1 and a 2 are as defined in the theorem. Now, from w 0 + w 1 + w 2 = 1 with w 1 , w 2 , and w 0 given by (2.24), (2.25), and (2.26), respectively, we obtain the following solution for λ 1 : The proof is now completed by substituting λ 1 of (2.27) in (2.24) and (2.25) and then using the facts that E( For general k, we consider estimators of θ k+1 of the form where θ i is the UMVUE of θ i based on X i , i = 1,...,k, α ∈ [0, 1) and 0 ≤ w i < 1, i = 1,...,k such that 0 ≤ k i=1 w i < 1. Again, optimum values of w = (w 1 ,...,w k ) may be obtained by minimizing the Bayes risk r (G 0 ,δ G k ,w ) = The solution is given in the next theorem. We state the theorem without proof since the proof is similar to that of Theorem 2.1.
Theorem 2.2. Let r (G 0 ,δ G k ,w ) denote the Bayes risk of δ G k ,w (X k+1 ) given by (2.28) Then, the values of w = (w 1 ,w 2 ,...,w k ) that minimize r (G 0 ,δ G k ,w ) are obtained as the solution to the system of equations given by (2.30) A is a (k + 1) × (k + 1) matrix and W and µ are (k + 1) × 1 matrices.

A numerical example.
In this section, we apply our proposed method on a real data set in order to see how much improvement is gained by using the proposed BEB estimator over that of S&N estimator. We employed the data provided by Efron [5, Table 1]. The preceding data is based on forty-one randomized trials of a new surgical treatment for stomach ulcers conducted between 1980 and 1989, Sacks et al. [11]. The kth experiment data are recorded as where a k and b k are the number of occurrences and nonoccurrences for the Treatment (the new surgery), and c k and d k are the occurrences and nonoccurrences for Control (an older surgery). The true log-odds ratio in the kth experimental population is given by (Nonoccurrence|Control) .
An estimate of θ k is given by  Table 1]. Applying this data to our computation, we acted as if estimates θ = log(a k /b k ÷ c k /d k ) are generated from a normal distribution with mean θ k and standard deviation σ k = SD k , k = 1,...,41. The values of σ 1 , σ 2 , and σ 3 are given by σ 1 = 0.86, σ 2 = 0.66, and σ 3 = 0.68 from [5, Table 1]. We further assumed that the prior G 0 given by (2.2) is also normal with the mean µ 0 and standard deviation τ 0 . Again, from Efron's paper [5], we obtain µ 0 = −1.22 and τ 0 = 1.19 (see [5, equation (3.8)]). A computational expression of r (G 0 ,δ G 2 ,w * ) can be easily obtained from (2.19) with w 1 and w 2 replaced by w * 1 and w * 2 , respectively. In all our computations, we assumed that the α used in expressions (2.4) and (2.9) is equal to α = 0.6. Finally, we chose five values of µ = E(G), namely, µ = −0.5, 0, 1.0, 2.0, and 3.0, for our computation of r (G 0 ,δ G 2 ,w * ) and r (G 0 ,d G 2 ), where r (G 0 ,d G 2 ) denotes the Bayes risk with respect to G 0 of S&N estimator d G k given by (2.4) with k = 2. The sample sizes of experiments 1, 2, and 3 in [5, Table 1] are given by n 1 = 28, n 2 = 35, and n 3 = 73, respectively. For the above specifications, we computed the efficiency of δ G 2 ,w * (X 3 ) relative to d G 2 (X 3 ), that is, the ratio r (G 0 ,d G 2 )/r (G 0 ,δ G 2 ,w * ). As percentages, the resulting values are 100.99, 101.56, 101.80, 102.56, and 110.30 (see the first row of Table 3.1), which shows a better performance of δ G 2 ,w * (X 3 ) relative to d G 2 (X 3 ). Various other choices of (n 1 ,n 2 ,n 3 ) were also considered, and the relative efficiency of the two estimators were computed. Our results are given in Table 3.1. Also, various other values of µ were investigated, and the results were similar to those in Table 3.1. From Table 3.1, it is clear that when the difference µ − µ 0 is large, the efficiency of δ G 2 ,w * (X 3 ) relative to d G 2 (X 3 ) is higher as we would expect. In al cases examined, we observe that δ G 2 ,w * (X 3 ) is relatively more efficient than d G 2 (X 3 ).