ON EMPIRICAL BAYES ESTIMATION OF MULTIVARIATE REGRESSION COEFFICIENT

We investigate the empirical Bayes estimation problem of multivariate regression coefficients under squared error loss function. In particular, we consider the regression model Y = Xβ+ ε, where Y is an m-vector of observations, X is a known m× k matrix, β is an unknown k-vector, and ε is an m-vector of unobservable random variables. The problem is squared error loss estimation of β based on some “previous” data Y1, . . . ,Yn as well as the “current” data vector Y when β is distributed according to some unknown distribution G, where Yi satisfies Yi = Xβi + εi, i = 1, . . . ,n. We construct a new empirical Bayes estimator of β when εi ∼N(0,σIm), i= 1, . . . ,n. The performance of the proposed empirical Bayes estimator is measured using the mean squared error. The rates of convergence of the mean squared error are obtained.


Introduction
In an EB problem (Robbins [15,16]), there is a sequence of independent random vectors (Y 1 ,β 1 ),(Y 2 ,β 2 ),...,(Y n+1 ,β n+1 ), where the Y j 's are observable whereas the β j 's are not observable.Furthermore, the β j s are independent identically distributed (i.i.d.) according to some unknown distribution G and, given β j = β, Y j is distributed according to some conditional distribution with density f (y | β).In this paper, we will assume that Y j and β j are related by a multiple linear regression model given by Y j = Xβ j + ε j , (1.1) j = 1,2,...,n + 1, where Y j is an m-vector of observations, X is a known m × k matrix, β j is a k-vector, and ε j is an m-vector of unobservable random variables.We will further assume that the conditional distribution of ε j , given β j = β, is a multivariate normal distribution N(0,σ 2 I m ), where σ 2 is an unknown constant and I m denotes the identity matrix of order m × m.The objective is to estimate β n+1 based on the observations (Y 1 ,...,Y n+1 ) under a squared error loss function given by where a 2 = a a = m i=1 a 2 i for any m-vector a = (a 1 ,...,a m ) .Denote β n+1 = β and Y n+1 = y.Let β n (y) denote an empirical Bayes (EB) estimator of β under the loss (1.2) based on (Y 1 ,...,Y n ) as well as Y n+1 = y.Then the sequence { β n } is said to be asymptotically optimal if the "regret" EL( β n ,β) − EL(β G ,β) → 0 as n → ∞, where expectation E is with respect to all the random variables involved, and β G denotes the Bayes estimator of β with respect to G under the loss function (1.2).
Martz and Krutchkoff [8] have considered the EB estimation problem in the general linear model of (1.1) above.They have, however, only studied their EB estimator of β through some Monte Carlo simulations and have shown that for certain priors their EB estimator performed better than the usual least squares estimator.Wind [21] has considered EB estimation of β when the error vector ε is assumed to have 0 mean and covariance σ 2 I m , but is not assumed to take a specific parametric form, for example, normal.For priors specifying their means and variances, he has exhibited restricted asymptotically optimal EB estimators of β.Singh [17] extended their work and exhibited two classes of estimators φ and φ for β, one for the case when nothing is known about the support of the prior, and the other for the case when it is known that the prior distribution has a compact support.He showed that φ is asymptotically optimal (a.o.) with rates O(n −1+η ) of the corresponding regret uniformly over a class of all priors satisfying certain moment conditions dependent on η, whereas φ is shown to be a.o. with rates O(n −1+η ) uniformly over the class of all priors with compact support.Singh's [17] results are for the case of known error variance σ 2 of ε.Recently, Zhang and Wei [22], Wei and Zhang [20], and Wei [19] extended Singh's work to the case of unknown error variance of ε.They discussed asymptotic optimality and convergence rates of their EB estimators of the regression coefficient β.
All of the works mentioned above have employed the regret as a measure of goodness in order to study the performance of their proposed EB estimators.In this paper, we study the performance of our EB estimator using the "mean squared error" (MSE) criterion.Our rationale to study the MSE of our EB estimator is motivated by the work of Pensky [11], where she has argued that MSE is a more appropriate criterion as far as applications are concerned.We will construct an EB estimator of β (for the unknown σ 2 case) based on some improved estimators of multivariate normal mixture density and its first partial derivatives.In the work of Zhang and Wei [22], Wei and Zhang [20], and Wei [19], they have used kernel-type estimators of the preceding functions.We show that the rate of convergence of the MSE of the proposed EB estimator is of the order O(n −1 (logn) m+1 ), showing a considerable improvement over the MSEs of the estimators presented in the above papers.
This paper is organized as follows.Section 2 contains the Bayes estimator of β under the loss function (1.2).Section 3 presents a new method of density estimation of a multivariate normal mixture and its first partial derivatives.Section 4 contains the proposed EB estimator and the main result.Two applications are discussed in Section 5. Section 6 gives some concluding remarks.The proofs are deferred to the appendix.

The Bayes estimator of regression coefficient
In this section, we derive the Bayes estimator of β under the loss function (1.2) for the model (1.1).First, we will assume that k < m and that X X is nonsingular throughout, where X is as defined in (1.1).Further, assume that the error variance σ 2 is bounded away from zero and finite, that is, 0 < γ 0 ≤ σ 2 ≤ γ 1 < ∞ for some known constants γ 0 and γ 1 .We assume further that the prior distribution G of β is a member of the family given by where ω 0 ≥ 2. Then the conditional density of Y given β is and the marginal density of Y with respect to G is given by By (2.2) and [19,Lemma A.1], we obtain that where ∇ f (y) is the gradient of f at y and is defined by Since (X X) is invertible by our assumption, from (2.5) we obtain Under the loss function (1.2), the Bayes estimator of β is the posterior expected loss, that is, where In view of (2.7) and (2.8), an estimator of δ G (y) can be constructed by developing estimators of f (y) and ψ(y) given by (2.3) and (2.8), respectively, based on a sequence of observations {Y i } n i=1 from f (y).This density estimation problem is discussed in the next section.

Estimation of a multivariate normal mixture density and its first partial derivatives
Let φ(t) denote the characteristic function of f (y) given by (2.3) above.Then, it is easy to show that where φ G denotes the characteristic function of the prior distribution G. Since |φ(t)|dt < ∞, by Fourier inversion theorem, we have here and in what follows, all integrals without limits are taken to be over R m , the mdimensional Euclidean space.It is easy to show from (3.1) and (3.2) that for any 0 where c > 0 is a constant, see the appendix.Since, for large M, the right-hand side of the inequality (3.3) is small, one can consider estimating dt with large M in order to estimate f (y) given by (3.2).We define where j=1 e it Yj , the empirical characteristic function of an i.i.d.sample Y 1 ,...,Y n from f (y) given by the formula (3.2).Then the f M (y) is an unbiased estimator of (1/ (2π) m ) [−M,M] m e −it y φ(t)dt.With an appropriate choice of M, as a function of n, f M (y) can be used to estimate density f (y) given by (3.2).A similar estimator has been studied by Singh and Pensky [18].This estimator provides optimal convergence rates, however, its common drawback is that it has heavy tails (it is not integrable since its Fourier transform is discontinuous).It would be therefore advantageous to use a smoother estimator that will have the same convergence rates and will decrease faster at infinity.Thus, we define the following modified estimator as our proposed estimator of f (y): where g M (t) = m i=1 g M (t i ) with g M (t i ) is given by where a = M − M −1 e −M 2 γ0/2 , and M will be chosen so that a satisfies 0 < a < M. Note that g M (t i ) is bounded and continuous on (−∞,∞), and vanishes outside [−M,M].Further, the Fourier transform of f M (y) is continuous on (−∞,∞).It can be shown that the mean squared error of f M (y) satisfies the following inequality: where c 1 and c 2 are the positive constants independent of n, y, and M (see the appendix).A similar bound holds for the mean squared error of f M (y).By choosing M = ((γ 0 m) −1 logn) 1/2 , we now obtain from (3.7) that for some constant c 3 > 0 independent of n and y.
Then an unbiased estimator of Again a smooth version of f (1)  N, j (y) is defined as where g N (t) = m i=1 g N (t i ) with g N (t i ) is given by (3.6) with M replaced by N. By choosing N = ((γ 0 m) −1 logn) 1/2 , it can be shown that (see the appendix) E f (1)  N, j (y for some constant c 4 > 0 independent of n and y, where ∂ f (y)/∂y j is given by (3.9).
Remark 3.1.The density estimators (3.5) and (3.11) present a multivariate extension of similar estimators that have been investigated by O'Bryan and Susarla [9] for the univariate case.Their estimators are, however, not smooth.The rates of convergence mentioned at (3.8) and (3.12) are much faster than those which have been obtained by using the kernel-type density estimators; see, for example, Wei [19] and Wei and Zhang [20].The reason for better rates attained here is that we have exploited the fact that the density f (y) to be estimated is a mixture of multivariate normal densities.

Empirical Bayes estimator and main results
In view of (2.7), an EB estimator of δ G (y) can be obtained by replacing ψ(y) in (2.7) with an estimator of ψ(y) given by (2.8) based on Y 1 ,...,Y n , given Y n+1 = y.For this purpose, following Pensky [11] and Penskaya [10], we introduce the function and define our proposed EB estimator of ψ(y) as where with f M (y) and f (1)  N, j (y) are given by (3.5) and (3.11), respectively, j = 1,...,m, and α n (y) is a positive number that is to be specified later.Note that ψ(y) defined by (4.2) depends on n but this is suppressed for notational convenience here and in what follows.
The (nuisance) parameter σ 2 in (2.7) is estimated by where γ 0 and γ 1 are as given in Section 2; see, before formula (2.1), and with H = I m − X(X X) −1 X and s = m − k.Note that H is an idempotent matrix such that rank(H) = s and s σ 2 (i) /σ 2 ∼ χ 2 (s) , the chi-square distribution with s degrees of freedom.The estimator σ 2 has been previously studied by Rao [14] under the same model as in (1.1).
Finally, our proposed EB estimator of δ G (y) given by (2.7) is defined by where ψ and σ 2 are given by (4.2) and (4.4), respectively.We will measure the performance of φ(y) by the quantity where the expectation E is with respect to density n i=1 f (y i ) with f (y) given by (2.3).In typical EB problems, the quality of an EB estimator is measured by the regret.In the present problem, the regret of φ is equal to R n (y) f (y)dy with R n (y) given by (4.7).However, Pensky [11] pointed out at least two advantages of using (4.7) compared to the regret.First, R n (y) enables one to calculate the mean squared error for the given observation Y n+1 = y which is the interesting quantity.Second, by using the risk function (4.7), we eliminate the influence on the risk function of the observations having very low probabilities.So, the use of R n (y) provides a way of getting EB estimators with better convergence rates.The next theorem is the main result of this paper, which establishes the rates of convergence of R n (y).Theorem 4.1.Let δ G (y) be given by (2.7) and let φ(y) be defined by (4.6) with M = N = ((γ 0 m) −1 logn) 1/2 in (3.5) and (3.11), respectively.Let α n (y) in (4.3) be given by α n (y) = n −2τ/(2τ+1) (logn) τ(4m+1+ε)/(2τ+1) for some number 0 < ε < 1 and τ is as defined in (4.1).Then Remark 4.2.The inequality established in Theorem 4.1 means that there exists some constant c 0 > 0 such that for every large n, The preceding inequality shows a very fast convergence rate for the pointwise mean squared error of the proposed EB estimator φ(y).It is easy to show that for the EB estimators of Zhang and Wei [22], Wei and Zhang [20], and Wei [19], which are based on kernel-type density estimates, the rate of the mean squared error is only of the order n −ρ with ρ (0 < ρ < 1) determined by the behavior of the prior distribution G. Thus, the proposed estimator has a faster convergence rate for the mean squared error compared to other competitors.
Remark 4.3.The purpose of introducing the function Δ(x, y,α) given at (4.1) is to construct a mean square consistent estimator of the ratio ψ(y) = Δ f (y)/ f (y) given by (2.8).
Other possible forms of estimators of a ratio are given in Penskaya [10].The "τ" in (4.1) is chosen as a positive integer because it makes 2τ an even positive number.Therefore, the ratio x 2τ /y 2τ in (4.1) is a well-defined quantity.In applications, one can simply take τ = 1.Furthermore, for w in (4.1), any positive number can be chosen so that 2τw ≥ 1 for the chosen value of τ.Thus, it is very easy to implement the proposed EB estimator φ(y) defined by (4.6) in practical applications.
Remark 4.4.In the model (1.1), we have assumed that E(ε j | β j ) = 0 and Var(ε j | β j ) = σ 2 I m , where σ 2 is an unknown constant and I m denotes the identity matrix of order m × m.The preceding variance assumption can be replaced by a more general assumption that Var(ε j | β j ) = σ 2 V , where V is a known nonsingular matrix.In this case, the proposed EB estimator defined at (4.6) takes the form where ψ(y) is given by (4.2) and σ 2 is given by (4.4) with σ 2 is replaced by σ 2 1 , where

Applications
In this section, we discuss two applications.In each case, the construction of the Bayes estimator as well as an empirical Bayes estimator is exhibited.

One-way classification.
Suppose we have data in I groups, with J i (i = 1,...,I) observations in each group as follows: ( The usual fixed-effects analysis of variance model for such a situation is where e i j are i.i.d.(independently and identically distributed) as N(0,σ 2 ).Then, in regression terms, we can write the model (5.2) as Since X X is a diagonal matrix with J i in the ith diagonal position and is zero elsewhere, its inverse is a diagonal matrix with 1/J i in the ith diagonal position.Then the Bayes estimator of β given by (2.7) takes the form where Ȳ = ( Ȳ1 ,..., ȲI ) , X and (X X) −1 are given by (5.4) and (5.5), respectively, and ψ(y) is given by (2.8).An empirical Bayes estimator of (5.6) can be constructed by replacing σ 2 and ψ(y) by estimators σ 2 and ψ(y) defined by (4.2) and (4.4), respectively, compare with (4.6).For instance, σ 2 defined by (4.5) now becomes where X given by (5.4), m = I i=1 J i , and {Y i } n i=1 are some independent auxiliary data satisfying the same form as (5.3), that is, Y i = Xβ i + ε i , i = 1,...,n.For ψ(y), one could use the estimator f M (y), given by (3.4), and its first partial derivatives and then employ 10 Empirical Bayes estimation the function (4.1) to form an estimator as in (4.2) and (4.3).Note that f M (y) is equal to where y = (y 1 ,..., y m ) , Y j = (Y j1 ,...,Y jm ) , j = 1,...,n, and K(x) = sinx/πx.Thus, f M (y) has the usual kernel estimator form in a multivariate setting.

Two
and so forth, and using the same approach which led to (5.3), we find that (5.9) can be expressed in the regression form where ε is N(0,σ 2 I m ), m = IJK, and X is m × IJ of rank IJ.Then the Bayes estimator of β given by (2.7) is obtained by using the X and Y of (5.9).Again, an empirical Bayes estimator can be constructed by replacing σ 2 and ψ(y) by their estimators based on some auxiliary data {Y i } n i=1 as in Section 5.1.

Concluding remarks
In this paper, we investigated empirical Bayes estimation in a multivariate regression model of the form Y = Xβ + ε.An empirical Bayes estimator of β was constructed with very fast rates of convergence of the corresponding mean squared error.The proposed empirical Bayes estimator is based on a newly developed multivariate density estimator of a normal mixture density and its first derivatives.The preceding results also extend similar results of the univariate case of a normal mixture density available in the literature.The results of the present paper can be easily extended to more complicated linear models such as The model and the estimator we developed in this paper are applicable in a wide variety of contexts.For instance, they can be used in the construction of a selection index for choosing individuals with a high intrinsic genetic value.For example, β i may represent unknown genetic parameters and Y i are observable characteristics on the ith individual, while p β i for given p is the genetic value to be estimated in terms of observed Y i .Early examples of such estimation by computing the regression of p β i on Y i (suggested by R. A. Fisher) are due to Fairfield Smith [6].A detailed study of the estimation problem from a decision theoretic view point with an estimated parametric prior distribution of β i is given by Rao [13].Some applications are given by Rao [12,13], and further developments of parametric estimation theory are developed by Rao [14], Efron and Morris [3][4][5], and Bunke and Gladitz [2], among others.

Appendix
Proof of (3.3).From (3.2), for any 0 < M < ∞, we have where c > 0 is a constant independent of n, y, and M.This completes the proof.