Bayesian Inference of a Multivariate Regression Model

We explore Bayesian inference of a multivariate linear regression model with use of a flexible prior for the covariance structure. The commonly adopted Bayesian setup involves the conjugate prior, multivariate normal distribution for the regression coefficients and inverse Wishart specification for the covariance matrix. Here we depart from this approach and propose a novel Bayesian estimator for the covariance. A multivariate normal prior for the unique elements of the matrix logarithm of the covariance matrix is considered. Such structure allows for a richer class of prior distributions for the covariance, with respect to strength of beliefs in prior location hyperparameters, as well as the added ability, to model potential correlation amongst the covariance structure. The posterior moments of all relevant parameters of interest are calculated based upon numerical results via aMarkov chainMonte Carlo procedure.TheMetropolis-Hastings-within-Gibbs algorithm is invoked to account for the construction of a proposal density that closely matches the shape of the target posterior distribution. As an application of the proposed technique, we investigate a multiple regression based upon the 1980 High School and Beyond Survey.


Introduction
The multivariate multiple regression model is a natural extension of the univariate multiple regression model.The key difference, as the name implies, is that the univariate response variable is instead a multivariate response vector.By utilizing the multivariate multiple regression model the covariance of the response vector can be modeled.From an estimation standpoint van der Merwe and Zidek [1] suggest an intrinsic role to be played by the covariance structure, whereas, in the case of separate univariate multiple regression models, the covariance of the distinct response variables cannot be modeled.Although optimal point estimates of any linear combination of the means of the various response variables can still be obtained, an appropriate estimate of the variance of said estimator cannot be obtained without fully incorporating the covariance amongst the multivariate response vector.Under this framework multivariate analysis is required to most appropriately produce an estimate of the standard error.
As a particular example, in educational testing data when multiple subject area exams are administered it is common practice to simply report the sum of the individual exam scores as a total score.For instance, the metric most commonly associated with the Scholastic Aptitude Test (SAT) is simply the sum of the student's verbal score and the mathematical score.Other exams, such as the ACT exam, have even more than two subject areas and report a composite score which is the arithmetic average of the individual subject area scores.In these instances, multivariate analysis, by capturing the covariance amongst the various subject exams, is required to properly estimate the standard error of the final score.
Formal Bayesian analysis has long been used for multivariate multiple regression models [2].The inverse Wishart is widely used in this respect, since it is a conjugate prior distribution for the multivariate normal covariance matrix [3][4][5].Also see Dawid [6] for a general discussion of the inverse Wishart and Wishart distributions.However, in contrast to traditional Bayesian methods we will not make use of the standard inverse Wishart conjugate prior for the covariance matrix.The reason is that the inverse Wishart is a rather restrictive distribution in its ability to capture prior information that may be available to the practitioner.
Journal of Probability and Statistics See Leonard and Hsu [7], Hsu et al. [8], and Sinay et al. [9] for a more detailed explanation of the disadvantages of the inverse Wishart as a prior distribution for the covariance matrix.
Leonard and Hsu [7] presented an alternative approach that remedies the shortcomings of the inverse Wishart and allows for greater flexibility in the prior specification.In a univariate normal model setting, the normal distribution has been used as a prior for the logarithm of the variance parameter.In this same vein, Leonard and Hsu [7] consider the matrix logarithm transformation of the covariance matrix for the multivariate case.Making use of a result from Bellman [10, page 171], it can be demonstrated that the exponential terms of a multivariate normal likelihood function can be expressed in the form of a Volterra linear integral equation.An approximation to a function, that is, proportional to the likelihood, can then be obtained via Bellman's iterative solution to the Volterra integral equation.The resulting approximation has a multivariate normal form, with respect to the unique elements of the matrix logarithm of the covariance matrix.This allows a normal prior specification to act as a conjugate prior distribution, thereby yielding an approximate normal posterior for the covariance structure.
One of the primary benefits of such a technique is the ability to specify varying degrees of confidence in each element of the normal prior hyperparameter mean vector via the variance terms of the prior hyperparameter covariance matrix.Obviously, larger variance terms in the prior hyperparameter covariance matrix indicate a lack of confidence in the corresponding prior location hyperparameter.Another chief advantage of this method is the ability to model beliefs about any possible interdependency between the covariance parameters.This can be accomplished by specifying the covariance terms of the prior hyperparameter covariance matrix.Note that in this way both the interrelationships and the strength of prior beliefs with respect to the covariance parameters can be modeled.
Bayesian estimates of all relevant parameters of interest are calculated using Markov chain Monte Carlo (MCMC) techniques.With respect to the covariance structure, since an approximate posterior distribution is used as a proposal density, we employ the Metropolis-Hastings-within-Gibbs (MHWG) algorithm [11, page 291], to correctly estimate the true target posterior density.
Having laid out the general outline we now move on to the body of the paper.We begin by introducing and defining the standard multivariate multiple regression model.We follow that by making a distributional assumption about the error matrix of the multivariate regression model.This provides us with a mechanism to state the likelihood function for the model.In turn, we then go through the formal analytical Bayesian derivations.Subsequent to the Bayesian analysis we outline the MCMC procedure and discuss how the posterior means and standard errors are numerically calculated.We conclude with an application to the High School and Beyond Survey [12].

Multivariate Multiple Regression Model
We consider the standard multivariate multiple regression model: Notationally in matrix form we can succinctly write where Y is the ( × ) matrix of response variables, X is the ( × ) matrix of explanatory variables,  is the ( × ) matrix of unknown regression coefficients, and  is the (×) matrix of random errors.In Section 2.1 we introduce the matrix normal distribution and make a general assumption about the distribution of the ( × ) random error matrix  in (2).The matrix normal representation will greatly facilitate the Baysian analysis that follows.Based upon the matrix normal, we proceed in Section 2.2 to develop the joint likelihood function for  and Σ. Hierarchical prior specifications are discussed in Section 2.3 and the joint posterior distribution is reported in Section 2.4.

Distributional Assumptions and Matrix
Normal Distribution.The matrix normal distribution is closely related to, and is a generalization of the multivariate normal.In particular, the (×) random matrix M ∼ MN (×) (Φ, Σ, Ω), if and only if, the ( × 1) random vector Vec(M) ∼ N  (Vec(Φ), Σ ⊗ Ω), where MN (×) denotes the ( × ) dimensional matrix normal distribution, Φ is a ( × ) location matrix, Σ is a ( × ) first covariance matrix, and Ω is a ( × ) second covariance matrix [13, page 54].Vec(⋅) and ⊗ are the standard vector operator and Kronecker product, respectively.We make the distributional assumption that, conditional on the ( × ) covariance matrix Σ, the ( × ) random error matrix  = (  1 ,   2 , . . .,    )  , in (2), follows a matrix normal with (×) zero mean matrix and covariance matrices given by Σ and I  , where I  is a ( × ) identity matrix.Formally, we have  | Σ ∼ MN (×) (0, Σ, I  ), or equivalently, the error terms  1 ,  2 , . . .,   are independent and identically distributed normal random vectors each with mean vector 0 and covariance matrix Σ.The probability density function for the error matrix is given by where tr(⋅) is the standard trace function.
For a given value of  let S =  −1 (Y − X)  (Y − X).Note that S is a ( × ) symmetric almost surely positive definite matrix.Then the likelihood function (4) for Σ conditional on  can be written as In Bayesian analysis for a univariate normal model, the logarithm of the variance parameter has been modeled by a univariate normal prior distribution.In a multivariate setting the matrix logarithm of a covariance matrix has also been investigated by Chiu et al. [14].Along these same lines, we consider the matrix logarithm of Σ and S: where E is a ( × ) orthonormal matrix whose columns are normalized eigenvectors and D is a ( × ) diagonal matrix of the corresponding normalized eigenvalues associated with Σ. E 0 and D 0 are defined analogously for S. Using the fact that A = log(Σ) from (6) and noting that |Σ| = exp{tr[A]}, we can express the exact likelihood function (5) in the following equivalent fashion: We now define the following unconventional matrix operator Vec * (⋅).Let   be the element in the th row and th column of the matrix A, and then where  = [ 1 , . . .,   ]  is a ( × 1) vector and  = (1/2)( + 1).We analogously define  = Vec * (Λ), which will appear again in Section 3.4.Moving forward we will use , which captures the unique elements of the matrix logarithm of the covariance matrix, to model the covariance structure.
2.3.Prior Distributional Specfications.We will assume a priori that  is independent of Σ.More specifically, with respect to  we make the assumption of a uniform prior distribution: Note that the uniform prior assumption for  can be viewed as a limiting case of the informative multivariate normal prior specification, which this modeling approach could accommodate.
We will assume a priori that, given  and Υ,  follows a  dimensional normal distribution with mean location hyperparameter vector  and covariance hyperparameter matrix Υ.The multivariate normal provides a very rich and flexible family of prior distributions for the matrix logarithm of the covariance structure.This adds far greater flexibility than the conventional inverse Wishart prior specification.Since the multivariate normal is fully parameterized by a mean vector and covariance matrix, we have the ability to model more complex prior information.In particular, we can specify different prior mean values for each element of  via the elements of the location hyperparameter .Moreover, we have the ability to model varying degrees of strength of the prior belief in each of the  elements of  through the  diagonal elements of the covariance hyperparameter matrix Υ.Additionally, with the multivariate normal prior we are able to model potential interdependency among the elements of  because we can specify nontrivial covariance terms in the covariance hyperparameter matrix.That is, the off diagonal elements of Υ can be used to specify any potential correlations amongst the elements of .
We are now able to craft a more complex and accurate prior specification for the covariance structure.A subjective Bayesian may in fact wish to specify all +(1/2)(+1) hyperparameters.In this way, the practitioner can fully take advantage of any relevant prior information through use of the flexible multivariate normal prior specification for the covariance structure.Alternatively, we can opt to model  = () and Υ = Υ(), where  and  are of smaller order than  and Υ, respectively.That is, a priori we may wish to only model certain subsets of the covariance structure.An obvious choice is to consider the variance components as one subset and the covariance components as another.However, we stress the point that the fully general multivariate normal prior specification can be utilized in its totality.
Here we will consider the intraclass matrix form for the prior specification as an example of the fully generalized multivariate normal prior distribution.Specifically, we will consider the first  elements of  separate from the remaining ( − ) terms.That is, we wish to model the variance components separately from the covariance components of .Formally, we assume  | , Δ ∼ N  (J, Δ) for the prior distribution.We have the following prior distributional form: where the (2 × 1) Note that J is a (×2) matrix, whose first  elements of the first column are equal to one and the remaining (−) terms of the first column are equal to zero.The second column of J consists of the first  elements equal to zero and the remaining ( − ) elements equal to one.In Δ, I  and I − are indicator matrices of dimension  and  − , respectively.Thus,  1 and  2 1 are the location and variance hyperparameters, respectively, for the variance components of .Analogously,  2 and  2  2 are the location and variance hyperparameters for the covariance components of .In this way we can specify two different location hyperparameters and two levels of confidence.
For the hyperparameters, 2 ), we will assume the following vague prior distribution: Note here that the uniform prior specification can be viewed as a limiting case of a multivariate normal and inverse Wishart prior specification for  and Δ, respectively.Furthermore, the analysis could in fact accommodate such nontrivial specifications quite easily.Having stated all the prior distributional assumptions we now turn to the posterior Bayesian analysis.We begin this by first examining the exact joint posterior distribution.

Exact Joint Posterior Distribution.
The exact joint posterior distribution for all parameters and hyperparameters will be proportional to the product of (4) the exact likelihood function, (9) the prior distribution for , (10) the prior distribution for , and (12) the prior distribution for  and Δ.
Note that here we will use  interchangeably with Σ: The  term in ( 13) can be integrated out, so that where μ = (X  Δ −1 X) −1 X  Δ −1 .
We clearly see that the exact joint posterior distribution is in fact not tractable.This is the driving motivation behind the implementation of the numerical MHWG sampling techniques.Rather than working with the cumbersome exact joint posterior distribution it is much easier to consider the so-called full conditional distributions for each of the parameters and hyperparameters.

Markov Chain Monte Carlo Approach
As already noted, the joint posterior distribution in ( 14) is not analytically tractable with respect to drawing inference for the relevant parameters and hyperparameters.This will give rise to our consideration in the subsequent subsections of the full conditional posterior distributions.The conditional posterior distributions we derive below will provide the framework for the MHWG sampling techniques.

Exact Conditional Posterior Distribution for 𝛽.
Recall that β = (X  X) −1 X  Y is the maximum likelihood estimator of .Furthermore, we can rewrite the exact likelihood function (4) as Therefore, the posterior distribution for  conditional on Σ will be proportional, with respect to only the terms involving , to the product of (15) the exact likelihood function multiplied by (9) the prior distribution for : We recognize that the posterior distribution of  conditional on Σ is of a matrix normal form.Making use of the relationship between the matrix normal and the multivariate normal distributions as stated above in Section 2.1, we have

Exact Conditional
Posterior Distribution for .The joint prior distribution for , , and Δ is given by the product of (10) the prior distribution for  and (12) the joint prior distribution for  and Δ.We can derive the joint prior distribution for just  and Δ by integrating over the joint prior distribution for , , and Δ with respect to .Upon completion of the integration we have the following joint prior distribution for  and Δ: where and I  is a ( × ) identity matrix.By integrating  out we will help to facilitate the MCMC procedure both in terms of speed and simplification of the algorithm.The exact posterior distribution will be proportional to the product of ( 7) the exact likelihood function, multiplied by (18) the joint prior distribution for  and Δ.Note that here we will use  interchangeably with A: Note that the above exact posterior distribution is not of a known form and cannot be directly simulated from in an easy fashion.This will motivate us to use a proposal density that closely matches this target density in a MHWG sampling routine.

Exact Conditional Posterior Distribution for Δ. The exact posterior distribution for
2 ) conditional on  and  will be proportional to (18) the joint prior distribution for  and Δ.Note that the exact likelihood function (7) does not depend upon Δ and thus can be omitted entirely: where are the arithmetic means of the variance and covariance components of , respectively.We recognize that the posterior distributions for  2  1 and  2 2 conditional on  are independent Inverse Gamma random variables: ). ( This result is intuitive and theoretically appealing in that the posterior distribution for  2 1 , the variance hyperparameter for the variance components of , depends only on the number of variance terms  and  1 , . . .,   , whereas the posterior distribution for  2  2 , the variance hyperparameter for the covariance components of , depends only on the number of covariance terms  −  and  +1 , . . .,   .This draws out the point of modeling the variance components separate from the covariance components.In addition, the Inverse Gamma is highly tractable and lends itself to the numerical procedures in the subsequent section.17), ( 20), (22), and (23), respectively.However, we clearly see that the simulation of  based upon the true conditional posterior distribution in (20), the target distribution, is not tractable.Therefore, the Metropolis-Hastings algorithm is employed and a proposal density needs to be constructed.The algorithm works best if the proposal density closely matches the shape of the target distribution.
We can construct an approximation to a function, that is, proportional to the likelihood function by utilizing the linear Volterra integral.The key advantage of this is that the approximation can be written as a multivariate normal with respect to the unique elements of the matrix logarithm of the covariance matrix.This allows for a multivariate normal to act as conjugate prior.Hence, we have a multivariate normal posterior, that is, a good proxy for the true posterior, and the proposal can be easily simulated from.Interested readers should refer to Leonard and Hsu [7] and Hsu et al. [8] for a detailed exposition of how this is performed.
Leonard and Hsu [7] show how we can use Bellman's solution of a Volterra integral equation [10, page 171] to derive the following approximation, that is, proportioanl to the likelihood function of  given Y and : Recall from Section 2.2 that  = Vec * (Λ), where Λ is as defined in (6).The ( × ) symmetric almost surely positive definite matrix Q is the likelihood information matrix of  and is a function of the normalized eigenvalues and normalized eigenvectors of S. In particular, where f  (×1) = e  * e  and where   and e  for  = 1, . . .,  are the th normalized eigenvalue and eigenvector, respectively, of S. f  denotes the ( × 1) vector that satisfies the condition   (e  * e  ) = e   Ae  .We see that the approximate likelihood function (24) is a multivariate normal form with respect to .This functional form of the approximate likelihood function in (24) will be the driving mechanism in the Bayesian analysis for .Again, for details of the derivation of the approximate likelihood function please refer to Leonard and Hsu [7] and Hsu et al. [8].
Equation ( 24) provides an excellent approximation to a function, that is, proportional to the exact likelihood (7), when the sample size  is large.To illustrate the effects of , the sample size, and , the dimension of the covariance matrix Σ, we conduct the following exercise.Without loss of generality, we consider a simplified model, when Y 1 , . . ., Y  is a random sample from   (0, Σ).In our illustrative example, three dimensional sizes ( = 3, 5, and 10), and four sample sizes ( = 20, 100, 500, and 5000) were considered for comparison.For a fair comparison, the same sample covariance is used for all four different sample sizes, for each .The sample covariance matrix S is assumed to consist of elements   for the th row and th column, where   = 1.0 − | − | × 0.1.For example, when  = 3, then S = [ [ 1.0 0.9 0.8 0.9 1.0 0.9 0.8 0.9 The histograms, in Figure 1, represent the exact likelihood from (7) of the (1, 1)th element,  11 , of A, where the histogram is normalized for comparison purposes, and the dotted curves represent the univariate normal densities based on approximation (24).Please note that these histograms are computed according to an importance sampling method using (24) as the importance function.For an overview of importance sampling methods, please see, for example, Rubinstein [15] and Leonard et al. [16].It can be seen from Figure 1 that the approximation is better when the sample size  is bigger and the dimension size  is smaller.Similar patterns were found for other variance and covariance elements of the covariance matrix A. The approximation is fairly accurate when  is 500 or greater.
The approximate joint posterior distribution for  and Δ conditional on  will be proportional to the product of the approximate likelihood function (24) and the joint prior distribution (18): Completing the square for the two terms in the exponent of (28) yields the following approximate posterior distribution for  conditional on  and Δ: where the ( × 1) vector  * = (Q + G * ) −1 Q.Recall that Q and G * are as defined in ( 25) and ( 19), respectively.Thus, we have the following approximate posterior distribution for  conditional on  and Δ: This demonstrates the conjugacy of utilizing the approximate likelihood function.Equation (30) provides an efficient proposal distribution for implementing a MHWG algorithm.In short, we have developed a highly flexible while at the same time tractable Bayesian methodology for the covariance structure.

Metropolis-Hastings within Gibbs Sampling Procedure.
Based upon the theoretical results derived above we outline the following procedure for implementing the MHWG algorithm.Specifically, from ( 17), ( 30), ( 20), (22), and (23) we have a formal setup for implementing a MCMC procedure with a Metropolis-Hastings step.Below we outline the specific steps involved in implementing the MHWG algorithm.
The last procedure in step (3) is the Metropolis-Hastings algorithm.We employ this procedure since we are utilizing an approximation to the exact posterior distribution [11, page 291].
Posterior moments for any parameter of interest can be calculated easily upon the MHWG results.It is usually the case in MHWG estimation procedures that the first  simulations are not included in the estimates, due to the fact that the Markov chain has not yet reached a steady state.In the parlance of numerical procedures,  is usually referred to as the number of burn in iterations.
In addition to the burn in value, we also explored the potential need to perform a so-called thinning procedure.A thinning step entails only retaining every th simulated value of the MHWG sampling, where  is chosen large enough so that any autocorrelation is removed.By examining the sample autocorrelation function (ACF) plots, practitioners can decide if thinning is necessary.We investigated several plots for numerous parameters.For illustrative purposes, in Figure 2, we present the sample ACF plots for  11 ,  12 ,  1 , and  2  1 .We can see that autocorrelation is not a significant concern in our particular application.Other sample ACF plots looked quite similar to numerous other parameters.Practitioners should explore the need to use a thinning procedure in their specific applications.

Application: High School and Beyond Survey
In 1980 the National Education Longitudinal Studies program of the National Center for Education Statistics administered the High School and Beyond (HSB) Survey [12].The HSB study contains both a 1980 senior class cohort and a 1980 sophomore class cohort.Within the sophomore class cohort we have a total sample size of  = 14,667 students.The HSB study has been analyzed extensively by many, for example, Astone and McLanahan [17], Grogger [18], St. John [19], and Zwick and Sklar [20].

Description of Data.
The HSB study contains a myriad of data and variables.In particular, for the sophomore class cohort a total of  = 7 exams were administered in the areas of vocabulary, reading, two exams in mathematics, science, writing, and civics.As is often the case with educational testing data the test scores were standardized to have a mean of fifty and a standard deviation equal to ten.We will use these standardized test scores for the sophomore class cohort as our multivariate response variables.Grade point average data was also obtained from official transcripts that were included in the survey.In addition, a number of demographic variables were collected on both the school and individual student levels.These variables included school type, relative urbanization of school environment, geographic region of the country, gender, and race.All of these variables will serve as categorical or qualitative explanatory variables in the multivariate multiple regression.Table 1 provides a description of the categorical explanatory variables as well as the associated number of students per category.Note that the original HSB study did not include a school type four.
As can be expected with educational data there was some moderate degree of missing data.We employed a data augmentation technique for missing data imputation.The specific procedure was invoked using the mice library [21] in R.This particular data augmentation procedure fits nicely within the context of our research here since it employs a multivariate imputation by chained equations technique.That is, it uses a multivariate Gibbs sampler procedure to augment the missing data set.In particular, incomplete columns of data, that is, variables with missing data, are augmented by generating appropriate values of data given the values of the other columns of variables.

Treatment Contrast for Categorical Variables.
Characterization of the explanatory data or design matrix X to account for categorical explanatory variables is not unique [22, page 173].There are actually several contrast methods.In our analysis, school type zero, which corresponds to a regular public school, will serve as the base level and will not in fact have its own regressor.In the same fashion, urban will serve as the base level for the relative urbanization categorical explanatory variable.Also, the New England region will act as the base level for the geographic variable.Hispanic or Spanish is the base level for the race identifier variable.Finally, we will designate male as the base gender level.Thus, after properly accounting for the base levels in our particular application  = 26.

Classical Multivariate Multiple Regression Results
. We consider the standard multivatiate regression model described in (2) for the HSB survey data within the Sophomore class cohort.A total of  = 14,667 students in that cohort participated in the study.We estimated the model for the seven standardized exams VOCAB, READ, MATH 1, MATH 2, SCI, WRITE, and CIVIC regressed on GPA and the other explanatory variables described in the previous section.Most of the regression coefficient estimates are highly significant.In particular, GPA is highly significant for all seven exams.Although, certain particular levels for various categorical variables are not significant for some of the response variables.All of the associated  statistics are highly significant for all response variables.Moreover, based upon the analysis of variance that was performed we can conclude that nearly all the explanatory variables are highly significant for all seven subject area exams.In practice, for the MCMC procedure we found that a total iteration size of  = 500,000 was quite sufficient to establish convergence and we used  = 2,000 as the burn in value.
From Table 4 we observe that the posterior means for the variance and covariance components are each slightly pulled towards central quantities, respectively.To better illustrate  4. Similar phenomenon appeared for the off-diagonal elements (covariances).This is due to the fact that we have assumed the intraclass matrix form for the prior specification of .
To further investigate the shrinkage property we considered an informative prior distribution for  2 1 and  2 2 instead of the vague prior specification of (12).In particular, we assumed ∼ Inverse Gamma (5000, 1).Specifically, observe that the posterior estimates for the variance of the MATH 2 and CIVIC exams are less than their associated classical frequentist estimates, whereas, all others on diagonal elements of the posterior mean estimate for Σ are greater than their respective classical frequentist estimates.An analogous statement concerning the covariance terms can also be made.This draws our attention to the notion of the Bayesian posterior mean as a compromise between the prior information and the information contained in the data.

Posterior Estimates of the Sum Total.
Of particular interest to educational testing data is some estimate of the overall summary or composite score of the individual subject area exams for a given set of explanatory variables.An obvious choice is to estimate the sum total of the individual exam scores.In a Bayesian framework, this can easily be accomplished by simply summing the individual posterior estimates of the seven subject area exams.However, the standard error associated with this estimate of the sum total cannot simply be calculated as the square root of the sum of the variance of the individual estimates.This would fail to incorporate the obvious covariance terms that exist amongst the subject area exams.Additionally, we would be overlooking any potential parameter uncertainty.Suppose for an individual we have the ( × 1) vector of response variables Y ℎ = [ ℎ1 , . . .,  ℎ ]  and the associated ( × 1) vector of explanatory variables x ℎ = [ ℎ1 , . . .,  ℎ ]  .Note that the linear model for a single observation can be expressed as Y ℎ =   x ℎ +  ℎ .Furthermore, the sum total score is  = 1  Y ℎ = ∑  =1  ℎ , where 1 = [1, . . ., 1]  is a ( × 1) vector of ones.Then, given  and Σ, the total score  follows a univariate normal distribution with mean 1    x ℎ and variance 1  Σ1.
In order to more fully account for the added variability due to parameter uncertainty we consider the unconditional mean and variance of the sum total for an individual: exams, respectively.This results in an estimated sum total of 361.9328 with a standard error of 44.2357.

Conclusion
In conclusion, we have demonstrated how a flexible prior specification for the covariance structure of a multivariate multiple regression model can provide a richer class of distributions than the inverse Wishart family.We discussed how the likelihood function for the covariance structure can be approximated based upon Bellman's solution of a linear Volterra integral equation.We discussed the shrinkage properties of the posterior mean of the covariance structure.This highlighted the concept of the posterior means as a compromise between the prior information and the information contained in the data.
All posterior estimates were calculated based upon the numerical results of a MHWG procedure.The Metropolis-Hastings algorithm was employed to account for sampling from an approximate posterior distribution for the covariance structure.

Figure 1 : 1 Figure 2 :
Figure 1: Comparison of the approximate likelihood (normalized) with the exact likelihood (normalized) for covariance matrices of dimensions  = 3, 5, and 10, and sample sizes  = 20, 100, 500, and 5000.The histogram is the normalized likelihood and the dashed curve is the approximate normalized likelihood.

4. 4 .
Posterior Estimates of the Model Parameters.In Tables 2, 3, 4, 5, and 6 we present the Bayesian posterior means and the associated standard errors for the matrices  and Σ.

Table 2 :
Posterior mean for .

Table 5 .
If we make the element-wise comparison between the posterior mean in Table4and classical frequentist estimate in Table5we see that, among all diagonal elements (variances), relatively smaller elements of the classical frequentist estimate are pulled up and relatively larger elements of the classical frequentist estimate are pulled down.For example, the estimated variance for MATH 1 moved up from 64.9922 in Table5to 65.0307 in Table4and the estimated variance for CIVIC moved down from 80.8225 in Table5to 80.8041 in Table Table 7 presents the posterior mean for Σ with such informative conjugate prior specification.Under this informative prior specification the variance and covariance components are each pulled more towards central quantities, respectively, in comparison to the elements of Table 4.For example, the estimated variance for MATH 1 moved further up from 65.0307 in Table 4 to 65.7643 in Table 7 and the estimated variance for CIVIC moved further down from 80.8041 in Table 4 to 79.0518 in Table 7.Under the informative prior specification for  2 1 and  2 2 we observe that the shrinkage property is more pronounced.

Table 7 :
Posterior mean for Σ with an informative prior.