Comparison of the Frequentist MATA Confidence Interval with Bayesian Model-Averaged Confidence Intervals

Model averaging is a technique used to account for model uncertainty, in both Bayesian and frequentist multimodel inferences. In this paper, we compare the performance of model-averaged Bayesian credible intervals and frequentist confidence intervals. Frequentist intervals are constructed according to the model-averaged tail area (MATA) methodology. Differences between the Bayesian and frequentist methods are illustrated through an example involving cloud seeding. The coverage performance and interval width of each technique are then studied using simulation. A frequentist MATA interval performs best in the normal linear setting, while Bayesian credible intervals yield the best coverage performance in a lognormal setting. The use of a data-dependent prior probability for models improved the coverage of the model-averaged Bayesian interval, relative to that using uniform model prior probabilities. Data-dependentmodel prior probabilities are philosophically controversial in Bayesian statistics, and our results suggest that their use is beneficial when model averaging.


Introduction
Historically, statistical inference has been based on a single model selected from among a set of predetermined candidate models, with no allowance made for model uncertainty.This process of model selection has been shown to produce biased estimators and result in the incorrect calculation of standard error terms [1][2][3][4].Recently, model averaging has gained popularity as a technique to incorporate model uncertainty into the process of inference [5][6][7].The use of model averaging has been studied in a variety of settings (e.g., [8,9]), where it generally exhibits favorable results relative to traditional model selection.
Model averaging is a natural extension in the Bayesian paradigm, where the choice of model is introduced as a discrete-valued parameter.A prior probability mass function is specified for this parameter, defining the prior probability of each candidate model.Posterior model probabilities are defined by the posterior distribution of the model parameter, and the posterior distributions for model parameters are not conditional upon a particular model and hence naturally account for model uncertainty [10,11].In practice, Bayesian model averaging is achieved by allowing a Gibbs sampler to traverse the augmented parameter space, which generates approximations to the posterior distributions of interest.Facilitated by recent advances in computation, Bayesian model averaging has been widely applied in a variety of application domains (e.g., [12][13][14]).
In the frequentist setting, a model-averaged estimate θ is defined as the weighted sum of single-model estimates: θ = ∑  =1   θ , where θ is the estimate under model   , model weights   are determined from an information criterion such as AIC, and the summation is over the set of  candidate models.
Several approaches to constructing frequentist modelaveraged confidence intervals have been suggested.Wald intervals of the form θ ±   ŝe( θ), where   is the (1 − ) quantile of the standard normal distribution, rely on accurate estimation of se( θ), the standard error of θ.Estimation of this term is complicated by the fact that the model weights and the single-model estimates are all random quantities.Burnham and Anderson [6] have suggested a variety of forms for ŝe( θ), which are studied by Claeskens and Hjort [7] and by Turek and Fletcher [15].In each of these studies, model-averaged Wald intervals of this form were found to perform poorly in terms of coverage rate.
An alternate methodology for the construction of frequentist model-averaged intervals is proposed by Turek and Fletcher [15].Here, each confidence limit is defined as the value for which a weighted sum of the resulting single-model Wald interval error rates is equal to the desired error rate.As this involves averaging the "tail areas" of the sampling distributions of single-model estimates, this new construction is called a model-averaged tail area Wald (MATA-Wald) interval.In a simulation study by Turek and Fletcher [15], the MATA-Wald interval outperformed model-averaged intervals of the form θ ±   ŝe( θ).Fletcher and Turek [16] applied the MATA construction to profile likelihood intervals to produce a model-averaged tail area profile likelihood (MATA-PL) interval.Coverage properties of MATA confidence intervals are also studied in Kabaila et al. [17], and a transformed version of the MATA interval was proposed by Yu et al. [18].
In this paper, we compare the performance of modelaveraged Bayesian credible intervals and the MATA-Wald and MATA-PL intervals of Turek and Fletcher [15] and Fletcher and Turek [16].The effect of using various model prior probabilities and parameter prior distributions on Bayesian intervals is considered.We also study the use of several information criteria to calculate frequentist model weights.A theoretical study of the asymptotic properties of these intervals is complicated by the random nature of the model weights.For this reason, we assess the performance of these intervals through a simulation study.
In Section 2, we define the Bayesian and frequentist model-averaged intervals.The differences between these intervals are shown in Section 3, through an example involving cloud seeding.We describe the simulation study used to compare these intervals in Section 4 and present the results of this study in Section 5. We conclude with a discussion in Section 6.

Model-Averaged Intervals
Assume a set of  candidate models {  } exists, where the parameter of interest  is common to all models.For data , let model   have likelihood function   (,   ), parameterized in terms of  and the nuisance parameter   , which may be vector-valued.We now define the Bayesian and frequentist model-averaged intervals for .

Bayesian Interval.
The model-averaged posterior distribution for  is where ( Evaluation of the integrals in ( 2) and ( 3) is generally difficult in practice, and Markov chain Monte Carlo (MCMC) simulation is used to approximate the posterior distributions of interest.In the multimodel case, this is implemented using the reversible jump MCMC (RJMCMC) algorithm [19].

Frequentist Interval.
The frequentist MATA intervals are constructed in a manner analogous to Bayesian model averaging.Confidence limits are defined such that the weighted sum of error rates under each single-model interval will produce the desired overall error rate.This utilizes model weights   , which are derived from an information criterion.
We initially focus on the information criterion AIC = −2 log L + 2 to define model weights, where L is the maximized likelihood and  is the number of parameters.Model weights are calculated as   ∝ exp(−ΔAIC  /2), where ΔAIC  ≡ AIC  − min =1,..., (AIC  ) and AIC  is the value of the information criterion for model   [20].Other choices of information criteria for defining model weights are addressed in the discussion in Section 6.

MATA-Wald Interval.
In the normal linear model, the confidence limits   and   of a single-model (1 − 2)100% Wald interval for  satisfy the equations where  ] (⋅) is the distribution function of the -distribution with ] degrees of freedom, ] is the error degrees of freedom associated with the model,   = ( θ −   )/ŝe( θ),   = ( θ −   )/ŝe( θ), and ŝe( θ) is the estimated standard error of θ [21,22].A MATA-Wald interval is constructed using a weighted sum of the single-model error rates.The lower and upper confidence limits of a MATA-Wald interval,   and   , are defined as the values satisfying where model   has ]  error degrees of freedom,  , = ( θ −   )/ŝe( θ ),  , = ( θ −   )/ŝe( θ ), and θ is the estimate of  under model   .
The MATA-Wald interval may be generalized to nonnormal data, assuming that we can specify a transformation  = () for which the sampling distribution of φ = ( θ ) is approximately normal when   is true.For example,  = logit() when  is a probability.In this case, the MATA-Wald confidence limits   and   are the values satisfying the pair of equations where Φ(⋅) is the standard normal distribution function,  , = ( φ −   )/ŝe( φ ),  , = ( φ −   )/ŝe( φ ),   = (  ), and   = (  ), as set out by Turek and Fletcher [15].

MATA Profile Likelihood Interval
. Assuming a single model with likelihood function (, ), the limits   and   of a (1 − 2)100% profile likelihood interval for  satisfy where () is the signed likelihood ratio statistic, defined as and   () = max  (, ) is the profile likelihood function for  [23, p. 126-129].The limits   and   of the MATA-PL interval are defined as the values which satisfy where   () is defined in terms of the corresponding likelihood function   (,   ), as in (8), and as described by Fletcher and Turek [16].

Example
We use a study of cloud seeding to illustrate the differences between these methods of model averaging.There is clear evidence that seeding clouds causes an increase in the mean volume of rainfall [24][25][26].However, the size of this effect may depend on the pattern of motion of the clouds.As rainfall volume has agricultural impacts, the results may affect the practicality and focus of cloud seeding operations.The data we consider come from testing conducted by the Experimental Meteorology Laboratory in Florida, USA.Total rainfall volume was measured for 27 stationary clouds, 16 of which were seeded and 11 of which were unseeded.The full data set appears in Biondini [27], and the subset relevant to our analysis is presented in Table 1.
Suppose that we aim to predict the expected rainfall from seeded, stationary clouds.The lognormal distribution can provide a good model for total rain volume [27].Denote the volume of rainfall from seeded, stationary clouds as   , where log   ∼ (  ,  2 ), and the volume of rainfall resulting from unseeded, stationary clouds as   , where log   ∼ (  ,  2 ).Let the quantity of interest be the expected rain volume resulting from the seeded clouds,   ≡ [  ] = exp(  +  2 /2), and we consider the following two models: 2 :   and   unspecified.
In the Bayesian analyses, we used a vague (0,  2 = 100 2 ) prior distribution for parameters   and   , a uniform prior distribution on the interval (0, 100) for  [28], and an equal prior probability for each model.We ran an MCMC algorithm for 300,000 iterations, with a 5% burn-in period.
Convergence was assessed using the Brooks-Gelman-Rubin (BGR) diagnostic on two parallel chains [29,30].This indicated convergence for each model, with all BGR values being less than 1.008.Frequentist models were fit using maximum likelihood.Since we are interested in prediction of   , each likelihood function was reparameterized using log   −  2 /2 in place of   and log   −  2 /2 in place of   .The MATA-Wald interval was constructed according to (6) and the MATA-PL interval following (9), both of which used AIC weights for   .
The resulting Bayesian posterior model probabilities were ( 1 | ) = ( 2 | ) = 0.50, which were equal to the model prior probabilities to two decimal places.The AIC weights slightly favored  2 , with  1 = 0.38 and  2 = 0.62. Figure 1 shows the predicted mean rain volume θ from seeded, stationary clouds, with 95% confidence intervals.Predictions and confidence intervals are shown for singlemodel inferences under  1 and  2 , as well as using model averaging.
The Bayesian posterior mean and the maximum likelihood estimate for predicted rainfall are reasonably similar, with the Bayesian estimate being approximately 15% higher under each model.As expected, all estimates under  2 (where seeding may cause increased rainfall) are greater than those under  1 .
The differences between methods are highlighted by confidence intervals for the expected rainfall.All lower limits are reasonably similar, while the upper limits from the Bayesian analyses are significantly higher than those from the frequentist analyses.This is particularly true under  2 and also when model averaging, where the MAB interval is 62% wider than the MATA-Wald interval.The MAB interval produces a visually appealing compromise between the single-model Bayesian intervals, especially when considering the high degree of model uncertainty.
Each profile likelihood interval is slightly more asymmetric than the corresponding Wald interval, as one would expect.The frequentist model-averaged intervals again produce a pleasing compromise between the separate inferences under each model.In light of the model uncertainty present, it would seem appropriate to use one of the model-averaged intervals to summarize the results of this analysis.It is important to realize the generalizability of the analysis presented here.The same approach is equally applicable to any data analysis situation in which there is model uncertainty, meaning that the true, underlying data-generating model is unknown.
The performance of each method was assessed by the actual coverage rate achieved, defined as the proportion of simulations for which  ∈ [  ,   ].We averaged results over 20,000 simulations, ensuring a standard error for the coverage rate less than 0.3%.In addition, we calculated the mean interval width of each method, defined as   −   .All calculations were performed in R, version 2.13.0 [31].

Bayesian Implementation.
Three sets of prior probabilities were considered, for the construction of three distinct model-averaged Bayesian intervals.The first Bayesian interval (MAB) used equal prior probabilities for each model and "flat" prior distributions for the parameters,   ∼ (0,  2 = 100 2 ) and  ∼ Uniform(0, 100), as suggested by Gelman [28].The second interval (MAB J ) used equal model prior probabilities and improper Jeffreys' prior distributions [32] for the parameters: (  ) ∝ 1 and () ∝ 1/ (see, e.g., [33]).The third interval (MAB KL ) used flat prior distributions for the parameters and the Kullback-Leibler (KL) prior probability for each model, defined as where   is the number of parameters in model   [6, p. 302-305].The KL model prior is a Bayesian counterpart to frequentist AIC model weights, being designed to produce posterior model probabilities asymptotically equal to AIC model weights.
A Gibbs sampler was implemented in R, using the RJMCMC algorithm.Convergence of two parallel chains was again assessed using the BGR convergence diagnostic.Simulations which failed to converge after 100,000 iterations (BGR > 1.1) were discarded.In total, 99.7% of the simulations were retained, with a maximum BGR value of 1.099 and a mean BGR value of 1.007.The initial 5% of each simulation was discarded as burn-in.

Frequentist Implementation.
Frequentist model-averaged intervals were constructed using AIC weights, as defined in Section 2.2.For the normal linear simulation, the MATA-Wald interval was constructed using ( 5) and the lognormal MATA-Wald interval following (6).The MATA-PL interval was defined according to (9), using the reparameterized likelihood in the lognormal case.Numerical solutions to these equations were found using the R root-finding command uniroot.

Results
In the normal linear setting, the results for  1 and  2 are identical by symmetry.In addition, in the lognormal setting, the results were qualitatively similar for  1 and  2 .Therefore, for simplicity, we focus on the results for  2 .Figure 2(a) shows the estimated coverage rate for the MATA-Wald, MATA-PL, and MAB intervals.The MATA-Wald interval performs best, in particular for small sample sizes, followed by the MATA-PL and MAB intervals.All intervals asymptotically approach the nominal coverage rate of 95%.We would expect the MATA-Wald interval to perform well, since  2 is the generating model, and the Wald interval based on this model will achieve exact nominal coverage in this setting.To observe the tradeoff between coverage rate and interval width, coverage is also plotted against mean interval width.For comparable interval width, the MATA-Wald interval achieves notably improved performance.
Figure 2(b) provides the same comparison for the Bayesian MAB, MAB J , and MAB KL intervals.The MAB KL interval provides a noticeable improvement in coverage performance, as compared to the MAB and MAB J intervals, each of which uses equal model prior probabilities.Use of the KL prior probability for models in the MAB KL interval provides an improvement of almost 2% in coverage rate for small sample sizes.This improvement comes at no noticeable increase in interval width.In addition, the use of Jeffreys' prior distributions for the parameters slightly degrades the performance of the Bayesian interval, relative to the use of flat prior distributions.
Figure 3 provides analogous comparisons in the lognormal setting.The MAB interval outperforms the frequentist intervals for small sample sizes, although it requires a substantial increase in interval width.The MAB interval remains roughly within 1% of the nominal coverage rate for all sample sizes, while the frequentist intervals deviate by as much as 3%.The MATA-Wald interval performs better than the MATA-PL interval, with both exhibiting comparable interval widths.
Comparison of the Bayesian intervals in the lognormal setting is qualitatively similar to that of the normal linear setting.The use of the KL prior probability for models in the MAB KL interval provides an improvement over the use of equal prior probabilities, and here the use of Jeffreys' prior distributions for the parameters severely degrades performance, relative to the use of flat prior distributions.Overall, the Bayesian interval using KL model prior probabilities outperforms all other model-averaged interval constructions in the lognormal setting.

Discussion
The aim of this paper has been to compare the performance of Bayesian and frequentist model-averaged confidence intervals.The frequentist MATA intervals are based upon model averaging the error rates of single-model intervals, rather than constructing an interval around a model-averaged estimator.This construction is analogous to Bayesian model averaging, and the idea was initially motivated using an analogy to a model-averaged Bayesian interval [16].The MATA construction was studied further in Turek and Fletcher's work [15], where it is shown that, asymptotically, a MATA interval will converge to the single-model interval based upon the candidate model with minimum Kullback-Leibler distance to the true, generating model.
Through simulation, the frequentist MATA-Wald interval produced the best coverage properties in the normal linear setting, where we would expect Wald intervals to perform well.In the lognormal setting, Bayesian intervals produced substantial improvement over the frequentist intervals.A Bayesian analysis fully allows for parameter uncertainty and does not rely on the asymptotic distributions of estimators.So long as we are willing to accept the prior distributions for the parameters, we might expect the Bayesian approach to be better suited for nonnormal settings.In contrast, when the assumptions of Wald intervals are satisfied exactly (as with normal data), use of the frequentist MATA-Wald interval resulted in improved coverage performance.
In both settings, the use of KL prior probabilities provided a noteworthy improvement in the performance of the Bayesian interval, when compared to the use of equal model prior probabilities.The KL model prior is designed to produce posterior model probabilities approximately equal to frequentist AIC model weights.This agreement between posterior probabilities and model weights was observed in our simulation.
Burnham and Anderson [6] describe prior probabilities which depend upon sample size and model complexity, such as the KL prior, as "savvy priors," and argue in favor of their use.Larger data sets have the potential to support more complex models, which may justify assigning model prior probabilities dependent upon the data available and the relative complexity of the models being considered.
In contrast, Link and Barker [34] argue that for large sample sizes the data ought to completely dominate the priors, and the use of prior probabilities which depend upon  the sample size may prevent this from occurring.They also argue that prior probabilities should represent one's beliefs prior to data collection and have no dependence upon the data observed.This is consistent with Box and Tiao [33], where a prior is defined as "what is known about [a parameter] without knowledge of the data."This discrepancy in what a prior probability may represent is interesting, especially considering that data-dependent priors were seen to be advantageous for Bayesian model averaging.
Thus far, we have presented results for frequentist MATA intervals constructed using AIC model weights.Two alternate information criteria were also considered: AIC c [35] and BIC [36].AIC c was derived as a small-sample correction to AIC and in certain contexts may be favorable for use in model selection [37].BIC provides an asymptotic approximation to Bayes factors and may also be used to approximate the posterior model probabilities which result from equal model priors [34].In our study, the MATA intervals based upon AIC c and BIC weights were consistently inferior to those using AIC weights.This was true in both simulation settings, and also for small sample sizes, when one may have expected AIC c to perform best.This result is consistent with the findings of Fletcher and Dillingham [38], in which model-averaged intervals constructed using AIC weights yielded improved coverage properties over a variety of other information criteria, including both AIC c and BIC.
Our study has used the assumption that "truth is in the model set."This assumption is also used in the derivations of the MATA-Wald and MATA-PL intervals, as well as generally in Bayesian multimodel inference.We do not feel that this assumption undermines our conclusions, since all model averaging techniques would be adversely affected when this assumption is not met.
Our simulation has also followed the assumption that "the largest model is truth."Philosophically this may not pose a problem, as Burnham and Anderson [6] believe that nature is arbitrarily complex, and it is unrealistic to assume that we might fully characterize the underlying process.From this viewpoint, model selection attempts to identify the most parsimonious approximating model to truth, given the data available.This assumption may in part explain the superior performance of AIC model weights, since AIC is known to favor increased model complexity [39].However we do not consider this an issue, since results from Fletcher and Turek [16] indicate that intervals using AIC weights perform at least as well as those using other information criteria when the most complex model is not the generating model.Furthermore, all simulations presented herein were repeated using data generated under the simpler of the two candidate models ( 1 ).In these simulations all modelaveraged constructions performed similar to one another and achieved very near to the nominal coverage rates.This would be expected, since model averaging takes place over two models, both of which now represent truth.
Any simulation study is inherently limited in scope.We have considered both normal and nonnormal data, as well as a wide range of sample sizes, and observed consistent patterns throughout.Bayesian model averaging was better suited for the nonnormal setting, and the frequentist MATA-Wald interval performed best in the normal linear setting.In addition, the performance of model-averaged Bayesian intervals was improved through use of the KL model prior, a data-dependent prior probability.This result raises consideration of exactly what Bayesian priors represent; in particular, whether or not knowledge of the size of an observed sample provides grounds to update model prior probabilities.

3 )Figure 1 :
Figure 1: Expected mean rainfall for seeded, stationary clouds, under each model and using model averaging.Vertical bars show 95% intervals for each prediction.Intervals shown: Bayesian and MAB (purple), Wald and MATA-Wald (orange), and profile likelihood and MATA-PL (blue).

Table 1 :
Rain volume data recorded in 1968 and 1970 cloud seeding experiments.All clouds are stationary and are categorized as seeded or unseeded.Rain volume is measured in thousands of cubic meters (10 3 m 3 ).