A Focused Bayesian Information Criterion

Myriads of model selection criteria (Bayesian and frequentist) have been proposed in the literature aiming at selecting a single model regardless of its intended use. An honorable exception in the frequentist perspective is the “focused information criterion” (FIC) aiming at selecting amodel based on the parameter of interest (focus).This paper takes the same view in the Bayesian context; that is, a model may be good for one estimand but bad for another. The proposed method exploits the Bayesian model averaging (BMA) machinery to obtain a new criterion, the focused Bayesian model averaging (FoBMA), for which the best model is the one whose estimate is closest to the BMA estimate. In particular, for two models, this criterion reduces to the classical Bayesian model selection scheme of choosing the model with the highest posterior probability. The new method is applied in linear regression, logistic regression, and survival analysis. This criterion is specially important in epidemiological studies in which the objective is often to determine a risk factor (focus) for a disease, adjusting for potential confounding factors.


Introduction
A variety of model selection criteria (Bayesian or frequentist) have been proposed in the literature; most of them aim at selecting a single model for any purposes.For an overview of model selection criteria, see the studies by Leeb and Poetscher [1] and Zucchini [2]; for inference after model selection, see the studies by Nguefack-Tsague [3], Nguefack-Tsague and Zucchini [4], Zucchini et al. [5], Behl et al. [6], and Nguefack-Tsague [7][8][9].Allen [10], within the context of Mallows'   [11], developed a criterion that depends on a given prediction.In a frequentist approach, Claeskens and Hjort [12] developed a focused information criterion (FIC) for model selection which, unlike common model selection criteria that lead to a single model for all purposes, selects different models for different purposes.Thus Allen's criterion can be considered as an early precursor of FIC.So far the FIC is gaining in popularity as evidenced by its applications in various fields and specific models.Some of these applications include missing response (Sun et al. [13]), energy substitution (Behl et al. [6]), economic applications (Behl et al. [14]), Tobit model (Zhang et al. [15]), additive partial models (Zhang and Liang [16]), volatility forecasting (Brownlees and Gallo [17]), and Cox proportional hazard regression models (Hjort and Claeskens [18]).Focused information criterion and model averaging can be found in the studies by Sueishi [19] and Sun et al. [13].A more recent development of FIC on quantile regression can be found in the studies by Du et al. [20], Behl et al. [21], and Xu et al. [22].The motivation for the new method is based on the fact that this concept appears, up to our knowledge, to be virtually unknown and overlooked in Bayesian model selection; thus, there is a need to develop a Bayesian counterpart.The present paper is organized as follows.Section 2 presents the concept of Bayesian model averaging and model selection while Section 3 introduces the new criterion.Section 4 provides practical examples, while Section 5 provides discussions.The paper ends with concluding remarks.

Bayesian Model Selection and Model Averaging
2.1.Framework.Consider a situation in which some quantity of interest, , is to be estimated from a sample of observations that can be regarded as realizations from some unknown 2 Advances in Statistics probability distribution and that, in order to do so, it is necessary to specify a model for the distribution.There are usually many alternative plausible models available and, in general, they each lead to different estimates of .Consider a sample of data, , and a set of  models M = ( 1 , . . .,   ), which we will assume to contain the true model   .Each   consists of a family of distributions ( |   ,   ), where   represents a parameter (or vector of parameters).
The prior probability that   is the true model is denoted by (  ) and the prior distribution of the parameters of   (given that   is true) by (  |   ).Conditioning on the data  and integrating out the parameter   , one obtains the following posterior model probabilities: where is the integrated likelihood under   .If (  |   ) is a discrete distribution, the integral in ( 2) is replaced by a sum.From the Bayes factor framework, the Bayes factor (Kass and Raftery [23]) for   versus model   is defined by

Bayesian Model
Model   is chosen if   > 1.Under certain assumptions and approximations (in particular the Laplace approximation) and taking all candidate models as a priori equally likely to be true, this leads to the Bayesian information criterion (BIC), also known as the Schwarz criterion [24].More information on Bayesian model selection and applications can be found in the studies by Guan and Stephens [25], Nguefack-Tsague [26], Carvalho and Scott [27], Fridley [28], Robert [29], Liang et al. [30], and Bernado and Smith [31].

Focused Information Criterion (FIC).
As one can notice, in classical Bayesian model selection, a single (selected) model is used to explain all aspects of data, that is, regardless of the purpose of the selection, irrespective of the inference to follow (Claeskens and Hjort [12]).Allen [10] first developed this idea of focusing on a parameter of interest in a prediction problem in which the prediction at a particular value (target) of the regressor vector different from the values in the sample is of interest.Geisser [32] focused on prediction as a major inferential goal, rather than estimation, under many circumstances.In that method, the steps in the derivation of Mallows'   are repeated for this target; thus, this criterion depends on that particular value and is therefore an early precursor to the FIC.In the FIC framework, a parameter of interest, say, , must have a definition making it meaningful across competing models.FIC methodology uses general parametric models and maximum likelihood as the estimation method in a general large sample theory.The FIC is derived as the result of establishing an unbiased estimation of the limiting risk of any submodel-based estimator of the parameter .FIC is based on the (crucial) assumption that the true data generating mechanism is contained in the largest parametric model considered.The candidate model with the smallest value of FIC is chosen.As one can see, FIC actually is based on frequentist approach; so far, up to our knowledge, Bayesian equivalent has not yet been considered.

Bayesian Model Averaging. Bayesian model averaging (BMA) is used to deal with the problem of model uncertainty.
A discussion on the issue of model uncertainty is given in the study by Clyde and George [33].Let  be a quantity of interest depending on , for example, a future observation from the same process that generated .The idea is to use a weighted average of the estimates of  obtained using each of the alternative models, rather than the estimate obtained using any single model.More precisely, the posterior distribution of  is given by Note that ( | ) is a weighted average of the posterior distributions ( |   , ),  = 1, . . ., , where the th weight, (  | ), is the posterior probability that   is the true model.The posterior distribution of , conditioned on   being true, is given by The posterior mean and posterior variance are given by Raftery et al. [34] call this averaging scheme Bayesian model averaging.Leamer [35] and Draper [36] advocate the same idea.Madigan and Raftery [37] note that BMA provides better predictive performance than any single model if the measure of performance is Good's [38] logarithm score rule, under the posterior distribution of  given .
Hoeting et al. [39] give an extensive framework of BMA methodology and applications for different statistical models.Various real data and simulation studies have investigated the predictive performance of BMA (Clyde [40]; Clyde and George [33]).Nguefack-Tsague [41] uses BMA in the context of estimating a multivariate mean.

Challenges.
Implementing BMA is demanding, especially the computation of the integrated likelihood.Software for BMA implementation, as well as some BMA papers, can be found at "http://www.research.att.com/∼volinsky/bma.html".An R [42] package for BMA is now available for computational purposes; this package provides ways for carrying out BMA for linear regression, generalized linear models, and survival analysis using Cox proportional hazard models.For computations, Monte Carlo methods, or approximating methods, are used; thus, many BMA applications are based on the BIC, an asymptotic approximation of the log posterior odds when the prior odds are all equal.
Another problem is the selection of priors for both models and parameters.In most cases, a uniform prior is used for each model; that is, (  ) = 1/,  = 1, 2, . . ., .When the number of models is large, model search strategies are sometimes used to reduce the set of models, by eliminating those that seem comparatively less compatible with the data.

Using BMA for Model Selection
The purpose of this section is to define the focused Bayesian model averaging (FoBMA).
Definition 1.For a set of  models M = ( 1 , . . .,   ),   with focus parameter   , under BMA framework, the selection criterion FoBMA consists of choosing   ∈ M for which its estimate μ is closest (in terms of squared error) to BMA estimate.Proposition 2. Under the square error loss and the weighted posterior probability of (4), FoBMA is an optimal model choice.
Proof.Conditioning on all models, that is, under (4), the optimal choice   ∈ M is the one for which its estimate μ minimizes This is equivalent to minimizing The term (2) does not depend on μ and, denoting as well as and since (1  ) does not depend on μ , the only term that depends on μ is (2  ).Hence, the preferred   is the one whose estimate μ is closest to the BMA estimate μbma .
In particular, for two models, Corollary 3 shows that FoBMA reduces to the classical Bayesian model selection scheme of choosing the model with the highest posterior probability.

Corollary 3. For two models, the selected model is the one with the highest posterior probability.
Proof.From Proposition 2, let us find the distance between each model and BMA model.Consider Similarly, Therefore,  1 is selected if that is, if ( 1 | ) > ( 2 | ).Hence, FoBMA is equivalent to selecting the model with the highest posterior probability.

Applications
In this section, we apply the methodology to three models: linear regression, logistic regression, and survival analysis.The three following examples have been widely used in Bayesian model averaging (Raftery et al. [43]) and are available in the R packages  and survival.They are also used as tutorial in the R package .Since these are parametric models, the focus parameter, , is in every case the regression coefficient ().FoBMA is compared to the classical well-known Bayesian information criterion (BIC).All computations were performed with R [42].

Linear Regression.
In this subsection, FoBMA is applied to data of the effect of punishment regimes on crime rates (Ehrlich [44]), used in the study by Raftery [45].It can be downloaded in the R package .

Data Description.
Criminologists are interested in the effect of punishment regimes on crime rates.This is a dataset  1 describes all variables used for linear regression.The dependent variable is the rate of crimes in a particular category per head, and 15 potential independent variables perceived to be associated with crime rates (Ehrlich [44]). 2 shows that the classical BIC (no focus) selects the model with variables , , 1, , , 2, , ,  (-BIC = 55.91).Since the number of explanatory variables in this case is 15, the model space is small enough, which allows for full enumeration of model space.If the focus was the probability of imprisonment (the number of offenders imprisoned per offense known), the selected model is , 1, , , ,  which is ranked 25 with BIC.If the focus was the average time spent in state prisons, the selected model is , , 1, , , 2, , ,  which is ranked 18 with BIC.The other focuses parameters show that there is a great discrepancy between the FoBMA and the classical BIC.

Logistic Regression.
In this subsection, FoBMA is applied to data of risk factors associated with low infants birth weights (Hosmer and Lemeshow [46]).These data were also used in the study by Raftery [45] and can be downloaded in the R package .The aim was to study the risk factors associated with low infants birth weights.

Data Description.
The "birthwt" data frame has 189 rows and 10 columns.The data were collected at Baystate Medical Center, Springfield, Massachusetts, during 1986.3 describes all variables used for logistic regression.The outcome variable is low (indicator of birth weight (BWT) less than 2.5 kg), and 7 potential independent variables perceived to be associated with low birth weight (Hosmer and Lemeshow [46]).Table 4 shows that BIC selects the one with variables ,  (-BIC = 753).If the focus was the , the selected model is , ,  which is ranked 38 with BIC.If the focus was the , the selected model is , ,  which is ranked 39 with BIC.The other focuses parameters show that there is a great discrepancy between the FoBMA and BIC.If hypertension (HT) is the focus, the selected model is the one with variables , , ; but this model is ranked 20 if the focus is .The model to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine.The first 312 cases in the dataset participated in the randomized trial and contain largely complete data.The additional 112 cases did not participate in the clinical trial but consented to have basic measurements recorded and to be followed for survival.Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants.5 describes all variables used for survival analysis.The dependent variable was the survival time while the 18 others were the independent variables.Table 6 shows that BIC selects the model with variables , , , , , .If the focus was the , the selected model is , , , , ,  which is ranked 14 with BIC.If the focus was the , the selected model is , , , , , ,  which is ranked 7 with BIC.The other focuses parameters show that there is a great discrepancy between the FoBMA and BIC.If the focus was on age, BIC and FoBMA select the same model.

Comparison to Other Bayesian Approaches.
As with any other Bayesian approach, parameter priors and model priors are a source of uncertainty in FoBMA.FoBMA shares the same drawbacks of BMA, namely, the computation of integrated likelihoods of various models and the choice of model space.As model selection criterion, compared to Bayesian likes, it is simpler after BMA estimates have been obtained.It was shown in Corollary 3 that FoBMA is equivalent to the classical approach of selecting the model with the highest posterior model probability if there are only two competing models.

Interpretation of Focus.
It was shown in Paragraph 4 that, in general, FoBMA is found to select a model different from the one selected without a focus.Hereby, the difference in terms of ranking is sometimes large.For example, in studying the risk factors associated with low infants birth weight, the classical Bayesian model selection approach selects the one with variables , .However, if the focus was the  (mother's weight in pounds at last menstrual period), that is, if the main objective was to find out whether  was a risk factor, adjusted for possible confounding factors, the selected model is , ,  which is ranked 38 with BIC.Thus, a poor model for estimating a focus  1 may be the best for  2 .

Heuristic Explanation of Focus.
Consider a set of data  1 , . . .  , each   with PDF (  | ).Mean, median, and mode measure central tendencies.Mean is obtained by maximizing the square error loss, the median by maximizing the absolute risk, and the mode by maximizing o-1 loss.In this context of model selection, the square error loss was used; then it is reasonable to find out the "mean, " which can be identified as BMA estimate.Since BMA estimate is not any of the competing models, it seems appropriate to select the closest one to it.

Concluding Remarks
The present paper has derived a new model selection criterion (FoBMA), which, in contrast to other Bayesian model selection criteria, focuses on the parameter singled out for interest.The methodology was applied to concrete examples.The method needs to be applied in a variety of conditions; in particular, more works need to be done to find the asymptotic properties of FoBMA.It is expected that this paper brings motivations for more researches on focus-related criteria in Bayesian model selection.
Selection.Within this framework, classical Bayesian model selection involves selecting the model with the highest posterior.Sometimes Bayes factors are used.
[31]term (1) can be rearranged as (see also the study by Bernardo and Smith[31]

Table 6 :
Survival regression example: selected (best) model according to the focus parameter (FoBMA) compared to its BIC rank and value, the probability that the focus is not null (100( ̸ = 0)), its BMA mean ( μbma =  bma ( | )), and standard deviation (SD bma ( | ))., , , ,  is ranked 4 if the focus is , 11 if , 14 if , 23 if , and 5 if .It is not best (number 1) to estimate any of these focuses.Other applications of FIC for logistic regression in dental restoration can be found in the study by Candolo [47].