Bootstrap Order Determination for ARMA Models : A Comparison between Different Model Selection Criteria

The present paper deals with the order selection of models of the class for autoregressive moving average. A novel method—previously designed to enhance the selection capabilities of the Akaike Information Criterion and successfully tested—is now extended to the other three popular selectors commonly used by both theoretical statisticians and practitioners. They are the final prediction error, the Bayesian information criterion, and the Hannan-Quinn information criterion which are employed in conjunction with a semiparametric bootstrap scheme of the type sieve.


Introduction
Autoregressive moving average (ARMA) models [1] are a popular choice for the analysis of stochastic processes in many fields of applied and theoretical research.They are mathematical tools employed to model the persistence, over time and space, of a given time series.They can be used for a variety of purposes, for example, the generation of predictions of future values, to remove the autocorrelation structure from a time series (prewhitening) or to achieve a better understanding of a physical system.As it is well known, performances of an ARMA model are critically affected by the determination of its order: once properly built and tested, such models can be successfully employed to describe the reality, for example, trend patterns of economic variables and temperature oscillations in a given area, or to build futures scenarios through simulation exercises.Model order choice plays a key role not only for the validity of the inference procedures but also, from a more general point of view, for the fulfillment of the fundamental principle of parsimony [2,3].Ideally, the observation of this principle leads to choosing models showing simple structures on one hand but able to provide an effective description of the data set under investigation on the other hand.Less parsimonious models tend to extract idiosyncratic information and therefore are prone to introduce high variability in the estimated parameters.Such a variability determines for the model a lack of generalization capabilities (e.g., when new data become available), even though, by adding more and more parameters, an excellent fit of data is usually obtained [4].Overfitting is more likely to occur when the system under investigation is affected by different sources of noise, for example, related to changes in survey methodologies, time evolving processes, and missing observations.These phenomena, very common and in many cases simply unavoidable in "real life" data, might have a significant impact on the quality of the data set at hand.Under noisy conditions, a too complex model is likely to fit the noise components embedded in the time series and not just the signal and therefore it is bound to yield poor future values' predictions.On the other hand, bias in the estimation process arises when underfitted models are selected, so that only a suboptimal reconstruction of the underlying Data Generating Process (DGP) can be provided.As it will be seen, bias also arises as a result of the uncertainty conveyed by the process itself of model selection.ARMA model order selection is a difficult step in time series analysis.This issue has attracted a lot of attention so that, according to different philosophies, theoretical and practical assumptions as well as several methods, both parametric and nonparametric, have been proposed over the years as a result.Among them, bootstrap strategies [5][6][7][8][9] are gaining more and more acceptance among researchers and practitioners.
In particular, in [6] bootstrap-based procedures applied to the Akaike Information Criterion (AIC) [10,11] in the case of ARMA models, called b-MAICE (bootstrap-Minimum AIC Estimate), has proven to enhance the small sample performances of this selector.The aim of this work is to extend such a procedure to different selectors, that is, final prediction error (FPE) [12] and two information based criteria, that is, Bayesian information criterion (BIC) [13,14] and Hannan-Quinn criterion (HQC) [15,16].In particular, the present paper is aimed at giving empirical evidences of the quality of the bootstrap approach in model selection, by comparing it with the standard procedure, which, as it is well known, is based on the minimization of a selection criterion.In particular, the empirical study (presented in Section 4) has been designed to contrast the performances of each of the considered selectors both in nonbootstrap and bootstrap world.The validity of the proposed method is assessed not only in the case of pure ARMA processes, but also when real life phenomena are simulated and embedded in the artificial data.In practice, the problem of order determination is considered also when the observed series is contaminated with outliers and additive Gaussian noise.The last type of contamination has been employed, for example, in [17], for testing a model selection approach driven by information criteria in the autoregressive fractionally integrated moving average (ARFIMA) and ARMA cases.Such a source of disturbance has been employed here in order to test the degree of robustness of the method proposed against overfitting.As it will be seen, computer simulations show that the addition of white noise generates a number of incorrect specifications comparable to those resulting from the contamination of the process with outliers of the type innovation.Outliers are a common phenomenon in time series, considering the fact that real life time series from many fields, for example, economic, sociology, and climatology, can be subjected and severely influenced by interruptive events, such as strikes, outbreaks of war, unexpected heat or cold waves, and natural disasters [18,19].The issue is absolutely nontrivial, given that outliers can impact virtually all the stages of the analysis of a given time series.In particular, model identification can be heavily affected by additive outliers, as they can induce the selection of underfitted models as a result of the bias elements introduced into the inference procedures.In the simulation study (Section 4), outliers of the type additive (i.e., added to some observations) and innovative (i.e., embedded in the innovation sequence driving the process) [19] will be considered.
The remainder of the paper is organized as follows: in Section 2, after introducing the problem of order identification for time series, the considered selectors are illustrated along with the related ARMA identification procedure.In Section 3 the employed bootstrap selection method is illustrated and the bootstrap scheme briefly recalled.Finally, small sample performances of the proposed method will be assessed via Monte Carlo simulations in Section 4.

Order Selection for Time Series Models
A key concept underlying the present paper is that, in general, "reality" generates complex structures, possibly ∞dimensional, so that a model can at best capture only the main features of the system under investigation in order to reconstruct a simplified version of a given phenomenon.Models are just approximations of a given (nontrivial) phenomenon and the related identification procedures could never lead to the determination of the "true" model.In general, there is no true model in a finite world.What we can do is to find the one giving the best representation of the underlying DGP, according to a predefined rule.In this section, after highlighting the role played by model selection procedures in generating uncertainty, we briefly introduce the models belonging to the class ARMA along with the order selectors considered.Finally, the information criterion-based standard selection procedure is illustrated.

Uncertainty in Model Selection.
Uncertainty is an unfortunate, pervasive, and inescapable feature characterizing real life data which has to be faced continually by both researchers and practitioners.The framework dealt with here is clearly no exception: if the true model structure is an unattainable goal, approximation strategies have to be employed.Such strategies are generally designed on iterative basis and provide an estimate of the model structure which embodies, by definition, a certain amount of uncertainty.Common sources of uncertainty are those induced by the lack of discriminating power of the employed selector and by the so-called model selection bias [20,21], which arises when a model is specified and fitted on the same data set.Unfortunately, not only are these two types of uncertainty not mutually exclusive but also statistical theory provides little guidance to quantify their effect in terms of bias introduced in the model as a result [22].Particularly dangerous is this last form of uncertainty, as it is based upon the strong and unrealistic assumption of making correct inference as if a model is known to be true, while its determination has been made on the same set of data.On the other hand, the first source of uncertainty is somehow less serious, given its direct relationship with the size of the competition set, which is usually included in the design of the experiment.In practice, it is related to the fact that very close SC minima can be found in the model selection process, so that even small variations in the data set can cause the identification of different model structures.In general, trying to explain only in part the complexity conveyed in the observed process by means of as simple as possible structures is a way to minimize uncertainty in the model selection, as it is likely to lead to the definition of a smaller set of candidate models.This approach can be seen as an extension of the principle of parsimony to the competition set.In the sequel, how the proposed procedure, being aimed at replicating both the original process and the related selection procedure, has a positive effect in reducing both the considered sources of uncertainty will be emphasized [23].

The Employed Identification Criteria.
Perhaps the most well-known model order selection criteria (SC), among those considered, are the AIC and the FPE, whose asymptotic equivalence to the -test has been proved in [24].AIC has been designed on information-theoretic basis as an asymptotically unbiased estimate of the Kullback-Leibler divergence [25] of the fitted model relative to the true model.
Assuming   ,  being the sample size, to be randomly drawn from an unknown distribution () with density ℎ(), the estimation of ℎ is done by means of a parametric family of distributions, with densities [( | ;  ∈ Θ)],  being the unknown parameters' vector.Denoting ( | θ) as the predictive density function,  as the true model, and ℎ as the approximating one, Kullback-Leibler discrepancy can be expressed as follows: As the first term on the right hand side of (1) does not depend on the model, it can be neglected so that we can rewrite the distance in terms of the expected log likelihood, (  ; ); that is, This quantity can be estimated by replacing  with its empirical distribution Ĥ, so that we have that (  ; Ĥ) = (1/) ∑  =1 log (  | θ).This is an overestimated quantity of the expected log likelihood, given that Ĥ is closer to θ than .The related bias can be written as follows: and therefore an information criterion can be derived from the bias-corrected log likelihood; that is, Denoting by  and  the number of estimated parameters and the sample size, respectively, Akaike proved that () is asymptotically equal to /, so that the information based criterion takes the form (  ; Ĥ) + /.By multiplying this quantity by −2, finally AIC is defined as −2 log (  ; Ĥ) + 2.In such a theoretical framework, AIC can be seen as a way to solve the Akaike Prediction Problem [6], that is, to find a model  0 producing estimation of density f minimizing Kullback-Leibler discrepancy (1).Originally conceived for AR process, extended to the ARMA case by Soderstrom and Stoica [24], FPE was designed as the minimizer of the one-step-ahead mean square forecast error, after taking in account the inflating effect of the estimated parameter.FPE statistic is defined as FPE() = [(1 + /)/(1 − /)]σ 2  (), where σ2 is the estimated variance of the residuals and  is the model's size.A different perspective has led to the construction of BIC-type criteria, which are grounded on the maximization of the model posterior probability [14].In more detail, they envision the specification of the prior distribution on parameter values and the models, respectively, denoted by ( | ) and (), and their introduction into the analysis through the joint probability function (, ) = ()( | ).
Posterior probabilities for (, ) are then obtained through Bayes theorem, so that the value of  maximizing (4), that is,  ( |   ) ∝  () ∫ ∈Θ  (  ; , )  ( | ) , (4) is found.With (  ; , ) being the likelihood function associated with both the data   and the model   , the selected order will be k = arg max  ( |   ).By assuming all the models equally probable, that is, () = 1/( max + 1), the BIC criterion is hence defined by −2 log ( θ) + 2 log().The last criterion considered-constructed from the law of iterated algorithm-is the BIC, in which the penalty function grows at a very slow rate as the samples size increases.It is defined as follows: HQC = log ( θ) + 2 log(log()).
All these selectors can be divided into two groups: one achieving asymptotic optimality [26] and one selection consistency.AIC and FPE fall in the first group, in the sense that the selected model asymptotically tends to reach the smallest average squared error [27,28], if the true DGP is not included in the competition set.On the other hand, BIC and HQ are dimension consistent [29], in that the probability of selection of the "true" model approaches 1 as the sample size goes to infinity.However, it should be pointed out that such an asymptotic property holds only if the true density is in the set of the candidate models.In this regard, AIC and FPE as well as the other Shibata efficient criteria (e.g., Mallows   [30]) fail to select the "true" model asymptotically.As pointed out earlier, ∞-dimensionality of the "truth" implies for all the models being "wrong" to some extent-except in trivial cases-so that no set of competition models will ever encompass the true DGP.As long as this approach is held true, asymptotic efficient criteria might be preferred.In this case, one may argue a lack of significance in comparing any finite list of candidate models when we rule out the existence of a true one.Such an approach is justified in that, even if no model can ever represent the truth, we can achieve the goal to find the one being approximately correct.Conversely, if one does believe that the true density belongs to the model space, hence dimension consistent selection criteria can be preferred.

ARMA Model Selection through Minimization of Selection Criteria.
In what follows, it is assumed that the observed time series {  } ∈Z + is a realization of a real valued, 0-mean, second-order stationary process, admitting an autoregressive moving average representation of orders  and ; that is,   ∼ ARMA (, ), with (, ) ∈ Z + .Its mathematical expression is as follows: with , being   ∈ R and   ∈ R, AR polynomial, and MA polynomial, respectively.With  the backward shift operator, such that     =  − , is denoted whereas   is assumed to be sequence of centered, uncorrelated variables with common variance  2 .The parameters vector is denoted by Γ.Standard assumptions of stationarity and invertibility, respectively, of AR and MA polynomials, that is, are supposed to be satisfied.Finally, the ARMA parameters of the true underlying DGP ( 5) are denoted by ( ∘ ,  ∘ ) (i.e., {}   ∼ ARMA ( ∘ ,  ∘ )) and the related model by  0 (Γ).
Identification procedures of the best approximating model for  0 is carried out on a priori specified set Λ of plausible candidate models   ; that is, where the chosen model, say  0 ( Γ) = (p 0 , q0 ), is selected from (i.e., [ 0 ( Γ) ≡ (p 0 , q0 ) ⊂ Λ] ≈  0 (Γ)).In the ARMA case, each model   ∈ Λ represents a specific combination of autoregressive and moving average parameters (, ).The set Λ is upper bounded by the two integers  and  for the AR and MA part, respectively; that is, This assumption is a necessary condition for the abovementioned Shibata efficiency and dimension consistency properties to hold other than for the practical implementation of the procedure (the model space needs to be bounded).From an operational point of view, the four SC considered in this work, when applied to models of the class ARMA, take the following form: AIC (, ) =  ln σ2 , + 2 ( +  + 1) , where σ, is an estimate of the Gaussian pseudo-maximum likelihood residual variance when fitting ARMA (, ) models; that is, Equations ( 10)-( 13) can be synthetically expressed as follows: where σ2 , is defined in Section 3 and  is the penalty term as a function of model complexity.
The standard identification procedure, here called for convenience Minimum Selection Criterion Estimation (MSCE), is based on the minimization of the SC.In practice, the model  0 minimizing a given SC is the winner; that is,  0 : ( p0 , q0 ) = arg min <,< SC (, ) . (16)

The Bootstrap Method
As already pointed out, in [6] a bootstrap selection method has been proposed to perform AIC-based ARMA structure identification.The comparative Monte Carlo experiment with its nonbootstrap counterpart, commonly referred to as MAICE (Minimum Akaike Information Criterion Expectation) procedure, gave empirical evidences in favor of -MAICE procedure.Such results motivated us to extend this approach to other selectors (see (11), (12), and ( 13)).For convenience, the proposed generalized version of -MAICE procedure has been called bMSE (Bootstrap Minimum Selector Expectation) procedure.Finally, in order to keep the paper as self-contained as possible, and to reduce uncertainty in the experimental outcomes, AIC has also been included in the experiment.

The Bootstrapped Selection Criteria.
The proposed bMSE method relies on the bootstrapped version of a given SC, obtained by bootstrapping both the residual variance term σ2  and the penalty term, so that (15) becomes The particularization of (17) to the criteria object of this study is straightforward and yields their bootstrapped versions; that is, with , ,  being as above defined and  2 , being the residual variance of the residuals from the fitting of the bootstrapped series  *  with its ARMA estimate ŷ *  .In symbols, In essence, bMSE method works as follows: MSCE procedure is applied iteratively on each  *  bootstrap replication  = 1, . . .,  of the observed series.A winner model   is selected at each iteration on the basis of a given SC, which in turns works exploiting the bootstrap estimated variances of the residuals.The final model is chosen on the basis of its relative frequency over the  bootstrap replication.
3.2.The Applied Bootstrap Scheme.Sieve [31] [32,33] is the bootstrap scheme employed here.It is an effective and conceptually simple tool to borrow randomness from white noise residuals, generated by the fitting procedure of a "long" autoregression to the observed time series.This autoregression, here supposed to be 0-mean, is of the type   = ∑  =1   ( − ) +   ,  ∈ Z, under the stationarity conditions as in (6).Its use is here motivated by the AR(∞) representation of process of type (5); that is, with (  ) ∈Z being a sequence of iid variables with [  ] = 0 and ∑ ∞ =0  2  < ∞.In essence, V bootstrap approximates a given process by a finite autoregressive process, whose order p = () increases with the sample size  such that () → ∞, () = (),  → ∞.In this regard, in the empirical study the estimation of the -vector of coefficients (â 1 , . . ., âp ) has been carried out through the Yule-Walker equations.The residuals ε = ∑ p =1    − +    = 1, 2, . . .,  obtained from the fitting procedure of this autoregression to the original data are then employed to build up the centered empirical distribution function, which is defined as where   =   − ∑  =1 â  − , with  being the mean value of the available residuals, that is,   ,  = p + 1, . . ., .From F bootstrap samples X *  = ( * 1− p, . . .,  *  ) are generated by the recursion with starting values  *  = 0,  *  = 0 for  ≤ − max(, ),  =  + 1, . . ., 2.

The Proposed bMSE
Procedure.Let {  }   be the observed time series realization of ARIMA (, ) DGP (5), from which  bootstrap replications { * , ;  = 1, 2, . . ., }   are generated via V method (Section 3).Our B-MSCE procedure is based on the minimization, over all the combinations of ARMA structures, of a given SC by applying MSCE procedure to each bootstrap replication  * , of the original time series   .In what follows the proposed procedure is summarized in a step-by-step fashion.
(3) A bootstrap replication,  *  , of the original time series   is generated via V method.
(4) The competition set Λ is iteratively fitted to  *  so that  values (one for each of the models in Λ) of the SC * are computed and stored in the -dimension vector V  .
(5) Minimum SC * value is extracted from V  so that a winner model,  * 0, , is selected; that is,  * 0, : ( p * , q * ) = arg min (6) By repeating  times steps (3) to ( 5), the final model  * 0 is chosen according to a mode-based criterion, that is, on the basis of its more frequent occurrence in the set of the bootstrap replications.In practice, the selected model is chosen according to the following rule: with the symbol # being used as a counter of the number of the cases satisfying the inequality condition expressed in (24).
The order  V 0 of the V autoregression is chosen by iteratively computing the Ljung-Box statistic [34] on the residuals resulting from the fitting of tentative autoregression on the original time series with sample size  0 .Further orders, say  V  ,  = 1, 2, . .., for increasing sample sizes,   ,  = 1, 2, . .., are selected according to the relation  V  = (  ) 1/3 , where  =  V 0 / 1/3 0 (in [6]  V 0 is chosen by iteratively computing the spectral density on the residuals resulting from the fitting of tentative autoregression on the original time series; the order p for which the spectral density is approximately constant is then selected).
The presented method is exhaustive and then highly computer intensive, as for all the ( + 1) * ( + 1) possible pairs (in the attempt to reduce such a burden, sometimes, see, e.g., [35], the set of the ARMA orders under investigation is restricted to Λ = {(, −1): 0 ≤  ≤ Ψ}; i.e., the competition set is made up of ARMA (,  − 1); however, the fact that such an approach entails the obvious drawback of not being able to identify common processes, such as ARMA (2, 0), has appeared to be a too strong limitation; therefore, in spite of its ability to drastically reduce the computational time, such an approach has not been followed here), the values of the given SC * must be computed for each of the  bootstrap replications.

Empirical Study
In this section, the outcomes of a simulation study will be reported.It has been designed with the twofold purpose of (i) evaluating bMSE procedure's small sample performances and (ii) giving some evidences of its behavior for increasing sample sizes.As a measure of performances, the percentage frequency of selection of the true order ( 0 ,  0 ), in the sequel denoted as  and  * for MSCE and bMSE procedure, respectively, has been adopted; that is, with  denoting the number of the artificial time series employed in the experiment and # the quantifier symbol, expressing the number of times the statement "time series correctly identified" is true.Its extension to the bootstrap case  * is straightforward.Aspect (i) consists of a series of Monte Carlo experiments carried out on three different sets of time series, 10 for each set, detailed in Table 1, which (1) are realization of three prespecified ARMA orders, that is, (1, 1), (2, 1), and (1, 2) (one order for each set), and (2) differ from each other, within the same set, only for the coefficients' values, but not for the order (, ).Two sample sizes will be considered, that is,  = 100, 200.Formally, these sets are, respectively, denoted as {(J 1 , J 2 , J 3 )} and supposed to belong to the order subspace I: {I ⊇ J  ;  = 1, 2, 3}.For each DGP ∈ I, 10 different coefficient vectors are specified, that is, {I ⊇ J ≡ ( 1 ,  1 ), . . ., ( 10 ,  10 )}.The validity of the presented method is assessed on comparative basis, using as benchmark the standard MSCE procedure.For the sake of concision, the values  and  * will be computed averaging over all the DGPs belonging to either the same set J  or I.In practice, two indicators, that is, the Percentage Average Discrepancy (PAD) and the Overall Percentage Average Discrepancy (OPAD), depending on weather only one set J  or the whole order subspace I  is considered, will be employed.They are formalized as follows: where with the symbol | ⋅ | being the cardinality of a set is denoted.In other words, the average percentage differences in the frequency of selection of the true model is used as a measure of the gain/loss generated by bMSE procedure with regard to a single J (26) or by averaging over the sets I (27).As already outlined, in analyzing aspect (ii) the attention is focused on the behavior of the proposed method for increasing sample sizes, that is,  = 100, 200, 500, 1000.In Table 4, the results obtained for the case of 4 DGPs-detailed in the same table-will be given.In both (i) and (ii), for each DGP ∈ (J 1 , J 2 , J 3 ), a set of  = 500 time series has been generated.Each time series   ( = 1, 2, . . ., ) has been artificially replicated  = 125 times using the bootstrap scheme outlined in Section 3.2 (the simulations have been implemented using the software R (8.1 version) and performed using the hardware resources of the University of California, San Diego; in particular, the computer server EULER (maintained by the Mathematical Department) and the supercomputer IBM-TERAGRID have been employed).
The number of bootstrap replications  employed has been chosen on empirical basis, as the best compromise between performances yielded by the method and computational time.
The parameter space of all the DGPs considered always satisfies the invertibility and stationarity conditions (see ( 6), ( 7)), whereas the maximum order  and  investigated has been kept fixed and low throughout the whole experiment ( =  = 3) mainly to keep the overall computational time reasonably low.However, such an arbitrary choice seems to be able to reflect time series usually encountered in practice in a number of fields, such as economy, ecology, or hydrology.However, it should be emphasized that in many other contexts (e.g., signal processing) higher orders must be considered.

The Experiments.
Other than on the pure ARMA signal, aspect (i) has been investigated in terms of the robustness shown against outliers and noisy conditions.In practice, the simulated DGPs are assumed to be The first set of simulations (experiment a) is designed to give empirical evidences for the case of noise-free, uncontaminated ARMA process of type (5).Experiment b is aimed at mimicking a situation where a given dynamic system is perturbed by shocks resulting in aberrant data, commonly referred to as outliers.As already pointed out, such abnormal observations might be generated by unpredictable phenomena (e.g., sudden events related to strikes, wars, and exceptional meteorological conditions) or noise components which have the ability to lead to an inappropriate model identification, other than to biased inference, low quality forecast performances, and, if seasonality is present in the data, poor decomposition.Without any doubt, outliers represent a serious issue in time series analysis; therefore testing the degree of robustness of any procedure against such potentially disruptive source of distortion is an important task.This topic has attracted much attention from both theoretical statisticians and practitioners.Detection of time series outliers was first studied by Fox [19], whose results have been extended to ARIMA models by Chang et al. [36].Other references include [37][38][39].In addition, more and more often outlier detection algorithms are provided in the form of stand-alone efficient routines-for example, the library TSO of the software "R," based on the procedure of Chen and Liu (1993) [37]-or included in automatic model identification procedures provided by many software packages, as in the case of the statistical program TRAMO (Time series Regression with ARIMA noise, Missing observations, and Outliers [40]) or SCA (Scientific Computing Associates [41]).Following [19], two common types of outliers, that is, additive (AO) and innovational (IO), will be considered.As it will be illustrated, unfortunately the proposed identification procedure shows sensitivity to outliers, as they are liable, even though to different extents, to noticeable deterioration of the selecting performances.
In more detail, the observed time series   is considered as being affected by a certain number  of deterministic shocks at different time  =  1 , . . .,   ; that is, where   is the uncontaminated one of type ( 5), ℎ  measures the impact of the outlier at time  =   , and is an indicator variable taking the value 1 for  =   and 0 otherwise.Outlierinduced dynamics are described by the function () which takes the form As the onset of an external cause, outliers of the type IO have the ability to affect the level of the series at the time they occur until a lag   , whose localization depends on the memory mechanism encoded in the ARMA model.Their effect can be even temporally unbounded, for example, under ARIMA DGPs with nonzero integrating stationary inducing constant .Conversely, AOs affect only the level of the observations at the time of its occurrence (in this regard, typical examples are errors related to the recording process or to the measurement device employed).They are liable to corrupting the spectral representation of a process, which tends to be of the type white noise and in general the autocorrelations are pulled towards zero (their effect on the Autocorrelation Function (ACF) and the spectral density level has been discussed in the literature (see, e.g., [42] and the references therein)), so that meaningful conclusion based on these functions-depending on their location, magnitude, and probability of occurrence-might be severely compromised.On the other hand, the effects produced by IOs are usually less dramatic as the ACF tends to maintain the pattern of the uncontaminated process   and the spectral density   (),  being the frequency, roughly shows a shape consistent with the one computed on   (i.e.,   () ∝   ()).The outcomes of the simulations conducted are consistent with the above.
In the present study, IOs have been randomized and introduced according to a Bernoulli (BER) distribution with parameter  = .04.In order to better assess the sensitivity of the proposed procedure to outlying observations, experiment b has been conducted considering two different levels of standard errors, that is,  = 3 (experiment b 1 ) and  = 4 (experiment b 2 ); in symbols, recalling (5), we have In b 3 , AOs have been placed according to the following scheme: The last experiment, that is, c, has been designed to mimic a situation characterized by low quality data, induced, for example, by phenomena like changes in survey methodologies (e.g., sampling design or data collecting procedures) or in the imputation techniques.Practically, a Gaussian-type noise ]  is added to the output signal, so that   =   + ]  ,   being the pure ARMA process.Using (5), we have   = [()/()]  + ]  , where   ∼ nid(0,  2 ) and ]  ∼ nid(0,  2 ) is additive noise, independent of   .The variance of ]  , say  2 , has been chosen according to the relation  2 = (1/10) 2 (  ).

Results
. The empirical results pertaining to aspect (i) are summarized in Tables 2 and 3 for the sample sizes  = 100 and  = 200, respectively.By inspecting these tables it is possible to notice that, with the exception of experiment  3 , in all the other cases bMSE procedure gives no negligible Regarding the gains over the standard procedure, now BIC and HQC show PAD values above 10 (with a spike of 12.7 of HQC in the case of J 2 ), whereas the performances for the AIC (PAD above 9) are still good.Less satisfactory job is done by the FPE (PAD = 8.2).Finally, it is worth mentioning that the greatest gains pertain to the HQC, with PAD(J 1 ) = 15.5 for  = 100 and PAD(J 2 ) = 12.7 for  = 200.
Even though to different extents, both the procedures are affected by the presence of outliers, especially in the case of the smaller sample size.However, as long as IOs (experiments  1 and  2 ) are involved, bMSE seems to do a good job in counteracting their adverse effects.In fact, for  = 200, this procedure, applied to dimension consistent criteria, selects the right model always more than 50% (experiment  2 ) and approximately 55% of the times in experiment  1 .For this type of criteria, the average gain over the standard procedure is noticeable, especially in the case of experiment  1 (OPAD = 6.7 for BIC and 7.9 for HQC).On the other hand, Shibata efficient criteria achieve less remarkable results: with PAD values ranging from 4.9 for the FPE (PAD(J 3 )) to 5.7 for the AIC (PAD(J 2 )).As expected, for  = 100 the impact of the IOs is stronger: applied to Shibata consistent criteria, bMSE procedure selects the right model in average approximately 43.4% of the time with a minimum of 34.6% recorded for FPE in the case of J 2 , whereas dimension consistent criteria show a  * value in average equal to 55.7%.Selecting performances granted by the proposed method, even though still acceptable, tended to deteriorate to a greater extent considering the experiment  2 , especially with  = 100: here the frequency of selection of the true model for Shibata efficient SC is around 40.1% versus 35% of the standard procedure, for a recorded OPAD amounting to 5.5 for the AIC and 4.7 for the FPE.Slightly better results for  = 200 are recorded, where the correct model has been identified by dimension consistent criteria 55.2% (OPAD = 5.8%) of the times versus 49.4% of the standard procedure.Experiment  3 is where the proposed procedure crashes and offers little or no improvements over the standard one.The most seriously affected selector is the FPE, which shows an ability to select the correct model in average only 18.9% and 22.4% of the times, versus 21% and 25.2% recorded for the nonbootstrap counterpart, respectively, for  = 100 and 200.Finally the effect of the injection of a Gaussian noise to the output signal (experiment ) is commented on.Here, the performances of the method appear to be adequate: averaging over I, the value recorded for  * is 61.8% ( = 54.6) for dimension consistent criteria ( = 200) with particularly interesting improvements over the standard procedure yielded by HQC, which shows OPAD values amounting to 10% and 7.5% for  = 100 and 200, respectively.The bootstrapped version of HQC performs consistently better than the other criteria: in fact it chooses the correct model in average 63.2% and 56.6% of the times for  = 100 and  = 200, respectively.On the other hand, FPE detects the true model with the smallest probability by reaching the average frequency of selection of the true model of 39.8 ( = 100) and 45.1 ( = 200).Shibata consistent criteria show also the smallest gains over the standard procedure; for example, for  = 100 the maximum PAD is equal to 7.1 and 6.4 for AIC and FPE, respectively (both values' recorder for J 2 ), whereas dimension consistent criteria, for the same sample sizes, show a maximum PAD of 8.9 and 11.1, in the case of BIC (J 1 ) and HQC (J 2 ), respectively.
In the analysis of aspect (ii), the performances yielded by the two procedures, in terms of frequency of selection of the correct model, are considered for increasing sample sizes ( = 100, 200, 500, 1000).The results for four different ARMA (2, 1) models, along with their details, are presented in Table 4.As possibly seen by inspecting this table, all the SC under test exhibit roughly a similar pattern: for the small sample size, remarkable disclosures in selecting performances between the two methods are noticeable whereas such discrepancies become less pronounced for  = 500 and very small for  = 1000.For example, considering all the 4 DGPs, BIC shows a PAD ranging from 12.6 (series D) to 14.7 (series C) with sample size  = 100, whereas for  = 1000, PAD is in the range 1.9-2.9 for the series B and A, respectively.For this sample size, the smallest PAD has been recorded selection of the different tentative ARMA models.In practice, the  winning models generated at each and every bootstrap replication are ranked according to their relative frequency of selection of the true model.In this way, our confidence in the bootstrap selection procedure is linked to the difference in the relative frequency of selection of the winner model (with the highest selection rate) compared to the ones achieved by its closest competitors.Ideal situations are characterized high rate of choices of the winner model, which drops sharply considering the rest of the competition set.In such a case, we can reasonably be sure that the selected model is closer to the true order than the one found by using the standard MAICE procedure (clearly if different models are selected).On the other hand, slight discrepancies (say 3-4%) between the winning model and the others should be regarded with suspicion and carefully evaluated on a case-by-case basis.

Final Remarks and Future Directions
In this paper two pairs of selectors, differing for their derivation and properties, have been brought in a bootstrap framework with the purpose of enhancing their selecting capabilities.A set of Monte Carlo-type experiments has been employed in order to assess the magnitude of the improvements achieved.These encouraging results obtained can be explained in terms of the reduction of uncertainty induced by the bootstrap approach.Identification procedures of the type MSCE, in fact, base the choice of the final model on the minimum value attained by a given SC, no matter how small the differences in the values showed by other competing models might be.When they are actually small, standard MSCE procedures are likely to introduce significant amount of uncertainty in the selection procedure; that is, different order choices can be determined by small variations in the data set.The proposed procedure accounts for such a source of uncertainty, by reestimating the competing models and recomputing the related SC value  times (one for each bootstrap replication).In doing so the identification procedure is based on  different data replications each of them embodying random variations.Also the improvements achieved by the proposed method in the case of IOs can also be explained in the light of reduction of uncertainty.Basically, what the procedure does is to reallocate these outliers  times, so that the related selection procedure can control for such anomalous observations.On the other hand, bMSE procedure breaks down in the case of AOs, probably because of the fact that the employed maximum likelihood estimation procedure is carried out on the residuals, which are severely affected by these types of outliers.Consistently with other Monte Carlo experiments, in the proposed simulations the best results are achieved by dimension consistent criteria, especially by BIC.However, two drawbacks affect this criterion: tendency in the selection of underfitted models and consistency achieved only in case of very large sample [4], under the condition that the true model is included in the competition set.The last assumption implies the existence of a model able to provide full explanation of the reality and the "existence" of an analyst able to include it in the competition set.Unfortunately, even assuming finite dimensionality of real life problems, reality is still very complex so that a large number of models are likely to be included in the competition set.As a result of that, selection uncertainty will rise.Superiority of BIC should also be reconsidered in the light of different empirical framework, as Monte Carlo experiment cannot capture the aforementioned problems.It is in fact characterized by the presence of the true model in the portfolio of candidate models.This appears unfair if we consider that criteria of the types AIC and FPE are designed to relax such a strong, in practice unverifiable, assumption and that they enjoy the nice Shibata efficiency property.In addition, in order to keep the computational time acceptable, in Monte Carlo experiments the true DGP is generally of low order, so that BIC underestimation tendency is likely to be masked or, at least, to appear less serious.For these reasons, from a more operational point of view, it can be advisable to consider the indications provided by both AIC * and BIC * , which are the best selectors in their respective categories, according to the simulation experiment.This is particularly true when the sample size is "small" and the information criteria, either considered in their standard or bootstrap form, tend to yield values close to each other for closer models.As a result of that, significant amount of uncertainty can be introduced in the selection process.Finally, as a future direction, it might be worth emphasizing that the purpose of a given model is built and thus identified, which can be usefully considered to assess the selector's performances.For instance, in many cases computational time is a critical factor, so that one might be willing to accept less accurate model outcomes by reducing the number of bootstrap replications.In fact, global fitting is not necessarily the only interesting feature one wants to look at, as a model might be also evaluated on the basis of the potential ability in solving the specific problems it has been built for.In this regard, selection procedures optimized on a case-by-case basis and implemented in the bootstrap world might result in a more efficient tool for a better understanding of the reality.

a
: a pure process (no contamination), b: contaminated with outliers of the type IO (experiments b 1 , b 2 ) and AO (experiment b 3 ), c: contaminated with Gaussian additive noise.

Table
Frequency of selection of the true model in the nonbootstrap () and bootstrap () world for  = 100.

Table 3 :
Frequency of selection of the true model in the nonbootstrap ( * ) and bootstrap ( * ) world for  = 200.PAD between 8.4 for J 2 and 9.6 for J 3 ).As expected, for  = 200 both the methods show an increasing average frequency of selection of the correct model for all the SC: averaging over I and all the SC the values of 55.4% and 65.6% have been recorded for  and  * , respectively.

Table 4 :
Frequency of selection of the true in the nonbootstrap () and bootstrap ( * ) world, for different sample sizes.