Complexity: Frontiers in Data-Driven Methods for Understanding, Prediction, and Control of Complex Systems 2022 on the Development of Information Theoretic Model Selection Criteria for the Analysis of Experimental Data

It can be argued that the identification of soundmathematical models is the ultimate goal of any scientific endeavour. On the other hand, particularly in the investigation of complex systems and nonlinear phenomena, discriminating between alternative models can be a very challenging task. Quite sophisticated model selection criteria are available but their deployment in practice can be problematic. In this work, the Akaike Information Criterion is reformulated with the help of purely information theoretic quantities, namely, the Gibbs-Shannon entropy and the Mutual Information. Systematic numerical tests have proven the improved performances of the proposed upgrades, including increased robustness against noise and the presence of outliers. $e same modifications can be implemented to rewrite also Bayesian statistical criteria, such as the Schwartz indicator, in terms of information-theoretic quantities, proving the generality of the approach and the validity of the underlying assumptions.


Introduction to Nonfrequentist Model
Selection Criteria e promised land of modern scientific enterprises is often the formulation of robust and generally applicable mathematical models [1,2]. e ultimate validation of any model resides in the comparison with the results of experiments or observations. In the last decades, enormous quantities of data have become available in many fields of science and engineering.
e statistical inference has therefore progressively moved to centre stage. e older frequentist techniques, based on traditional significance level criteria, have been complemented by a series of Bayesian and information-theoretic criteria, in many respects more suited to managing large amounts of information.
One of the most popular model selection criteria (MSC) is the Akaike Information Criterion (AIC) [3]. e AIC can be derived from the Kullback-Leibler divergence and can be interpreted as the loss of information associated with the adoption of a model different from the exact one, generating the data. e basic idea underlying the AIC criterion resides indeed in the consideration that the less information a model loses, the higher its quality. e theoretical derivation of the AIC gives the unbiased form of the criterion [4].
where L is the likelihood of the data given the model and k is the number of estimated parameters in the model. e AIC is a metric that is minimised by the best model as a compromise between the goodness of fit (the first term) and complexity (the second term). e general formulation of the AIC is not always easy to apply in practice as can be appreciated by a simple inspection of (1). First, in many instances, it can be impossible to reliably calculate the likelihood. Moreover, it is well known that the number of parameters is a poor quantifier of a model complexity and it is not inherently an information-theoretic indicator. e more practical expression of the AIC, very often the one used in practice, is even more distant from its original information theoretic origin, as discussed in the next section. e first quantity, proposed to improve the AIC, is the Gibbs-Shannon entropy H H � − i p i log p i . (2) e higher the value of H, the higher the uniformity of the corresponding probability distribution function (whose values are indicated with p i ). e Gibbs-Shannon entropy can improve significantly the quantification of the model complexity, as discussed in detail in Section 2.2. e second quantity, used in the rest of the work, is the mutual information, MI.

MI � −
x y p xy ln P xy P x P y , where P x,y is the joint pdf of the random variables X and Y.
Mutual Information can play a fundamental role in determining the goodness of fit of the models, as discussed in Section 2.1.
With regard to the organization of the paper, the next section introduces the rationale and details of the proposed information-theoretic upgrades of the Akaike Information Criterion. Section 3 is devoted to a simple but challenging didactic case, meant to illustrate the effects of the modifications with an easy-to-grasp example. e family of functions and the types of noise statistics, implemented to perform a series of systematic tests, are summarised in Section 4. e results of the aforementioned tests are exemplified in Section 5 with the help of some representative cases. e extension of the approach to the Bayesian Selection criterion is covered in Section 6 before the conclusions and lines of future developments are discussed in the final section of the paper.

Model Selection Formulated in terms of Information Theoretic Quantities
Among the many indicators, for identifying the "best model" among a set of candidates, the Akaike Information Criterion AIC can be conceived originally as a pure information theoretic criterion. Unfortunately, the original formulation of the AIC criterion is typically problematic to implement in practice, particularly in applications involving complex systems and nonlinear phenomena. Both terms in the AIC present significant issues [5][6][7]. To bypass the practical difficulties of calculating the likelihood, the strong assumption that the data are identically distributed and independently sampled from a normal distribution is the most commonly invoked. If this traditionally called iid hypothesis is valid, it can be demonstrated that the AIC can be written (up to an additive immaterial constant depending only on the number of entries in the database) as follows: In (4), formally derived in [4], the Mean Squared Error (MSE) is calculated in terms of the residuals, the differences between the data, and the estimates of the models; in its turn n indicates the number of entries in the database.
(4) is certainly the most widely used form of AIC. On the other hand, as can be easily appreciated by inspection, the criterion is now expressed in terms of quantities, which are not information theoretic anymore. Moreover, all the statistical information content, originally in the likelihood, is reduced to the mere MSE of the residuals. e first obvious question, which comes to mind, is whether some additional statistical information about the distribution of the residuals could be taken into account, to improve the discriminatory capability of the criterion. e practical relevance of this issue is quite significant also because, in many applications, the assumptions behind (4) are clearly violated. In real life, indeed, the statistics of the noise can have a non-Gaussian distribution, memory effects can be important, and a significant number of outliers can be unavoidable. How to improve the model selection criteria in this respect is the subject of Section 2.1. e second term in (4) is also problematic because it is well known that the number of parameters is a quite poor indicator of the complexity of a model. More sophisticated quantifiers exist, such as the VC dimension [8] and the Rademacher dimension [9], but they are often impossible to calculate for most practical functions. An alternative information theoretic and computationally simple way to calculate a model complexity is the subject of Section 2.2.

Expressing the Goodness of Fit in terms of Mutual
Information.
e main idea informing one of the AIC upgrades, proposed in this work, is based on the observation that the better a model, the more similar the residuals to the noise affecting the measurements. In the case of a perfect model, the residuals should present exactly the same distribution as the noise. Assuming that the noise is not correlated with the measurements, absolutely legitimate in most practical applications, this consideration can be quantified mathematically by calculating the mutual information between the model predictions and the residuals, MI MRes . MI MRes � MI y mod , y res . (5) e AIC can therefore be rewritten as follows: Conceptually, (6) is to be preferred to (4) for various reasons. First, it formulates the criterion in terms of an information theoretic quantity, the mutual information. Moreover, it retains much more statistical information about the model and the residuals. At the same time, MI MRes takes into account also nonlinear correlations and does not make any "a priori" assumption about the statistics of the 2 Complexity noise or the presence of outliers. Consequently, as shown by numerical tests, AIC MI is a much more general and sensitive model selection criterion than the original AIC.

Expressing the Complexity in terms of the Shannon
Entropy. e other weakness in the original definition of AIC is certainly the quantification of complexity. Indeed, the simple number of parameters in a model is a very poor indicator of its flexibility and in particular of its potential to overfit (see Section 3). A possible alternative relies on the traditional idea that complexity is the middle ground between randomness and determinism. According to this view, complete randomness and perfect determinism are considered less complex than a combination of the two. is approach to complexity has a long pedigree and can be traced back to the interpretation of information as uncertainty, the concept at the basis of information theory [10]. A possible way of expressing this idea in mathematical terms is the following complexity measure C[X]: where H is the usual Shannon entropy and D is the distance from a uniform distribution.
where with the usual notation, n is the number of entries in the database. e distance D reduces the estimated complexity of models, whose predictions are uniform. e entropy reduces the estimated complexity of models, whose outputs are concentrated on a few well-defined values. Conceptually, the implementation of this quantification of complexity is quite simple. e pdf of the model predictions can be inserted in (7) to obtain a simple indicator, implementing the aforementioned information theoretic interpretation of complexity. e most delicate aspect of (7) is the choice of the exponents α and β because they contribute significantly to determining the trade-off between entropy and distance. To this end, the increments of the model predictions have been calculated as follows: e moving averages (Mov), of the mean and standard deviation of the squared increments, are good indicators of the flexibility of a model and therefore of its potential to overfit. e normalized versions of these quantities are defined in e ratio of the two averages calculated in (10) is e parameter MF increases for functions, which have stronger variations in the domain of interest and can therefore be considered more complex. Indeed, these more nervous functions would have a higher potential of overfitting the data, following the noise. is is the interpretation of the quantity MF, which is used to determine the exponents α and β.
Finally, the proposed final versions of the AIC expressed only in terms of the mentioned information theoretic quantities read

A Didactic Example to Illustrate the Main Characteristics of AIC MICX
To illustrate the potential and the meaning of the proposed upgrades of the AIC, an academic but challenging example, already discussed in detail in the literature [11], is described in this section. To this end, it is assumed that the actual data is generated with a polynomial function depending on 5 parameters.
e equations, considered as possible candidate models for the data generated with (14), are reported in Table 1.
A comment about the sinusoidal functions is in place. ese functions can be tuned to fit perfectly the data generated with (14) by increasing their frequency. is fact can be appreciated by inspection of the first two plots of Figure 1. If there is any noise added to the data, the sinusoidal functions, given their higher flexibility, can fit the data even better than the original equation generating it.
On the other hand, they depend only on two parameters, their amplitude and frequency. erefore, the traditional version of the AIC would tend to prefer a welladjusted sinusoidal model (because it would achieve lower values of both terms of the indicator). e proposed version AIC MICx , on the contrary, manages to properly identify the right model, as shown in Figure 2. e plots report the differences between the AIC and AIC MICx of the candidate models and the reference, the equation used to generate the data.
When these differences are positive, the reference model is the preferred one; the negative cases indicate that the criteria would have selected the wrong model. From the plots of Figure 2, it appears quite clearly that the traditional AIC would have preferred the sinusoids (particularly model 1) for various numbers of entries, whereas the AIC MICx always identifies the reference model as the right one.
is is achieved by taking into account the distributions of the Complexity Table 1: e four candidate models to fit the data generated by (14).  (14). Red: the models of Table 1. From top left to bottom right models from 1 to 4.

The Main Functional Classes and Noise Statistics for Practical Applications
To assess the performance of the alternative AIC model selection criterion proposed in Section 2, a series of systematic numerical tests have been performed. e analysis is focussed mainly on four classes of models that cover the most widely used in practice. ey are the classes of polynomials, power laws, power laws multiplied by a squashing term, and exponential functions. In the rest of the paper, only the results for bidimensional functions (of the form z � f(x, y)) are discussed, because they are susceptible of clear visualization, which helps illustrating the properties of the criterion. e extension to a larger number of variables is straightforward and does not pose any conceptual difficulty. erefore, the considerations and conclusions reported have to be assumed valid also in higher dimensions. For the reader's convenience, the mathematical form of the aforementioned models is reported in the left column of Table 2.
Significant attention has been devoted to noise statistics. ree of the most relevant distribution functions have been tested: Gaussian, uniform, and multi-Gaussian [12]. Again for the reader's convenience, the mathematical formulation of these types of noise is summarised in the right column of Table 2, together with the parameter values valid for the runs reported in the rest of the paper. Since in practice very often the presence of outliers in the data cannot be excluded, the robustness of the proposed upgrade of the AIC in this respect has also been verified.
is has been achieved by randomly adding to the synthetic data values sampled from a Gaussian distribution of small variance but nonzero mean (see the entry called Asymmetric noise in Table 2 for a precise mathematical definition).

Representative Results of Numerical Tests
As mentioned, a systematic series of tests with synthetic data has been performed to assess the competitive advantage of the proposed version of the AIC. All the combinations of cases summarised in Section 4 have been investigated. e new version AIC MICx has always proved to have better discriminatory capabilities than the traditional AIC. In practice, this means that AIC MICx at least provides better separation between the right model (the one used to generate the data) and its wrong competitors.
is has proved to occur for any type of function, noise statistics, and levels of outliers. In general, the more severe the conditions, the higher the level of noise or outliers, and the better the AIC MICx performance compared to the traditional AIC. In some cases, as the one already discussed in Section 3, only the AIC MICx can converge on the right model.
In the rest of this section, some relevant examples of the performed tests are reported. ey have to be considered absolutely representative of the vast majority of systematic investigations performed.
In the first case discussed in the following, the model generating the data consists of a power law multiplied by a squashing term. e importance and popularity of power laws are difficult to overstate. Self-similarity can result in many quantities presenting a power law trend. Power laws are also particularly important for the investigation of scalings. On the other hand, power law monomials can be too rigid and the multiplication by a squashing factor can provide some additional flexibility. e function implemented to generate the synthetic data is reported in the last row of Table 3. e other rows of the same table report the alternative models. e synthetic data generated with the reference model of Table 3 is shown in Figure 3, together with the functions constituting the alternative models. Two different levels of Gaussian additive noise are shown; corresponding to a standard deviation of 15% and 30% of the synthetic data averaged amplitude. As can be derived by simple inspection of the plots, AIC MICx not only increases the separation between the models, compared to the traditional AIC, but it also allows identifying the equation generating the data. Indeed whereas, for some numbers of entries and 30% of added noise, the AIC of the candidate models can be lower than the reference one, the AIC MICx always identifies the model generating the data as the best; this can be seen by noticing that the values of the AIC MICx differences, with respect to the best model, are always positive. e discriminatory power of AIC MICx is even higher in the case of high noise. is fact is exemplified by the following example, in which the generating model belongs to the class of exponential functions. e alternative models are reported in Table 4, whose last row reports the equation used to generate the data. In addition to Gaussian noise, with a standard deviation of 30% and 60% of the synthetic data averaged amplitude, some concentrated high noise has also been added, according to the relations specified in the last row of Table 2. e better performance of AIC MICx compared to the traditional AIC can be easily recognised by Table 2: e main families of functions tested and the statistics of the additive noise.

Families of functions
Additive noise applied Polynomials y � a 0 x b 0 + a 1 x b 1 + a n x b n Uniform Noise μ � ± 10 until ± 50 Asymmetric Noise N 1 : μ 1 � 0 and σ 1 � 10; N 2 : μ 2 ≠ 0 and σ 2 � 30; with μ 2 � 2(σ 1 + σ 2 )/100, f(x) Ratio between N 1 , N 2 ⇒0.75 until 0.95 Complexity 5 inspection of the plots in Figure 4. Indeed, the separation between the alternative models and the right one is much larger for the AIC MICx than for the traditional AIC (the reader should please consider also the different scales of the plots in Figure 4).

Extension to Bayesian Model Selection
It is worth noting that the same modifications proposed for the AIC can be applied also to the Bayesian information criterion (BIC) [13]. BIC is based on Bayesian theory and has been designed to maximize the posterior probability of a model given the data. BIC is again a cost function and therefore it is also an indicator to be minimised. e BIC's most general form is   Table 2. e coloured curves are the various candidate models and in dashed point green is the reference one. e bottom plots are the comparison of AIC and AIC MICx results in terms of the difference with respect to the exact reference model.
where again L is the likelihood of the data given the model, k is the number of estimated parameters in the model, and n is the number of entries in the database. BIC has the same structural form as the AIC and is affected by the same difficulties in practical applications, in particular the challenges posed by the calculation of the likelihood and the quantification of the model complexity.    Table 4. e coloured curves are the various candidate models and in dashed point green is the reference one. e bottom plots are the comparison of AIC and AIC MICx results in terms of the difference with respect to the exact reference model.    Table 1 in Section 3.

Complexity
Assumptions, similar to the ones leading to (4), allow expressing the BIC criterion as follows: Even if the conceptual origins of BIC are different, the proposed changes have the same effects, namely, they improve BIC's discriminatory power by including more statistical information about the residuals and by better quantifying the models' complexity. In full analogy to (13), the final upgraded version of the BIC criterion is e tests of the AIC have been performed also for the BIC and they produce basically the same results. e discriminatory capability of BIC MICx is clearly superior to the original version of the indicator, as can be seen in the plots of Appendix B. Of course, given the fact that BIC is based on Bayesian statistics, the argument that the implemented upgrades improve the coherence, with information-theoretic definitions and assumptions, cannot be made. On the other hand, the fact that the proposed modifications improve the quality also of a Bayesian type of selection criterion increases the confidence in the validity of the ideas, which have led to them.  Figure 6: Plots of the mutual information between the models and the residuals for the models of Table 1 Table 1 in Section 3.  Table 1 in Section 3.  Table 1 in Section 3. e plots show the indicator difference between the candidate models and the reference one; therefore negative values indicate that the corresponding indicator would have reached the wrong conclusion about the model to select.

Conclusions
e Akaike Information Criterion was conceived to minimise the out-of-sample error and it is based on information theory. Statistical models are indeed developed to represent the process that generated the data, and the AIC estimates the relative amount of information lost by a given model. On this basis, it is assumed that the better a model, the less information it loses. Unfortunately, the deployment of AIC is problematic because its practical versions are affected by significant limitations. Indeed the most widely used version of AIC is valid under the assumptions that the data are affected by Gaussian, zero-sum additive noise.
ese hypotheses have to be accepted because, in most practical applications, it is often very difficult, if not impossible, to compute the likelihood of the data given the model. If the processes generating the data do not verify these assumptions, the traditional versions of the AIC can become poorly effective or even misleading.
On the other hand, other information theoretic quantities can be implemented to improve the discrimination potential of the criterion. In particular, the mutual information between the model estimates and the residuals can help reward the goodness of fit. e entropy in its turn can be used to quantify the model complexity. With these upgrades, the proposed version of the AIC has always proved to have much better convergence properties than the traditional version in all respects, including robustness against noise and zero-sum outliers. is has occurred in all the numerical tests performed, some of which consist of very challenging selection tasks, given the fact that some candidate models assume values very similar to the right one in the range covered by the data. e proposed improvements have an equally positive impact on the other criteria of the AIC family, such as TIC and AICc [4]. e extension of the same concepts to the Bayesian information criterion proves the soundness of the basic rationale behind the proposed modifications. e good performance in presence of nonnormal noise distributions is particularly encouraging because model assessment in such situations has not yet received a lot of attention in the literature. Indeed, only a few publications have addressed the fact that many existing model selection criteria such as the BIC     conditional mean and variance of the response are dependent [14]. Synergies with other formulations of the complexity term would also be very interesting from the methodological point of view [15].

∆BIC
Given the quite positive results obtained with synthetic data, proving their better discriminatory capability, the proposed new versions of the selection criteria are expected to become useful in various fields. ey are already being deployed for the investigation of complex systems, ranging from high-temperature plasmas [16][17][18][19][20][21][22][23] to remote sensing of the atmosphere and radar [24][25][26]. Another promising application seems to be in support of the regularization of recent tomographic inversion methods [27][28][29]. In these fields, Dimensional Analysis (DA) is a methodology widely used to identify key variables based on physical dimensions. Even if it has been granted some attention recently, in most literature DA is treated as merely a preprocessing tool, creating various statistical problems [30]. e upgrades of the criteria proposed in this work could hopefully help in devising an appropriate statistical methodology that integrates DA and model selection.