Sparse causality network retrieval from short time series

We investigate how efficiently a known underlying sparse causality structure of a simulated multivariate linear process can be retrieved from the analysis of time-series of short lengths. Causality is quantified from conditional transfer entropy and the network is constructed by retaining only the statistically validated contributions. We compare results from three methodologies: two commonly used regularization methods, Glasso and ridge, and a newly introduced technique, LoGo, based on the combination of information filtering network and graphical modelling. For these three methodologies we explore the regions of time series lengths and model-parameters where a significant fraction of true causality links is retrieved. We conclude that, when time-series are short, with their lengths shorter than the number of variables, sparse models are better suited to uncover true causality links with LoGo retrieving the true causality network more accurately than Glasso and ridge.


Introduction
Establishing causal relations between variables from observation of their behaviour in time is central to scientific investigation and it is at the core of data-science where these causal relations are the basis for the construction of useful models and tools capable of prediction. The capability to predict (future) outcomes from the analytics of (past) input data is crucial in modelling and it should be the main property to take into consideration in model selection, when the validity and meaningfulness of a model is assessed. From a high-level perspective, we can say that the whole scientific method is constructed around a circular procedure consisting in observation, modelling, prediction, and testing. In such a procedure, the accuracy of prediction is used as a selection tool between models. In addition, the principle of parsimony favours the simplest model when two models have similar predictive power.
The scientific method is the rational process that, for the last 400 years, has mostly contributed to scientific discoveries, technological progress, and the advancement of human knowledge. Machine learning and data-science are nowadays pursuing the ambition to mechanize this discovery process by feeding machines with data and using different methodologies to build systems able to make models and predictions by themselves. However, the automatisation of this process requires to identify, without the help of human intuition, the relevant variables and the relations between these variables out of a large quantity of data. Predictive models are methodologies, systems, or equations which identify and make use of such relations between sets of variables in a way that the knowledge about a set of variables provides information about the values of the other set of variables. This problem is intrinsically high-dimensional with many input and output data. Any model that aims to explain the underlying system will involve a number of elements which must be of the order of magnitude of the number of relevant relations between the system's variables. In complex systems, such as financial markets or the brain, prediction is probabilistic in nature and modelling concerns inferring the probability of the values of a set of variables given the values of another set. This requires the estimation of the joint probability of all 2 Complexity variables in the system and, in complex systems, the number of variables with potential macroscopic effects on the whole system is very large. This poses a great challenge for the model construction/selection and its parameter estimation because the number of relations between variables scales with, at least, the square of the number of variables but, for a given fix observation window, the amount of information gathered from such variables scales, at most, linearly with the number of variables [1,2].
For instance, a linear model for a system with variables requires the estimation from observation of ( +1)/2 parameters (the distinct elements of the covariance matrix). In order to estimate O( 2 ) parameters one needs a comparable number of observations requiring time series of length ∼ or larger to gather a sufficient information content from a number of observations which scales as × ∼ O( 2 ). However, the number of parameters in the model can be reduced by considering only O( ) out of the O( 2 ) relations between the variables reducing in this way the required time series length to O( ). Such models with reduced numbers of parameters are referred to in the literature as sparse models. In this paper we consider two instances of linear sparse modelling: Glasso [3] which penalizes nonzero parameters by introducing a ℓ 1 norm penalization and LoGo [4] which reduces the inference network to an O( ) number of links selected by using information filtering networks [5][6][7]. The results from these two sparse models are compared with the ℓ 2 norm penalization (nonsparse) ridge model [8,9]. This paper is an exploratory attempt to map the parameter regions of time series length, number of variables, penalization parameters, and kinds of models to define the boundaries where probabilistic models can be reasonably constructed from the analytics of observation data. In particular, we investigate empirically, by means of a linear autoregressive model with sparse inference structure, the true causality link retrieval performances in the region of short time series and large number of variables which is the most critical region-and the most interesting-in many practical cases. Causality is defined in information theoretic sense as a significant reduction on uncertainty over the present values of a given variable provided by the knowledge of the past values of another variable obtained in excess to the knowledge provided by the past of the variable itself and-in the conditional case-the past of all other variables [10]. We measure such information by using transfer entropy and, within the present linear modelling, this coincides with the concept of Granger causality and conditional Granger causality [11]. The use of transfer entropy has the advantage of being a concept directly extensible to nonlinear modelling. However, nonlinearity is not tackled within the present paper. Linear models with multivariate normal distributions have the unique advantage that causality and partial correlations are directly linked, largely simplifying the computation of transfer entropy, and directly mapping the problem into the sparse inverse covariance problem [3,4].
Results are reported for artificially generated time series from an autoregressive model of = 100 variables and time series lengths between 10 and 20,000 data points. Robustness of the results has been verified over a wider range of from 20 to 200 variables. Our results demonstrate that sparse models are superior in retrieving the true causality structure for short time series. Interestingly, this is despite considerable inaccuracies in the inference network of these sparse models. We indeed observe that statistical validation of causality is crucial in identifying the true causal links, and this identification is highly enhanced in sparse models.
The paper is structured as follows. In Section 2 we briefly review the basic concepts of mutual information and conditional transfer entropy and their estimation from data that will then be used in the rest of the paper. We also introduce the concepts of sparse inverse covariance, inference network and causality networks. Section 3 concerns the retrieval of causality network from the computation and statistical validation of conditional transfer entropy. Results are reported in Section 4 where the retrieval of the true causality network from the analytics of time series from an autoregressive process of = 100 variables is discussed. Conclusions and perspectives are given in Section 5.

Estimation of Conditional Transfer Entropy from Data
In this paper causality is quantified by means of statistically validated transfer entropy. Transfer entropy (Z i → Z j ) quantifies the amount of uncertainty on a random variable, Z j , explained by the past of another variable, Z i , conditioned to the knowledge about the past of Z j itself. Conditional transfer entropy, (Z i → Z j | W), includes an extra condition also to a set variables W. These quantities are introduced in detail in Appendix A (see also [11][12][13]). Let us here just report the main expression for the conditional transfer entropy that we shall use in this paper: where (⋅ | ⋅) is the conditional entropy and Z j, is a random variable at time , whereas Z lag i, = {Z i, −1 , . . . , Z i, − } is the lagged set of random variable "i" considering previous times − 1 ⋅ ⋅ ⋅ − and W are all other variables and their lags (see Appendix A, (A.5)).
In this paper we use Shannon entropy and restrict to linear modelling with multivariate normal setting (see Appendix B). In this context the conditional transfer entropy can be expressed in terms of the determinants of conditional covariances det(Σ(⋅ | ⋅)) (see (B.5) in Appendix B): Conditional covariances can be conveniently computed in terms of the inverse covariance of the whole set of variables Such inverse covariance matrix, J, represents the structure of conditional dependencies among all couples of variables in the system and their lags. Each subpart of J is associated with the conditional covariances of the variables in that part with respect to all others. In terms of J, the expression for the conditional transfer entropy becomes where the indices "1" and "2" refer to submatrices of J, respectively, associated with the variables Z j, and Z lag i, .

Causality and Inference
Networks. The inverse covariance J, also known as precision matrix, represents the structure of conditional dependencies. If we interpret the structure of J as a network, where nodes are the variables and nonzero entries correspond to edges of the network, then we shall see that any two subsets of nodes that are not directly connected by one or more edges are conditionally independent. Condition is with respect to all other variables. Links between variables at different lags are associated with causality with direction going from larger to smaller lags. The network becomes therefore a directed graph. In such a network entropies can be associated with nodes, conditional mutual information can be associated with edges between variables with the same lag, and conditional transfer entropy can be associated with edges between variables with different lags. A nice property of this mapping of information measures with directed networks is that there is a simple way to aggregate information which is directly associated with topological properties of the network. Entropy, mutual information, and transfer entropies can be defined for any aggregated subset of nodes with their values directly associated with the presence, direction, and weight of network edges between these subparts.
Nonzero transfer entropies indicating, for instance, variable i causing variable j are associated with some nonzero entries in the inverse covariance matrix J between lagged variables i (i.e., Z i, − , with > 0) and variable j (i.e., Z j, ). In linear models these nonzero entries define the estimated inference network. However, not all edges in this inference network correspond to transfer entropies that are significantly different from zero. To extract the structure of the causality network we shall retain only the edges in the inference network which correspond to statistically validated transfer entropies.
Conditioning eliminates the effect of the other variables retaining only the exclusive contribution from the two variables in consideration. This should provide estimations of transfer entropy that are less affected by spurious effects from other variables. On the other hand, conditioning in itself can introduce spurious effects; indeed two independent variables can become dependent due to conditioning [13]. In this paper we explore two extreme conditioning cases: (i) conditioned to all other variables and their lags; (ii) unconditioned.
In principle, one would like to identify the maximal value of (Z i → Z j | W) over all lags and all possible conditionings W. However, the use of multiple lags and conditionings increases the dimensionality of the problem making estimation of transfer entropy very hard especially when only a limited amount of measurements is available (i.e., short time series). This is because the calculation of the conditional covariance requires the estimation of the inverse covariance of the whole set of variables and such an estimation is strongly affected by noise and uncertainty. Therefore, a standard approach is to reduce the number of variables and lags to keep dimensionality low and estimate conditional covariances with appropriate penalizers [3,8,9,14]. An alternative approach is to invert the covariance matrix only locally on low dimensional subsets of variables selected by using information filtering networks [5][6][7] and then reconstruct the global inversion by means of the LoGo approach [4]. Let us here briefly account for these two approaches.

Penalized
Inversions. The estimate of the inverse covariance is a challenging task to which a large body of literature has been dedicated [14]. From an intuitive perspective, one can say that the problem lies in the fact that uncertainty is associated with nearly zero eigenvalues of the covariance matrix. Variations in these small eigenvalues have relatively small effects on the entries of the covariance matrix itself but have major effects on the estimation of its inverse. Indeed small fluctuations of small values can yield to unbounded contributions to the inverse. A way to cure such near-singular matrices is by adding finite positive terms to the diagonal which move the eigenvalues away from zero:Ĵ = ((1 − )S + I ) −1 , where S = cov(Z) is the covariance matrix of the set of variables Z ∈ R estimated from data and I ∈ R × is the identity matrix (where = × ( + 1); see later). This is what is performed in the so-called ridge regression [9], also known as shrinkage mean-square-error estimator [15] or Tikhonov regularization [8]. The effect of the additional positive diagonal elements is equivalent to compute the inverse covariance which maximizes the log-likelihood: log det(Ĵ)−tr(SĴ)− ‖Ĵ ‖ 2 , where the last term penalizes large off-diagonal coefficients in the inverse covariance with a ℓ 2 norm penalization [16]. The regularizer parameter tunes the strength of this penalization. This regularization is very simple and effective. However, with this method insignificant elements in the precision matrix are penalized toward small values but they are never set to zero. By using instead ℓ 1 norm penalization log det(Ĵ) − tr(SĴ) − ‖Ĵ ‖ 1 , insignificant elements are forced to zero leading to a sparse inverse covariance. This is the so-called lasso regularization [3,14,17]. The advantage of a sparse inverse covariance consists in the provision of a network representing a conditional dependency structure. Indeed, let us recall that in linear models zero entries in the inverse covariance are associated with couples of nonconditionally dependent variables.

Information Filtering Network Approach:
LoGo. An alternative approach to obtain sparse inverse covariance is by using information filtering networks generated by keeping the elements that contribute most to the covariance by means of a greedy process. This approach, named LoGo, proceeds by first constructing a chordal information filtering graph such as a Maximum Spanning Tree (MST) [18,19] or a Triangulated Maximally Filtered Graph (TMFG) [7]. These graphs are built by retaining edges that maximally contribute to a given gain function which, in this case, is the log-likelihood or-more simply-the sum of the squared correlation coefficients [5][6][7]. Then, this chordal structure is interpreted as the inference structure of the joint probability distribution function with nonzero conditional dependency only between variables that are directly connected by an edge. On this structure the sparse inverse covariance is computed in such a way to preserve the values of the correlation coefficients between couples of variables that are directly connected with an information filtering graph edge. The main advantage of this approach is that inversion is performed at local level on small subsets of variables and then the global inverse is reconstructed by joining the local parts through the information filtering network. Because of this Local-Global construction this method is named LoGo. It has been shown that LoGo method yields to statistically significant sparse precision matrices that outperform the ones with the same sparsity computed with lasso method [4].

Simulated Multivariate Autoregressive Linear Process.
In order to be able to test if causality measures can retrieve the true causality network in the underlying process, we generated artificial multivariate normal time series with known sparse causality structure by using the following autoregressive multivariate linear process [20]: where A ∈ R × are matrices with random entries drawn from a normal distribution. The matrices are made upper diagonal (diagonal included) by putting to zero all lower diagonal coefficients and made sparse by keeping only a O( ) total number of entries different from zero in the upper and diagonal part. U ∈ R are random normally distributed uncorrelated variables. This process produces autocorrelated, cross-correlated, and causally dependent time series. We chose it because it is among the simplest processes that can generate this kind of structured datasets. The dependency and causality structure is determined by the nonzero entries of the matrices A . The upper-triangular structure of these matrices simplifies the causality structure eliminating causality cycles. Their sparsity reduces dependency and causality interactions among variables. The process is made autoregressive and stationary by keeping the eigenvalues of A all smaller than one in absolute value. For the tests we used = 5, = 100 and sparsity is enforced to have a number of links approximately equal to . We reconstructed the network from time series of different lengths between 5 and 20,000 points. To test statistical reliability the process was repeated 100 times with every time a different set of randomly generated matrices A . We verify that the results are robust and consistent by varying sample sizes from = 20 to 200, by changing sparsity with number of links from 0.5 to 5 and for from 1 to 10. We verified that the presence of isolated nodes or highly connected hub nodes does not affect results significantly.

Causality and Inference Network Retrieval.
We tested the agreement between the causality structure of the underlying process and the one inferred from the analysis of time series of different lengths , Z ∈ R with = 1 ⋅ ⋅ ⋅ , generated by using (4) We have different variables and lags. The dimensionality of the problem is therefore = × ( + 1) variables at all lags including zero.
We retrieved the inference network by looking at all couples of variables, with indices i ∈ [1, . . . , ] and j ∈ [1, . . . , ], which have nonzero entries in the inverse covariance matrix J between the lagged set of j and the nonlagged i. Clearly, for the ridge method the result is a complete graph but for the Glasso and LoGo the results are sparse networks with edges corresponding to nonzero conditional transfer entropies between variables i and j. For the LoGo calculation we make use of the regularizer parameter as a local shrinkage factor to improve the local inversion of the covariance of the 4-cliques and triangular separators (see [4]).
We then estimated transfer entropy between couples of variables, i → j conditioned to all other variables in the system. This is obtained by estimating of the inverse covariance matrix (indicated with an "hat" symbol) by using (C.7) (see Appendix C.2) with with W a conditioning to all variables Z except Z 1 , Z 2 , and . Finally, to retrieve the causality network we retained the network of statistically validated conditional transfer entropies only. Statistical validation was performed as follows.

Statistical Validation of Causality.
Statistical validation has been performed from likelihood ratio statistical test. Indeed, entropy and likelihood are intimately related: entropy measures uncertainty and likelihood measures the reduction in uncertainty provided by the model. Specifically, the Shannon entropy associated with a set of random variables, Complexity 5 (see (B.1)) whereas the log-likelihood for the model̂(Z i ) associated with a set of independent observationsẐ , with Note that is the total available number of observations which, in practice, is the length of the time series minus the maximum number of lags. It is evident from these expressions that entropy and the log-likelihood are strictly related though this link might be nontrivial. In the case of linear modelling this connection is quite evident because the entropy estimate is = (1/2)(− log |Ĵ| + log(2 ) + ) and the log-likelihood is log L = ( /2)(log|Ĵ| − Tr(ΣĴ) − log (2 )). For the three models we study in this paper we have Tr(ΣĴ) = and therefore the log-likelihood is equal to times the opposite of the entropy estimate. Transfer entropy and conditional transfer entropy are differences between two entropies: the one of a set of variables conditioned to their own past minus the one conditioned also to the past of another variable. This, in turns, is the difference of the unitary log-likelihood of two models and therefore it is the logarithm of a likelihood ratio. As Wilks pointed out [21,22] the null distribution of such model is asymptotically quite universal. Following the likelihood ratio formalism, we have = and the probability of observing a transfer entropy larger than , estimated under null hypothesis, is given by V ∼ 1 − 2 ( , ) with ≃ 2 and 2 the chi-square the cumulative distribution function with degrees of freedom which are the difference between the number of parameters in the two models. In our case the two models have, respectively, ( 2 + 1) and ( 2 + 1) + ( ) parameters.

Statistical
Validation of the Network. The procedures described in the previous two subsections produce the inference network and causality network. Such networks are then compared with the known network of true causalities in the underlying process which is defined by the nonzero elements in the matrices (see (4)). The overlapping between the retrieved links in the inference or causality networks with the ones in the true network underlying the process is an indication of a discovery of a true causality relation. However some discoveries can be obtained just by chance or some methodologies might discover more links only because they produce denser networks. We therefore tested the hypothesis that the matching links in the retrieved networks are not obtained just by chances by computing the null-hypothesis probability to obtain the same or a larger number of matches randomly. Such probability is given by the conjugate cumulative hypergeometric distribution for a number equal or larger than TP of "true positive" matching causality links between an inferred network of links and a process network of true causality links, from a population of 2 − possible links: Small values of indicate that the retrieved TP links out of are unlikely to be found by randomly picking edges from 2 − possibilities. Note that in the confusion matrix notation [23] we have = TP + FP and = TP + FN with TP number of true positives, FP number of false positives, FN number of false negatives, and TN number of true negatives. The total number of "negatives" (unlinked couples of vertices) in the true model is instead = FP + TN.

Computation and Validation of Conditional Transfer
Entropies. By using (4) we generated 100 multivariate autoregressive processes with known causality structures. We here report results for = 100 but analogous outcomes were observed for dimensionalities between = 20 and 200 variables. Conditional transfer entropies between all couples of variables, conditioned to all other variables in the system, were computed by estimating the inverse covariances by using tree methodologies, ridge, Glasso, and LoGo and applying (3). Conditional transfer entropies were statistically validated with respect to null hypothesis (no causality) at V = 1% value. Results for Bonferroni adjusted value at 1% (i.e., V = 0.01/( 2 − ) ∼ 10 −6 for = 100) are reported in Appendix E. We also tested other values of V from 10 −8 to 0.1 obtaining consistent results. We observe that small V reduce the number of validated causality links but increase the chance that these links match with the true network in the process. Conversely large values of V increase the numbers of mismatched links but also of the true links discoveries. Let us note that here we use V as a thresholding criteria and we are not claiming any evidence of statistical significance of the causality. We assess the goodness of this choice a posteriori by comparing the resulting causality network with the known causality network of the process.

Statistical Significance of the Recovered Causality Network.
Results for the contour frontiers of significant causality links for the three models are reported in Figure 1 for a range of time series with lengths between 10 and 20,000 and regularizer parameters between 10 −8 and 0.5. Statistical significance is computed by using (6) and results are reported for both < 0.05 and < 10 −8 (continuous and dotted lines respectively). As one can see, the overall behaviours for the three methodologies are little affected by the threshold on . We observe that LoGo significance region extends well beyond the Glasso and ridge regions.
The value of the regularizer parameter affects the results for the three models in a different way. Glasso has a region in the plane -/ where it has best performances (in this case it appears to be around ≃ 0.1 and / ≃ 2.5). Ridge appears instead to be little affected with mostly constant performances across the range of . LoGo has best performances for small, even infinitesimal, values of . Indeed, different from Glasso in this case does not control sparsity but instead acts as local shrinkage parameter. Very small values can be useful in some particular cases to reduce the effect of noise but large values have only the effect to reduce information.  (6)) and dotted lines indicate < 10 −8 significance levels ( is averaged over 100 processes). The plots refer to = 100 and report the region where the causality networks are all significant for 100 processes.
significant, we also measured the fraction of true links retrieved. Indeed, given that the true underlying causality network is sparse, one could do significantly better than random by discovering only a few true positives. Instead, from any practical perspective we aim to discover a significant fraction of the edges. Figure 2 shows that the fraction of causality links correctly discovered (true positive, TP) with respect to the total number of causality links in the process ( ) is indeed large reaching values above 50%. This is the so-called true positive rate or sensitivity, which takes values between 0 (no links discovered) and 1 (all links discovered). Reported values are averages over 100 processes. We observe that the region with discovering of 10% or more true causality links greatly overlaps with the statistical validity region of Figure 1. We note that when the observation time becomes long, / ⪅ 0.25, ridge discovery rate becomes larger than LoGo. However, statistical significance is still inferior to LoGo, indeed the ridge network becomes dense when increases and the larger discovery rate of true causality links is also accompanied by a larger rate of false links incorrectly identified (false positive FP).
The fraction of false positives with respect to the total number of causality links in the process ( ) are reported in Table 1 together with the true positive rate for comparison. This number can reach values larger than one because the process is sparse and there are much more possibilities to randomly chose false links than true links. Note that this is not the false positive rate, which instead is FP/ , and cannot be larger than one. Consistent with Figure 1 we observe that, for short time series, up to / ∼ 0.5, the sparse models have better capability to identify true causality links and to discard the false ones with LoGo being superior to Glasso. Remarkably, LoGo can identify a significant fraction of causality links already from time series with lengths of 30 data points only. value significance, reported in the table with one or two stars, indicates when all values of ( ≥ TP | , , ) from (6) for all 100 processes have, respectively, < 0.05 or < 10 −8 . Again we observe that LoGo discovery rate region extends well beyond the Glasso and ridge regions.

Inference Network.
We have so far empirically demonstrated that a significant part of the true causality network can be retrieved from the statistically validated network of conditional transfer entropies. Results depend on the choice of the threshold value of V at which null hypothesis is rejected. We observed that lower V are associated with network with fewer true positives but also fewer false positives and conversely larger V yield to causality networks with larger true positives but also larger false positives. Let us here report on the extreme case of the inference network which contains all causality channels with no validation. For the ridge model this network is the complete graph with all variables connected to each other. Instead, for Glasso and LoGo the inference network is sparse.  Results are summarized in Table 2. In terms of true positive rate we first notice that they are all larger than the ones in Table 1. Indeed, the network of statistically validated conditional transfer entropies is a subnetwork of the inference network. On the other hand we notice that the false positive fraction is much larger than the ones in Table 2. Ridge network has a fraction of 1 because, in this case, the inference network is the complete graph.
Glasso also contains a very large number of false positives reaching even 55 times the number of links in the true network and getting to lower fractions only from long time series with > 1000. These numbers also indicate that Glasso networks are not sparse. LoGo has a sparser and more significant inference network with smaller fractions of false positives which stay below 5 , which is anyway a large number of misclassifications. Nonetheless, we observe that, despite such large fractions of FP, the discovered true positives are statistically significant.

Unconditioned Transfer Entropy Network.
We last tested whether conditioning to the past of all other variables gives better causality network retrievals than the unconditioned case. Here, transfer entropy, (Z i → Z j ), is computed by using (3) with W = 0, the empty set. For the ridge case this unconditional transfer entropy depends only from the time series, Z , , {Z , −1 , . . . , Z , − } and {Z , −1 , . . . , Z , − } (with = 5 in this case). Glasso and LoGo cases are instead hybrid because a conditional dependency has been already introduced in the sparse structure of the inverse covariance J (the inference network). Results are reported in Table 3 where we observe that these networks retrieve a larger quantity of true positives than the ones constructed from conditional entropy. However, the fraction of false positive is also larger than the ones in Table 1 although it is smaller than what observed in the inference network in Table 2. Overall, these results indicate that conditioning is effective in discarding false positives.

Summary of All Results in a Single ROC Plot.
In summary, we have investigated the networks associated with conditional transfer entropy, unconditional transfer entropy, and inference for three models under a range of different parameters. In the previous subsections we have provided some comparisons between the performances of the three models in different ranges of parameters. Let us here provide a summary of all results within a single ROC plot [23]. Figure 3 reports the ROC values, for each model and each parameter combination, -axis is false positive rates (FP/ ), and -axis is true positive rates (TP/ ). Each point is an average over 100 processes. Points above the diagonal line are associated with relatively well performing models with the upper left corner representing the point where models correctly discover all true causality links without any false positive. The plot reports with large symbols the cases for = 0.1 and validation at value V = 0.01, which can be compared with the data reported in the tables. We note that, by construction, LoGo models are sparse (with a number of edges ∼ 3 [4]). This restrains the ROC results to the left-hand side of the plot. For this reason an expanded view of the figure is also proposed with the -axis scaled. Note that this ROC curve is provided as a visual tool for intuitive comparison between models.  Overall from Tables 1, 2, and 3 and Figure 3 we conclude that all models obtain better results for longer time series and that conditional transfer entropy overperforms the unconditional counterparts (see, Tables 1 and 3 and the two separated ROC figures for conditional and unconditional transfer entropies reported in Figure 5 in Appendix D). In the range of short time series, when ≤ , which is of interest for this paper, LoGo is the best performing model with better performances achieved for small ≲ 10 −4 and validation with small values V ≲ 10 −4 . LoGo is consistently the best performing model also for longer time series up to lengths of ∼ 1000. Instead, above = 2000 ridge begins to provide better results. For long time series, at =20,000, the best performing model is ridge with parameters = 10 −5 , value V = 5 10 −6 . LoGo is also performing well when time series are long with best performance obtained at =20,000 for parameters = 10 −10 , value V = 5 10 −6 . We note that LoGo instead performs poorly in the region of parameters with ≤ 0.1 and V ≤ 0.01 for short time series ≤ /2.

Conclusions and Perspectives
In this paper we have undertaken the challenging task to explore models and parameter regions where the analytics of time series can retrieve significant fractions of true causality links from linear multivariate autoregressive process with known causality structure. Results demonstrate that sparse models with conditional transfer entropy are the ones who achieve best results with significant causality link retrievals already for very short time series even with ≤ /5 = 20. This region is very critical and general considerations would suggest that no solutions can be discovered. Indeed, this result 9 is in apparent contradiction with a general analytical results in [24,25] who find that no significant solutions should be retrieved for ≤ /2 = 150. However, we notice that the problem we are addressing here is different from the one in [24,25]. In this paper we have been considering an underlying sparse true causality structure and such a sparsity changes considerably the condition of the problem yielding to significant solutions even well below the theoretical limit from [24,25] which is instead associated with nonsparse models.
Unexpectedly, we observed that the structure of the inference networks in the two sparse models, Glasso and LoGo, has excessive numbers of false positives yielding to rather poor performances. However, in these models false positive can be efficiently filtered out by imposing statistical significance of the transfer entropies.
Results are affected by the choice of the parameters and the fact that the models depend on various parameters ( , , , V , ) make the navigation in this space quite complex. We observed that the choice of values, V , for valid transfer entropies affects results. Within our setting we obtained best results with the smaller values especially in the regions of short time series. We note that the regularizer parameter also plays an important role and best performances are obtained by combination of the two parameters and V . Not surprisingly, longer time series yield to better results. We observe that conditioning to all other variables or unconditioning is affecting the transfer entropy estimation with better performing causality network retrieval obtained for conditioned transfer entropies. However, qualitatively, results are comparable. Other intermediate cases, such as conditioning to past of all other variables only, have been explored again with qualitatively comparable results. It must be said that in the present system results are expected to be robust to different conditionings because the underlying network of the investigated processes is sparse. For denser inference structures, conditioning could affect more the results.
Consistently with the findings in [4] we find that LoGo outperforms the other methods. This is encouraging because the present settings of LoGo is using a simple class of information filtering networks, namely, the TMFG [7], obtained by retaining largest correlations. There are a number of alternative information filtering networks which should be explored. In particular, given the importance of statistical validation emerged from the present work, it would be interesting to explore statistical validation within the process of construction of the information filtering networks themselves.
In this paper we investigate a simple case with a linear autoregressive multivariate normal process analysed by means of linear models. Both LoGo and Glasso can be extended to the nonlinear case with LoGo being particularly suitable for nonparametric approaches as well [4].
There are alternative methods to extract causality networks from short time series, in particular Multispatial CCM [26,27] appears to perform well for short time series. A comparison between different approaches and the application of these methods to real data will be extremely interesting. However this should be the object of future works.

A. Conditional Transfer Entropy
Let us here briefly review two of the most commonly used information theoretic quantities that we use in this paper, namely, mutual information (quantifying dependency) and transfer entropy (quantifying causality) for the multivariate case [11][12][13].
A.1. Mutual Information. Let us first start from the simplest case of two random variables, ∈ R 1 and ∈ R 1 , where dependence can be quantified by the amount of shared information between the two variables, which is called mutual information: ( ; ) = ( ) + ( ) − ( , ), where ( ) is the entropy of variable , ( ) is the entropy of variable , and ( , ) is the joint entropy of variables and [13]. Extending to the multivariate case, the shared information between a set of random variables X = ( 1 , . . . , ) ∈ R and another set of random variables with (X), (Y) being the entropies, respectively, for the set of variables X and Y and (X, Y) being their joint entropy. It must be stressed that this quantity is the mutual information between two sets of multivariate variables and it is not the multivariate mutual information between all variables {X, Y} which instead measures the intersection of information between all variables. Mutual information in (A.1) can also be written as which makes use of the conditional entropy of Y given X: Conditioning to a third set of variables W can also be applied to mutual information itself and its expression is a direct extension of (A.1) and it is called conditional mutual information: To quantify causality one must investigate the transmission of information not only between two sets of variables but also through time.
A.2. Conditional Transfer Entropy. Causality between two random variables, ∈ R 1 and ∈ R 1 , can be quantified by means of the so-called transfer entropy which quantifies the amount of uncertainty on explained by the past of given the past of . Let us consider a series of observations and denote with being the random variable at time and with − being the random variable at a previous time, lags before . Using this notation, we can define transfer entropy from variable to variable in terms of the following conditional mutual information: ( → ) = ( ; − | − ) [11,13].
For the multivariate case, given two sets of random variables X ∈ R and Y ∈ R , the transfer entropy is the conditional mutual information between the set of variables Y at time and the past of the other set of variables, X − conditioned to the past of the first variable Y − . That is, [13]. In general, the influence from the past can come from more than one lag and we can therefore extend the definition including different sets of lags for the two variables: 1 , . . . , , 1 , . . . , ℎ : a further generalization, which we use in this paper, includes conditioning to any other set of variables {W − 1 ⋅ ⋅ ⋅ W − } lagged at 1 , . . . , : In this paper we simplify notation using In the literature, there are several examples that use adaptations of (1) to compute causality and dependency measures [28]. A notable example is the directed information, introduced by Massey in [29], where spans all lags in a range between 0 and − 1 and spans the lags from 1 to − 1. The directed information is then defined as the sum over transfer entropies from = 1 to present: where we adopted the notations {X} 1 = {X 1 ⋅ ⋅ ⋅ X } and Interestingly, this definition includes the conditional synchronous mutual information contributions between X and Y . Following Kramer et al. [30,31] we observe that for stationary processes . This identity supports the intuition that the directed information accounts for the transfer entropy plus an instantaneous term.

B. Shannon-Gibbs Entropy
The general expression for the transfer entropy reported di in Section A, (1), is independent of the kind of entropy definition. In this paper we use Shannon entropy, which is defined as where (X) and (Y) are the probability distribution function for the set of random variables X and Y. Similarly, the joint Shannon entropy for the variables X and Y is defined as with (X, Y) being the joint probability distribution function of X and Y. This is the most common definition of entropy. It is a particularly meaningful and suitable entropy for linear modelling, as we focus in the paper.

B.1. Multivariate Normal Modelling. For multivariate normal variables the Shannon-Gibbs entropy is
and its conditional counterpart is with Σ being the covariance matrix and det(⋅) being the matrix determinant. In the paper we use these expressions to compute mutual information and conditional transfer entropy.

C. Computing Conditional Covariances for Subsets of Variables from the Inverse Covariance
Let us consider three sets of variables Z 1 ∈ R 1 , Z 2 ∈ R 2 , and Z 3 ∈ R 3 and the associated inverse covariance . The conditional covariance of Z 1 given Z 2 and Z 3 is the inverse of the 1 × 1 upper left part of J with indices in 1 = (1, . . . , 1 ) (see Figure 4): Instead, the conditional covariance of Z 1 given Z 3 is obtained by inverting the larger upper left part J 12,12 with both indices in { 1 , 2 } with 2 = ( 1 + 1, . . . , 1 + 2 ), and then taking the inverse of the part with indices in 1 which, using the Schur complement [13], is (C.2) Figure 4 schematically illustrates these inversions and their relations with conditional covariances. Let us note that these conditional covariances can also be expressed directly in terms of subcovariances by using again the Schur complement: However, when 3 (cardinality of 3 ) is much larger than 1 and 2 (cardinalities of 1 and 2 ) then the equivalent expressions, (C.1) and (C.2), that use the inverse covariance involve matrices with much smaller dimensions. This can become computationally crucial when very large dimensionalities are involved. Furthermore, if the inverse covariance J is estimated by using a sparse modelling tool such as Glasso or LoGo [4,14] (as we do in this paper), then computations in expressions (C.1) and (C.2) have to handle only a few nonzero elements providing great computational advantages over (C.3).
In the paper we make use of (C.1)-(C.2) to compute mutual information and conditional transfer entropy for the system of all variables and their lagged versions. Note that although this is not directly evident, (C.6) is symmetric by exchanging 1 and 2 (i.e., X and Y).

C.2. Conditional Transfer Entropy.
Conditional transfer entropy (see (1)) is conditional mutual information between lagged sets of variables and therefore it can be computed directly from (C.6). In this case we shall name obtaining an expression which is formally identical to (C.6) but with indices 1 and 2 referring to the above sets of variables instead. Note that index 3 does not appear in this expression. Information from variables 3 (W) has been used to compute J but then only the subparts 1 and 2 are required to compute the conditional transfer entropy. The fact that these expressions for conditional mutual information and conditional transfer entropy involve only local parts (1 and 2) of the inverse covariance can become extremely useful when highdimensional datasets are involved.

D. Comparison between Conditional and Unconditional Transfer Entropies
The two ROC plots for conditional and unconditional transfer entropies are displayed in Figure 5. Form the comparison it is evident that, for the process studied in this paper, conditional transfer entropy provides best results. This is in line with what observed in Tables 1, 3, 4, and 5.

E. Causality Network Results for Transfer Entropy Validation with 1% Bonferroni Adjusted Values
In Tables 4 and 5, true positive rates (TP/ ) and fraction of false positives (FP/ ) statistically validated and causality links with validation at 1% Bonberroni adjusted value (i.e., V ≲ 10 −6 ) are reported. These tables must be compared with Tables 1 and 3, in the main text where causality links are validated at V = 1% nonadjusted value.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.