Information integration from distributed threshold-based interactions

We consider a collection of distributed units that interact with one another through the sending of messages. Each message carries a positive ($+1$) or negative ($-1$) tag and causes the receiving unit to send out messages as a function of the tags it has received and a threshold. This simple model abstracts some of the essential characteristics of several systems used in the field of artificial intelligence, and also of biological systems epitomized by the brain. We study the integration of information inside a temporal window as the model's dynamics unfolds. We quantify information integration by the total correlation, relative to the window's duration ($w$), of a set of random variables valued as a function of message arrival. Total correlation refers to the rise of information gain above and beyond that which the units already achieve individually, being therefore related to consciousness studies in some models. We report on extensive computational experiments that explore the interrelations of the model's parameters (two probabilities and the threshold), highlighting relevant scenarios of message traffic and how they impact the behavior of total correlation as a function of $w$. We find that total correlation can occur at significant fractions of the maximum possible value and provide semi-analytical results on the message-traffic characteristics associated with values of $w$ for which it peaks. We then reinterpret the model's parameters in terms of the current best estimates of some quantities pertaining to cortical structure and dynamics. We find the resulting possibilities for best values of $w$ to be well aligned with the time frames within which percepts are thought to be processed and eventually rendered conscious.


Introduction
A threshold-based system is a collection of loosely coupled units, each characterized by a state function that depends on how the various inputs to the unit relate to a threshold parameter. The coupling in question refers to how the units interrelate, which is by each unit communicating its state to some of the other units whenever that state changes. Given a set of timing assumptions and how they relate to the exchange of states among units as well as to state updates, each individual unit processes inputs (states communicated to it by other units) and produces a threshold-dependent output (its own new state, which gets communicated to other units).
The quintessential threshold-based system is undoubtedly the brain, where each neuron's state function determines whether an action potential is to be fired down the neuron's axon. This depends on how the combined action potentials the neuron perceives through the synapses connecting other neurons' axons to its dendrites (its synaptic potentials) relate to its threshold potential [1]. The greatly simplified model of the natural neuron known as the McCulloch-Pitts neuron [2], introduced over seventy years ago, retained this fundamental property of being threshold-based and so did generalizations thereof such as generalized Petri nets [3] and threshold automata [4]. In fact, this holds for much of the descent from the McCulloch-Pitts neuron, which has extended through the present day in a succession of ever more influential dynamical systems.
Such descent includes the essentially deterministic Hopfield networks of the 1980s [5,6] and moves on through generalizations of those networks' Ising-model type of energy function and the associated need for stochastic sampling. The resulting networks include the so-called Boltzmann machines [7] and Bayesian networks [8,9], as well as the more general Markov (or Gibbs) Random Fields [10][11][12] and several of the probabilistic graphical models based on them [13]. A measure of the eventual success of such networks can be gained by considering, for example, the restricted form of Boltzmann machines [14,15] used in the construction of deep belief networks [16], as well as some of the other deep 2 Complexity networks that have led to landmark successes in the field of artificial intelligence recently [17][18][19].
Our interest in this paper is the study of how information gets integrated as the dynamics of a threshold-based system is played out. The meaning we attach to the term information integration is similar to the one we used previously in other contexts [20,21]. Given a certain amount of time and a set of random variables, each related to the firing activity of each of the system's units inside a temporal window of duration , we quantify integrated information as the amount of information the system generates as a whole (relative to a global state of maximum entropy) beyond that which accounts for the aggregated information the units generate individually (now relative to local states of maximum entropy). This surplus of information is known as total correlation [22] and is fundamentally dependent on how the units interact with one another.
Our understanding of information integration, therefore, lies in between those used by approaches that seek it in the synchronization of input/output signals (see, e.g., [23]) and those that share our view but would consider not just the whole and the individual units but all partitions in between as well [24]. By virtue of this, we remain aligned with the latter theory by acknowledging that information gets integrated only when it is generated by the whole in excess of the total its parts can generate individually. On the other hand, by sticking with total correlation as an information-theoretic quantity that requires only two partitions of the set of units to be considered (one that is fully cohesive and another that is maximally disjointed), we ensure tractability way beyond that of the all-partitions theory.
We conduct all our study on a simple model of thresholdbased systems. In this model, the units are placed inside a cube and exchange messages whose delivery depends on their propagation speed and the Euclidean distance between sender and receiver. Every message is tagged and upon reaching its destination its tag is used to move a local accumulator either toward or away from the threshold. Reaching the threshold makes the unit send out messages and the accumulator is reset. There are three parameters in the model. Two of them are probabilities (that a message is tagged so that the accumulator at the destination gets decreased upon its arrival and that a unit sends a message to each of the other units), the other being the value of the threshold. Parameter values are the same for all units.
This model is by no means offered as an accurate representation of any particular threshold-based system but nevertheless summarizes some key aspects of such systems through its relatively few parameters. In particular, it gives rise to three possible expected global regimes of message traffic. One of them is perfectly balanced, in the sense that on average as much traffic reaches the units as that leaving them. In this case, message traffic is sustained indefinitely. In each of the other two regimes, by contrast, either more traffic reaches the units than that leaving them or the opposite. Message traffic dies out in the former of these two (unless the units receive further external input) but grows indefinitely in the latter one.
We find that information integration is guaranteed to occur at high levels for some window durations whenever message traffic is sustained at the perfect-balance level or grows. We also find that this happens nearly independently of parameter variations. On the other hand, we also find that information integration is strongly dependent on the model's parameters, with significant levels occurring only for some combinations, whenever message traffic is imbalanced toward the side that prevents it from being sustained. Here we once again turn to the brain, whose cortical activity is in many accounts characterized as tending to be sparse [25,26], as an emblematic example.
We proceed as follows. Our message-passing model is laid out in Section 2, where its geometry and distributed algorithm are detailed and the question of message imbalance is introduced. An account of our use of total correlation is given in Section 3, followed by our methodology in Section 4. This methodology is based on the carefully planned computational experiments described in Section 4.2, all based on the distributed algorithm of Section 2.2, using the analyses in Sections 2.3 and 4.1 for guidance. Results, discussion, and conclusion follow, respectively, in Sections 5, 6, and 7.

Model
Our system model comprises a structural component and an algorithmic one. The two are described in what follows, along with some analysis of how they interrelate.

Underlying Geometry. For 1 ≤
≤ 3, our model is based on simple processing units, henceforth referred to as nodes, each one placed at a fixed position inside the -dimensional cube of side ℓ. The position of node has coordinates (1) , . . . , ( ) , so the Euclidean distance between nodes and is We assume that nodes can communicate with one another by sending messages that propagate at the fixed speed on a straight line. The delay incurred by a message sent between nodes and in either direction is therefore / .
Our computational experiments will all be such that nodes are placed in the -dimensional cube uniformly at random. In this case and in the limit of infinite , the expected distance between two randomly chosen nodes and is given by where 1/ℓ is the probability density for each of the 2 variables. Letting ( ) = ℓ ( ) in this equation for ∈ { , } and ∈ {1, . . . , } yields where now has 's in place of 's. We then have with the expected distances in the unit cube for the numbers of dimensions of interest being well-known: Δ (1) 1 = 1/3, Δ (2) 1 ≈ 0.5214 [27], and Δ (3) 1 ≈ 0.6617 [28]. In addition to expected distances, the associated variances will also at one point be useful. Analytical expressions for most of them seem to have remained unknown thus far, but the underlying probability densities have been found to be more concentrated around the expected values given above as grows [29]. That is, variance is the greatest for = 1.

Network
Algorithmics. We view nodes as running an asynchronous message-passing algorithm collectively. By asynchronous we mean that each node remains idle until a message arrives. When this happens, the arriving message is processed by the node, which may result in messages being sent out as well. Such a purely reactive stance on the part of the nodes requires at least one node to send out at least one message without having received one, for startup. We assume that this is done by all nodes initially, after which they start behaving reactively.
We assume that each message carries a signed-unit tag (i.e., either +1 or −1) with it. The specific tag to go with the message is chosen probabilistically by its sender at send time, with −1 being chosen with probability − . The processing done by node upon arrival of a message is the heart of the system's thresholding nature and involves manipulating an accumulator , initially equal to 0, to which every tag received is added (unless = 0 and the tag is −1, in which case remains unchanged). Whenever reaches a preestablished integer value > 0, node sends out messages of its own and is reset to 0. Thus, the integer acts as a threshold governing the sending of messages by (the firing of) node . The values of − and are the same for all nodes.
It follows from this simple rule that the value of is perpetually confined to the interval [0, ]. The expected number of messages that node has to receive in order for to be increased all the way from 0 to is the node's expected number of message arrivals between firings, henceforth denoted by = ( − , ). The value of can be calculated easily once we recognize that is simply the expected number of steps for the following discrete-time Markov chain to reach state having started at state 0. The chain has states 0, . . . , and transition probability , from state to state , given by It is easily solved and yields The sending of messages when a node fires is based on another parameter, → , which is the probability with which the node sends a message to each of the other − 1 nodes. It follows that the expected number of messages that get sent out is ( −1) → . The value of → is the same for all nodes as well.

Local Imbalance and Global Message
Traffic. At node , a balance exists between message output and message input when the expected number of messages sent out at each firing is the same as the expected number of messages received between two successive firings. That is, message traffic is locally balanced when ( −1) → = . It is locally imbalanced otherwise, which can be quantified by the difference , defined to be Given , clearly the instantaneous density of global message output at time , denoted by ( ), is expected to remain constant with if = 0 or to decrease or increase exponentially with depending on whether < 0 or > 0, respectively. This behavior is described by where 0 participating in the time constant 0 / is some fundamental amount of time related to the system's geometry and kinetics. In Section 5, we provide empirical evidence that 0 is the expected delay undergone by a message, given by Equation (8) is of immediate solution, yielding where is the expected number of messages that all nodes, collectively, send out initially. Similarly, the cumulative global message output inside a temporal window of duration starting at time is In the case of locally balanced message traffic ( = 0), this expression is easily seen to yield therefore independent of . Otherwise, ( ) either decreases or increases exponentially with , depending, respectively, on whether < 0 or > 0.

Graph-Theoretic Interpretations.
The model described so far in Section 2 can be regarded as a directed geometric graph, that is, a graph whose nodes are positioned in some region of interest (the -dimensional cube of side ℓ) and whose edges are directed. It is moreover a complete graph without selfloops, in the sense that an edge exists directed from every node to every node ̸ = .
Our use of the model in the sequel will require the nodes to be positioned at random before each new run of the distributed algorithm of Section 2.2, so an alternative interpretation that one might wish to consider views the model as a variation of the traditional random geometric graph [31]. In this variation, an edge exists directed from to ̸ = with fixed probability → , independently of any other node pair. That is, aside from node positioning the graph underlying our model is an Erdős-Rényi random graph [32] as extended to the directed case [33].
This interpretation is somewhat loose, though, because it requires that we view each individual run of the algorithm as being equivalent to several runs on independent instances of the underlying random graph, with multiple further runs serving to validate any statistics that one may come up with at the end. This is hard to justify, however, particularly when one considers the nonlinearities characterizing the quantities we will average over all runs of the algorithm (see Section 3). Even so, interpreting our model in terms of random graphs remains tantalizing in some contexts. For example, it allows the parameter − to be regarded as the fraction of a node's in-neighbors from which messages with negative tags are received. In the context of networks of the brain at the neuronal level, for example, an abstract rendering of the fraction of neurons that are inhibitory is obtained (see Section 6.1).

Total Correlation
We use the total correlation of random variables [22], each corresponding to one of nodes, as a measure of information integration. Each of these variables is relative to a temporal window of fixed duration , the variable corresponding to node being denoted by ( ) and taking up values from the set {0, 1}. The intended semantics is that ( ) = 1 if and only if node receives at least one message in a time interval of duration . We also use the shorthands X ( ) and x to denote the sequence of variables ( ( ) 1 , . . . , ( ) ) and the sequence of values ( 1 , . . . , ), respectively. Thus, X ( ) = x stands for the joint valuation ( ) 1 = 1 , . . . , ( ) = . Given the marginal Shannon entropy of each variable ( ) , (14) and the joint Shannon entropy, the total correlation of variables given is defined as follows: (When = 2 this formula coincides with that for mutual information, but one is to note that in the general case the two formulas are completely different [34].) To see the significance of this definition in our context, consider the flat joint probability mass function, Pr(X ( ) = x) = 2 − for all x ∈ {0, 1} . This mass function entails maximum uncertainty of the variables' values, hence the maximum possible value of the joint entropy,̂(X ( ) ) = . It also implies flat marginals, Pr( ( ) = ) = 0.5 for all ∈ {0,1} and all , and again the maximum possible value of each marginal entropy, ( ( ) ) = 1. The difference from the actual joint entropy (X ( ) ) to its maximum reflects a reduction of uncertainty, or an information gain, (17) the same holding for each of the marginals, Thus, it is possible to rewrite the expression for (X ( ) ) in such a way that That is, the total correlation of all variables is the information gain that surpasses their combined individual gains. This surplus is zero if and only if the variables are independent of one another, that is, precisely when Pr(X ( ) = x) = ∏ =1 Pr( ( ) = ) for all x ∈ {0, 1} , since in this case we have (X ( ) ) = ∑ =1 ( ( ) ). It is strictly positive otherwise, with a maximum possible value of − 1.
Achieving this maximum requires a joint probability mass function assigning no mass to any but two members of {0, 1} , say x and y, and moreover that these two be equiprobable (i.e., Pr(X ( ) = x) = Pr(X ( ) = y) = 0.5) and complementary to each other (i.e., + = 1 for every ). Referring back to the intended meaning of random variables, total correlation is maximized in those runs of the distributed algorithm of Section 2.2 for which a partition (X 1 , X 2 ) of the set { ( ) 1 , . . . , ( ) } exists with the following two properties. First, no matter which particular window of duration we concentrate on, the set of nodes that receive at least one message inside the window is either X 1 or X 2 . Second, the first property holds with X 1 for exactly half such windows.
While these are exceedingly stringent conditions both spatially and temporally, perhaps implying that values of total correlation equal to or near −1 are practically unachievable, they serve to delineate those scenarios with a chance of generating substantial amounts of total correlation. Specifically, such scenarios will on average have a pattern of global Complexity 5 message traffic, inside a window, that is neither too sparse nor too dense. Furthermore, sustaining such an amount of total correlation as time elapses will also require traffic patterns that deviate only negligibly from the ones yielding the average, possibly entailing some variability on window sizes. Our methodology to track and validate values of leading to noteworthy total correlation is described next. It involves computational experiments for a variety of values for and also gauging the cumulative global message output ( ) that results from each experiment against a function of the reference number of messages and reference delay embodied in 0 and 0 , respectively.

Methods
Our results are based on running the distributed algorithm of Section 2.2 for a fixed geometry (i.e., fixed number of dimensions ∈ {1, 2, 3}, fixed value of the cube side ℓ, and fixed positioning of nodes in the cube) and a fixed set of values for the parameters ( − , , and → ). Each run of the algorithm terminates either when no more message is in transit (so none will ever be, thenceforth, given the algorithm's reactive nature) or when a preestablished maximum number of messages in transit have been reached, whichever comes first. Imposing the latter upper bound is important because it serves to size the data structures where messages in transit are kept for later processing.
Node positioning is achieved uniformly at random, so multiple runs are needed for each fixed configuration ( , ℓ, , − , , → ). Each run leaves a trace of all events taking place as it unfolds, each event referring to the arrival of a message at a node and comprising the node's identification and the message's arrival time. A series of values for is then considered and for each each of the traces is analyzed, yielding the total correlation produced by the corresponding run. The average total correlation over all the runs is then reported.
Following our discussion at the end of Section 3, for each value of we gauge (12) against the approximation to it given by 0 ( 0 ) = 0 0 , according to which 0 messages get sent at time = 0 and received at time 0 . We do this by postulating a proportionality constant > 0 between them, that is, by assuming Doing this allows us to express as a function of for each and, whenever possible, to characterize traffic regimes giving rise to substantial amounts of total correlation.

Supporting Analysis.
We denote the value of upon termination of a run by and the total number of messages sent by . An approximation to (12) similar to the one above can be used to relate and as 0 ( ) = 0 , whose righthand side quantifies what would be expected to happen if all messages were sent at time = 0 and received at time 0 . This leads to Solving (20) for = ( ) given yields whose value for = 0 is the duration of the first window, As for the duration of the last window, which we denote by * , it can likewise be found by solving (20) for , now letting the window's start time be = − and then letting * = .
We obtain * = 0 ln ( The average window duration between time = 0 and time denoted by , is also of interest and comes from the indefinite integral where Li 2 ( ) = ∑ ≥1 / 2 is the dilogarithm of . Given this, we obtain While for = 0 we have = 0 / 0 and for ̸ = 0 everything depends on the sign of . If < 0, then we need / 0 < −1/ in order for to be well defined. Moreover, we have where the first inequality holds if < / 0 , this being necessary and sufficient for the last inequality to hold as well. For > 0, on the other hand, is always well defined and we get In this case, the constraint < / 0 is necessary and sufficient for both the first and the last inequality to hold.

Computational Experiments.
We organize our computational experiments into settings, numbered I-IV, each comprising all configurations ( , ℓ, , − , , → ) for which , − , and are fixed. In each setting, there are three possibilities for the value of → , one ensuring < 0 ( → = 0.01), one for = 0 (extracted from (7)), and one for > 0 ( → = 0.06). The four settings are summarized in Table 1. Each of settings II-IV is derived from setting I by a change in the value of , − , or , respectively. Each setting entails 2100 runs of the algorithm for each → value, this total comprising 100 independent trials for each combination of ∈ {1, 2, 3} and ℓ ∈ {10 −3 , 10 −2 , . . . , 10 3 }. Each run starts by placing nodes anew, uniformly at random, and proceeds therefrom subject to the further indeterminacies that characterize the algorithm through the probabilities − and → . Messages are assumed to travel at the speed = 1. Each of the 100 traces resulting from the same configuration is then analyzed and the corresponding value of (X ( ) ) is computed, the average over the traces being reported at the end. This is done for each ∈ {2 −33 , 2 −32.5 , . . . , 2 10 }.
One last important aspect of a run has to do with the determination of the maximum number of messages ever allowed to be in transit as the run unfolds. We determine this number via the formula 1000 , where 1000 can be interpreted as the expected number of messages a node must receive if it is to fire 1000 times during the run. This latter number is in many ways arbitrary, though, having to do only with ensuring that the computational resources required by all runs remain manageable. As a result, taken as a whole the 25200 runs have generated a total of about 394 gigabytes of trace in compressed format. Moreover, processing these data for the determination of the various (X ( ) ) averages has required several weeks of computation on an Intel Xeon E5-1650 core clocked at 3.2 GHz and having exclusive access to 30 gigabytes of RAM.

Results
The average value of the total correlation (X ( ) ) resulting from the computational experiments described in Section 4.2 is presented in normalized form (i.e., relative to the maximum − 1) in Figures 1, 2, and 3, respectively, for < 0 cases, = 0 cases, and > 0 cases. Each figure comprises four panels, numbered I-IV to match the four settings of Table 1. Each panel contains 21 plots, each corresponding to one of the possible variations in the number of dimensions and in the cube side ℓ.
Each plot in the three figures is given against the rescaled version of the window size given by / 0 , where we recall from (9) that 0 is the delay a message incurs when traversing the distance Δ ( ) ℓ at speed . That is, the abscissas in Figures  1-3 are all relative to the geometric and kinetic underpinnings to which each plot refers. This rescaling reveals that, for each fixed combination of a setting with an imbalance profile (i.e., for each panel in the three figures), the behavior of the average (X ( ) ) is essentially invariant with respect to how the various and ℓ values are paired. Our choice in Section 2.3 of 0 / as the system's time constant is then clearly justified, as the value of is fixed in each of the figures' panels. Moreover, such invariance also backs up our choice of = 1 for all simulations (see Section 4.2), since the choice of any other value would simply alter the rescaling factors.
Nevertheless, there are signs in Figures 3(I) and 3(III) that such invariance may not hold quite as well when = 1. We attribute this to the fact that Δ ( ) ℓ is only an expected distance and as such affects the role of 0 as a rescaling factor differently for each number of dimensions . In particular, recalling from Section 2.1 that the corresponding variances get larger as is decreased, it seems clear that what is affecting invariance in the two figures in question is precisely the poor representativeness of Δ (1) ℓ . Even so, we see in Figure 3 that this affects setting III much more severely than it does setting I. The reason for this has to do with the value of imbalance in each case, as we discuss in Section 6.
In all but the case of Figure 1(IV), the average total correlation starts off at some negligibly small value, then slowly climbs toward a peak along an increase in / 0 by some orders of magnitude, and finally recedes as / 0 is made to vary by further orders of magnitude. The plots in each of the panels of Figures 1-3 are given against a backdrop of two vertical lines, the leftmost one marking the smallest / 0 at which total correlation peaks for some ( , ℓ) pair and the rightmost marking the largest such value. (The smallest / 0 value in setting III panel of Figure 1 does not take into account the cases = 2, 3 with ℓ = 10 3 , whose peaks seem to occur past = 2 10 , the largest window used in the simulations, and are therefore unknown.) The backdrop lines' locations are detailed in Table 2, where ( , ℓ) giving rise to each one is also shown. The table also contains for each such location a value for , the proportionality constant that through (20) is used to relate the global message output ( ) inside a size window beginning at time to the reference output 0 0 . For = 0, we have found in Section 4.1 that = ( ) necessary for (20) to hold is time-invariant and given as in (28), whence = / 0 . This is reflected in Table 2.
For ̸ = 0, on the other hand, satisfying (20) for some fixed requires to vary with time according to (22). This is what one would expect, since the dwindling message output that results from < 0 scenario requires an ever larger window size to accommodate the amount of traffic given by 0 0 , and likewise the output expansion caused by > 0 requires progressively smaller window sizes. This dependency on time  Table 2 (the smallest such value for setting IV is off range).
is reflected in the inequalities of (29) and (30), respectively, for < 0 and > 0, where the earliest window size ( 0 ), as well as the average one ( ), and the latest ( * ) are put in perspective. The values of given in Table 2 for ̸ = 0 cases come from (27) by letting each value of / 0 reported in the table be such that = ( ).
This use of (27) requires the expected value of (the number of messages sent in a run) to be estimated from the 100 independent trials for each configuration of interest. We do this by first averaging the number of messages received during each run (let denote this average) and then recalling that each message received entails an expected number of  Table 2. messages sent equal to ( − 1) → / . Our estimate of the expected is then as follows: (Our simulator is event-based, each event being the reception of a message. The total we have available at the end of each simulation is that of messages received, not sent, and the two may differ because not every message sent gets received, by virtue of the preestablished maximum number of messages in transit having been reached (see Section 4). We then need this workaround to find the expected through .) An illustration highlighting the use of (27) on setting III cases of Table 2 for ̸ = 0 is given in Figure 4, where the plots relative to (24) rely on the same estimate of the expected as above.  Table 2.

Discussion
Even though our notation for total correlation, (X ( ) ), may suggest that it is a function solely of random variables in X ( ) , it is in fact a function of the joint probability mass function associated with those variables as well. Therefore, changing the allotment of probability mass to the various points in {0, 1} must have an impact on the value of (X ( ) ) as a matter of principle. Thus, while we know that such variation must never lead total correlation to fall below zero or above − 1, figuring out the resulting expected value has the potential of aiding in the interpretation of any particular figure one encounters in some situation of interest. As far as we know, such expected value has so far remained unknown, 10 Complexity Table 2: Smallest and largest / 0 at which (X ( ) ) peaks for each combination of a setting with a possibility for relative to 0, all configurations in each combination considered. Each of the two extremes occurs for the configuration to which and ℓ values shown refer. For each of the two extremes, the value of reported along with it is such that = ( ). The smallest / 0 reported for < 0, setting III case, does not take into account the peaks for = 2, 3 with ℓ = 10 3 , which have remained undiscovered by the simulations.

Imbalance
Setting Smallest / 0 for (X ( ) ) peak Largest / 0 for (X ( ) ) peak   (23), (27), and (24), respectively, for = 0 , = , and = * . All plots refer to setting III, with either < 0 (a) or > 0 (b). The corresponding ( , / 0 ) pairs in Table 2  but resorting to results on the expected value of the joint Shannon entropy ( (X ( ) ), in our case) [35], it is possible to ascertain that an upper bound on the expected value of (X ( ) ) is approximately 0.6 [20]. Once normalized to − 1, this upper bound would be unnoticeable in any of the panels in Figures 1-3. In fact, two orders of magnitude would fall below the nearly flat values of Figure 1(IV), which are of the order of 10 −1 . Clearly, then, in all of settings I-IV and for all three imbalance scenarios, our model is seen to be promoting the rise of probability mass functions leading to total correlations significantly above the expected value. In most cases, this occurs over ranges for the value of , the temporal-window duration that provides meaning to the random variables in X ( ) , spanning several orders of magnitude.
Furthermore, in most cases we have found normalized total correlation to peak at substantial levels: between 0.22 and 0.67 in settings I and III when < 0 ( Figure 1); 0.77 and 0.86 in all four settings when = 0 ( Figure 2); and 0.93 and 0.97 in all four settings when > 0 (Figure 3). The peaks relating to = 0 cases (and also those of > 0 cases, but to a more limited extent) seem to occur largely independently of the setting in question, that is, regardless of the number of nodes or the parameters that control firing (the probability − and the threshold ). Understanding this independence comes from focusing on = 0 cases, since for = 0 the value of probability → varies from setting to setting as a function of and (therefore a function of − and as well, by (6)); see Table 1. Setting → in this way aims precisely to cause = 0 and, as a consequence, tends to compensate for any dependency of total correlation peaks on , − , or . It also suggests that the value of , along with that of / 0 , is a major player when it comes to determining how much total correlation can be achieved. However, increasing the value of → from those used to ensure = 0 to → = 0.06, thus ensuring > 0, preserves nearly the same independence of peak values as when = 0 but fails to keep constant from setting to setting (see Table 3 for the specific value of in each one). With the increased value of → and the ever-growing barrage of messages that ensues, this independence is now supported by window sizes at least two orders of magnitude lower (though yielding taller peaks) but not significantly by the value of itself. The value of , however, does make its influence felt, in the following manner. As mentioned in Section 5, rescaling window size through a division by 0 makes some of > 0 cases stand out for = 1 by failing to comply (though to a small degree) with the invariance to the values of and ℓ that seems to be the rule. In that section we correctly attributed this to the increased variance of 0 (through that of Δ (1) ℓ ), but the deviation from invariance clearly increases with the value of as well (1.81 and 3.06, by Table 3, for settings I and III, resp.). On the other hand, the cases for which < 0 are different in regard to this independence issue. In fact, it is clear from Figure 1 that total correlation peaks are highly affected by the setting in question. In particular, they are increasingly lower than the corresponding peaks for = 0 and also occur for increasingly lower values of / 0 , as the value of (see Table 3) becomes more negative. For fixed 0 , therefore, this consolidates the value of as the main driver defining the window duration for which total correlation peaks provided ≤ 0. = 0 cases are particularly interesting also because total correlation was in all configurations of all settings found to peak roughly between / 0 = 0.522 and / 0 = 0.96 (see Table 2). As we noted in Section 5, this is also the range of , which for = 0 is the fraction of the reference message output 0 0 yielding the message traffic inside a window of duration beginning, essentially, at any time ≥ 0. The fact that total correlation should peak with between roughly 0.522 and 0.96 seems well aligned with our discussion at the end of Section 3 on maximizing total correlation, through which we found out that a condition necessary (though not sufficient) to such maximization is that only half the nodes receive messages inside the window.
Relativizing message traffic inside the window to 0 0 in ̸ = 0 cases must be approached differently, because now fixing requires the duration of the window starting at time to either expand with (if < 0) or shrink with it (if > 0). In these cases, taking each in Table 2 to be the average duration of these windows reveals that total correlation peaks for / 0 between about 1.47582 and 3.41955 for < 0 (therefore somewhat above the range recorded for = 0 cases), ignoring the more or less degenerate cases of settings II and IV, and between about 0.00101 and 0.00236 for > 0 (therefore substantially below = 0 ranges, as noted above).
The corresponding values of , given by (27) and illustrated in Figure 4 for setting III, range between 0.71 and 1.917 for < 0 and between 0.007 and 0.017 for > 0.
6.1. The Case of the Brain. Right at the opening of Section 1 we mentioned the brain as the most representative thresholdbased system we know. It is only fitting, then, that we should try and relate it more closely to the model we have developed and analyzed in this paper. This is no simple task for at least two reasons. The first has to do with the model itself, which as we also mentioned in Section 1 has not been meant to faithfully represent the specifics of any one threshold-based system. The second reason is that very little has so far been found out of the brain's so-called microconnectome [36], that is, its network structure at the cellular level, which is where thresholds work their influence.
Here we resort to the few glimpses that are available in order to estimate parameter values within our model as well as possible. The largest microconnectome to have been mapped to date is from the V1 area of the mouse visual cortex [37]. It reveals a total of 1278 excitatory neurons and 581 inhibitory neurons, which following our discussion in Section 2.4 allows us to conclude − ≈ 581/(581 + 1278) ≈ 0.31, thus agreeing with the commonly accepted range of values for the ratio of inhibitory neurons [38]. It also reveals a total of 29 connected pairs involving 45 neurons, which again in the spirit of Section 2.4 leads to → ≈ 29/(45×44) ≈ 0.015. Using = 45 and = 7.5 (halfway between 5 and 10, which seem to delimit the range of the typical neuron's ratio of threshold potential to synaptic potential [39][40][41]), we obtain an imbalance = −0.96, close therefore to the minimum possible imbalance ( = −1).
Note that concluding → ≈ 0.015 reflects an effective, rather than merely anatomical, view of the microconnectome. That is, the 29 axon-to-dendrite connections leading to this estimate of the value of → do not result simply from the possibility that action potentials travel from a neuron's axon toward one of the dendrites of another neuron, but rather from actually observing such potentials. (To further strengthen this notion that the value of → is highly networkdependent in terms of effective signaling in the brain, we note that a completely different estimate of → can be deduced from a slightly earlier (but similar in the sense of targeting effective connections) study [42], whose authors report a neuron's expected out-degree to be in the order of 10 0 . The study in question is based on the rodent somatosensory cortex, but should a similar conclusion hold for the human cortex taken as a whole, with its 16 billion neurons [43], then an estimate of → would place its value at the order of 10 −10 and that of very close indeed to −1. This is all rather speculative, though, especially if we consider the already established fact that different cortical areas of the primate brain have different neuronal densities [44].) This distinction is of paramount importance in the present study, since the parameter → is semantically tied to the actual sending of messages in our model (once interpreted as a directed graph, as in Section 2.4), not merely to the existence of a channel through which such messages could be sent.
To judge from the values of for → = 0.01 in Table 3 and from the conclusions we have drawn regarding < 0 cases of settings I-IV, this entire attempt to interpret cortical activity in the light of our model seems to be suggesting that we view it as a threshold-based system severely imbalanced on the negative side, therefore incapable of giving rise to any significant amount of total correlation. The missing component, of course, is that the cortex is continually subjected to new input, either external or originating spontaneously from within [45]; hence a better characterization of local imbalance in this case would be based on with 0 standing for some core amount of external or spontaneous input expected to accompany every messages incoming from other cortical neurons. Clearly, 0 > 0 implies > whenever all else remains constant. Local imbalance would then be defined through ̸ = 0, but without the availability of some rationale on which to base an estimate of 0 this is essentially as far as we can go. In any event, such redefined imbalance might lie above < 0 values of Table 3 while still being negative. (Letting * denote the greatest of these negative values for , the purported such that * < < 0 would require 0 < 0 < −( −1) → , assuming for the former of these inequalities that and all parameters remain unchanged between * and .) In this case, we could expect (X ( ) ) to peak for / 0 a few orders of magnitude above 3.41955 (see Table 2).
We can go further and estimate the value of 0 as well. Because the cerebral cortex takes up 82% of all brain mass in humans [43] and assuming a uniform density in addition to a brain volume of 1.5 × 10 −3 m 3 [46, BNID 112053], we can take the cortical volume to be roughly 1.23 × 10 −3 m 3 . In our model, this volume corresponds to the threedimensional cube of side ℓ ≈ 0.107 m, whence Δ (3) ℓ ≈ 0.071 m (see Section 2.1). Assuming further that action potentials propagate on myelinated axons at a speed ( ) between 50 and 100 m/s [46, BNID 107125] [47] leads, by (9), to a value of 0 roughly between 0.71 and 1.42 ms. This would place , the time window duration for peak total correlation, a few orders of magnitude above some value between 2.4 and 4.9 ms. Significant total correlation also arises for / 0 values below or above the peak's location by a small factor, so window durations of a few hundred milliseconds would support information integration as defined. These, as it turns out, are time lapses within range of the commonly accepted delay before a percept can be rendered conscious [48].

Conclusion
Our approach to quantifying the integration of information to which interacting, distributed threshold-based units give rise has relied on a very general model, tailored to no specific system in particular but built around three fundamental notions: that the units interact with one another by passing positively or negatively tagged messages among them; that a unit sends messages out as a function of how the tag balance of incoming messages relates to a threshold; and that this sending out of messages is selective, in the sense of being addressable to a subset of the units only. Assuming that the units are positioned in a one-, two-, or three-dimensional cube uniformly at random and moreover considering a temporal window of duration , we have demonstrated by means of extensive computational experiments that information does get integrated in significant amounts inside the window, depending chiefly on the value of , on the average delay incurred by a message and on local message imbalance (how much gets transmitted compared to what is received, depending on the combined effect of the message-passing and threshold parameters).
We have analyzed situations of peak information integration and related the results to cortical dynamics by fixing the model's parameters accordingly. This has served to suggest a validation of the model, despite its purposeful generality, highlighting its potential usefulness for systemic studies of threshold-based interacting units. Though this suggested validation is dependent on the presently unavailable specification of an input-related quantity in the cortex ( 0 of Section 6.1), it nevertheless points at two distinct perspectives on whose interaction the integration of information seems to hinge given the basic kinetic properties of the system. The first perspective is that of the system itself, through the local message imbalance that summarizes every stochastic and threshold-related aspect of neuronal interaction. The second perspective is that of an observer (some observer), as summarized in the window duration . The latter notion of an observer-related dependency is unquestionably elusive, but exploring it further may help highlight what, if anything, the interplay of these two perspectives has to do with the rise of consciousness.
Given the window duration and the set of random variables X ( ) (one variable per unit, each related to the reception of messages by the unit in question within the window), our measure of information integration has been the total correlation of the variables in X ( ) , here denoted by (X ( ) ). This measure seeks to quantify the interdependence of the variables on one another and can be interpreted as information gain that is attained in excess of the total gain the units achieve at the local level. Total correlation can also be interpreted as the Kullback-Leibler (KL) divergence of one probability mass function relative to another, namely, of Pr(X ( ) = x) relative to ∏ =1 Pr( ( ) = ). That is, (X ( ) ) can be rewritten as (X ( ) ) = ∑

x∈{0,1}
Pr (X ( ) = x) log 2 Pr (X ( ) = x) This expression highlights the well-known fact that the KL divergence is asymmetric with respect to the two mass functions it applies to. Note, however, that this is of no import in our context, because total correlation corresponds to the very specific case in which the KL divergence applies to the two mass functions above and in the direction indicated only. We mention this issue because the current version of the all-partitions theory of integrated information mentioned in Section 1 abandons the KL divergence in favor of the so-called earth-mover's distance because of the latter's symmetry and consequent status as a distance between two probability mass functions [49]. This provides further distinction between the two approaches. We finalize by noting that extensions to our approach are certainly possible, especially by attempting to encompass those systems that, despite embodying explicit references to a threshold-based dynamics, do so in a manner that is not completely aligned with the one our model assumes. One example involves the role of thresholds in the grouping and synchronization of the interacting units [50]. Perhaps our approach can provide useful insight in such contexts as well.

Competing Interests
The author declares that he has no competing interests.