Given the widespread use of lossless compression algorithms to approximate algorithmic (Kolmogorov-Chaitin) complexity and that, usually, generic lossless compression algorithms fall short at characterizing features other than statistical ones not different from entropy evaluations, here we explore an alternative and complementary approach. We study formal properties of a Levin-inspired measure m calculated from the output distribution of small Turing machines. We introduce and justify finite approximations mk that have been used in some applications as an alternative to lossless compression algorithms for approximating algorithmic (Kolmogorov-Chaitin) complexity. We provide proofs of the relevant properties of both m and mk and compare them to Levin’s Universal Distribution. We provide error estimations of mk with respect to m. Finally, we present an application to integer sequences from the On-Line Encyclopedia of Integer Sequences, which suggests that our AP-based measures may characterize nonstatistical patterns, and we report interesting correlations with textual, function, and program description lengths of the said sequences.
Vetenskapsrådet2015-052991. Algorithmic Information Measures
Central to Algorithmic Information Theory is the definition of algorithmic (Kolmogorov-Chaitin or program-size) complexity [1, 2]: (1)KTs=minp,Tp=s,where p is a program that outputs s running on a universal Turing machine T and |p| is the length in bits of p. The measure was first conceived to define randomness and is today the accepted objective mathematical measure of randomness, among other reasons, because it has been proven to be mathematically robust [3]. In the following, we use K(s) instead of KT(s) because the choice of T is only relevant up to an additive constant (invariance theorem). A technical inconvenience of K as a function taking s to be the length of the shortest program that produces s is its uncomputability. In other words, there is no program that takes a string s as input and produces the integer K(s) as output. This is usually considered a major problem, but one ought to expect a universal measure of randomness to have such a property.
In previous papers [4, 5], we have introduced a novel method to approximate K based on the seminal concept of algorithmic probability (or AP), introduced by Solomonoff [6] and further formalized by Levin [3] who proposed the concept of uncomputable semimeasures and the so-called Universal Distribution.
Levin’s semimeasure (it is called a semimeasure because, unlike probability measures, the sum is never 1. This is due to the Turing machines that never halt) mT defines the so-called Universal Distribution [7], with the value mT(s) being the probability that a random program halts and produces s running on a universal Turing machine T. The choice of T is only relevant up to a multiplicative constant, so we will simply write m instead of mT.
It is possible to use m(s) to approximate K(s) by means of the following theorem.
This implies that if a string s has many descriptions (high value of m(s), as the string is produced many times, which implies a low value of -log2m(s), given that m(s)<1), it also has a short description (low value of K(s)). This is because the most frequent strings produced by programs of length n are those which were already produced by programs of length n-1, as extra bits can produce redundancy in an exponential number of ways. On the other hand, strings produced by programs of length n which could not be produced by programs of length n-1 are less frequently produced by programs of length n, as only very specific programs can generate them (see Section 14.6 in [8]). This theorem elegantly connects probability to complexity—the frequency (or probability) of occurrence of a string with its algorithmic (Kolmogorov-Chaitin) complexity. It implies that [4] one can calculate the Kolmogorov complexity of a string from its frequency [4], simply rewriting the formula as (3)Ks=-log2ms+O1. Thanks to this elegant connection established by (2) between algorithmic complexity and probability, our method can attempt to approximate an algorithmic probability measure by means of finite approximations using a fixed model of computation. The method is called the Coding Theorem Method (CTM) [5].
In this paper, we introduce m, a computable approximation to m which can be used to approximate K by means of the algorithmic coding theorem. Computing m(s) requires the output of a numerable infinite number of Turing machines, so we first undertake the investigation of finite approximations mk(s) that require only the output of machines up to k states. A key property of m and K is their universality: the choice of the Turing machine used to compute the distribution is only relevant up to an (additive) constant, independent of the objects. The computability of this measure implies its lack of universality. The same is true when using common lossless compression algorithms to approximate K, but on top of their nonuniversality in the algorithmic sense, they are block entropy estimators as they traverse files in search of repeated patterns in a fixed-length window to build a replacement dictionary. Nevertheless, this does not prevent lossless compression algorithms from finding useful applications in the same way as more algorithmic-based motivated measures can contribute even if also limited. Indeed, m has found successful applications in cognitive sciences [9–13] and in financial time series research [14] and graph theory and networks [15–17]. However, a thorough investigation to explore the properties of these measures and to provide theoretical error estimations was missing.
We start by presenting our Turing machine formalism (Section 2) and then show that it can be used to encode a prefix-free set of programs (Section 3). Then, in Section 4, we define a computable algorithmic probability measure m based on our Turing machine formalism and prove its main properties, both for m and for finite approximations mk. In Section 5, we compute m5, compare it with our previous distribution D(5) [5], and estimate the error in m5 as an approximation to m. We finish with some comments in Section 7.
2. The Turing Machine Formalism
We denote by (n,2) the class (or space) of all n-state 2-symbol Turing machines (with the halting state not included among the n states) following the Busy Beaver Turing machine formalism as defined by Radó [18]. Busy Beaver Turing machines are deterministic machines with a single head and a single tape unbounded in both directions. When the machine enters the halting state, the head no longer moves and the output is considered to comprise only the cells visited by the head prior to halting. Formally, we have the following definition.
Definition 2 (Turing machine formalism).
We designate as (n,2) the set of Turing machines with two symbols {0,1} and n states {1,…,n} plus a halting state 0. These machines have 2n entries (s1,k1) (for s∈{1,…,n} and k∈{0,1}) in the transition table, each with one instruction that determines their behavior. Such entries are represented by (4)s1,k1⟶s2,k2,d, where s1 and k1 are, respectively, the current state and the symbol being read and (s2,k2,d) represents the instruction to be executed: s2 is the new state, k2 is the symbol to write, and d is the direction. If s2 is the halting state 0, then d=0; otherwise d is 1 (right) or -1 (left).
Proposition 3.
Machines in (n,2) can be enumerated from 0 to (4n+2)2n-1.
Proof.
Given the constraints in Definition 2, for each transition of a Turing machine in (n,2), there are 4n+2 different instructions (s2,k2,d). These are 2 instructions when s2=0 (given that d=0 is fixed and k2 can be one of the two possible symbols) and 4n instructions if s2≠0 (2 possible moves, n states, and 2 symbols). Then, considering the 2n entries in the transition table, (5)n,2=4n+22n. These machines can be enumerated from 0 to |(n,2)|-1. Several enumerations are possible. We can, for example, use a lexicographic ordering on transitions (4).
For the current paper, consider that some enumeration has been chosen. Thus, we use τtn to denote the machine number t in (n,2) following that enumeration.
3. Turing Machines as a Prefix-Free Set of Programs
We show in this section that the set of Turing machines following the Busy Beaver formalism can be encoded as a prefix-free set of programs capable of generating any finite nonempty binary string.
Definition 4 (execution of a Turing machine).
Let τ∈(n,2) be a Turing machine. We denote by τ(i) the execution of τ over an infinite tape filled with i (a blank symbol), where i∈{0,1}. We write τ(i)↓ if τ(i) halts and τ(i)↑ otherwise. We write τ(i)=s if
τ(i)↓,
s is the output string of τ(i), defined as the concatenation of the symbols in the tape of τ which were visited at some instant of the execution τ(i).
As Definition 4 establishes, we are only considering machines running over a blank tape with no input. Observe that the output of τ(i) considers the symbols in all cells of the tape written on by τ during the computation, so the output contains the entire fragment of the tape that was used. To produce a symmetrical set of strings, we consider both symbols 0 and 1 as possible blank symbols.
Definition 5 (program).
A program p is a triplet 〈n,i,t〉, where
n≥1 is a natural number,
i∈{0,1},
0≤t<(4n+2)2n.
We say that the output of p is s if and only if τtn(i)=s.
Programs can be executed by a universal Turing machine that reads a binary encoding of 〈n,i,t〉 (Definition 6) and simulates τtn(i). Trivially, for each finite binary string s with length |s|>0, there is a program p that outputs s.
Now that we have a formal definition of programs, we show that the set of valid programs can be represented as a prefix-free set of binary strings.
Definition 6 (binary encoding of a program).
Let p=〈n,i,t〉 be a program (Definition 5). The binary encoding of p is a binary string with the following sequence of bits:
First, there is 1n-10, that is, n-1 repetitions of 1 followed by 0. This way we encode n.
Second, a bit with value i encodes the blank symbol.
Finally, t is encoded using log2(4n+2)2n bits.
The use of log2(4n+2)2n bits to represent t ensures that all programs with the same n are represented by strings of equal size. As there are (4n+2)2n machines in (n,2), with these bits we can represent any value of t. The process of reading the binary encoding of a program p=〈n,i,t〉 and simulating τtn(i) is computable, given the enumeration of Turing machines.
As an example, this is the binary representation of the program 〈2,0,185〉.
The proposed encoding is prefix-free; that is, there is no pair of programs p and p′ such that the binary encoding of p is a prefix of the binary encoding of p′. This is because the n initial bits of the binary encoding of p=〈n,i,t〉 determine the length of the encoding. So p′ cannot be encoded by a binary string having a different length but the same n initial bits.
Proposition 7 (programming by coin flips).
Every source producing an arbitrary number of random bits generates a unique program (provided it generates at least one 0).
Proof.
The bits in the sequence are used to produce a unique program following Definition 6. We start by producing the first n part by selecting all bits until the first 0 appears. Then the next bit gives i. Finally, as we know the value of n, we take the following log2(4n+2)2n bits to set the value of t. It is possible that, constructing the program in this way, the value of t is greater than the maximum (4n+2)2n-1 in the enumeration, in which case we associate the program with some trivial nonhalting Turing machine, for example, a machine with the initial transition staying at the initial state.
The idea of programming by coin flips is very common in Algorithmic Information Theory. It produces a prefix-free coding system; that is, there is no string w encoding a program p which is a prefix of a string wz encoding a program p′≠p. These coding systems make longer programs (for us, Turing machines with more states) exponentially less probable than short programs. In our case, this is because of the initial sequence of n-1 repetitions of 1, which are produced with probability 1/2n-1. This observation is important because when we later use machines in ⋃n=1k(n,2) to reach a finite approximation of our measure, the greater k is, the exponentially smaller the error we will be allowing: the probability of producing by coin flips a random Turing machine with more than k states decreases exponentially with k [8].
4. A Levin-Style Algorithmic MeasureDefinition 8.
Given a Turing machine A accepting a prefix-free set of programs, the probability distribution of A is defined as (6)PAs=∑p:Ap=s12p, where A(p) is equal to s if and only if A halts with input p and produces s. The length in bits of program p is represented by |p|.
If A is a universal Turing machine, PA(s) measures how frequently the output s is generated when running random programs at A. Given that the sum of PA(s) for all strings is not 1 (nonhalting programs not producing any strings are counted in 2|p|), it is said to be a semimeasure, also known as Levin’s distribution [3]. The distribution is universal in the sense that the choice of A (among all the infinite possible universal reference Turing machines) is only relevant up to a multiplicative constant and that the distribution is based on the universal model of Turing computability.
Let M be a Turing machine executing the programs introduced in Definition 5. Then, m(s) is defined by (7)ms=PMs.
Theorem 10.
For any binary string s, (8)ms=∑n=1∞τ∈n,2∣τ0=s+τ∈n,2∣τ1=s2n+1+log24n+22n.
Proof.
By Definition 6, the length of the encoding of program p=〈n,i,t〉 is n+1+log2(4n+2)2n. It justifies the denominator of (8), as (6) requires it to be 2|p|. For the numerator, observe that the set of programs producing s with the same n value corresponds to all machines in (n,2) producing s with either 0 or 1 as blank symbol. Note that if a machine produces s both with 0 and 1, it is counted twice, as each execution is represented by a different program (that differ only as to the i digit).
4.1. Finite Approximations to <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M225"><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:math></inline-formula>
The value of m(s) for any string s depends on the output of an infinite set of Turing machines, so we have to manage ways to approximate it. The method proposed in Definition 11 approximates m(s) by considering only a finite number of Turing machines up to a certain number of states.
The finite approximation to m(s) bound to k states, mk(s), is defined as (9)mks=∑n=1kτ∈n,2∣τ0=s+τ∈n,2∣τ1=s2n+1+log24n+22n.
Proposition 12 (convergence of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M234"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M235"><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>).
(10)∑s∈0+1⋆ms-mks≤12k.
Proof.
By (8) and (9), (11)∑s∈0+1⋆ms-mks=∑s∈0+1⋆ms-∑s∈0+1⋆mks≤∑n=k+1∞24n+22n2n+1+log24n+22n≤∑n=k+1∞24n+22n2n·2·2log24n+22n=∑n=k+1∞12n=12k.
Proposition 12 ensures that the sum of the error in mk(s) as an approximation to m(s), for all strings s, decreases exponentially with k. The question of this convergence was first broached in [19]. The bound of 1/2k has only theoretical value; in practice, we can find lower bounds. In fact, the proof counts all 2(4n+2)2n programs of size n to bound the error (and many of them do not halt). In Section 5.1, we provide a finer error calculation for m5 by removing from the count some very trivial machines that do not halt.
4.2. Properties of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M246"><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M247"><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>
Levin’s distribution is characterized by some important properties. First, it is lower semicomputable; that is, it is possible to compute lower bounds for it. Also, it is a semimeasure, because the sum of probabilities for all strings is smaller than 1. The key property of Levin’s distribution is its universality: a semimeasure P is universal if and only if for every other semimeasure P′ there exists a constant c>0 (that may depend only on P and P′) such that, for every string s, c·P(s)≥P′(s). That is, a distribution is universal if and only if it dominates (modulo a multiplicative constant) every other semimeasure. In this section, we present some results pertaining to the computational properties of m and mk.
Proposition 13 (runtime bound).
Given any binary string s, a machine with k states producing s runs a maximum of 2|s|·|s|·k steps upon halting or never halts.
Proof.
Suppose that a machine τ produces s. We can trace back the computation of τ upon halting by looking at the portion of |s| cells in the tape that will constitute the output. Before each step, the machine may be in one of k possible states, reading one of the |s| cells. Also, the |s| cells can be filled in 2|s| ways (with a 0 or 1 in each cell). This makes for 2|s|·|s|·k different possible instantaneous descriptions of the computation. So any machine may run, at most, that number of steps in order to produce s. Otherwise, it would produce a string with a greater length (visiting more than |s| cells) or enter a loop.
Observe that a key property of our output convention is that we use all visited cells in the machine tape. This is what gives us the runtime bound which serves to prove the most important property of mk, its computability (Theorem 14).
Theorem 14 (computability of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M276"><mml:mrow><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>).
Given k and s, the value of mk(s) is computable.
Proof.
According to (9) and Proposition 3, there is a finite number of machines involved in the computation of mk(s). Also, Proposition 13 sets the maximum runtime for any of these machines in order to produce s. So an algorithm to compute mk(s) enumerates all machines in (n,2), 1≤n≤k, and runs each machine to the corresponding bound.
Corollary 15.
Given a binary string s, the minimum k with mk(s)>0 is computable.
Proof.
Trivially, s can be produced by a Turing machine with |s| states in just s steps. At each step i, this machine writes the ith symbol of s, moves to the right, and changes to a new state. When all symbols of s have been written, the machine halts. So, to get the minimum k with mk(s)>0, we can enumerate all machines in (n,2), 1≤n≤|s|, and run all of them up to the runtime bound given by Proposition 13. The first machine producing s (if the machines are enumerated from smaller to larger size) gives the value of k.
Now, some uncomputability results of mk are given.
Proposition 16.
Given k, the length of the longest s with mk(s)>0 is noncomputable.
Proof.
We proceed by contradiction. Suppose that such a computable function as l(k) gives the length of the longest s with mk(s)>0. Then ?l(k), together with the runtime bound in Proposition 13, provides a computable function that gives the maximum runtime that a machine in (k,2) may run prior to halting. But it contradicts the uncomputability of the Busy Beaver [18]: the highest runtime of halting machines in (k,2) grows faster than any computable function.
Corollary 17.
Given k, the number of different strings s with mk(s)>0 is noncomputable.
Proof.
Also by contradiction, if the number of different strings with mk(s)>0 is computable, we can run in parallel all machines in (k,2) until the corresponding number of different strings has been found. This gives us the longest string, which is in contradiction to Proposition 16.
Now to the key property of m, its computability is demonstrated.
Theorem 18 (computability of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M317"><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:math></inline-formula>).
Given any nonempty binary string, m(s) is computable.
Proof.
As we argued in the proof of Corollary 15, a nonempty binary string s can be produced by a machine with |s| states. Trivially, it is then also produced by machines with more than |s| states. So, for every nonempty string s, the value of m(s), according to (8), is the sum of enumerable infinite many rationals which produce a real number. A real number is computable if and only if there is some algorithm that, given n, returns the first n digits of the number. And this is what mk(s) does. Proposition 12 enables us to calculate the value of k such that mk(s) provides the required digits of m(s), as m(s)-mk(s) is bounded by 1/2k.
The subunitarity of m and mk implies that the sum of m(s) (or mk(s)) for all strings s is smaller than one. This is because of the nonhalting machines.
Proposition 19 (subunitarity).
The sum of m(s) for all strings s is smaller than 1; that is, (12)∑s∈0+1⋆ms<1.
Proof.
By using (8), (13)∑s∈0+1⋆ms=∑n=1∞τ∈n,2∣τ0↓+τ∈n,2∣τ1↓2n+1+log24n+22n; but |{τ∈n,2∣τ(0)↓}|+|{τ∈(n,2)∣τ(1)↓}| is the number of machines in (n,2) which halt when starting with a blank tape filled with 0 plus the number of machines in (n,2) which halt when starting on a blank tape filled with 1. This number is at most twice the cardinality of (n,2), but we know that it is smaller, as there are very trivial machines that do not halt, such as those without transitions to the halting state, so (14)∑s∈0+1⋆ms<∑n=1∞24n+22n2n+1+log24n+22n=∑n=1∞4n+22n2n·2log24n+22n≤∑n=1∞4n+22n2n4n+22n=∑n=1∞12n=1.
Corollary 20.
The sum of mk(s) for all strings s is smaller than 1.
Proof.
See Proposition 19, (8), and (9).
The key property of mk(s) and m(s) is their computability, given by Theorems 14 and 18, respectively. So these distributions cannot be universal, as Levin’s Universal Distribution is noncomputable. In spite of this, the computability of our distributions (and the possibility of approximating them with a reasonable computational effort), as we have shown, provides us with a tool to approximate the algorithmic probability of short binary strings. In some sense, this is similar to what happens with other (computable) approximations to (uncomputable) Kolmogorov complexity, such as common lossless compression algorithms, which in turn are estimators of the classical Shannon entropy rate (e.g., all those based in LZW) and, unlike mk(s) and m(s), are not able to find algorithmic content beyond statistical patterns, not even in principle, unless a compression algorithm is designed to seek a specific one. For example, the digital expansion of the mathematical constant π is believed to be normal and therefore will contain no statistical patterns of the kind that compression algorithms can detect, yet there will be a (short) computer program that can generate it or at least finite (and small) initial segments of π.
We have explored the sets of Turing machines in (n,2) for n≤5 in previous papers [4, 5]. For n≤4, the maximum time that a machine in (n,2) may run upon halting is known [20]. It allows us to calculate the exact values of m4. For n=5, we have estimated [5] that 500 steps cover almost the totality of halting machines. We have the database of machines producing each string s for each value of n. So we have applied (9) to estimate m5 (because we set a low runtime).
In previous papers [5, 21], we worked with D(k), a measure similar to mk, but the denominator of (9) is the number of (detected) halting machines in (k,2). Using D(5) as an approximation to Levin’s distribution, algorithmic complexity is estimated (values can be consulted at http://www.complexitycalculator.com/. Accessed on June 22, 2017) by means of the algorithmic coding Theorem 1 as KD(5)(s)=-log2D(5)(s). Now, m5 provides us with another estimation: Km5(s)=-log2m5(s). Table 1 shows the 10 most frequent strings in both distributions, together with their estimated complexity.
Top 10 strings in m5 and D(5) with their estimated complexity.
s
Km5(s)
KD(5)(s)
0
3.7671
2.5143
1
3.7671
2.5143
00
6.8255
3.3274
01
6.8255
3.3274
10
6.8255
3.3274
11
6.8255
3.3274
000
10.4042
5.3962
111
10.4042
5.3962
001
10.4264
5.4458
011
10.4264
5.4458
Figure 1 shows a rank comparison of both estimations of algorithmic complexity after application of the algorithmic coding theorem. With minor differences, there is an almost perfect agreement. So, in classifying strings according to their relative algorithmic complexity, the two distributions are equivalent.
Correlation of rank comparison between Km5 and KD(5).
The main difference between mk and D(k) is that D(k) is not computable, because computing it would require us to know the exact number of halting machines in (k,2), which is impossible given the halting problem. We work with approximations to D(k) by considering the number of halting machines detected. In any case, although mk is computable, it is computationally intractable, so in practice (approximations to) the two measures can be used interchangeably.
5.1. Error Calculation
We can make some estimations about the error in m5 with respect to m. “0” and “1” are two very special strings, both with the maximum m5 value. These strings are the most frequent outputs in (n,2) for n≤5, and we may conjecture that they are the most frequent outputs for all values of n. These strings then have the greatest absolute error, because the terms in the sum of m(“0”) (the argument for m(“1”) is identical) not included in m5(“0”) are always the greatest independent of n.
We can calculate the exact value of the terms for m(“0”) in (8). To produce “0,” starting with a tape filled with i∈{0,1}, a machine in (n,2) must have the transition corresponding to the initial state and read symbol i with the following instruction: write 0 and change to the halting state (thus not moving the head). The other 2n-1 transitions may have any of the 4n+2 possible instructions. So there are (4n+2)2n-1 machines in (n,2) producing “0” when running on a tape filled with i. Considering both values of i, we have 2(4n+2)2n-1 programs of the same length n+1+log2(4n+2)2n producing “0.” Then, for “0,” (15)m“0”=∑n=1∞24n+22n-12n+1+log24n+22n.This can be approximated by (16)m“0”=∑n=1∞24n+22n-12n+1+log24n+22n=∑n=1∞24n+22n-12n+12log24n+22n=∑n=1∞4n+22n-12n2log24n+22n=∑n=120004n+22n-12n2log24n+22n+∑n=2001∞4n+22n-12n2log24n+22n<∑n=120004n+22n-12n2log24n+22n+∑n=2001∞4n+22n-12n2log24n+22n=∑n=120004n+22n-12n2log24n+22n+∑n=2001∞4n+22n-12n4n+22n=∑n=120004n+22n-12n2log24n+22n+∑n=2001∞12n4n+2≃0.0742024;we have divided the infinite sum into two intervals cutting at 2000 because the approximation of 2log2(4n+2)2n to (4n+2)2n is not good for low values of n but has almost no impact for large n. In fact, cutting at 1000 or 4000 gives the same result with a precision of 17 decimal places. We have used Mathematica to calculate both the sum from 1 to 2000 and the convergence from 2001 to infinity. So the value m(“0”)=0.0742024 is exact for practical purposes. The value of m5(“0”) is 0.0734475, so the error in the calculation of m(“0”) is 0.0007549. If “0” and “1” are the strings with the highest m value, as we (informedly) conjecture, then this is the maximum error in m5 as an approximation to m.
As a reference, Km5(“0”) is 3.76714. With the real m(“0”) value, the approximated complexity is 3.75239. The difference is not relevant for most practical purposes.
We can also provide an upper bound for the sum of the error in m5 for strings different from “0” and “1.” Our way of proceeding is similar to the proof of Proposition 12, but we count in a finer fashion. The sum of the error for strings different from “0” and “1” is(17)∑n=6∞τ∈n,2∣τ0↓,τ0∉“0”,“1”+τ∈n,2∣τ1↓,τ1∉“0”,“1”2n+1+log24n+22n.
The numerators of the above sum contain the number of computations (with blank symbol “0” or “1”) of Turing machines in (n,2), n≥6, which halt and produce an output different from “0” and “1.” We can obtain an upper bound of this value by removing, from the set of computations in (n,2), those that produce “0” or “1” and some trivial cases of machines that do not halt.
First, the number of computations in (n,2) is 2(4n+2)2n, as all machines in (n,2) are run twice for both blank symbols (“0” and “1”). Also, the computations producing “0” or “1” are 4(4n+2)2n-1. Now, we focus on two sets of trivial nonhalting machines:
Machines with the initial transition staying at the initial state. For blank symbol i, there are 4(4n+2)2n-1 machines that when reading i at the initial state do not change the state (for the initial transition there are 4 possibilities, depending on the writing symbol and direction, and for the other 2n-1 transitions there are 4n+2 possibilities). These machines will keep moving in the same direction without halting. Considering both blank symbols, we have 8(4n+2)2n-1 computations of this kind
Machines without transition to the halting state. To keep the intersection of this and the above set empty, we also consider that the initial transition moves to a state different from the initial state. So for blank symbol i, we have 4(n-1) different initial transitions (2 directions, 2 writing symbols, and n-1 states) and 4n different possibilities for the other 2n-1 transitions. This makes a total of 4(n-1)(4n)2n-1 different machines for blank symbol i and 8(n-1)(4n)2n-1 computations for both blank symbols.
Now, an upper bound for (17) is (18)∑n=6∞24n+22n-44n+22n-1-84n+22n-1-8n-14n2n-12n+1+log24n+22n.The result of the above sum is 0.0104282 (smaller than 1/32, as guaranteed by Proposition 12). This is an upper bound of the sum of the error m(s)-m5(s) for all infinite strings s different from “0” and “1.” Smaller upper bounds can be found by removing from the above sum other kinds of predictable nonhalting machines.
6. Algorithmic Complexity of Integer Sequences
Measures that we introduced based on finite approximations of algorithmic probability have found applications in areas ranging from economics [14] to human behavior and cognition [9, 12, 13] to graph theory [15]. We have explored the use of other models of computation suggesting similar and correlated results in output distribution [22] and compatibility, in a range of applications, with general compression algorithms [21, 23]. We also investigated [5] the behavior of the additive constant involved in the Invariance theorem from finite approximations to D(5), strongly suggesting fast convergence and smooth behavior of the invariance constant. In [15, 23], we introduced an AP-based measure for 2-dimensional patterns, based on replacing the tape of the reference Turing machine for a 2-dimensional grid. The actual implementation requires breaking any grid into smaller blocks for which we then have estimations of their algorithmic probability according to the Turing machine formalism described in [15, 23, 24].
Here we introduce an application of AP-based measures—as described above—to integer sequences. We show that an AP-based measure constitutes an alternative or complementary tool to lossless compression algorithms, widely used to find estimations of algorithmic complexity.
6.1. AP-Based Measure
The AP-based method used here is based on the distribution D(5) and is defined just like mk(s). However, to increase its range of applicability, given that D(5) produces all 212 bit-strings of length 12 except for 2 (that are assigned maximum values and thus complete the set), we introduce what we call the Block Decomposition Method (BDM) that decomposes strings longer than 12 into strings of maximum length 12 which can be derived from D(5). The final estimation of the complexity of a string longer than 12 bits is then the result of the sum of the complexities of the different substrings of length not exceeding 12 in D(5) if they are different but the sum of only log2(n) if n substrings are the same. The formula is motivated by the fact that n strings that are the same do not have n times the complexity of one of the strings but rather log2(n) times the complexity of just one of the substrings. This is because the algorithmic complexity of the n substrings to be considered is the length of at most the “print(s) n times” program and not the length of “print(ss…s).” We have shown that this measure is a hybrid measure of complexity, providing local estimations of algorithmic complexity and global evaluations of Shannon entropy [24]. Formally,(19)BDMX=∑ikm5xi+logsi,where si is the multiplicity of xi and xi are the subsequences from the decomposition of X into k subsequences, with a possible remainder sequence y<x if |X| is not a multiple of the decomposition length l. More details on error estimations for this particular measure extending the power of m5 and on the boundary conditions are given in [24].
6.2. The On-Line Encyclopedia of Integer Sequences (OEIS)
The On-Line Encyclopedia of Integer Sequences (OEIS) is a database with the largest collection of integer sequences. It is created and maintained by Neil Sloane and the OEIS Foundation.
Widely cited, the OEIS stores information on integer sequences of interest to both professional mathematicians and amateurs. As of 30 December 2016, it contained nearly 280,000 sequences, making it the largest database of its kind.
We found 875 binary sequences in the OEIS database, accessed through the knowledge engine WolframAlpha Pro and downloaded with the Wolfram Language.
Examples of descriptions found to have the greatest algorithmic probability include the sequence “a maximally unpredictable sequence” with associated sequence 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 or A068426, the “expansion of ln2 in base 2” and associated sequence 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0. This contrasts with sequences of high entropy such as sequence A130198, the single paradiddle, a four-note drumming pattern consisting of two alternating notes followed by two notes with the same hand, with sequence 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 or sequence A108737, found to be among the less compressible, with the description “start with S={}. For m=0,1,2,3,…, let u be the binary expansion of m. If u is not a substring of S, append the minimal number of 0’s and 1’s to S to remedy this. Sequence gives S” and sequence 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1. We found that the measure most driven by description length was compressibility.
The longest description of a binary sequence in the OEIS, identified as A123594, reads “unique sequence of 0s and 1s which are either repeated or not repeated with the following property: when the sequence is ‘coded’ in writing down a 1 when an element is repeated and a 0 when it is not repeated and by putting the initial element in front of the sequence thus obtained, the above sequence appears.”
6.3. Results
We found that the textual description length as derived from the database is, as illustrated above, best correlated with the AP-based (BDM) measure, with Spearman test statistic 0.193, followed by compression (only the sequence is compressed, not the description) with 0.17, followed by entropy, with 0.09 (Figure 2). Spearman rank correlation values among complexity measures reveal how these measures are related to each other with BDM versus compress, 0.21; BDM versus entropy, 0.029; and compress versus entropy, −0.01, from 875 binary sequences in the OEIS database.
(a) Correlation between the estimated algorithmic complexity (log) by the AP-based measure (BDM) and the length of the text description of each sequence from the OEIS. Fitted line for highest correlation (BDM) is given by 1064.84+7.29x using least squares. (b) Algorithmic complexity estimation by BDM (log) and of compression on program length (in the Wolfram Language/Mathematica) as coming from the OEIS. In parenthesis, the Spearman rank correlation values for each case. Further compressing the program length using "compress" resulted in a lower correlation value and BDM outperformed lossless compression.
We noticed that the descriptions of some sequences referred to other sequences to produce a new one (e.g., “A051066 read mod 2"). This artificially made some sequence descriptions look shorter than they should be. When avoiding all sequences referencing others, all Spearman rank values increased significantly, with values 0.25, 0.22, and 0.12 for BDM, compression, and entropy, respectively.
To test whether the AP-based (BDM) measure captures some algorithmic content that the best statistical measures (compress and entropy) may be missing, we compressed the sequence description and compared again against the sequence complexity. The correlation between the compressed description and the sequence compression came closer to that of the AP-estimation by BDM, and BDM itself was even better. The Spearman values after compressing textual descriptions were 0.27, 0.24, and 0.13 for BDM, compress, and entropy, respectively.
We then looked at 139,546 integer sequences from the OEIS database, avoiding other noninteger sequences in the database. Those considered represent more than half of the database. Every integer was converted into binary, and for each binary sequence representing an integer an estimation of its algorithmic complexity was calculated. We compared the total sum of the complexity of the sequence (first 40 terms) against its text description length (both compressed and uncompressed) by converting every character into its ASCII code, program length, and function lengths, these latter in the Wolfram Language (using Mathematica). While none of those descriptions can be considered as the shortest possible, their lengths are upper bounds of the maximum possible lengths of the shortest versions. As shown in Figure 2, we found that the AP-based measure (BDM) performed best when comparing program size and estimated complexity from the program-generated sequence.
7. Conclusion
Computable approximations to algorithmic information measures are certainly useful. For example, lossless compression methods have been widely used to approximate K, despite their limitations and their departure from algorithmic complexity. Most of these algorithms are closer to entropy-rate estimators rather than algorithmic ones, for example, those based on LZ and LZW algorithms such as zip, gzip, and png. In this paper, we have studied the formal properties of a computable algorithmic probability measure m and of finite approximations mk to m. These measures can be used to approximate K by means of the Coding Theorem Method (CTM), despite the invariance theorem, which sheds no light on the rate of convergence to K. Here we compared m and D(5) and concluded that for practical purposes the two produce similar results. What we have reported in this paper are the first steps toward a formal analysis of finite approximations to algorithmic probability-based measures based on small Turing machines. The results shown in Figure 2 strongly suggest that AP-based measures are not only an alternative to lossless compression algorithms for estimating algorithmic (Kolmogorov-Chaitin) complexity but may actually capture features that statistical methods such as lossless compression, based on popular algorithms such as LWZ and entropy, cannot capture.
All calculations can be performed and reproduced by using the Online Algorithmic Complexity Calculator available at http://www.complexitycalculator.com/.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors wish to thank the members of the Algorithmic Nature Group. Hector Zenil also wishes to acknowledge the support of the Swedish Research Council (Vetenskapsrådet) (Grant no. 2015-05299).
KolmogorovA. N.Three approaches to the quantitative definition of informationChaitinG. J.On the length of programs for computing finite binary sequences: statistical considerationsLevinL. A.Laws on the conservation (zero increase) of information, and questions on the foundations of probability theoryDelahayeJ.-P.ZenilH.Numerical evaluation of algorithmic complexity for short strings: A glance into the innermost structure of randomnessSoler-ToscanoF.ZenilH.DelahayeJ.-P.GauvritN.Calculating Kolmogorov complexity from the output frequency distributions of small turing machinesSolomonoffR. J.A formal theory of inductive inference. Part IIKirchherrW.LiM.VitányiP.The miraculous universal distributionCoverT. M.ThomasJ. A.GauvritN.ZenilH.Soler-ToscanoF.DelahayeJ.BruggerP.SantosF. C.Human behavioral complexity peaks at age 25GauvritN.ZenilH.DelahayeJ.-P.Soler-ToscanoF.Algorithmic complexity for short binary strings applied to psychology: A primerGauvritN.SingmannH.Soler-ToscanoF.ZenilH.Algorithmic complexity for psychology: a user-friendly implementation of the coding theorem methodGauvritN.Soler-ToscanoF.ZenilH.Natural scene statistics mediate the perception of image complexityKempeV.GauvritN.ForsythD.Structure emerges faster during cultural transmission in children than in adultsZenilH.DelahayeJ.-P.An algorithmic information theoretic approach to the behaviour of financial marketsZenilH.Soler-ToscanoF.DingleK.LouisA. A.Correlation of automorphism group size and topological properties with program-size complexity evaluations of graphs and complex networksZenilH.KianiN. A.TegnérJ.Methods of information theory and algorithmic complexity for network biologyZenilH.KianiN. A.TegnérJ.Low-algorithmic-complexity entropy-deceiving graphsRadóT.On non-computable functionsDelahayeJ.-P.ZenilH.BradyA. H.The determination of the value of Rado's noncomputable function Σ(k) for four-state Turing machinesSoler-ToscanoF.ZenilH.DelahayeJ.-P.GauvritN.Correspondence and independence of numerical evaluations of algorithmic information measuresZenilH.DelahayeJ.-P.On the algorithmic nature of the worldZenilH.Soler-ToscanoF.DelahayeJ.GauvritN.Two-dimensional Kolmogorov complexity and an empirical validation of the coding theorem method by compressibilityZenilH.Soler-ToscanoF.KianiN. A.Hernßndez-OrozcoS.Rueda-ToicenA.Hernández-OrozcoS.A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexityhttps://arxiv.org/abs/1609.00110