Generalizing Benford's law using power laws: application to integer sequences

A simple method to derive parametric analytical extensions of Benford's law for first digits of numerical data is proposed. Two generalized Benford distributions are considered, namely the two-sided power Benford distribution and the new Pareto Benford distribution. The fitting capabilities of these generalized Benford distributions are illustrated and compared at some interesting and important integer sequences.


Introduction.
Since Newcomb(1881) and Benford(1938) it is known that many numerical data sets follow Benford's law or are closely approximated by it.To be specific, if the random variable X, which describes the first significant digit in a numerical table, is Benford distributed, then ( ) ( ) (1.1) Mathematical explanations of this law have been proposed by Pinkham(1961), Cohen(1976), Hill(1995a/b/c,97,98), Allart(1997), Janvresse and de la Rue(2004).In recent years an upsurge of applications of Benford's law has appeared, as can be seen from the recently compiled bibliography by Hürlimann(2006).Hill(1995c) also suggested to switch the attention to probability distributions that follow or closely approximate Benford's law.Papers along this path include Leemis et al.(2000) and Engel and Leuenberger(2003).Some survival distributions, which satisfy exactly Benford's law, are known.However, many simple analytical distributions, which include as special case Benford's law are not known.Combining facts from Leemis et al.(2000) and Dorp and Kotz(2002), such a simple one-parameter family of distributions has been considered in Hürlimann(2003).In a sequel to this, a further generalization of Benford's law is considered.
The interest of enlarged Benford laws is two-fold.First, parametric extensions may provide a better fit of the data than Benford's law itself.Second, they yield a simple statistical procedure to validate Benford's law.If Benford's model is sufficiently "close" to the oneparameter extended model, then it will be retained.These points will be illustrated through our application to integer sequences.

Generalizing Benford's distribution.
If T denotes a random lifetime with survival distribution S t P T t ( ) ( ) = ≥ , then the value Y of the first significant digit in the lifetime T has probability distribution 1 10 1 9 . (2.1) Alternatively, if D denotes the integer-valued random variable satisfying 10 10 1 then the first significant digit can be written in terms of T and D as where [ ] x denotes the greatest integer less than or equal to x.In particular, if the random variable Z T D = − log is uniformly distributed as U ( , ) 0 1 , then the first significant digit Y is exactly Benford distributed.Starting from the uniform random variable W U = ( , ) 0 2 or the triangular random variable W Triangular = ( , , ) 0 1 2 with probability density function w ∈ 1 2 , , one shows that the random lifetime T W = 10 generates the first digit Benford distribution (Leemis et al.(2000), Examples 1 and 2).
A simple parametric distribution, which includes as special cases both the above uniform and triangular distributions, is the two-sided power random variable ( ) Dorp and Kotz(2002) with probability density function , . (2.4) Proof.This has been shown in Hürlimann(2003).◊

From the geometric Brownian motion to the Pareto Benford law.
Another interesting distribution, which also takes the form of a two-sided power law, is the double Pareto random variable Reed(2001) with probability density function Recall the stochastic mechanism and the natural motivation, which generates this distribution.It is often assumed that the time evolution of a stochastic phenomena t X involves a variable but size independent proportional growth rate and can thus be modeled by a geometric Brownian motion (GBM) described by the stochastic differential equation where dW is the increment of a Wiener process.Since the proportional increment of a GBM in time dt has a systematic component dt ⋅ µ and a random white noise component dW ⋅ σ , GBM can be viewed as a stochastic version of a simple exponential growth model.The GBM has long been used to model the evolution of stock prices (Black-Scholes option pricing model), firm sizes, city sizes and individual incomes.It is well-known that empirical studies on such phenomena often exhibit power-law behavior.However, the state of a GBM after a fixed time T follows a lognormal distribution, which does not exhibit power-law behavior.
Why does one observe power-law behavior for phenomena apparently evolving like a GBM?A simple mechanism, which generates the power-law behavior in the tails, consists to assume that the time of observation T itself is a random variable, whose distribution is an exponential distribution.The distribution of T X with fixed initial state s is described by the double Pareto distribution where λ is the parameter of the exponentially distributed random variable T. Setting Proof.The probability density function of T W = 10 is given by Making the change of variable u t = ln / ln10 , one obtains

Fitting the first digit distributions of integer sequences.
Minimum chi-square estimation of the generalized Benford distributions is straightforward by calculation with modern computer algebra systems.The fitting capabilities of the new distributions are illustrated at some interesting and important integer sequences.
The first digit occurrences of the analyzed integer sequences are listed in Table 4.1.The minimum chi-square estimators of the generalized distributions as well as an assumed summation index m for the infinite series (3.5) are displayed in Table 4.2.Statistical results are summarized in Table 4.3.For comparison we list the chi-square values and their corresponding p-values.The obtained results are discussed.
The definition, origin and comments on the mathematical interest of a great part of these integer sequences has been discussed in Hürlimann(2003).Further details on all sequences can be retrieved from the considerable related literature.The mixing sequence represents the aggregate of the integer sequences considered in Hürlimann(2003).All of the 19 considered integer sequences are quite well fitted by the new PB distribution.For 14 sequences the minimum chi-square is smallest among the three comparative values and in the other 5 cases its value does not differ much from the chi-square of the TSPB distribution (green cells in Table 4.3).
The Benford property for the sequence of primes has long been studied.Diaconis(1977) shows that primes are not Benford distributed.However, it is known that the sequence of primes is Benford distributed with respect to other densities rather than with the usual natural density (Whitney(1972), Schatte(1983), Cohen and Katz(1983)).Bombieri (see Serre(1996), p. 76) has noted that the analytical density of primes with first digit 1 is 2 log 10 , and this result can be easily generalized to Benford behavior for any first digit.Table 4.3 shows that the primes less than 1,000 respectively 10,000 are not at all Benford or TSPB distributed, but they are approximately PB distributed with high p-values of 93.3% and 99.9%.Is this a new property of the prime number sequence?Unfortunately, the fit of the Pareto Benford distribution for the 78,499 prime numbers below 100,000 is rejected since the corresponding minimum chi-square value equals 1,391.Therefore it seems that the good fit of the BP distribution remains restricted to finite prime number sequences.Similar results for the sequence of squares and cubes can be made.Recall that the exact probability distribution of the first digit of m-th integer powers with at most n digits is known and asymptotically related to Benford's law (e.g.Hürlimann(2004)).Here again the fit of the PB distribution is very good when restricted to finite sequences but breaks down for longer sequences.A further remarkable result is that Benford's law of the mixing sequence is rejected at the 5% significance level while the PB law is accepted with a 93.6% p-value, which improves the pvalue of 25.2% obtained previously for the TSPB law in Hürlimann(2003).
A strong numerical evidence for the Benford property for the Fibonacci, Bell, Catalan and partition numbers is observed (corresponding yellow cells in Tables 4.2 and 4.3).In particular, the values of the parameters β α, of the BP distribution for the Fibonacci sequence are close to 1 and ∞ , which means that the BP distribution is almost Benford as remarked after Theorem 3.1.It is well-known that the Fibonacci sequence is Benford distributed (e.g. Brown and Duncan(1970), Wlodarski(1971), Sentence(1973), Webb(1975), Raimi(1976), Brady(1978) and Kunoff(1987)).The same result for Bell numbers has been derived formally in Hürlimann(2003), Theorem 4.1.
digit distributions closely related to Benford's distribution, at least if c is close to 1 or 2. -parameter two-sided power Benford (TSPB) probability density function positive roots of the characteristic equation -parameter Pareto Benford (PB) probability density function

Table 4 . 1 :
First digit distributions of some integer sequences

Table 4 . 3 :
Fitting integer sequences to the Benford and generalized Benford distributions