THE m-VERSION OF BINARY SEARCH TREES: AN AVERAGE-CASE ANALYSIS

Following a suggestion of Cichoń and Macyna, binary search trees are generalized by keeping m (classical) binary search trees and distributing incoming data at random to the individual trees. Costs for unsuccessful resp. successful search are analyzed, as well as the internal path length.


Introduction
Cichoń, together with his coauthor Macyna, had the seminal idea [2] to generalize approximate counting to approximate counting with m counters. While in the original version [3] a stream of letters (a word) is dealt with a counter in a certain way (that is of no interest here), the new version uses m counters, and chooses for each letter one of these counters (with probability 1 m ) where it is dealt with as usual. The result of the procedure is the sum of the individual results of the m counters.
This fundamental idea should, however, not be restricted to approximate counting! Indeed, it can be considered within a variety of different contexts. In this paper, the fundamental idea is applied to binary search trees. They are very well understood and described in classic books such as [7] and [9], with plenty of backward pointers to the older literature due to Lynch, Hibbard, Louchard, Brown, Shubert, and many others. We assume that, instead of just one, m binary search trees are kept, and for each element, when inserting it, a decision is made, to which of the m trees it is being sent. Of course, for algorithmic purposes, this choice must be deterministic, so that one knows, in which tree to search. However, for the analysis, it is assumed that each tree is equally likely, and will be selected with probability 1 m . Almost all the information about binary search trees that is known to this day can be found in the encyclopedic books [7,9]. We only mention that they originate from random permutations (typically of {1, . . . , n}); a new element is compared to the root, and, if there is no space for it, moved to the left/right if it is smaller/larger than the root; then the process continues. A binary search tree is used as a data structure. It is thus essential that one can find existing elements in a reasonable number of steps, and also get the information that a searched element is not present after a small number of comparisons. This is the first paper about m-binary search trees, and the hope is that many more will be written in the future, by various specialists. Thus, no completeness is aimed at. Three parameters are studied: The cost for inserting a new element into the m trees, which is related to the cost of unsuccessful search; then the cost for successful search, which is the average of the level of all the elements in all m trees, and then the internal path length, which is the sum of the internal path lengths of the m trees.
In the classical case, probability generating functions are available, so that one can extract moments from them, which can be written in terms of harmonic numbers and generalizations. We describe here how these probability generating functions translate to the m-model.
We try to use consistent notation: If the probability generating function is f (z), then we write F(z) for the transformed m-version. We always write n for the number of nodes in a classical binary search tree and N for the total number of nodes in the m binary search trees. Furthermore, we write P n,k , P N,k , E n , E (2) n , E N , E N , for probabilities and moments. We use the second factorial moments on our way to the variance.
A crucial expression is with n 1 + · · · + n m , which is the probability that the N data split into m sets of sizes n 1 , . . . , n m each. It turns out that we have to use three auxiliary quantities, named S N , T N , U N , which are introduced in the next section. All our quantities of interest can be expressed in terms of them. This is done in full in the section on unsuccessful search, but only sketched in the remaining sections, since the actual computations are quite long.
The intuition is of course that each of the m binary search tree should have roughly N/m nodes; the analysis that follows will make this precise.
The classic book [8] is an excellent source on harmonic numbers and their manipulation; in fact, quantity T n appears already in it!

Unsuccessful search
The first parameter that we study is the number of comparisons to insert node n into a binary search tree with n nodes. This is directly related to searching for a key which is not present, since it is equivalent to insert this (nonexistent) item as the (n + 1)-st node. The probability generating function is g n (z) = 1 n! 2z(2z + 1) . . . (2z + n − 2), so that the probability that k comparisons are needed, is From this, one derives and All this is classical. Now we translate this into the m-model. The largest node N sits in one of the m binary search trees of size n. Therefore The quotient N −1 n i −1 / N n i is the probability that the remaining n i − 1 nodes can be chosen. On the level of probability generating functions, this means The last form is obtained by multiplication by z k and summing. Moments can be computed from this, using differentiations. In order to do so, we need some auxiliary sums, that will be also useful in later sections.

Lemma 1.
(1) (2) Proof. All the proofs are using the basic recursion for binomial coefficients, to create a first order recursion which can be solved by summation. The procedure for T n is contained in [8].
In our applications, x = 1 m−1 , and then the formulae read: After these long but necessary computations have been done, we can now compute the moments: Now, by two differentiations, we find by a similar (but much longer) computation as before From these results, we can get the variance explicitly as (E We don't display it since it is quite long. However, we will drop exponentially small terms of the form O( N ) with 1 − 1 m < < 1; then the results are a bit more appealing: (The sums can be extended to infinity; the extra terms are absorbed in our exponentially small remainder term.) The remaining sums can be asymptotically evaluated: Not more is required than the generating function of the harmonic numbers. What we have done here is justified by singularity analysis, as described in [4]; note that z = 1 is the dominant singularity here.
Theorem 1. The expectation and variance of the number of comparisons needed to insert the last element into m binary search trees of altogether N nodes, are given by More terms in the asymptotic expansions are easily available.

Successful search
Now we look at successful search in binary search trees. The model is that the comparisons to find all possible nodes are added, and this count is than divided by the total number of nodes. This parameter has the following probability generating function: .
It translates into the m-model as follows: note that we add the comparisons in each subtree, given by n i R n i (z), and then divide by the total number N . The following results are classical: nR n (1) = nE n = 2(n + 1)H n − 3n, nR n (1) = nE (2) n = 4(n + 1)(H 2 n − H (2) n ) − 12nH n − 4H n + 16n. Consequently we can evaluate moments, in the same style as in the last section. We don't present all the long computations here. Further, Once again, we drop exponentially small terms to get shorter formulae: The asymptotic form is now computed as in the previous section.
Theorem 2. The expectation and variance of the number of comparisons in a successful search related to m binary search trees of altogether N nodes, are given by More terms in the asymptotic expansions are easily available.

Internal path length
The last parameter that we study is the (internal) path length, namely the sum of the distances of all the nodes to the root (in the classical case). In the m-version, it is simply the sum of the path lengths in the m individual trees. It is known that the probability generating functions satisfy g n (z) = z n−1 n n k=1 g k−1 (z)g n−k (z), g 0 (z) = 1, whence G N (z) = m −N n 1 +···+nm=N N n 1 , . . . , n m g n 1 (z) + · · · + g nm (z) It is known that g n (1) = 2(n + 1)H n − 4n, g n (1) = 4(n + 1) 2 (H 2 n − H (2) n ) − 4(n + 1)(4n + 1)H n + n(23n + 17). Therefore (and again, the extremely long computations are not displayed) − 4(n + 1)(4n + 1)H n + n(23n + 17) One can now plug in the aforementioned expliced formulae for S, T , U , which we don't display, because of length. Instead, we decided to produce an asymptotic formula including terms of order N or higher: Eventually we arrive at the last result of this paper. More terms in the asymptotic expansions are easily available.

Conclusion
This was a first step towards the analysis of the m-model of binary search trees. Much more is known about binary search trees and could/should be lifted to that level. Just to mention something explicit, one could look at the depth of node j in an m-binary search tree of N random nodes. The average of this (in the classical case) is due to Arora and Dent [1] and is related to the number of passes that the recursive algorithm Quickselect needs to find the j-th largest element, see [5,6].
If one wants to compute higher moments, then one needs to introduce sums like n k=1 n k x k 1≤h<i<j≤k 1 hij and similar ones.
The quantity C 2 (m) is related to the dilog function.