A Fast Algorithm for Computing Binomial Coefficients Modulo Powers of Two

I present a new algorithm for computing binomial coefficients modulo 2N. The proposed method has an O(N 3 · Multiplication(N) + N 4) preprocessing time, after which a binomial coefficient C(P, Q) with 0 ≤ Q ≤ P ≤ 2N − 1 can be computed modulo 2N in O(N 2 · log(N) · Multiplication(N)) time. Multiplication(N) denotes the time complexity of multiplying two N-bit numbers, which can range from O(N 2) to O(N · log(N) · log(log(N))) or better. Thus, the overall time complexity for evaluating M binomial coefficients C(P, Q) modulo 2N with 0 ≤ Q ≤ P ≤ 2N − 1 is O((N 3 + M · N 2 · log(N)) · Multiplication(N) + N 4). After preprocessing, we can actually compute binomial coefficients modulo any 2R with R ≤ N. For larger values of P and Q, variations of Lucas' theorem must be used first in order to reduce the computation to the evaluation of multiple (O(log⁡(P))) binomial coefficients C(P′, Q′) (or restricted types of factorials P′!) modulo 2N with 0 ≤ Q′ ≤ P′ ≤ 2N − 1.


Introduction
In this paper I present a novel efficient algorithm for computing binomial coefficients modulo 2 ( ≥ 1). The definition of the binomial coefficient ( , ) is the usual one: We will mainly consider the case where ≤ 2 − 1, and after fully handling this case, we will discuss how to compute ( , ) modulo 2 for ≥ 2 .
After the preprocessing stage, a binomial coefficient ( , ) (with ≤ 2 − 1) can be computed modulo 2 in ( 2 ⋅ log( ) ⋅ Multiplication( )) time. Of course, the presented algorithm can evaluate a binomial coefficient modulo any 2 with ≤ (after computing it modulo 2 , we only need to keep the least significant bits out of the computed bits).
The rest of this paper is structured as follows. In Sections 2-6 I present the steps of the preprocessing stage. In Sections 7-9 I present the algorithm which computes the binomial coefficients modulo 2 , considering that the preprocessing stage was completed. In Section 10 I discuss related work, and in Section 11 I conclude and discuss future work. I will summarize below the contents of each section describing the preprocessing steps and the algorithm itself.
The preprocessing stage consists of 5 steps. The first step consists of computing all the small binomial coefficients ( , ). The term small refers to the values of and . A binomial coefficient ( , ) is defined to be small if 0 ≤ ≤ ≤ . This step is presented in Section 2. There are ( 2 ) small binomial coefficients, and we can compute all of them with only ( 2 ) additions of pairs of -bit numbers.
The second preprocessing step (presented in Section 3) consists of computing a set of large binomial coefficients. All the binomial coefficients of the form (2 − , ), with 0 ≤ , , ≤ , are defined to be large. There are ( 3 ) such binomial coefficients and all of them can be computed in ( 2 ⋅Multiplication( )+ 4 ) time overall. This step requires 2 The Scientific World Journal the largest amount of memory (among all the preprocessing steps) in order to store the ( 3 ) large binomial coefficients. In Section 4 I present the third step of the preprocessing stage, which consists of computing power sums of consecutive numbers (where the number of such numbers is a power of 2). There are ( 2 ) power sums which can all be computed in ( 2 ⋅ log( ) ⋅ Multiplication( )) time.
The 4th preprocesing step (presented in Section 5) consists of computing the sums of the products of the elements of all the subsets of a given size of a set consisting of the first 2 (0 ≤ ≤ ) positive integer numbers. In order to achieve this goal, inclusion-exclusion-based equations from the theory of elementary symmetric functions and polynomials are used. There are only ( 2 ) values being computed, but it takes ( 3 ⋅ Multiplication( )) time to compute them. This step is the performance bottleneck step in the preprocessing stage.
Finally, in Section 6 I present the last step of the preprocessing stage, computing the sums of the products of the elements of all the odd-element subsets of a given size of a set consisting of the first 2 (0 ≤ ≤ ) positive integer numbers. Like in the previous case there are ( 2 ) values which need to be computed during this step.
In Section 7 I will show how to efficiently find the largest odd divisor (modulo 2 ) of ! (for ≤ 2 − 1). In Section 8 I will present the actual algorithm for computing binomial coefficients ( , ) modulo 2 (for ≤ 2 − 1), and in Section 9 I will discuss extensions to the case ≥ 2 .
Note that by precomputing all the mentioned values, we are capable of achieving a running time of ( 2 ⋅ log( ) ⋅ Multiplication( )) for computing a single binomial coefficient ( , ) (with 0 ≤ ≤ ≤ 2 − 1). When computing binomial coefficients we achieve a running time of ( ⋅ 2 ⋅ log( ) ⋅ Multiplication( )). Since computing a binomial coefficient requires the values from the preprocessing stage, the time complexity would increase significantly if we had to compute all those values for each binomial coefficient (instead of computing them only once and then reusing them for each binomial coefficient).

Computing Large Binomial Coefficients
Since all the computations are performed modulo 2 , we need to perform multiplications by −1 . But only has a multiplicative inverse (modulo 2 ) if it is odd. Thus, we will have to compute two different values: (i) LC 2 ( , , ) = the largest odd divisor of (2 − , ); (ii) Exp 2 ( , , ) = the exponent of 2 in the prime factor decomposition of (2 − , ).
We will assume that the multiplicative inverses of all the odd numbers from 1 to were precomputed. The inverse of an odd number (modulo 2 , for ≥ 3) is equal to For ≤ 2, the inverse of an odd number is the odd number itself. Equation (6) provides a way of computing the inverse of an odd number < 2 using ( ) -bit multiplications. Note that more efficient alternatives exist; for example, in [3], an algorithm with ( ) -bit additions, left bit-shifts, and right bit-shifts for computing the inverse is presented (which would take only ( 2 ) time overall The Scientific World Journal 3 instead of ( ⋅ Multiplication( ))). However, the algorithm obtained from (6) will never be the bottleneck in any step of the presented algorithm, so there will be no problem using it.
After computing the values LC 2 ( , , ), and Exp 2 ( , , ) we have We will assume that we previously precomputed all the numbers 2 for 0 ≤ ≤ . This way we have a method for computing all the values LC( , , ).

Computing Power Sums of Consecutive Numbers
In this section we will efficiently compute power sums of consecutive numbers starting from 1 and ending at a power of two. We define where 0 ≤ ≤ and 0 ≤ ≤ . We will start with the easy cases: PSUM(0, ) = 1 (independent of the value of ) and PSUM( , 0) = 2 (mod 2 ).
For 2 ≤ ≤ , we will use formulas based on the inclusion-exclusion principle derived from the theory of power sum and elementary symmetric polynomials [4]: The Scientific World Journal The problem that we face now is that −1 ( mod 2 ) exists only when is odd. This implies that we will not be able to compute all the values SSP( , ) (for a fixed value of ) with the same precision. Let us define Precision( , ) = the maximum value ≤ such that the last bits of SSP( , ) are correct (i.e., we were able to compute SSP( , ) modulo 2 but not modulo 2 for > , unless we perform computations modulo 2 for > ). Obviously, Precision( , 0) = Precision( , 1) = .
Let us consider the case ≥ 2. We will have Precision( , ) ≤ Precision( , − 1): the precision either stays the same or decreases as increases. At first we evaluate the following sum: This sum is correctly computed modulo 2 Precision( , −1) . Then we compute the following: (i) = the largest odd divisor of ; (ii) = the exponent of 2 in the prime factor decomposition of .
We should now prove that Precision( , ) > − for ≥ 1. This is obviously true for = 1. In general, Precision( , ) is equal to minus the exponent of 2 in !. Since the exponent of 2 in ! is equal to it is obvious that this value cannot exceed − 1. Thus, Precision( , ) ≥ − ( − 1) > − . This proof is very important. Although we cannot compute all the SSP( , ) values with the same precision, we will see that this precision will be sufficient in order to obtain exact and correct results when computing binomial coefficients modulo 2 . This is because in all the equations where SSP( , ) is involved, it will be multiplied by 2 .
Let us perform the time complexity analysis now. For each pair ( , ), we need to perform multiplication of -bit numbers. It would seem that we also need to compute −1 for each pair ( , ) (which would require another multiplications of -bit numbers for raising to the appropriate power). Although this would not affect the theoretical time complexity, we should notice that Precision( , ) depends only on the numbers 1, . . . , and does not depend on at all (we only need to have ≤ 2 ). Thus, we only need to compute −1 once for the first value of for which we will compute SSP( , ) (and then we will cache −1 and use it later when it is needed). Overall, the time complexity is dominated by the 3 multiplications which need to be performed, obtaining an ( 3 ⋅ Multiplication( )) time complexity. This step of the preprocessing stage is the bottleneck, having the largest time complexity (it is the step which dominates the computations of the preprocessing stage).
To be more precise, This time we are interested in subsets with large sizes, that is, sizes ranging from 2 −1 − to 2 −1 (thus having 0 ≤ ≤ ). For = 0, we are interested in computing the product By using Newton's binomial theorem, we can rewrite (19) as We notice that, for ≥ , each term of the sum is zero modulo 2 . Moreover, notice that SSP( − 1, ) is multiplied by 2 . Since the precision of SSP( − 1, ) is larger than − , the result of this multiplication is exact modulo 2 . Thus, by iterating through all the values of from 0 to min{2 −1 , −1}, we have an algorithm which requires ( ) multiplications of -bit numbers for computing SOSP( , 0). For > 0, we will need to use a different approach. Let us choose a subset of odd numbers from 1 to 2 −1 of size 2 −1 − : (1), . . . , (2 −1 − ). The product of all the elements in the subset is equal to ). By using Newton's binomial theorem, we can rewrite this product as Since (2 −1 − , 2 −1 − − ) = (2 −1 − , ) and we already computed these values as LC( − 1, , ), we can finally rewrite (22) as We can now use (23) directly in order to obtain an algorithm performing ( ) multiplications of -bit numbers for computing SOSP( , ). Of course, as before, all computations are performed modulo 2 . The time complexity in this case is simple to analyze: there are ( 3 ) multiplications performed, so the complexity is ( 3 ⋅ Multiplication( )).
This part can be improved because, as we will see in Section 7, for a given value of , we will only need the values SOSP( , ) such that 0 ≤ ≤ /( + 1). Thus, overall, we actually need to compute the values SOSP( , ) for only ( ⋅ log( )) pairs ( , ) instead of ( 2 ) such pairs. In this case the time complexity drops down to ( 2 ⋅ log( ) ⋅ Multiplication( )).

Computing the Binomial Coefficient
Let consider the binomial coefficient ( , ) with 0 ≤ ≤ ≤ 2 − 1. In order to evaluate it modulo 2 we will first need to compute 2 ( ), 2 ( ), and 2 ( − ). Then, we will need to find the largest exponent exp( ) such that 2 exp( ) is a divisor of !, for = , and − . We can use (17) for this.
Then, the answer will be As discussed earlier, the multiplicative inverse of an odd number modulo 2 can be computed by using (6)

Related Work
The problem of computing binomial coefficients modulo various numbers has been widely studied in the scientific literature. In [7] properties of binomial coefficients modulo prime numbers are discussed, including Lucas' theorem, which provides a simple way of computing ( , ) modulo , where is a prime number. The computation of ( , ) is reduced to the computation of multiple ( (log( ))) binomial coefficients ( , ) modulo , with 0 ≤ ≤ ≤ −1. In [8] the authors studied periodicity properties of the binomial coefficients modulo both prime and composite numbers. Congruence properties of binomial coefficients modulo prime powers were presented in [9], and congruence properties of products of binomial coefficients modulo composite numbers were studied in [10]. The general method of computing binomial coefficients modulo a composite number is to evaluate them modulo the (maximal) prime powers which are divisors of and then use the Chinese Remained Theorem [11] in order to obtain the result modulo (as observed in [10]), but in [10] a more direct study of these values is performed (i.e., the Chinese Remainder Theorem is not used). Properties of the residues of binomial coefficients and their products modulo prime powers were studied in [12].

Conclusions and Future Work
In this paper I presented a novel efficient algorithm for computing binomial coefficients modulo 2 . The algorithm consists of a preprocessing stage, after which any number The Scientific World Journal 7 of binomial coefficients can be computed modulo 2 (or modulo any number 2 with 0 ≤ ≤ ).
The time complexity of the presented algorithm is comparable with that of state-of-the-art algorithms for computing binomial coefficients modulo prime powers [6]. In fact, the time complexity of the algorithm presented in this paper is slightly better than that of the algorithm presented in [6], but since a sufficiently detailed analysis of the time complexity in [6] is not provided by the authors, it is not clear if the algorithm from [6] cannot be improved any further. In any case, because the algorithm presented in this paper consists of a preprocessing stage, a slightly worse preprocessing time complexity (in some cases) could be balanced by computing multiple binomial coefficients (when the preprocessing stage, which is the bottleneck, is run only once).
When computing a small number of binomial coefficients (e.g., just one), the bottleneck of the algorithm (in the preprocessing stage) consists of the computation of sums of products of elements of subsets having sizes 0 to (described in Section 5). That step requires 3 multiplications of -bit numbers, while all the other steps need fewer multiplications (and one of the steps requires 3 additions of -bit numbers). If the values SSP( , ) defined in Section 5 could be computed faster, the algorithm presented in this paper could be considerably improved. In future work, I intend to study the problem of computing the SSP( , ) values in a more efficient manner.