Techniques for Performance Improvement of Integer Multiplication in Cryptographic Applications

The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32-bit and 64-bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively.


Introduction
The cryptographic transformations with public key are revolutionized from Diffie and Hellman consideration to modern algebraic curves cryptosystems [1].However, transformations have stayed permanent-with operations in the number field ().The integer multiplication takes a special place in number field operations; see Figure 1.One of the urgent problems of public key cryptosystem improvements is an increase of software performance and hardware implementation.One of the approaches to increasing cryptosystems performance is the increasing of the performance of finite field arithmetic in multiplication operations.
Publications analysis [2][3][4][5][6][7][8][9][10] enabled extracting the most effective multiplication algorithms, Comba [2,3] and Karatsuba [3,8,10].However, the Comba algorithm shows better results in tests performance (benchmark) of software implementations on modern platforms [3][4][5][6][7][8][9].Karatsuba-Comba described multiplication (KCM) algorithm for the RISC processors in the article [8].The KCM algorithm is an interesting symbiosis of Comba and Karatsuba algorithms, where Karatsuba algorithm is specially used for machine word multiplication.As a result, the main goal of this paper is to provide a suggestion for the effective increasing of software implementation of finite field number () multiplication (squaring) via well-known Comba algorithm [2,3,8].Such researches were caused by the necessity of effective confirmation of software implementation of known algorithms for continuous development of modern 32-bit and  64-bit platforms.It is important to mention that last ten years have seen much development in the direction of the multicore CPU and multi-CPU systems [8,9].

Multiplication Algorithm-Prototype Description and Its Modification
Let us begin by introducing some notation and basic definitions.Carry is a digit that is transferred from one column of digits to another column of more significant digits during a calculation algorithm;  is machine word size and  is the number of machine words required to store a large integer.We present large integers (multipliers) as a set of  machine words; see Figure 2.For example, if we have 65-bit integer, we need three 32-bit machine words to store it.The Comba algorithm [2] is based on main loops p. 2 and p. 3 and nested loops p. 2.1 and p. 3.1 (Algorithm 1).In the low level of hierarchy, in loops p. 2.1 and p. 3.1 we will compute 64-bit integer product (V) (2) which consists of two 32-bit integers  () and V () .
The sum accumulation occurs in 32-bit temporary variables  0 ,  1 , and  2 , on each iteration p.In this case there are 3 additions of 32-bit integer (includes 2 additions with carry) and 3 assignments of 32-bit variables  0 ,  1 and  2 .The sum accumulation with carry takes place in each iteration of loop p. 2.1.
(ii) In nested loops p. 2.1 and p. 3.1, for the sum accumulation, for 32-bit variables  0 ,  1 , and  2 the transfers are considered using the assembler code for the implementation of addition operation with carry.This does not allow pairing and parallelizing [22]; therefore we observe an ineffective processor resource usage.
(iii) Loops p. 2 and p. 3 cannot be effectively parallelized due to high internal linkage code because of carry consideration.
It is easy to obtain a computational complexity for Comba's algorithm: where   assign is an assignment operation of 32-bit integers,   add is an addition operation of 32-bit integers, and   mul is a multiplication operation of 32-bit integer.
Figure 2 illustrates the drawbacks of algorithm for  = 3 and its impact on computational complexity of algorithm.
Modern CPUs allow the use of 64-bit data types and operations to achieve better performance, but the algorithm is not adapted for their use.
). ( 5) Return ().under the solidus.It should be noted that Comba's algorithm implements well-known long multiplication technique, with a small difference where the multiplier part    = 1,  multiplies all parts of other multipliers    = 1, , in case of fulfillment condition ( +  = ) (in columns).
Such approach leads not to strings addition (multiplication of intermediate results) as long multiplication but to columns addition.That allows finding a part of resulting product   (under the solidus).Each multiplication is accompanied by the sum accumulation, as shown in Figure 3.
The computational complexity for  = 3 will be In the following steps of calculation procedure we eliminate the drawbacks.
(i) The modern 32-bit CPUs effectively implement the addition operations of 32-bit and 64-bit integers, using 32-bit CPUs commands.That allows implementing a carry accumulation by the addition of 32-bit variables in 64-bit variable-accumulator that obviate the need for carry accounting and correction requirements after the addition of variables  0 ,  1 , and  2 .An accumulated carry will be accounted in the final iterations in the loops in p. 2 and p. 3.
(ii) Modern CPUs have multicore architecture that allows them to execute several instruction flows at the same time.This property brings to parallel iterations execution in loops p. 2 and p. 3 by the OpenMP library [22][23][24].
The following notations are introduced in Algorithm 2.
(i) Variable  (2) is used to denote 64-bit variables,  () is used to denote 32-bit variables; (ii) Operation hi () ( (2) ) is used to extract 32 the most significant bits in 64-bit variable, and operation low () ( (2) ) is used to extract 32 the least significant bits in 64-bit variable.
It is not difficult to get a computational complexity of modified Comba's algorithm: where   assign is an assignment operation of 32-bit integers,  2 assign is an assignment operations of 64-bit integers,   add is an addition operation of 32-bit integers,  2| add is an addition operation of 32-bit and 64-bit integers, and   mul is a multiplication of 32-bit integers.
Figures 4 and 5 illustrate Algorithm 2 for  = 3; computational complexity in this case will be

Comparison with Other Algorithms
In order to provide an objective comparison of given results, the authors have made the review of well-known software math libraries [12][13][14][15][16][17][18][19][20][21] for public key cryptography.According to results review [25,26], the software library GMP was an etalon [12].GMP uses Karatsuba's integer multiplication algorithm [12].The comparison of software implementations will be done by comparing the execution average time of software implementation of Comba and modified Comba's algorithms and implemented in GMP library for one million iterations.
To measure the algorithm performance of software implementation we can use protocols in fields of Table 1 from [27], except (82).These fields are recommended [3,27] for usage in cryptographic application for different security levels.In testing mainstream mobile platform with Intel Core i3 350 M CPU and desktop platform with Intel Pentium Dual Core E5400 were used.
Performance measurement timings for different algorithms, implementations, and CPU are shown in Table 2.
As we can see from the timing in Table 2, the proposed modification of the algorithm Comba has 1.5 times better time characteristic compared with GMP.Classic implementation of Comba's algorithm is the slowest, which is confirmed by the theoretical estimation (as it contains a larger number of additions and assignment operations).In addition, proposed software implementation of multiplication algorithms is more efficient on Dual Pentium CPU with higher frequency than on Core i3 CPU with several instruction streams.This implementation of multiplication algorithms does not support parallelization; thus, a more powerful multicore CPU Core i3 with 4 instructions processing flows does not realize their full potential.

Conclusions
The research resulted in the following conclusions.
(1) We ensure an increase in performance of software implementation of Comba's integer multiplication algorithm for 1.5-2 times and surpass of performance of the popular math library GMP v4.1.2,an average for 1.5 times.
(2) Modified Comba's multiplication algorithm is preferred to Karatsuba's algorithm [2] which is used in GMP library, because implementation of modified Comba's algorithm is faster than Karastuba [2] implementation in GMP for modern hardware platform (32-and 64-bit).
Recently, the microprocessors development increases the number of instruction processing flows.Thus, we should perform the necessity of suitable algorithms development for efficient parallelization.
NVIDIA has already proposed GPU with more than 256 cores and suitable CUDA toolkit [29] which allows creating valid multithread applications.This area is already under close monitoring and is demonstrated in publication [9,31].A further line of our research will focus on investigation and effective parallelization algorithms for arithmetic operations with integers.

Table 1 :
Performance measurements of fields.

Table 2 :
Experimental results of software implementation of multiplication algorithms.