SCNSecurity and Communication Networks1939-01221939-0114Hindawi10.1155/2020/40878734087873Research ArticleParallel and Regular Algorithm of Elliptic Curve Scalar Multiplication over Binary Fieldshttps://orcid.org/0000-0002-2057-8737LiXingran123https://orcid.org/0000-0002-9015-9351YuWei13LiBao12AzizBenjamin1State Key Laboratory of Information SecurityInstitute of Information EngineeringCASBeijing 100093Chinacas.cn2School of Cyber SecurityUniversity of Chinese Academy of SciencesBeijing 100049Chinaucas.ac.cn3Data Assurances and Communications SecurityInstitute of Information EngineeringCASBeijing 100093Chinacas.cn202024620202020171020190805202024620202020Copyright © 2020 Xingran Li et al.This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Accelerating scalar multiplication has always been a significant topic when people talk about the elliptic curve cryptosystem. Many approaches have been come up with to achieve this aim. An interesting perspective is that computers nowadays usually have multicore processors which could be used to do cryptographic computations in parallel style. Inspired by this idea, we present a new parallel and efficient algorithm to speed up scalar multiplication. First, we introduce a new regular halve-and-add method which is very efficient by utilizing λ projective coordinate. Then, we compare many different algorithms calculating double-and-add and halve-and-add. Finally, we combine the best double-and-add and halve-and-add methods to get a new faster parallel algorithm which costs around 12.0% less than the previous best. Furthermore, our algorithm is regular without any dummy operations, so it naturally provides protection against simple side-channel attacks.

National Natural Science Foundation of China618724426177251561502487U1936209National Cryptography Development FundMMJJ20180216Beijing Municipal Science and Technology CommissionZ191100007119006
1. Introduction

The elliptic curve was first imported into the world of cryptography by Neal Koblitz and Victor Miller independently in 1985 [1, 2] and is now increasingly used for a wide range of cryptography primitives in practice such as public encryption and digital signature. More than 30 years after its introduction to the cryptography field, the practical advantages of elliptic curve cryptosystem (ECC) are clear and well-known: it has richer algebraic structures, a smaller key size, and relatively faster implementations to achieve the same level of security compared with other deployed schemes such as RSA. Based on the above benefits, ECC is particularly suitable for resource-constrained devices.

The efficiency of ECC is dominated by the speed of calculating scalar multiplication. Namely, given a rational point P of order r on elliptic curves, it requires to compute kP=P+P++P (k times), for a given scalar k0,r. Obviously, there are similar features between scalar multiplication and exponentiation in a general multiplicative finite group. Therefore, inspired by the repeated “square-and-multiply” algorithm, the normally used binary method called “double-and-add” for scalar multiplication over elliptic curves has been regarded as a fundamental technique.

In constrained environments, scalar multiplication is easily implemented by “double-and-add” variant of Horner’s rule, providing binary expansion of scalar k=i=0lki2i. However, each bit of k implies different algorithmic path during each iteration, that is, if ki=0, only a point doubling is necessary. Whereas if ki=1, a point doubling followed by a point addition is involved. As a consequence, different power and time consumption of this two prominent building blocks can be detected by simple power analysis (SPA)  and timing attack—this naive implementation leads to information leakage of secret scalar k.

Protecting against simple side-channel attacks (SSCA) can be achieved by recoding scalars in a regular manner, meaning that scalar multiplications are executed in the same instructions in the same order for any input value. Coron introduced a countermeasure against SSCA named “double-and-add always” algorithm . By inserting a dummy operation when necessary, it evaluates scalar multiplication by executing a doubling and an addition in each loop. However, it was soon found to be vulnerable to safe-error fault attacks [5, 6]. By timely inducing a fault at one iteration during the point addition, an adversary can determine whether the operation is dummy or not by checking the correctness of the output.

A measurement against safe-error fault attacks performs scalar multiplication in a predictable pattern. Besides the most commonly used Montgomery-ladder algorithm , another efficient method is m-ary recoding . This algorithm recodes a scalar in a sequence of m1 zeros and a nonzero with the percentage of nonzero numbers 1/m. However, scanning from look-up table could be dangerous if this step cannot be proceeded in constant-time.

Another increased interest-focused field of regular executing scalar multiplication is exploiting efficient curve forms that allow complete addition law. For any pair of k-rational points on elliptic curves (or in desired subgroup), complete addition law can compute the correct result, ignoring whether two addends are identical or not. As a corollary of the main results in , elliptic curves embedded in any projective spaces of dimension n by a symmetric line bundle admit a complete system of addition laws of bidegree 2,2. The later work of Bosma and Lenstra  shows that, when suitably chosen, a single addition law is able to act as add operation for all pairs of k-rational elliptic points. One of the well-studied examples is Edwards curves [11, 12], of which exceptional pairs for addition law exist outside k-rational points. A recent work  proposed an optimized algorithm that adds any pair of k-rational points for prime order elliptic curves defined over field of characteristic different from 2 and 3.

In , the authors introduce a new approach for scalar multiplication called Montgomery-halving algorithm which is a variation of the original Montgomery-ladder point multiplication. Besides, they present a new strategy for parallel implementation of point multiplication over elliptic curves by running the Montgomery-halving algorithm with the original Montgomery-ladder algorithm in parallel to calculate scalar multiplication concurrently. Moreover, this parallel algorithm can achieve protecting against SSCA. However, in their scheme, affine coordinate has to be used for halving, because the projective form of the Montgomery-halving algorithm could not be used to save operations.

In this paper, we provide a similar parallel implementation method using regular recoding technique which should be highly efficient by parallel processing doubling and halving operations in two different coprocessors. It can be concluded as two main contributions.

The first contribution is that we give a new regular algorithm computing halving operation called zero-less signed-digit (ZSD) halve-and-add which saves around 32.7% and 33.0% cost compared with Montgomery-halving method in  with m = 233 and m = 409. The λ projective coordinate system could offer projective coordinates saving inversions. This is especially useful for our ZSD halve-and-add algorithm (Algorithm 1). For halving operation, the best coordinate is λ affine coordinate. For the following addition operation, the better choice is λ projective coordinate. The Montgomery-halving algorithm in  has to exploit affine coordinate for its special structure without other choices, while our Algorithm 1 could make use of λ projective coordinate for its different structure design, where R1 can always be in λ affine coordinate for halving and R0 can always be in λ projective coordinate for addition so that λ projective mixed addition law could be used and no more coordinate transformation needed. In addition, the regular recoding technique ensures the secure implementation of scalar multiplication against SSCA.

<bold>Algorithm 1:</bold> Regular ZSD halve-and-add (left-to-right) method.

Input: PEF2m of odd order r, k=kn1,kn2,,k02 with kn1=1

Output: 2nkP

R0P/2; R1P/2

For i=n1 down to 1 do

t11+ki

R1R1/2; R0R0+tR1

End for

Rk0Rk0R1k0

Return R0

The second contribution concerns the new mixed-parallel algorithm. After analyzing all the algorithms in Table 1, we combine the fastest double-and-add method and Montgomery double-and-add method, in , and the fastest halve-and-add method, our ZSD halve-and-add algorithm, in this paper. A new efficient and secure mixed-parallel algorithm just comes into being, the mixed-parallel method, which costs around 11.7% and 12.0% less than Montgomery-Parallel approach in  when m = 233 and m = 409, respectively. The more thorough analysis will be exhibited in Section 4, and the related estimate results are all displayed in Tables 1 and 2.

Complexity comparison for m-bit k in different single algorithms.

MethodPoint operationsField operations (I = 10M)m=233m=409
Montgomery-DmDM+mAM6mM + 10M + I1418M2474M
Montgomery-HmHa+mAa3mM + m (2M + I)3495M6135M
Algorithm 2-D (λ-projective)mDλP+mAλP10mM + 11M + I2351M4111M
Algorithm 2-D (twisted μ4)mDμ4+mAμ49mM + 11M + I2118M3702M
Algorithm 1-HmHλa+mAλP10mM + 11M + I2351M4111M

Montgomery-D = Montgomery double-and-add algorithm, Montgomery-H = Montgomery halve-and-add algorithm, Algorithm. 2-D (λ-Projective) = Algorithm 2 using the λ-projective coordinate system, Algorithm. 2-D (twisted μ4) = Algorithm 2 using the twisted μ4 coordinate system, Algorithm. 1-H = Algorithm 1 for halve-and-add.

Complexity comparison for m-bit k in different parallel algorithms.

AlgorithmSplit (t)Estimate (m = 233)Split (t)Estimate (m = 409)
Montgomery-parallel671028M1181782M
Our mixed-parallel87908M1531568M

The rest of this paper is organized as follows. In the next section, we introduce the related arithmetic knowledge of binary elliptic curves, especially on efficient λ coordinate point representation, twisted μ4-normal form, and how to evaluate scalar multiplication in parallel by combining point halving and doubling operations. In Section 3, our new regular algorithm for halve-and-add is provided. Moreover, a similar parallel strategy as the one detailed in  shows how to efficiently implement scalar multiplication in a regular and parallel manner. Cost comparison and expected performance analysis are presented in Section 4. Finally, we conclude this paper and give the new mixed-parallel algorithm after analyzing.

2. Preliminaries

We focus on elliptic curves E defined over binary fields F2m, by the Weierstrass equation:(1)y2+xy=x3+ax2+b,where a,bF2m=F2t/ft, ftF2t is an irreducible polynomial of degree m. Isomorphic to the divisor class group of degree 0, the rational points Px,y on E together with the point at infinity O form an abelian group, of which the basic group operation—addition—is algebraically interpreted by the tangent-and-chord law.

Given two points P1=x1,y1 and P2=x2,y2 on EF2m, where P1=±P2, if the addition of the two points is presented by Q=P1+P2, then the coordinates of Q=x3,y3 can be computed according to the following formula:(2)x3=λ2+λ+x1+x2+a,y3=λx1+x3+x3+y1,with λ=y1+y2/x1+x2.

Similarly, given P=x1,y1EF2m, where PP, if the doubling of the point P is presented by 2P, then the coordinates of 2P=x3,y3 can be computed according to the following formula :(3)x3=λ2+λ+a=x12+bx12,y3=x12+λx3+x3,with λ=x1+y1/x1.

From the above formulas, it is easy to notice that there are inevitable inversion operations in the base field, which would consume much time. Usually, the projective coordinate system is more welcome for its inclusion of no field inversions. In practice, various kinds of coordinate systems are already available to be used. The work in this paper prefers to exploit the state-of-the-art coordinate systems: λ coordinate and the projective coordinate system of twisted μ4-normal form. They perform excellently in different situations.

2.1. <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M96"><mml:mi>λ</mml:mi></mml:math></inline-formula> Coordinates

Efficient point representation is of great importance to accelerate scalar multiplication. Inversion in the base field takes a large amount of time; however, they are indispensable if points are represented in affine coordinate. The homogeneous projective coordinate system (also called standard projective coordinate system) is usually used to eliminate this obstacle by injecting any k-rational affine point Px,yA2 into one of its projective copies X,Y,Z=x,y,12, where x=X/Z,y=Y/Z,Z0. When one of the projective copies X,Y,Z=x,y,1 corresponds to the affine point x,y, where x=X/Z2,y=Y/Z3,Z0, it is the Jacobian projective coordinate system. Later, López and Dahab proposed a new and efficient projective coordinate system. Compared with the above coordinate systems, the difference is x=X/Z,y=Y/Z2,Z0 here , denoted as LD coordinate for short. Later, Kim and Kim presented a four-dimensional LD coordinate system for binary curves which represents P as X,Y,Z,T, with x=X/Z, y=Y/T, T=Z2, and Z0.

The λ coordinate system was firstly noticed by Knudsen  when studying halving operations on binary elliptic curves. Oliveira  further surveyed its comprehensive arithmetic. Given a point P=x,yEF2m with x=0, the λ affine representation of P is defined as x,λ, where λ=x+y/x. So, it is easy to derive point addition and doubling formulas of points in λ affine coordinates from the normal affine ones. Let P=xP,λP and Q=xQ,λQ be two points on EF2m, where P±Q, then the formula for P+Q=xP+Q,λP+Q can be given by the following formula:(4)xP+Q=xPxQxp+xQ2λP+λQ,λP+Q=xQxP+Q+xP2xP+QxP+λP+1.

Referring to doubling operation, 2P=x2P,λ2P is given as follows:(5)x2P=λP2+λP+a,λ2P=xP2x2P+λP2+a+1.

As for projective conditions, the translation between affine representation x,y and λ projective representation X,L,Z is defined by x=X/Z, λ=L/Z with λ=x+y/x. The negative element of X,L,Z is X,L+Z,Z. Assumed two points PXP,LP,ZP and QXQ,LQ,ZQ represented in λ model on binary elliptic curves, similar to the affine case, the addition arithmetic could be described as the following formulas:(6)A=LPZQ+LQZP,B=XPZQ+XQZP2,XP+Q=AXPZQXQZPA,LP+Q=AXQZP+B2+ABZQLP+ZP,ZP+Q=ABZQZP,and for 2P=X2P,L2P,Z2P, it could be given as follows:(7)T=LP2+LPZP+aZP2,X2P=T2,Z2P=TZP2,L2P=XPZP2+X2P+TLPZP+Z2P.

The associated group addition P+Q and doubling 2P operations can be calculated by 11M+2S and 3M+4S, respectively, where M denotes a field multiplication and S denotes a squaring.

Having the above formulas, a direct thought is to combine doubling and addition formulas to obtain a formula evaluating 2Q+P, which is of great importance in the latter part of this paper.

Let P=xP,λP  andQ=XQ,LQ,ZQ be points of EF2m, then 2Q+P=X2Q+P,L2Q+P,Z2Q+P can be computed as follows:(8)T=LQ2+LQZQ+aZQ2,A=XQ2ZQ2+TLQ2+a+1+λPZQ2,B=xPZQ2+T2,X2Q+P=xPZQ2A2,Z2Q+P=ABZQ2,L2Q+P=TA+B2+λP+1Z2Q+P.

Using this, 2Q+P operations can be calculated efficiently by 10M+6S instead of 11M+2S+3M+4S=14M+6S, where M denotes a field multiplication and S denotes a squaring .

2.2. Twisted <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M157"><mml:msub><mml:mrow><mml:mi>μ</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>-Normal Form

Twisted μ4-normal form  can be seen as the complement and extension of μ4-normal form . The related definitions, theorems, equation forms, and group laws of twisted μ4-normal form and μ4-normal form are given by Kohel's series of papers . There are three forms for (twisted) μ4-normal form, called (twisted) μ4-normal form, (twisted) semisplit μ4-normal form, and (twisted) split μ4-normal form separately. Yet, for practical consideration, only twisted spilt μ4-normal form will be used here.

Let Ct be an elliptic curve over characteristic-two finite field in the twisted split μ4-normal form:(9)X02+X22=c2X1X3+aX1+X32,X12+X32=c2X0X2,and let X1,X2,X3,X4 and Y1,Y2,Y3,Y4 be two points on the curve. A complete system of addition laws is given by the two following two maps:(10)U00+U222,cU00U11+U22U33+aG,U11+U332,cU00U33+U11U22+aG,U13+U312,cU02U31+U20U13+aF,U02+U202,cU02U13+U20U31+aF,respectively, where(11)Ujk=XjYk,F=VU02+U20,G=VU00+U22,V=X1+X3Y1+Y3.

For the point X1,X2,X3,X4, the doubling map sends it to (12)X04+X24:cX02X12+X22X32:X14+X34:cX02X32+X12X22,if a=0, and to(13)X04+X24:cX02X32+X12X22:X14+X34:cX02X12+X22X32,if a=1. In twisted split μ4-normal form, addition operations of generic points can be evaluated by 9M+2S and doubling operations of a generic point can be evaluated by 2M+5S with notations M for field multiplication and S for squaring .

Among all the studied coordinate systems on binary curves, twisted μ4-normal form and λ projective coordinate appear to be faster. The difference is twisted μ4-normal form is better calculating double-and-add, while λ projective coordinate can be used in halving operation. The costs of different point operations using various point representing systems are shown in Table 3.

Cost comparison.

HomogeneousJacobianLDλTwisted μ4
Addition14M + 1S14M + 5S13M + 4S11M + 2S9M + 2S
Mixed-addition11M + 1S10M + 3S8M + 5S8M + 2S7M + 2S
Doubling7M + 3S4M + 5S3M + 5S3M + 4S2M + S
2.3. Halving Operation

The main ingredient we consider is a cyclic subgroup in EF2m of odd order r, denoted as G. The multiple-by-2 isogeny 2:P2P on G is an isomorphism, so is its inverse map halving operation 1/2:P1/2P. The use of point halving to speedup scalar multiplication was firstly investigated by Knudsen . Given a point Q=u,vG, it allows to compute another point P=x,yG satisfying Q=2P in the cost of a field multiplication, calculating a square root and solving a quadratic equation, which could be directly understood from the formulas below:(14)λ=x+yx,u=λ2+λ+a,v=x2+uλ+1.

The most commonly used method is to solve the second equation for λ, then the third one for x, and finally the first one for y.

When λ coordinate like x,λ is used instead of affine coordinate x,y, where P=x,λPandQ=u,λQ, the halving operation formulas would be changed as follows:(15)u=λP2+λP+a,x2=uu+λQ+λP+1.

This time we just need two steps, that is to say, solve the first equation for λP and then the second one for x. Without computing y, the halving point coordinates P=x,λP of Q=u,λQ can be obtained more simply.

As proved in , solving a quadratic equation x2+x=t on binary curves with Tra=1 equivalents to computing the half-trace function Ht=i=0m1/2t22i. Although extra memory resources are needed, Fong et al.  showed a technique to significantly reduce the required time and space. With dedicated implementation, a point halving is approximately twice the time of a field multiplication, significantly faster than the customarily used point doubling.

From the algorithmic view, the halve-and-add method  expands a scalar k in radix-1/2 representation system. Let l be the binary length of r, first compute k=k2lmodr=i=0l1ki2i, that is, k=k/2lmodr=i=0l1ki2i/2l=i=0l1kli11/2i+1. Much similar to double-and-add, point multiplication,(16)kP=i=0l1kli1P2i+1=kl1P2+kl2P22++k0P2l,can be efficiently computed by applying point halving on an accumulator. It can be further optimized combining methods like ω-NAF to get a better implementation performance, as shown in .

Enlightened by the treatment in halve-and-add, if we choose an appropriate number less than l, the scalar k can be split into two parts naturally. In consequence, the halve-and-add method is easy to be concurrently implemented with the double-and-add algorithm in parallel model, making use of increasing cores in modern processors, which would be a lot faster than applying one algorithm without parallel implementation (some inevitable computation load should be considered in advance). Specifically speaking, if the lengths of r is l and a proper t has been chosen, the scalar k can be split into two portions applying halve-and-add and double-and-add algorithms simultaneously, which can be indicated as follows—the length of each part (t and lt) depends on actual implementation speed of halving and doubling which can be found experimentally:(17)k=2tkmodr.

If we already have the binary expression of k=i=0l1ki2i with odd order r, then it is easily derived that k=k2tmodr=kt121++k02t+kl12lt1++ktmodr. The scalar multiplication kP of P is then split into two parts directly:(18)kP=kt121++k02tP+kl12lt1++ktP.

The first part is easily executed in the halve-and-add method; meanwhile, the second part can be performed through a double-and-add approach, in two different threads.

As far as side-channel attacks being concerned, noticing that double-and-add can be implemented using Montgomery-ladder point multiplication, Negre and Robert  presented analogous Montgomery-halving algorithm. During each iteration, two registers hold fixed difference---2P, and the algorithm processes a point halving and an addition in each iteration. However, as noticed by the authors, this parallel algorithm can only be implemented in affine coordinate, since halving operation cannot be implemented in the projective coordinate efficiently. To overcome this drawback, we present another regular recoding algorithm that can be used when implementing parallel halve-and-add/double-and-add in the projective coordinate system.

3. Regular Implementation

Protecting the implementation of scalar multiplication against SSCA can be achieved by many methods. Compared with unprotected implementation, algorithmic countermeasures like recoding scalars in a regular manner always sacrifice efficiency, yet may be easily mitigated by taking advantage of inherent parallelism of modern processors.

3.1. Zero-Less Signed-Digit Expansion

In general, point addition and doubling of elliptic curves are very different from the usual arithmetic operations, which are so complicated and time consuming that plenty of scholars have been sparing no effort to find efficient approaches to speed them up like work in this paper. As is well known, the negative of a point is a very cheap operation ensuring subtraction of points on elliptic curves being just as efficient as addition. This motivates modifying the binary method to signed-digit representations, that is to say, the scalar k is usually represented by digits in the set of 1,0,1 instead of 0,1. As we all know, there are many kinds of signed-digit representations. For achieving our aim, in this paper, zero-less signed-digit expansion is chosen to be used to come up with regular algorithms improving the resistance of scalar multiplication against timing attack and SPA.

Zero-less signed-digit expansion  (ZSD) is a highly regular scalar recoding algorithm that expresses an odd integer k with digits in 1,1. 1 is usually denoted as 1¯. Since bit 0 is avoided in recoded sequence, each iteration of point multiplication requires a double-and-add operation, providing a natural protection against timing attack and SPA.

Let kn1,kn2,,k0 be the binary expansion of a scalar k. Note that for any sequence of consecutive ω bits 0001, the above expansion can be rewritten as 11¯1¯1¯, i.e., P=2ωP2ω1P2PP. Similar treatment is able to be applied to radix-1/2 expansion of k, since P/2ω1=PP/2P/22P/2ω1. When applying halve-and-add algorithm, any consecutive ω bits 0001 can be rewritten as ω bits 11¯1¯1¯ as well. So if k is an odd integer, its radix-2 ZSD expansion k=i=0n1k˜i2i (or its corresponding one based on radix-1/2 represented by 2nk=i=0n1k˜i1/2ni) with k˜i1,1 can be obtained from(19)k˜n1=1,k˜i=1ki+1+1,for0in2.

From a security standpoint, every bit should be nonzero. When k is even, it requires a special treatment. This can be circumvented by computing kP with the least significant bit of k forced into 1 and finally subtracting P (or 1/2nP in the corresponding condition) from the so-obtained result if bit k0 is zero. The three algorithms in this paper applied this way to deal with kP or 2nkP correctly whether the input k is even or not.

Having known enough about ZSD expansion, we will get regular algorithms combining ZSD expansion and common binary methods to calculate the scalar multiplication. Algorithm 2 illustrates the regular ZSD double-and-add method based on radix-2 expansion from left to right, while Algorithm 3 does it from the opposite side.

<bold>Algorithm 2:</bold> Regular ZSD double-and-add (left-to-right) method.

Input: PEF2m of odd order r, k=kn1,kn2,,k02 with kn1=1

Output: kP

R0P; R1P

For i=n1 down to 1 do

t11+ki

R02R0+tR1

End for

Rk0Rk0R1k0

Return R0

<bold>Algorithm 3:</bold> Regular ZSD double-and-add (right-to-left) method.

Input: PEF2m of odd order r, k=kn1,kn2,,k02 with kn1=1

Output: kP

R1P

If k0=0 then

R0P

Else

R0O

For i=1 to n1 do

t11+ki

R0R0+tR1; R12R1

End for

R0R0+R1

Return R0

Algorithms 2 and 3 give regular binary methods to evaluate elliptic scalar multiplication based on radix-2 expansion. When it comes to calculating 2nkP, a similar condition based on radix-1/2 has to be considered, for which the halve-and-add method is needed. Referring to Algorithms 2 and 3, with a slight modification, we get Algorithm 1 for regular halving operation.

3.2. Parallelized Regular Scalar Multiplication

Let PEF2m be the point of odd order r with bit length l and a scalar k0,r1. The parallelized double-and-add/halve-and-add algorithm for scalar multiplication can be described in the following three parts including preprocessing, implementing, and postprocessing. Moreover, we may have a better view of the whole process from Figure 1.

Preprocessing: select a proper t0,l and compute k=2tkmodr.

So k=2tk=k/2t+kmod2t2t=k1+k22t, where k1 is the most significant lt bits and k2 is the least significant t bits of k. This equation indicates kP=k/2tP=k1P+k2/2tP.

Implementing: point multiplication can be done by concurrently implementing k1P in the binary method, k2/2tP in radix-1/2 method in two different threads. In detail,

Feed parameters k1 and P as inputs to the regular double-and-add algorithm, exploiting Algorithm 2 or Algorithm 3, in one thread. The final result point P1 is stored in the register.

In the meanwhile, feed parameters k2 and P as inputs to regular halve-and-add algorithm, Algorithm 1, in another thread. The final result point P2 is stored in the register.

Postprocessing: a single-point addition P1+P2 is operated to obtain the correct result of scalar multiplication.

4. Comparison and Expected Performance

Numerous standards have included NIST-recommended curves as implementation abelian groups for cryptographic protocols. The general conclusion in Tables 1 and 2 is specifically for NIST-recommended random curves having the form y2+xy=x3+x2+b, where b is an element in F2t. To allow easy comparison, the two considered curves with estimate results in this section are NIST B-233 and NIST B-409, defined by ft=t233+t74+1 and ft=t409+t87+1 over F2t, respectively.

4.1. Analysis

The theoretic complexity analysis of the four considered scalar multiplication approaches is reported in Table 1. Our work is to improve the algorithms in  and give a better new parallel algorithm for evaluating scalar multiplication (Algorithms 2 and 3 have the similar complexity, and just Algorithm 2 will be talked about in the following parts.)

For regular implementation against SSCA, the Montgomery methods and our new methods here both need m doubling and m addition point operations for double-and-add algorithms and m halving and m addition point operations for halve-and-add algorithms. To be specific, in Montgomery-D, DM, and AM mean doubling and addition operations of a very efficient Montgomery double-and-add algorithm in . It is so excellent that only 6mM+10M+1I field operations are enough for Montgomery-D, where M and I represent field multiplication and inversion. In Montgomery-H, Ha and Aa are halving and addition operations in the affine coordinate. Halving usually includes computing field multiplication, trace, solving the quadratic equation, and computing the square root operations. According to the analysis and experimental results in [14, 23], we can assume halving in affine coordinate needs 3M field operations while 2M field operations for λ projective coordinate. Besides, Aa, addition in the affine coordinate needs 2M+1I field operations. Unavoidably, the structure of Montgomery-H algorithm requires to use affine coordinate only, because no proper projective coordinate could be applied here so far, which influences its efficiency significantly. It can be easily seen from the estimate results later.

In Algorithm 2, DλP and AλP represent doubling and addition in λ projective coordinate separately. Dμ4 and Aμ4 represent doubling and addition in twisted μ4 projective coordinate. As for their corresponding field operations, DλP and AλP requires 3M and 11M, while Dμ4 and Aμ4 require 2M and 8M. In Algorithm 1, Hλa means halving in λ affine coordinate and requires 2M. Specially, if the mixed addition operation and the formula of calculating 2Q+P in Section 2.1 could be exploited, the field operations of Algorithm 2 will be 10mM+8M+3M+1I for λ projective coordinate, in which 8M is the cost of final step mixed addition in Algorithm 2 and 3M+1I is the cost of transforming the final result from λ projective coordinate to affine coordinate. When it turns to twisted μ4 projective coordinate, 9mM+7M+2M+1I field operations are needed, in which 7M is the cost of final step mixed addition in Algorithm 2 and 2M+1I is the cost of transforming the final result from twisted μ4 projective coordinate to affine coordinate. As for Algorithm 1, the mixed λ projective coordinate system could be applied saving inversion operations owing to the different algorithm structures of Algorithm 1 from Montgomery-H. Similarly, 10mM+8M+3M+1I field operations are supposed to be consumed here.

In this work, we assume I=10M and ignore S for squaring multiplication here referring to [14, 23]. In fact, squaring is nearly the fastest among all the field operations we talk about in this paper and usually S is less than 0.2M, so we can ignore it. For I=10M, it is a commonly used reference value. Yet on most occasions, I may be bigger than 10M, where Montgomery-H will be influenced most while the other three methods are almost unaffected. This is also the benefit of using the projective coordinate system. Having known all above cost comparison, two examples of NIST B-233 and B-409 are illustrated in Table 1 for easier understanding.

For double-and-add, the Montgomery-D algorithm is so outstanding that Algorithm 2 still could not catch up with the speed of it even using the twisted μ4 projective coordinate system, which is the fastest to date. For halve-and-add, Algorithm 3 saves 32.7% and 33.0% cost compared with Montgomery-H with m = 233 and m = 409. That means our algorithm for regular halve-and-add is much more useful in practice by using projective coordinate. When making use of the faster algorithm, the parallel method would also be much more efficient.

One may ask why the mixed λ projective coordinate system could not be applied to Montgomery-H. It seems that comparing these two algorithms in different coordinate systems is so unfair. To be honest, it is not our tricks to do this on purpose. If we take a good look at Montgomery-H in , supposing that we already have Q0 in λ affine coordinate and Q1 in λ projective coordinate when ki=0, we would meet the dilemma of transforming Q1 into λ affine coordinate for halving operation and Q0 into λ projective coordinate for mixed addition operation in order to save inversions when ki+1=1. Every time the consecutive two bits are different, the transformation has to be done. For a random m bits binary number, if its leftmost bit is 1, then the average number of 0 next to 1 or 1 next to 0 is approximately m1/2. Transforming from λ projective coordinate into λ affine coordinate equals to 2M+1I field operations. Taking these costs into account, Montgomery-H has to use around 16m+7M field operations, which is more than applying affine coordinate. So the best solution to deal with the problem is to use a new structure like Algorithm 1.

4.2. Parallel and New Discovery

Negre and Robert  get inspiration from  and utilize a split technique similar to the one introduced in . They also provide a Montgomery-halving algorithm like the original Montgomery-ladder scalar multiplication method. By carrying out these knowledge, a parallel method using Montgomery-D and Montgomery-H algorithms is presented. It is a pity that the Montgomery-H method from  can only use affine coordinate for its special structure. Aiming at solving this, we come up with a new regular parallel approach including Montgomery-D and Algorithm 1, which we call it mixed-parallel.

After analyzing each algorithm in Section 4.1, we can take a suitable split t to see complexity in parallel condition. The specific results are shown in Table 2. In the Algorithm column, Montgomery-parallel is the parallel algorithm in  meaning executing Montgomery-D and Montgomery-H concurrently in two different threads. Our mixed-parallel in the last line is the new united algorithm which applies Montgomery-D and Algorithm 1 simultaneously in different coprocessors.

It is evident that Montgomery-D has the least cost among all the algorithms in Table 1. However, either parallel method in Table 2 has less cost than Montgomery-D. Let us compare the Montgomery-parallel and Montgomery-D first. It turns out that Montgomery-parallel algorithm saves 27.5% and 28.0% cost than of Montgomery-D when m = 233 and m = 409. As a consequence, parallel is indeed a good idea for computing scalar multiplication. Furthermore, if we combine the best double-and-add Montgomery-D algorithm and the best halve-and-add Algorithm 1-H, a new efficient parallel method, mixed-parallel, jumps into our sight giving new hope. Estimating results demonstrate that our mixed-parallel method costs 11.7% and 12.0% less than that of Montgomery-parallel when m = 233 and m = 409, respectively. This is a new discovery and record.

5. Conclusion

In this paper, we present a new parallel algorithm to improve the Montgomery algorithm in . The two methods both take advantage of inherent parallelism of modern processors constructing parallel approaches. Instead of using Montgomery-like idea, a regular recoding technique is applied in our approach which is supposed to be highly efficient by processing double-and-add and halve-and-add in a parallel way. The regular method could protect the computing process against SSCA like Montgomery thought.

After the careful analysis of these algorithms, we could draw the conclusion that our regular halve-and-add approach, Algorithm 1, could use λ projective coordinate making up for the disadvantage of Montgomery-H saving about 32.7% and 33.0% cost compared with that of Montgomery-H with m = 233 and m = 409.

As a result, combining Montgomery-D and Algorithm 1, a new preferable parallel approach is born, our mixed-parallel. It costs 11.7% and 12.0% less than that of Montgomery-parallel when m = 233 and m = 409, respectively. This is a new record as well as a good improvement and supplement to the previous excellent work of .

Data Availability

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61872442, 61772515, 61502487, and U1936209); the National Cryptography Development Fund (No. MMJJ20180216); and the Beijing Municipal Science & Technology Commission (Project no. Z191100007119006).

KoblitzN.Elliptic curve cryptosystemsMathematics of Computation19874817720320910.1090/s0025-5718-1987-0866109-52-s2.0-84968503742MillerV. S.Use of elliptic curves in cryptographyProceedings of the Conference on the Theory and Application of Cryptographic TechniquesApril 1985Linz, AustriaSpringerKocherP.JaffeJ.BenjaminJ.Differential power analysisProceedings of the Annual International Cryptology ConferenceAugust 1999Santa Barbara, CA, USASpringerCoronJ.-S.Resistance against differential power analysis for elliptic curve cryptosystemsInternational Workshop on Cryptographic Hardware and Embedded SystemsAugust 1999Worcester, MA, USASpringerYenS.-M.JoyeM.Checking before output may not be enough against fault-based cryptanalysisIEEE Transactions on Computers200049996797010.1109/12.8693282-s2.0-0034276289Sung-MingY.A countermeasure against one physical cryptanalysis may benefit another attackProceedings of the International Conference on Information Security and CryptologyDecember 2001Seoul, KoreaSpringerMontgomeryP. L.Speeding the Pollard and elliptic curve methods of factorizationMathematics of Computation19874817724310.1090/s0025-5718-1987-0866113-72-s2.0-84968484435JoyeM.TunstallM.Exponent recoding and regular exponentiation algorithmsProceedings of the International Conference on Cryptology in AfricaJune 2009Gammarth, TunisiaSpringerLangeH.RuppertW.Complete systems of addition laws on abelian varietiesInventiones Mathematicae198579360361010.1007/bf013885262-s2.0-0001645509BosmaW.LenstraH. W.Complete systems of two addition laws for elliptic curvesJournal of Number Theory199553222924010.1006/jnth.1995.10882-s2.0-38349069733BernsteinD. J.LangeT.Faster addition and doubling on elliptic curvesProceedings of the International Conference on the Theory and Application of Cryptology and Information SecurityDecember 2007Kuching, MalaysiaSpringerHisilH.Twisted edwards curves revisitedProceedings of the International Conference on the Theory and Application of Cryptology and Information SecurityDecember 2008Melbourne, AustraliaSpringerRenesJ.CraigC.BatinaL.Complete addition formulas for prime order elliptic curvesProceedings of the Annual International Conference on the Theory and Applications of Cryptographic TechniquesMay 2016Vienna, AustriaSpringerNegreC.RobertJ.-M.New parallel approaches for scalar multiplication in elliptic curve over fields of small characteristicIEEE Transactions on Computers201564102875289010.1109/tc.2015.23898172-s2.0-84942412267HankersonD.MenezesA. J.ScottV.Guide to Elliptic Curve Cryptography2006Berlin, GermanySpringer Science & Business MediaLópezJ.DahabR.Improved algorithms for elliptic curve arithmetic in GF (2 n)Proceedings of the International Workshop on Selected Areas in CryptographyAugust 1998Kingston, CanadaSpringerKnudsenE. W.Elliptic scalar multiplication using point halvingProceedings of the International Conference on the Theory and Application of Cryptology and Information SecurityNovember 1999SingaporeSpringerOliveiraT.Lambda coordinates for binary elliptic curvesProceedings of the International Workshop on Cryptographic Hardware and Embedded SystemsAugust 2013Santa Barbara, CA, USASpringerKohelD.Twisted μ4-normal form for elliptic curvesProceedings of the Annual International Conference on the Theory and Applications of Cryptographic TechniquesApril 2017Paris, FranceSpringerKohelD.Efficient arithmetic on elliptic curves in characteristic 2Proceedings of the International Conference on Cryptology in India2012Chennai, IndiaSpringerKohelD.A normal form for elliptic curves in characteristic 2Arithmetic, Geometry, Cryptography and Coding Theory (AGCT 2011)2011Berlin, GermanySpringerKohelD.Addition law structure of elliptic curvesJournal of Number Theory2011131589491910.1016/j.jnt.2010.12.0012-s2.0-79551608664FongK.HankersonD.LopezJ.MenezesA.Field inversion and point halving revisitedIEEE Transactions on Computers20045381047105910.1109/tc.2004.432-s2.0-3242670828GoundarR. R.JoyeM.MiyajiA.RivainM.VenelliA.Scalar multiplication on weierstraβ elliptic curves from Co-Z arithmeticJournal of Cryptographic Engineering20111216117610.1007/s13389-011-0012-02-s2.0-84866022505LópezJ.DahabR.Fast multiplication on elliptic curves over GF (2 m) without precomputationProceedings of the International Workshop on Cryptographic Hardware and Embedded SystemsAugust 1999Worcester, MA, USASpringerTaverneJ.Faz-HernándezA.AranhaD. F.Rodríguez-HenríquezF.HankersonD.LópezJ.Speeding scalar multiplication over binary elliptic curves using the new carry-less multiplication instructionJournal of Cryptographic Engineering20111318719910.1007/s13389-011-0017-82-s2.0-84857705252