Parallel and Regular Algorithm of Elliptic Curve Scalar Multiplication over Binary Fields

Accelerating scalar multiplication has always been a signiﬁcant topic when people talk about the elliptic curve cryptosystem. Many approaches have been come up with to achieve this aim. An interesting perspective is that computers nowadays usually have multicore processors which could be used to do cryptographic computations in parallel style. Inspired by this idea, we present a new parallel and eﬃcient algorithm to speed up scalar multiplication. First, we introduce a new regular halve-and-add method which is very eﬃcient by utilizing λ projective coordinate. Then, we compare many diﬀerent algorithms calculating double-and-add and halve-and-add. Finally, we combine the best double-and-add and halve-and-add methods to get a new faster parallel algorithm which costs around 12 . 0% less than the previous best. Furthermore, our algorithm is regular without any dummy operations, so it naturally provides protection against simple side-channel attacks.


Introduction
e elliptic curve was first imported into the world of cryptography by Neal Koblitz and Victor Miller independently in 1985 [1,2] and is now increasingly used for a wide range of cryptography primitives in practice such as public encryption and digital signature. More than 30 years after its introduction to the cryptography field, the practical advantages of elliptic curve cryptosystem (ECC) are clear and well-known: it has richer algebraic structures, a smaller key size, and relatively faster implementations to achieve the same level of security compared with other deployed schemes such as RSA. Based on the above benefits, ECC is particularly suitable for resource-constrained devices. e efficiency of ECC is dominated by the speed of calculating scalar multiplication. Namely, given a rational point P of order r on elliptic curves, it requires to compute kP � P + P + · · · + P (k times), for a given scalar k ∈ [0, r). Obviously, there are similar features between scalar multiplication and exponentiation in a general multiplicative finite group. erefore, inspired by the repeated "square-and-multiply" algorithm, the normally used binary method called "double-and-add" for scalar multiplication over elliptic curves has been regarded as a fundamental technique.
In constrained environments, scalar multiplication is easily implemented by "double-and-add" variant of Horner's rule, providing binary expansion of scalar k � l i�0 k i · 2 i . However, each bit of k implies different algorithmic path during each iteration, that is, if k i � 0, only a point doubling is necessary. Whereas if k i � 1, a point doubling followed by a point addition is involved. As a consequence, different power and time consumption of this two prominent building blocks can be detected by simple power analysis (SPA) [3] and timing attack-this naive implementation leads to information leakage of secret scalar k.
Protecting against simple side-channel attacks (SSCA) can be achieved by recoding scalars in a regular manner, meaning that scalar multiplications are executed in the same instructions in the same order for any input value. Coron introduced a countermeasure against SSCA named "doubleand-add always" algorithm [4]. By inserting a dummy operation when necessary, it evaluates scalar multiplication by executing a doubling and an addition in each loop. However, it was soon found to be vulnerable to safe-error fault attacks [5,6]. By timely inducing a fault at one iteration during the point addition, an adversary can determine whether the operation is dummy or not by checking the correctness of the output.
A measurement against safe-error fault attacks performs scalar multiplication in a predictable pattern. Besides the most commonly used Montgomery-ladder algorithm [7], another efficient method is m-ary recoding [8]. is algorithm recodes a scalar in a sequence of m − 1 zeros and a nonzero with the percentage of nonzero numbers (1/m). However, scanning from look-up table could be dangerous if this step cannot be proceeded in constant-time.
Another increased interest-focused field of regular executing scalar multiplication is exploiting efficient curve forms that allow complete addition law. For any pair of k-rational points on elliptic curves (or in desired subgroup), complete addition law can compute the correct result, ignoring whether two addends are identical or not. As a corollary of the main results in [9], elliptic curves embedded in any projective spaces of dimension n by a symmetric line bundle admit a complete system of addition laws of bidegree (2,2). e later work of Bosma and Lenstra [10] shows that, when suitably chosen, a single addition law is able to act as add operation for all pairs of k-rational elliptic points. One of the well-studied examples is Edwards curves [11,12], of which exceptional pairs for addition law exist outside k-rational points. A recent work [13] proposed an optimized algorithm that adds any pair of k-rational points for prime order elliptic curves defined over field of characteristic different from 2 and 3.
In [14], the authors introduce a new approach for scalar multiplication called Montgomery-halving algorithm which is a variation of the original Montgomery-ladder point multiplication. Besides, they present a new strategy for parallel implementation of point multiplication over elliptic curves by running the Montgomery-halving algorithm with the original Montgomery-ladder algorithm in parallel to calculate scalar multiplication concurrently. Moreover, this parallel algorithm can achieve protecting against SSCA. However, in their scheme, affine coordinate has to be used for halving, because the projective form of the Montgomeryhalving algorithm could not be used to save operations.
In this paper, we provide a similar parallel implementation method using regular recoding technique which should be highly efficient by parallel processing doubling and halving operations in two different coprocessors. It can be concluded as two main contributions. e first contribution is that we give a new regular algorithm computing halving operation called zero-less signed-digit (ZSD) halve-and-add which saves around 32.7% and 33.0% cost compared with Montgomeryhalving method in [14] with m = 233 and m = 409. e λ projective coordinate system could offer projective coordinates saving inversions. is is especially useful for our ZSD halve-and-add algorithm (Algorithm 1). For halving operation, the best coordinate is λ affine coordinate. For the following addition operation, the better choice is λ projective coordinate. e Montgomeryhalving algorithm in [14] has to exploit affine coordinate for its special structure without other choices, while our Algorithm 1 could make use of λ projective coordinate for its different structure design, where R 1 can always be in λ affine coordinate for halving and R 0 can always be in λ projective coordinate for addition so that λ projective mixed addition law could be used and no more coordinate transformation needed. In addition, the regular recoding technique ensures the secure implementation of scalar multiplication against SSCA. e second contribution concerns the new mixed-parallel algorithm. After analyzing all the algorithms in Table 1, we combine the fastest double-and-add method and Montgomery double-and-add method, in [14], and the fastest halve-and-add method, our ZSD halve-and-add algorithm, in this paper. A new efficient and secure mixedparallel algorithm just comes into being, the mixed-parallel method, which costs around 11.7% and 12.0% less than Montgomery-Parallel approach in [14] when m = 233 and m = 409, respectively. e more thorough analysis will be exhibited in Section 4, and the related estimate results are all displayed in Tables 1 and 2. e rest of this paper is organized as follows. In the next section, we introduce the related arithmetic knowledge of binary elliptic curves, especially on efficient λ coordinate point representation, twisted μ 4 -normal form, and how to evaluate scalar multiplication in parallel by combining point halving and doubling operations. In Section 3, our new regular algorithm for halve-and-add is provided. Moreover, a similar parallel strategy as the one detailed in [14] shows how to efficiently implement scalar multiplication in a regular and parallel manner. Cost comparison and expected performance analysis are presented in Section 4. Finally, we conclude this paper and give the new mixed-parallel algorithm after analyzing.

Preliminaries
We focus on elliptic curves E defined over binary fields F 2 m , by the Weierstrass equation: is an irreducible polynomial of degree m. Isomorphic to the divisor class group of degree 0, the rational points P(x, y) on E together with the point at infinity O form an abelian group, of which the basic group operation-addition-is algebraically interpreted by the tangent-and-chord law. Given two points P 1 � (x 1 , y 1 ) and P 2 � (x 2 , y 2 ) on E(F 2 m ), where P 1 � ± P 2 , if the addition of the two points is presented by Q � P 1 + P 2 , then the coordinates of Q � (x 3 , y 3 ) can be computed according to the following formula: Similarly, given P � (x 1 , y 1 ) ∈ E(F 2 m ), where P ≠ − P, if the doubling of the point P is presented by 2 · P, then the coordinates of 2P � (x 3 , y 3 ) can be computed according to the following formula [15]: with λ � x 1 + y 1 /x 1 . From the above formulas, it is easy to notice that there are inevitable inversion operations in the base field, which would consume much time. Usually, the projective coordinate system is more welcome for its inclusion of no field inversions. In practice, various kinds of coordinate systems are already available to be used. e work in this paper prefers to exploit the state-of-the-art coordinate systems: λ coordinate and the projective coordinate system of twisted μ 4 -normal form. ey perform excellently in different situations.

λ Coordinates.
Efficient point representation is of great importance to accelerate scalar multiplication. Inversion in the base field takes a large amount of time; however, they are indispensable if points are represented in affine coordinate. e homogeneous projective coordinate system (also called standard projective coordinate system) is usually used to eliminate this obstacle by injecting any k-rational affine point P(x, y) ∈ A 2 into one of its projective copies When one of the projective copies (X, Y, Z) � (x, y, 1) corresponds to the affine point (x, y), where x � X/Z 2 , y � Y/Z 3 , Z ≠ 0, it is the Jacobian projective coordinate system. Later, López and Dahab proposed a new and efficient projective coordinate system. Compared with the above coordinate systems, the difference is x � X/Z, y � Y/Z 2 , Z ≠ 0 here [16], denoted as LD coordinate for short. Later, Kim and Kim presented a four-dimensional LD coordinate system for binary curves which represents P as (X, Y, Z, T), with x � X/Z, y � Y/T, T � Z 2 , and Z ≠ 0. e λ coordinate system was firstly noticed by Knudsen [17] when studying halving operations on binary elliptic curves. Oliveira [18] further surveyed its comprehensive arithmetic. Given a point P � (x, y) ∈ E(F 2 m ) with x � 0, the λ affine representation of P is defined as (x, λ), where λ � x + y/x. So, it is easy to derive point addition and doubling formulas of points in λ affine coordinates from the normal affine ones. Let P � (x P , λ P ) and Q � (x Q , λ Q ) be two points on E(F 2 m ), where P ≠ ± Q, then the formula for P + Q � (x P+Q , λ P+Q ) can be given by the following formula: ALGORITHM 1: Regular ZSD halve-and-add (left-to-right) method.
Montgomery-D � Montgomery double-and-add algorithm, Montgomery-H � Montgomery halve-and-add algorithm, Algorithm. 2-D (λ-Projective) � Algorithm 2 using the λ-projective coordinate system, Algorithm. 2-D (twisted μ 4 ) � Algorithm 2 using the twisted μ 4 coordinate system, Algorithm. 1-H � Algorithm 1 for halve-and-add. Referring to doubling operation, 2P � (x 2P , λ 2P ) is given as follows: As for projective conditions, the translation between affine representation (x, y) and λ projective representation e negative element of (X, L, Z) is (X, L + Z, Z). Assumed two points P(X P , L P , Z P ) and Q(X Q , L Q , Z Q ) represented in λ model on binary elliptic curves, similar to the affine case, the addition arithmetic could be described as the following formulas: and for 2P � (X 2P , L 2P , Z 2P ), it could be given as follows: e associated group addition P + Q and doubling 2 · P operations can be calculated by 11M + 2S and 3M + 4S, respectively, where M denotes a field multiplication and S denotes a squaring.
Having the above formulas, a direct thought is to combine doubling and addition formulas to obtain a formula evaluating 2Q + P, which is of great importance in the latter part of this paper. Let can be computed as follows: Using this, 2Q + P operations can be calculated efficiently by 10M where M denotes a field multiplication and S denotes a squaring [18].
Let C t be an elliptic curve over characteristic-two finite field in the twisted split μ 4 -normal form: and let (X 1 , be two points on the curve. A complete system of addition laws is given by the two following two maps: respectively, where For the point (X 1 , X 2 , X 3 , X 4 ), the doubling map sends it to if a � 0, and to if a � 1. In twisted split μ 4 -normal form, addition operations of generic points can be evaluated by 9M + 2S and doubling operations of a generic point can be evaluated by 2M + 5S with notations M for field multiplication and S for squaring [19]. Among all the studied coordinate systems on binary curves, twisted μ 4 -normal form and λ projective coordinate appear to be faster. e difference is twisted μ 4 -normal form is better calculating double-and-add, while λ projective coordinate can be used in halving operation. e costs of different point operations using various point representing systems are shown in Table 3.

4
Security and Communication Networks

Halving Operation.
e main ingredient we consider is a cyclic subgroup in E(F 2 m ) of odd order r, denoted as G. e multiple-by-2 isogeny [2]: P ⟶ 2P on G is an isomorphism, so is its inverse map halving operation [1/2]: P ⟶ (1/2)P. e use of point halving to speedup scalar multiplication was firstly investigated by Knudsen [17]. Given a point Q � (u, v) ∈ G, it allows to compute another point P � (x, y) ∈ G satisfying Q � 2P in the cost of a field multiplication, calculating a square root and solving a quadratic equation, which could be directly understood from the formulas below: e most commonly used method is to solve the second equation for λ, then the third one for x, and finally the first one for y.
When λ coordinate like (x, λ) is used instead of affine coordinate (x, y), where P � (x, λ P ) and Q � (u, λ Q ), the halving operation formulas would be changed as follows: is time we just need two steps, that is to say, solve the first equation for λ P and then the second one for x. Without computing y, the halving point coordinates P � (x, λ P ) of Q � (u, λ Q ) can be obtained more simply.
As proved in [23], solving a quadratic equation x 2 + x � t on binary curves with Tr(a) � 1 equivalents to computing the half-trace function H(t) � (m− 1)/2 i�0 t 2 2i . Although extra memory resources are needed, Fong et al. [23] showed a technique to significantly reduce the required time and space. With dedicated implementation, a point halving is approximately twice the time of a field multiplication, significantly faster than the customarily used point doubling.
From the algorithmic view, the halve-and-add method [17] expands a scalar k in radix-(1/2) representation system. Let l be the binary length of r, . Much similar to double-and-add, point multiplication, can be efficiently computed by applying point halving on an accumulator. It can be further optimized combining methods like ω-NAF to get a better implementation performance, as shown in [23]. Enlightened by the treatment in halve-and-add, if we choose an appropriate number less than l, the scalar k can be split into two parts naturally. In consequence, the halveand-add method is easy to be concurrently implemented with the double-and-add algorithm in parallel model, making use of increasing cores in modern processors, which would be a lot faster than applying one algorithm without parallel implementation (some inevitable computation load should be considered in advance). Specifically speaking, if the lengths of r is l and a proper t has been chosen, the scalar k can be split into two portions applying halve-and-add and double-and-add algorithms simultaneously, which can be indicated as follows-the length of each part (t and l − t) depends on actual implementation speed of halving and doubling which can be found experimentally: If we already have the binary expression of k ′ � l− 1 i�0 k i ′ 2 i with odd order r, then it is easily derived that k � k ′ · 2 − t mod r � (k t− 1 ′ · 2 − 1 + · · · + k 0 ′ · 2 − t ) + (k l− 1 ′ · 2 l− t− 1 + · · · + k t ′ ) mod r. e scalar multiplication k · P of P is then split into two parts directly: e first part is easily executed in the halve-and-add method; meanwhile, the second part can be performed through a double-and-add approach, in two different threads.
As far as side-channel attacks being concerned, noticing that double-and-add can be implemented using Montgomery-ladder point multiplication, Negre and Robert [14] presented analogous Montgomery-halving algorithm. During each iteration, two registers hold fixed difference---2P, and the algorithm processes a point halving and an addition in each iteration. However, as noticed by the authors, this parallel algorithm can only be implemented in affine coordinate, since halving operation cannot be implemented in the projective coordinate efficiently. To overcome this drawback, we present another regular recoding algorithm that can be used when implementing parallel halve-and-add/double-and-add in the projective coordinate system.

Regular Implementation
Protecting the implementation of scalar multiplication against SSCA can be achieved by many methods. Compared with unprotected implementation, algorithmic countermeasures like recoding scalars in a regular manner always sacrifice efficiency, yet may be easily mitigated by taking advantage of inherent parallelism of modern processors.

Zero-Less Signed-Digit Expansion.
In general, point addition and doubling of elliptic curves are very different from the usual arithmetic operations, which are so complicated and time consuming that plenty of scholars have been sparing no effort to find efficient approaches to speed them up like work in this paper. As is well known, the negative of a point is a very cheap operation ensuring subtraction of points on elliptic curves being just as efficient as addition. is motivates modifying the binary method to signed-digit representations, that is to say, the scalar k is usually represented by digits in the set of − 1, 0, 1 { } instead of 0, 1 { }. As we all know, there are many kinds of signed-digit representations. For achieving our aim, in this paper, zero-less signed-digit expansion is chosen to be used to come up with regular algorithms improving the resistance of scalar multiplication against timing attack and SPA.
Zero-less signed-digit expansion [24] (ZSD) is a highly regular scalar recoding algorithm that expresses an odd integer k with digits in − 1, 1 { }. − 1 is usually denoted as 1. Since bit 0 is avoided in recoded sequence, each iteration of point multiplication requires a double-and-add operation, providing a natural protection against timing attack and SPA.
Let (k n− 1 , k n− 2 , . . . , k 0 ) be the binary expansion of a scalar k. Note that for any sequence of consecutive ω bits 00 · · · 01, the above expansion can be rewritten as 111 · · · 1, i.e., P � 2 ω P − 2 ω− 1 P − · · · − 2P − P. Similar treatment is able to be applied to radix-(1/2) expansion of k, since P/2 ω− 1 � P − (P/2) − (P/2 2 ) − · · · − (P/2 ω− 1 ). When applying halve-and-add algorithm, any consecutive ω bits 00 · · · 01 can be rewritten as ω bits 111 · · · 1 as well. So if k is an odd integer, its radix-2 ZSD expansion k � n− 1 i�0 k i · 2 i (or its corresponding one based on radix-(1/2) represented by From a security standpoint, every bit should be nonzero. When k is even, it requires a special treatment. is can be circumvented by computing kP with the least significant bit of k forced into 1 and finally subtracting P (or (1/2 n )P in the corresponding condition) from the so-obtained result if bit k 0 is zero. e three algorithms in this paper applied this way to deal with kP or 2 − n kP correctly whether the input k is even or not.
Having known enough about ZSD expansion, we will get regular algorithms combining ZSD expansion and common binary methods to calculate the scalar multiplication. Algorithm 2 illustrates the regular ZSD double-and-add method based on radix-2 expansion from left to right, while Algorithm 3 does it from the opposite side.
Algorithms 2 and 3 give regular binary methods to evaluate elliptic scalar multiplication based on radix-2 expansion. When it comes to calculating 2 − n kP, a similar condition based on radix-1/2 has to be considered, for which the halve-and-add method is needed. Referring to Algorithms 2 and 3, with a slight modification, we get Algorithm 1 for regular halving operation.

Parallelized Regular Scalar Multiplication.
Let P ∈ E(F 2 m ) be the point of odd order r with bit length l and a scalar k ∈ [0, r − 1]. e parallelized double-and-add/ halve-and-add algorithm for scalar multiplication can be described in the following three parts including preprocessing, implementing, and postprocessing. Moreover, we may have a better view of the whole process from Figure 1.
Preprocessing: select a proper t ∈ (0, l) and compute where k 1 ′ is the most significant l − t bits and k 2 ′ is the least significant t bits of k ′ . is equation indicates kP � (k ′ /2 t )P � k 1 ′ P + (k 2 ′ /2 t )P. Implementing: point multiplication can be done by concurrently implementing k 1 ′ P in the binary method, (k 2 ′ /2 t )P in radix-1/2 method in two different threads. In detail, (1) Feed parameters k 1 ′ and P as inputs to the regular double-and-add algorithm, exploiting Algorithm 2 or Algorithm 3, in one thread. e final result point P 1 is stored in the register. (2) In the meanwhile, feed parameters k 2 ′ and P as inputs to regular halve-and-add algorithm, Algorithm 1, in another thread. e final result point P 2 is stored in the register.
Postprocessing: a single-point addition P 1 + P 2 is operated to obtain the correct result of scalar multiplication.

Comparison and Expected Performance
Numerous standards have included NIST-recommended curves as implementation abelian groups for cryptographic protocols. e general conclusion in Tables 1 and 2 is specifically for NIST-recommended random curves having the form y 2 + xy � x 3 + x 2 + b, where b is an element in F 2 (t). To allow easy comparison, the two considered curves with estimate results in this section are NIST B-233 and NIST B-409, defined by f(t) � t 233 + t 74 + 1 and f(t) � t 409 + t 87 + 1 over F 2 (t), respectively.

Analysis.
e theoretic complexity analysis of the four considered scalar multiplication approaches is reported in Table 1. Our work is to improve the algorithms in [14] and give a better new parallel algorithm for evaluating scalar multiplication (Algorithms 2 and 3 have the similar complexity, and just Algorithm 2 will be talked about in the following parts.) For regular implementation against SSCA, the Montgomery methods and our new methods here both need m doubling and m addition point operations for double-andadd algorithms and m halving and m addition point operations for halve-and-add algorithms. To be specific, in Montgomery-D, D M , and A M mean doubling and addition operations of a very efficient Montgomery double-and-add algorithm in [25]. It is so excellent that only 6mM + 10M + 1I field operations are enough for Montgomery-D, where M and I represent field multiplication and inversion. In Montgomery-H, H a and A a are halving and addition operations in the affine coordinate. Halving usually includes computing field multiplication, trace, solving the quadratic equation, and computing the square root operations. According to the analysis and experimental results in [14,23], we can assume halving in affine coordinate needs 3M field operations while 2M field operations for λ projective coordinate. Besides, A a , addition in the affine coordinate needs 2M + 1I field operations. Unavoidably, the structure of Montgomery-H algorithm requires to use affine coordinate only, because no proper projective coordinate could be applied here so far, which influences its efficiency significantly. It can be easily seen from the estimate results later.
In Algorithm 2, D λ P and A λ P represent doubling and addition in λ projective coordinate separately. D μ 4 and A μ 4 represent doubling and addition in twisted μ 4 projective coordinate. As for their corresponding field operations, D λ P and A λ P requires 3M and 11M, while D μ 4 and A μ 4 require 2M and 8M. In Algorithm 1, H λ a means halving in λ affine coordinate and requires 2M. Specially, if the mixed addition operation and the formula of calculating 2Q + P in Section 2.1 could be exploited, the field operations of Algorithm 2 will be 10mM + 8M + 3M + 1I for λ projective coordinate, in which 8M is the cost of final step mixed addition in Algorithm 2 and 3M + 1I is the cost of transforming the final result from λ projective coordinate to affine coordinate. When it turns to twisted μ 4 projective coordinate, 9mM + 7M + 2M + 1I field operations are needed, in which 7M is the cost of final step mixed addition in Algorithm 2 and 2M + 1I is the cost of transforming the final result from twisted μ 4 projective coordinate to affine coordinate. As for Algorithm 1, the mixed λ projective coordinate system could be applied saving inversion operations owing to the different algorithm structures of Algorithm 1 from Montgomery-H. Similarly, 10mM + 8M + 3M + 1I field operations are supposed to be consumed here.
In this work, we assume I � 10M and ignore S for squaring multiplication here referring to [14,23]. In fact, squaring is nearly the fastest among all the field operations we talk about in this paper and usually S is less than 0.2M, so we can ignore it. For I � 10M, it is a commonly used reference value. Yet on most occasions, I may be bigger than 10M, where Montgomery-H will be influenced most while the other three methods are almost unaffected. is is also the benefit of using the projective coordinate system. Having known all above cost comparison, two examples of NIST B-233 and B-409 are illustrated in Table 1 for easier  understanding. For double-and-add, the Montgomery-D algorithm is so outstanding that Algorithm 2 still could not catch up with the speed of it even using the twisted μ 4 projective coordinate system, which is the fastest to date. For halve-andadd, Algorithm 3 saves 32.7% and 33.0% cost compared with Montgomery-H with m = 233 and m = 409. at means our algorithm for regular halve-and-add is much more useful in practice by using projective coordinate. When making use of the faster algorithm, the parallel method would also be much more efficient.
One may ask why the mixed λ projective coordinate system could not be applied to Montgomery-H. It seems that comparing these two algorithms in different coordinate systems is so unfair. To be honest, it is not our tricks to do this on purpose. If we take a good look at Montgomery-H in [14], supposing that we already have Q 0 in λ affine coordinate and Q 1 in λ projective coordinate when k i ′ � 0, we would meet the dilemma of transforming Q 1 into λ affine coordinate for halving operation and Q 0 into λ projective coordinate for mixed addition operation in order to save inversions when k i+1 ′ � 1. Every time the consecutive two bits are different, the transformation has to be done. For a random m bits binary number, if its leftmost bit is 1, then the average number of 0 next to 1 or 1 next to 0 is approximately (m − 1)/2. Transforming from λ projective coordinate into λ affine coordinate equals to 2M + 1I field operations. Taking these costs into account, Montgomery-H has to use around (16m + 7)M field operations, which is more than applying affine coordinate. So the best solution to deal with the problem is to use a new structure like Algorithm 1.

Parallel and New
Discovery. Negre and Robert [14] get inspiration from [26] and utilize a split technique similar to the one introduced in [26]. ey also provide a Montgomery-halving algorithm like the original Montgomeryladder scalar multiplication method. By carrying out these knowledge, a parallel method using Montgomery-D and Montgomery-H algorithms is presented. It is a pity that the Montgomery-H method from [14] can only use affine coordinate for its special structure. Aiming at solving this, we come up with a new regular parallel approach including Montgomery-D and Algorithm 1, which we call it mixedparallel.
After analyzing each algorithm in Section 4.1, we can take a suitable split t to see complexity in parallel condition.
e specific results are shown in Table 2. In the Algorithm column, Montgomery-parallel is the parallel algorithm in [14] meaning executing Montgomery-D and Montgomery-H concurrently in two different threads. Our mixed-parallel in the last line is the new united algorithm which applies Montgomery-D and Algorithm 1 simultaneously in different coprocessors.
It is evident that Montgomery-D has the least cost among all the algorithms in Table 1. However, either parallel method in Table 2 has less cost than Montgomery-D. Let us compare the Montgomery-parallel and Montgomery-D first. It turns out that Montgomery-parallel algorithm saves 27.5% and 28.0% cost than of Montgomery-D when m = 233 and m = 409. As a consequence, parallel is indeed a good idea for computing scalar multiplication. Furthermore, if we combine the best doubleand-add Montgomery-D algorithm and the best halve-andadd Algorithm 1-H, a new efficient parallel method, mixedparallel, jumps into our sight giving new hope. Estimating results demonstrate that our mixed-parallel method costs 11.7% and 12.0% less than that of Montgomery-parallel when m = 233 and m = 409, respectively.
is is a new discovery and record.

Conclusion
In this paper, we present a new parallel algorithm to improve the Montgomery algorithm in [14]. e two methods both take advantage of inherent parallelism of modern processors constructing parallel approaches. Instead of using Montgomery-like idea, a regular recoding technique is applied in our approach which is supposed to be highly efficient by processing double-and-add and halve-and-add in a parallel way.
e regular method could protect the computing process against SSCA like Montgomery thought.
After the careful analysis of these algorithms, we could draw the conclusion that our regular halve-and-add approach, Algorithm 1, could use λ projective coordinate making up for the disadvantage of Montgomery-H saving about 32.7% and 33.0% cost compared with that of Montgomery-H with m = 233 and m = 409.
As a result, combining Montgomery-D and Algorithm 1, a new preferable parallel approach is born, our mixedparallel. It costs 11.7% and 12.0% less than that of Montgomery-parallel when m � 233 and m � 409, respectively.
is is a new record as well as a good improvement and supplement to the previous excellent work of [14].

Data Availability
All data generated or analyzed during this study are included in this published article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.