FPGA Based High Speed SPA Resistant Elliptic Curve Scalar Multiplier Architecture

The higher computational complexity of an elliptic curve scalar point multiplication operation limits its implementation on general purpose processors. Dedicated hardware architectures are essential to reduce the computational time, which results in a substantial increase in the performance of associated cryptographic protocols. This paper presents a unified architecture to compute modular addition, subtraction, and multiplication operations over a finite field of large prime characteristic GF(p). Subsequently, dual instances of the unified architecture are utilized in the design of high speed elliptic curve scalarmultiplier architecture.Theproposed architecture is synthesized and implemented on several different Xilinx FPGA platforms for different field sizes. The proposed design computes a 192-bit elliptic curve scalar multiplication in 2.3ms on Virtex-4 FPGA platform. It is 34% faster and requires 40% fewer clock cycles for elliptic curve scalar multiplication and consumes considerable fewer FPGA slices as compared to the other existing designs. The proposed design is also resistant to the timing and simple power analysis (SPA) attacks; therefore it is a good choice in the construction of fast and secure elliptic curve based cryptographic protocols.


Introduction
Elliptic curve based cryptography (ECC) proposed independently by Miller [1] and Koblitz [2] has established itself as a proper alternative to the traditional systems such as Ron Rivest, Adi Shamir, and Leonard Adleman (RSA) [3].The National Institute of Standards and Technology (NIST) recommended 256 bits of key lengths for ECC to achieve the same level of security as 3072 bits of RSA.
Due to the fact that ECC offers similar security with considerable smaller key sizes than RSA, it has been standardized by IEEE and NIST [4].Thus, as the result of smaller key sizes, its implementation led to substantial reduction in power consumption and storage requirements and offers potentially higher data rates.These inherent properties rank it as a strong candidate for providing security in resource-constrained devices.Unfortunately, due to the underlying complex mathematical structure, its implementation on general-purpose processors (GPP) struggles to meet the speed requirements of many real-time applications.
Thus, several new implementation platforms have been explored during the last years.Field programmable gate array (FPGA) has been established as a proper platform for implementation of security algorithms such as ECC and RSA.Its shorter design cycle time, lower design cost, and its reconfigurability make it more attractive than other platforms, such as Application Specific Integrated Circuits (ASICs).
Elliptic curve scalar point multiplication is the central and most time consuming operation in all ECC based schemes.Its efficient implementation on various platforms is very critical.It is achieved by manipulating points on a properly chosen elliptic curve over a finite field.Mathematically, it is expressed as  = , where  is a base point,  is an integer value, and  is the resultant point of multiplication of  and .For example, it can be achieved by adding  to itself ( − 1) times.The strength of any ECC schemes is based on the computational 2 International Journal of Reconfigurable Computing hardness of finding  given  and  known as Elliptic Curve Discrete Logarithm Problem (ECDLP).
There are several elliptic curve representations satisfying different performance and security requirements.A flexible design capable of supporting different values for elliptic curve parameters and a prime  is more demanding.The ECDLP is not the only way of finding scalar ; it can also be revealed by monitoring the timing [5] and power consumption of cryptographic devices known as side channel attacks (SCAs) [6].The simplest SCAs are based on the timing and simple power consumption analysis (SPA).Detailed surveys on known SCAs, countermeasures, and secure ECC implementations are reported previously in [7,8].
Elliptic curve scalar point multiplication involves many basic modular arithmetic operations such as addition, subtraction, multiplication, inversion, and division.Hence, optimization of these operations can significantly improve the performance of ECC schemes.
Elliptic curve cryptosystems can be designed on a finite field either with prime characteristics GF() or with binary characteristics GF(2  ).The GF(2  ) arithmetic is easier to implement in hardware than GF() because of carry-free arithmetic.However, field parameters in GF(2  ) are mostly fixed and are not very flexible.Some efficient ECC implementations over GF(2  ) are presented in [9][10][11][12][13][14].A very good survey of high speed hardware implementations of ECC has been reported in [15].
Several hardware based elliptic curve processors over GF() have also been proposed in the literature [5,[16][17][18][19][20][21][22][23][24][25][26].The design reported in [21] proposed two architectures to speed up the EC point multiplication operation.Both these architectures are based on incorporating parallel dedicated hardware units to compute arithmetic operations such as addition, subtraction, multiplication, and division over GF().The GF() multiplication unit [21] is based on a bit-serial interleaved multiplication while, for a division over GF(), a dedicated hardware unit based on a binary version of the extended Euclidean algorithm is used.Ghosh et al. proposed a speed and area optimized architecture for EC point multiplication by exploiting a concept of shared hardware arithmetic over GF() [20].The saving in area is achieved by sharing hardware resources among different GF() arithmetic operations, while multiple copies of the arithmetic units are used to speed up EC point multiplication.
1.1.Contribution.Modern FPGAs have dedicated built-in arithmetic components (dedicated multipliers, block RAMs, etc.) to perform different signal processing tasks efficiently.However, in this work these components are not used due to the limitations of the adopted technique to perform a modular multiplication, that is, Interleaved Multiplication (IM) algorithm [27], which interleaved the reduction step by reducing each partial product.To the best of authors knowledge, no work has been reported targeting a digitwise implementation of the IM technique.However, available small-sized dedicated multipliers inside an FPGA can be very effective in case of the Montgomery multiplication [28] and the NIST recommended primes [29].A modular multiplication using these methods can be performed by integers multiplication followed by a modular reduction.
This paper presents a novel architecture to speed up the EC point multiplication in affine coordinates.The proposed design is based on a unified GF() adder, subtractor, and multiplier (Add/Sub/Mul) unit.The unified Add/Sub/Mul unit is an extension of our previous GF() multiplier design reported in [30].The proposed unified unit in this work performs modular addition and subtraction in a single clock cycle, while modular multiplication is performed in ⌈/3⌉+2 clock cycles, where  = log 2 .The careful FPGA implementation of the proposed EC point multiplication architecture outperforms the other existing designs.The main advantages of the proposed design are as follows.
(i) It reduces the number of required clock cycles and computation time of EC point multiplication to almost 40% and 35%, respectively, with considerably smaller FPGA area consumption.The reduction in clock cycles and computation time is mainly due to the proposed GF() multiplier [30].
(ii) Furthermore, the adopted algorithm for EC point multiplication with careful implementation of GF() arithmetic primitives is capable of resisting the timing and SPA attacks [5].
(iii) It is flexible; all parameters (curve parameter , EC point , scalar value , and the prime value ) can be easily changed without FPGA reconfiguration.
This paper is organized as follows.Section 2 briefly explains EC group operations such as EC point addition and EC point doubling.In addition, this section also describes the Montgomery ladder structure for the EC point multiplication algorithm.The unified Add/Sub/Mul unit over GF() is presented in Section 3. Section 4 proposes a novel architecture for EC point multiplication based on the GF() unified Add/Sub/Mul unit.Implementation results and performance evaluation are presented in Section 5, and finally the paper is concluded in Section 6.

Elliptic Curve Group Operations
In this paper, we consider an elliptic curve E, defined over a prime field GF(), where  is a large prime characteristic number.Field elements are represented as integers in the range [0,  − 1].An elliptic curve E over GF() in short Weierstrass form is represented as where, , , , and  ∈ GF() and 4 3 + 27 2 ̸ = 0 (modulo ).The set of all points (, ) that satisfies (1), plus the point at infinity, makes an abelian group.EC point addition and EC point doubling operations over such groups are used to construct many elliptic curve cryptosystems.The EC point addition and EC point doubling operations in affine coordinates can be represented as follows: let  1 = ( 1 ,  1 ) and  2 = ( 2 ,  2 ) be two points on the elliptic curve.The group Input: An integer  and a point  on elliptic curve Output: Algorithm 1: Montgomery ladder for EC point multiplication [20].
operation is the point addition,  3 ( 3 ,  3 ) =  1 ( 1 ,  1 ) +  2 ( 2 ,  2 ), which is defined by the group law and is given as where If  1 =  2 , then a special case of adding a point to itself is called EC point doubling operation.In affine coordinates the EC point addition requires one division, two multiplications, and six addition or subtraction operations, whereas the EC point doubling can be performed by using one division, three multiplications, and seven addition or subtraction operations.Therefore, optimization of these operations impacts significantly on the overall performance of the EC point multiplication operation.

Elliptic Curve Scalar Multiplication. EC cryptosystems are mostly based on the EC point multiplication operation.
This operation can be performed as a sequence of EC point addition and EC point doubling operations given in Algorithm 1, which is known as the Montgomery ladder for EC point multiplication.Algorithm 1 works on the binary representation of  and it is assumed that the most significant bit is equal to 1.The EC point addition and EC point doubling operations are not dependant on the bit pattern of , so these operations can be performed in parallel.As these can be executed concurrently, therefore Algorithm 1 gives an extra feature of protection against the timing and simple power analysis (SPA) attacks.

Unified Add/Sub/Mul Unit
In this section we present a unified modular adder, subtractor, and multiplier (unified Add/Sub/Mul) unit.This unit is capable of performing modular addition, subtraction, and multiplication operations and supports any prime ; therefore it is able to provide hardware support for ECC over a variety of elliptic curves.Normally, to achieve a better performance of EC point multiplication on dedicated hardware, multiple copies of GF() adder, subtractor, multiplier, and divider units are integrated.These multiple copies can help to execute several operations in parallel at the expense of area and cost, which can also result in more power consumption.Our objective is to accelerate the computation of EC point multiplication operation with minimum number of dedicated arithmetic units.Modular multiplication is a critical component in the architecture of EC point multiplication operation.In this regard, several modular multipliers have been proposed.The design reported in [19] is based on an iterative addition and reduction algorithm.In every iteration addition and reduction modulo  of partial products are performed.It computes a -bit modular multiplication in  + 1 clock cycles.Two novel architectures based on radix-4 and radix-8 Booth encoding techniques are reported in [30,31].
In [30] the radix-4 Booth encoded version computes a modular multiplication operation in /2 + 1 clock cycles, whereas the radix-8 Booth encoded multiplier takes ⌈/3⌉+2 clock cycles.The radix-8 Booth encoded multiplier given in Algorithm 2 is based on an iterative addition and reduction modulo  of partial products technique proposed by Blakley reported in [27].The two main components in the design are as follows: (i) Three-bit left shift modulo  unit (Step (3)).
There is also a logic circuit for Booth encoding in addition to these two core components.The presented unified Add/Sub/Mul unit is based on the same design.The radix-8 Booth encoded modular multiplier design has a modular adder/subtractor unit.Hence this paper modified the radix-8 Booth encoded modular multiplier design in such a way that it becomes capable of performing modular addition and subtraction operations in addition to its main task, that is, a modular multiplication operation.Due to the proposed modification dedicated hardware units for modular addition and subtraction operations are not needed.The top-level block of unified Add/Sub/Mul unit is shown in Figure 1.The whole logic components of the radix-8 Booth encoded modular multiplier are mainly divided into shared and unshared logic parts.The shared logic components can be shared to perform modular addition, subtraction, and multiplication operations, whereas the unshared logic components are only dedicated to a modular multiplication operation.A control unit is responsible for decoding instructions on the basis of two bits of operational code (opcode) and generates appropriate signals for the shared and unshared logic parts.
The shared logic is comprised of a modular adder/subtractor unit while the unshared logic consists of three-bit left shift modulo  unit and Booth encoding logic.The adder/subtractor and three-bit left shift modulo  units are shown in Figure 2. The three-bit left shift modulo  unit is comprised of three identical D1 units cascaded in series.Each Input: , ,  : 0 ≤ ,  <  Output:  =  ×  mod  (1)  = 0,  1 = 2 mod ,  2 = 3 mod ,  3 = 4 mod .
(2) for  =  downto 0;  =  − 3 do //  is the bit length of  // (3) fl  ±  2 (12) else (13)  fl  ±  3 (14) return  Algorithm 2: Radix-8 BE IM algorithm [30].D1 unit performs a single bit left shift modulo  operation and it consists of one -bit adder and a multiplexer.Hence, in total, the unshared logic consists of three -bit adders, three multiplexers, and an additional logic for Booth recoding.The adder/subtractor unit consists of two -bit adders and five multiplexers.
In the proposed unified Add/Sub/Mul unit, these hardware logic resources are shared with other resources, so two -bit adders and five multiplexers are saved.This unit is not capable of performing modular addition, subtraction, and multiplication operations in parallel.However, EC point representation in affine coordinates has a very limited scope of parallelism.Therefore, the proposed unified Add/Sub/Mul unit can increase the performance of EC point multiplication in affine coordinates with a lower area overhead.The proposed unified Add/Sub/Mul unit performs modular addition, subtraction, and multiplication operations as given in Table 1 in the following manner.
A GF() addition is performed by the shared logic unit, if the two-bit input opcode = 00.The control unit decodes the opcode and activates the shared logic block; that is, the adder/subtractor unit and sets  in = 0.The adder/subtractor unit consists of two -bit adders and logic for input output multiplexing shown in Figure 2. The first -bit adder performs addition of operands ( + ) and the result is fed into the second -bit adder where a modulus  is subtracted from it.Similarly, a GF() subtraction is performed by the same unit by setting opcode = 01; the first -bit adder performs subtraction ( − ) followed by the addition of a modulus .The result of modular addition and subtraction becomes available at port  after a single clock cycle.In the case of GF() multiplication indicated by opcode = 10, the control unit generates appropriate signals for the shared and unshared logic components.Partial products addition or subtraction (1 ± 2 mod ) is computed by the shared logic components depending on cin signal generated by the Booth recoding logic, while three-bit left shift modulo  (8 mod ) operation is computed by the unshared logic components.The detailed execution procedure and control signals for both shared and unshared logic components are given in [30].The unified Add/Sub/Mul architecture takes ⌈/3⌉ + 2 clock cycles to produce a GF() multiplication result at port .The main advantages of the proposed unified Add/Sub/Mul units are a single unit that can handle GF() addition, subtraction, and multiplication instructions.It eliminates a need for dedicated hardware units for GF() addition and subtraction operations, which consumes two bit adders in addition to I/O multiplexers.The proposed unit is not only optimized for hardware resources and required number of clock cycles for GF() multiplication operation, but it is also programmable and supports any value for a modulus .

Elliptic Curve Scalar Multiplier Architecture
ECC based schemes heavily rely on the EC scalar multiplication operation; therefore, its efficient implementation can greatly improve the performance of associated ECC based protocols.
The EC scalar multiplication is the computation of  operation, where  is an integer and  is a base point of a chosen elliptic curve.Several algorithms have been proposed to compute the EC scalar multiplication operation [29].Standard double-and-add, nonadjacent form (NAF), and a Montgomery ladder for EC point multiplication are mostly used.EC point addition and EC point doubling operation can be executed in parallel using a Montgomery ladder method given in Algorithm 1.As these EC point operations are not dependant on the respective scalar bit   , hence, power consumptions of these operations are symmetric and it is not possible for an attacker to extract any information regarding a secret value .Therefore, this technique provides a protection against simple power analysis attacks.This section presents an efficient architecture for EC scalar multiplication in affine coordinates based on the proposed unified Add/Sub/Mul unit in Section 3. The proposed EC scalar multiplier architecture executes a scalar multiplication as a sequence of EC point addition and EC point doubling operations.These EC point operations can be achieved as a sequence of GF() arithmetic operations as given in Table 2.
The EC point addition operation requires six GF() subtraction, two GF() multiplication, and one GF() division operations.On the other hand, three GF() addition, four GF() subtraction, two GF() multiplication, and single GF() division operations are required to perform EC point doubling operation.As depicted in Table 2, the EC point operations in affine coordinates also require GF() division operation in addition to GF() addition, subtraction, and multiplication operations.A GF() division and inversion can be performed either by Fermat little theorem or by Extended Euclidean algorithm (EEA).The binary version of EEA given in [29] is the mostly adopted algorithm for GF() division.The EEA implementation in this work is based on the guidelines presented in [34].It takes 2 clock cycles to perform a -bit GF() division or inversion operation.
It is evident from Table 2 that, in the computation of EC point operations, a scope of parallelism among GF() arithmetic operations is very limited.Therefore, a semiparallel architecture for EC scalar multiplication is shown in Figure 3.
It consists of two GF() unified Add/Sub/Mul units, two GF() divider units, two register files (each comprised of 3 -bit registers), I/O multiplexers, and a main controller.The GF() unified Add/Sub/Mul unit executes a GF() addition, subtraction, or multiplication operations at a time, while GF() division unit executes GF() division (/ modulo ) operation in 2 clock cycles.Therefore, the proposed design can execute two GF() addition, subtraction, or multiplication operations in parallel to two GF() division operations.We grouped these GF() arithmetic units into SAU1 and SAU2 units.Each SAU1 and SAU2 consists of one GF() unified Add/Sub/Mul unit, one GF() divider unit, and one register file.The EC point addition operation and EC point doubling operation in Algorithm 1 can be performed in parallel.Therefore, the proposed architecture performs these EC point operations in parallel; however, on the unified Add/Sub/Mul unit, GF() addition, subtraction, and multiplication operations can only be performed in a serial fashion.The SAU1 unit is dedicated to perform the EC point addition operation, while the EC point doubling operation is executed by SAU2 unit.The register files store intermediate results during execution of EC point addition and EC point doubling operations based on control signals generated and managed by the main controller.

Scheduling Strategy.
A scheduling policy to compute EC point addition and EC point doubling operations on the proposed SAU1 and SAU2 units is shown in Figure 4, where GF() addition, subtraction, multiplication, and division operations are denoted as +, −, ×, and /, respectively.Coordinates of two input points

Implementation Results and Discussion
The elliptic curve scalar multiplier architecture presented in the previous section has been implemented in Verilog HDL.For simulation, synthesis, mapping, and routing purposes Xilinx ISE 9.1 design suite has been used.Table 4 shows the required number of clock cycles to compute the EC scalar multiplication operation.The proposed design computes EC point addition and EC point doubling operations in (11 + (8⌈/3⌉)) and (15 + 3) clock cycles, respectively.As in the proposed design EC point operations are executed concurrently; therefore a single iteration of Algorithm 1 is completed in (15+3) clock cycles.The designs reported in [21] take (13+5) clock cycles, which is almost 40% more than the proposed design.Similarly, [18,[24][25][26] require 48%, 179%, 85%, and 62% more clock cycles to perform the EC scalar multiplication, respectively.
Table 5 demonstrates performance analysis of the several existing FPGA based implementations of EC scalar multiplier.The design reported in [21] is based on parallel dedicated hardware units for GF() addition, subtraction, the slices for BRAM and dedicated embedded multipliers.The design presented in [33] consumes 15,775 slices and takes 5.99 ms to compute one EC scalar multiplication.On the same platform it is 25% faster but it consumes 28% more FPGA slices.The design proposed by Daly et al. in [18] is 262% slower but it consumes 47% lower slices.The design reported in [20] is 40% slower and consumes 13% more FPGA slices as compared to the proposed design.
Performance comparison on the basis of throughput rate is depicted in the last column of Table 5.The proposed design has 0.5 times, 2.64 times, 1.30 times, 2.1 times, and 0.42 times higher throughput rate as compared to the designs [21], [18], [16], [32], and [20], respectively.The design [33] has 1.25 times higher throughput rate as compared to our design; however, it consumes 1.42 times more FPGA slices.Therefore, our design is better in terms of the computation time, slice area, and throughput rate as compared to all the designs listed in Table 5.As the proposed design executed EC point addition and EC point doubling operations concurrently in a fixed amount of time (15 + 3), therefore, it provides a protection against the timing and simple power analysis attacks, which is an important feature in modern day security applications.Due to the lower computation time and high throughput rate it is suitable for network applications like SSL and IPsec.It is also suitable in the low power resource-constrained environments because of the smaller area and reduced clock cycles.

Conclusion
This paper first introduces a unified arithmetic architecture for GF() addition, subtraction, and multiplication operations.Then, a high speed elliptic curve scalar multiplier is developed on the basis of the unified arithmetic architecture.The proposed design has been synthesized using Xilinx ISE 9.1 and 14.2 Design Suites targeting various Xilinx FPGA devices.Performance is shown for 160-, 192-, 224-, and 256bit elliptic curve scalar multiplication operation.Compared with other contemporary designs, it gives 34% and 40% better performance in terms of computation time and number of required clock cycles, respectively.It is programmable for any value of prime  and is also resilient to timing and simple power analysis attacks.Therefore, it is a good choice in ECC based cryptosystems.

Table 1 :
Operation codes for unified Add/Sub/Mul unit.

Table 2 :
1 ,  2 are denoted by  1 ,  2 ,  1 ,  2 , while resultant point coordinates are shown as  3 ,  3 .The results of + and − operations are available after one clock cycle, whereas × and / operations are completed in ⌈/3⌉ + 2 and 2 clock cycles, respectively.The register transfer logic of EC point addition and EC point doubling operations on SAU1 and SAU2 units can be analyzed using Figure 4 and Table 2. Initially registers   1 ,   1 ,   2 , and   2 are loaded with coordinates of EC input points  and 2, while register   is initialized with the EC parameter .  1 ,   1 ,   2 , and   2 are updated with new values of EC point addition and EC point doubling.Let   be the total number of required clock cycles to compute the EC point multiplication operation; then on the proposed architecture it can be estimated as Sequences of GF() operations for EC point operations.It waits for the respective done signals, checks the th bit of scalar , and either decides to update the register files with new values or outputs the result and stops execution.

Table 3 ,
shows the performance of the proposed architecture for 160, 192, 224, and 256 bits field sizes on several different FPGA platforms.It takes 3.2 ms, 2.3 ms, and 1.4 ms while running at a maximum frequency of 35 MHz, 48 MHz, and 81 MHz for 192-bit implementation on Virtex-II pro, Virtex-4, and Virtex-6 FPGA platforms, respectively.As, ISE 9.1 design suit does not have a support for Virtex-6 FPGA, so implementation on Virtex-6 FPGA has been done using Xilinx ISE 14.7.For 192-bit field size our implementation on Virtex-4 computes a single EC scalar multiplication in 2.3 ms in 113,472 clock cycles running at a maximum frequency of 48 MHz.The 192-bit implementation consumes 8,500 slices of Virtex-4 FPGA and has a throughput of 83.5 Kbps.The same design on Virtex-II pro takes 3.2 ms at a maximum frequency of 35 MHz and it uses 7,930 slices.Performance comparison among the proposed architecture and other FPGA implementations is analyzed on the basis of clock cycles, computation time, frequency, occupied FPGA slices, and throughput (TP).