Prime Field ECDSA Signature Processing for Reconﬁgurable Embedded Systems

Growing ubiquity and safety relevance of embedded systems strengthen the need to protect their functionality against malicious attacks. Communication and system authentication by digital signature schemes is a major issue in securing such systems. This contribution presents a complete ECDSA signature processing system over prime ﬁelds for bit lengths of up to 256 on reconﬁgurable hardware. By using dedicated hardware implementation, the performance can be improved by up to two orders of magnitude compared to microcontroller implementations. The ﬂexible system is tailored to serve as an autonomous subsystem providing authentication transparent for any application. Integration into a vehicle-to-vehicle communication system is shown as an application example.


Introduction
With emerging ubiquity of embedded electronic systems and a growing part of distributed systems and functions even in safety relevant areas, the security of embedded systems and their communication gains importance quickly.One major concern of security is authenticity of communication peers and information exchange.Especially if many different remote participants have to communicate or not all participants are known in advance, asymmetric signature schemes are beneficial for authentication purposes.In contrast to symmetric schemes like the Keyed-Hash Message Authentication Code HMAC [1], asymmetric signature schemes like RSA [2], DSA [3], and the ECDSA scheme [3] considered in this contribution get along without key exchange or predistributed keys, relying usually on a certification authority as trusted third party instead.
This benefit comes at the cost of a much greater computational complexity of these schemes compared to authentication techniques based on symmetric ciphers or solely on hashing.This imposes major problems especially for embedded systems, where resources are scarce.
This contribution presents a hardware-implemented system for complete prime field ECDSA signature processing on FPGAs.It can be integrated as an autonomous subsystem for signature processing in embedded devices.As an application example the integration in a vehicle-to-vehicle communication unit is presented.
The remainder of this paper is organized as follows.In Section 2 some related work is given, Section 3 presents basics of the implemented signature scheme ECDSA, and Section 4 outlines the assumed situation and requirements for the system.The structure and implementation of the signature system itself is presented in Section 5, and Section 6 shows an application example and integration in a wireless communication system.Section 7 details performance and resource usage that are further discussed in Section 8.The paper is concluded in Section 9.

Related Work
Since elliptic curves were proposed as basis for public key cryptography in 1985 by Koblitz [4] and Miller [5] independently, many implementations of the prime field Elliptic Curve Digital Signature Algorithm (ECDSA) and Elliptic Curve Cryptography (ECC) in general have been published.Software implementations on general purpose processors International Journal of Reconfigurable Computing need a lot of computation power.The eBACS ECRYPT benchmark [6] gives values for 256-bit ECDSA of, for example, 1.88 ms for generation and 2.2 ms for verification on an Intel Core 2 Duo at 1.4 GHz and 2.9 ms respectively, 3.4 ms on an Intel Atom 330 at 1.6 GHz.Values for a crypto system based on an ARM7 32-bit microcontroller are given in [7] for a key bit length of 233 bit.Using a comb table precomputation (w = 4) 742 ms are needed for a generation and 1240 ms for a verification of an ECDSA signature.An implementation for a RIM Blackberry [8] using an ARM 9EJ-S core realizes 150 ms for a signature generation and 168 ms for a signature verification [9].
To achieve usable throughputs and latencies on embedded systems, various specialized hardware solutions have been proposed, for example, many approaches for implementation of F p arithmetic and the ECC primitives point add and point double on reconfigurable hardware.A survey of hardware implementations can be found in [10].McIvor et al. [11] propose a special ECC processor for F p on a Virtex II Pro FPGA, calculating a 256-bit scalar multiplication in 3.86 ms using a clock frequency of 39.5 MHz.Orlando and Paar [12] achieve for a bit length of 192 a scalar multiplication in 3 ms on a Virtex-E FPGA.Güneysu and Paar present in [13] a very fast approach based on special DSP FPGA slices, achieving processing times of 620 μs for a 256-bit scalar multiplication on a Virtex-4 FPGA.The implementation presented here is based on an F p ALU presented by Ghosh et al. in [14].Implementation approaches on CMOS standard cells can be found, for example, in [15,16], achieving scalar multiplications in 256bit length in 2.68 ms and 4.3 ms, respectively.
Nevertheless, open implementations of full signature processing units performing complete ECDSA are scarce.Järvinen and Skyttä [17] present a Nios II-based ECDSA system on an Altera Cyclone II FPGA for a key length of 163-bit performing signature generation in 0.94 ms and verification in 1.61 ms.
This contribution presents an FPGA-based autonomous ECDSA system for longer key lengths of 256 bit containing all necessary subsystems for application in embedded systems on reconfigurable hardware.

ECDSA Fundamentals
The Elliptic Curve Digital Signature Algorithm (ECDSA) is based on a group structure defined on an elliptic curve E over a finite field F q .Mostly two types of underlying finite fields are technically used: binary fields F 2 n of characteristic two and prime fields F p with large primes p and corresponding characteristic.This paper focuses on prime fields F p with characteristic char (F p ) 3. In this case the group E and the respective operation is defined as follows.
Definition 1 (group operation on E).Let E be an elliptic curve over a finite field F p of characteristic char (F p ) 3 given by the Weierstrass equation with a, b ∈ F p , 4a 3 + 27b 2 / = 0, and P = (x 1 , y 1 ), Q = (x 2 , y 2 ) points on E. A group on E ∪ {O}, O being the special point at infinity, and the group law on E is defined by the following (i) P + O = O + P = P for all P ∈ E.
(iii) For P / = ± Q and P / = − P, the operation for R = P + Q = (x 3 , y 3 ) is given by ( (iv) For P = Q and P / = − P, there is R = 2P = (x 3 , y 3 ) defined by The set E ∪ {O} with the defined group law + is an abelian group with neutral element O.The inverse to a point P = (x, y) is given by −P = (x, −y).
For the use of ECDSA a set of common domain parameters is needed to be known to all participants.These are the modulus p identifying the underlying field, parameters a, b defining the elliptic curve E used, a base point G ∈ E, the order n of G, and the cofactor h = order (E)/n.In addition a cryptographic hash function H is needed.The signature generation and verification for a key pair (Q, d), Q ∈ E being a point on the curve and d a scalar factor with Q = dG, can then be performed using the secret key d or the public key Q, respectively.The procedures needed are shown in Algorithms 1 and 2.
Input: Domain parameter D = (q, a, b, G, n, h), public key Q, message m, signature (r, s).Output: Acceptance or Rejection of the signature For identification of the most demanding operations, a tracing of the algorithms based on the hardware implementation presented in Section 5 and the possible parallelization was done.Of the total of 395,521 clock cycles needed for signature generation with the modulus p256 used (see Section 4), a percentage of 99.8% or 394,752 cycles were spent computing the scalar multiplication kG.For signature verification the amount of cycles spent for the double scalar multiplication X = u 1 G + u 2 Q is even 99.9%.So in the further consideration we focus on these central operations.

Setup and Situation
Objective of a digital signature is to guarantee authenticity and integrity of a signed message to the receiver and prove the identity of the sender, including nonrepudiation.The usual method based on an asymmetric primitive like ECDSA contains three protocol steps.First the sender generates a key pair consisting of a secret signature key SK (d in the ECDSA case) and a public verification key VK (Q for ECDSA) and publishes VK to all possible verifiers.To sign a message m of arbitrary length, the sender generates a digest H(m) of the message using a publicly known cryptographic hash function H.This digest is of a fixed length and can be seen as a fingerprint of the message in the sense that finding a different message m / = m with H(m) = H(m ) is infeasible.This digest is then signed, meaning encrypted using the signing key SK of the sender, and sent along with the original plain text message m.The receiver or verifier is then able to verify the signature by decrypting the received hash value using the sender's public verification key PK and comparing the decrypted value to the output of H applied to the received plain message.If the two values match, the signature is positively verified (see Algorithm 3).
The security and correctness of the signature method is based on the assumption that a signed value (encrypted with the secret key) can only be verified (decrypted) with knowledge of the corresponding public key and vice versa and that the secret key cannot be computed from the public key.Secondly the mapping of public keys to identities has to be guaranteed in some way.This is usually done using certification authorities as trusted third parties that verify the identity and issue a certificate for the public key.
We assume an embedded system communicating with several peers which are not entirely known in advance.Therefore, the exchanged signed messages are sent with a certificate attached, that is, issued, to a commonly trusted certification authority.As an example scenario the vehicleto-vehicle (V2V) communication is considered in Section 6.
This contribution focuses on prime field ECDSA as it is proposed for vehicle-to-vehicle communications which is our general focus application (see also Application Example).Implemented are especially two elliptic curves recommended by the U.S. National Institute of Standards and Technology (NIST) in [18] and Certicom Research in [19], namely, the curves p224 (secp224r1) and p256 (secp256r1) with bit lengths 224 and 256, respectively, and the corresponding domain parameters also given in the standard.
The proposed system works as a security subsystem exclusively performing signature processing and passing and receiving messages m to and from the external system.

Signature Processing System
Processing of ECDSA consists of several layers of computation.On the top level the signature generation and verification algorithms as well as the certificate validation are performed.This signature scheme-dependent layer is based on the group operations point add (PA) and point double (PD) in the underlying elliptic curve.These are in turn based on the underlying finite prime field (F p ) arithmetic, that is, modular arithmetic modulo a prime p.For the main operation of signature verification, the double scalar multiplication kG + rQ, the respective number of underlying operations needed on each layer to perform a single operation on the respective upper layer is given in Figure 1.In an even higher layer, there is also the communication protocol to consider at least partially as needed for the signature system.
The architecture and presentation of the system reflects this layering.The two upper layers are implemented as finite state machines (FSM) and make use of a basic F p arithmetic logical unit (ALU) and some additional auxiliary modules.Figure 2 outlines the structure of the system.The different building blocks are detailed in the following paragraphs.

F p Modular ALU.
The central processing is done by a specialized F p -ALU for primes of maximum 256-bit length.It is based on the ALU proposed by Ghosh et al. in [14].Figure 3 depicts the implemented structure.The ALU contains one F p adder, subtractor, multiplier, and divider/inverter each.All registers and datapaths between the modules are 256 bit wide so that complete operands up to 256-bit width (as in the p256 case) can be stored and transmitted within a single clock cycle.Four inputs, two outputs, and four combined operand/result register as well as a flexible interconnect allow a start of two operations each at the same time as long as they do not use the same basic arithmetic units.The units perform operations independently, so that using different starting points parallel execution in all four subunits is possible.This allows parallelisation especially in the scalar multiplication (see Section 5.2.1).The F p -adder and -subtractor perform each operation in a single clock cycle as a general addition/subtraction with subsequent reduction.The F p multiplying module computes the modular multiplication iteratively as shift-and-add with reduction mod p in every step.It therefore needs |p| clock cycles for one modular multiplication, |p| being the bit length of the modulus and thereby also the maximum bit length of the operands.
Modular inversion and division is the most complex task of the ALU.It is based on a binary division algorithm on F p ; see [14]   Table 1: Hardware execution of point addition. Step maximum runtime being 2|p| clock cycles, in the p256 case therefore up to 512 cycles.Statistical analysis showed an average runtime of 1.5 • |p| clock cycles.ALU control is performed over multiplexer and module control wires and is implemented as a finite state machine presented in the following paragraph.The complete ALU allocates 14256 LUT/FF pairs in a Xilinx Virtex-5 FPGA and allows a maximum clock frequency of 41.2 MHz (after synthesis).
In addition to the 256-bit arithmetic based on the modulus p256 the ECDSA unit also implements the arithmetic for modulus p224.This is done using the same hardware and is also implemented in the overlaying FSM.Theoretically all moduli up to 256-bit width are supported by the ALU.Nevertheless, in the following, all given data refers to the 256-bit key case.Details on resource consumption and performance values are given in Section 7. Step 5.2.Elliptic Curve Processing.On the elliptic curve E addition of points is defined as group operation.Doubling of a point is specially implemented as it requires a different computation because general point addition is not defined with operands being equal (see Section 3).A comprehensive introduction to elliptic curve arithmetic including algorithms can be found in [20].To map the algorithms to the implemented specific ALU, the single operation steps have to be scheduled to the respective units.The operation schedules for point addition and point doubling for execution on the ALU are given in Tables 1 and 2.
In the tables, |p| stands for the bit length of the modulus p.In the case of p256, this means |p| = 256.The execution schedules map the operations to the executing units using Computation is done iteratively using the socalled Montgomery ladder [21,22] showed in Algorithm 4.
The operations in the branches inside the for-loop, meaning steps 5 and 6 in the if-branch, respectively, steps 8 and 9 in the else-branch, can be executed in parallel.Since it is a point addition and a point doubling each, a real parallel execution on the ALU is possible using a tailored scheduling.Figure 4 depicts the implemented schedule.Although the computation of PA and PD is now done in parallel, a total of five registers for intermediate results is sufficient because the respective t 3 register of PA and PD is not needed at the same time and can therefore be shared.
The execution time using this schedule is 6|p| + 7 clock cycles for a single pair of point addition and point doubling.Compared to the time of (4|p| + 5) + (5|p| + 7) = 9|p| + 12 clock cycles needed for a sequential processing of PA and PD, a performance gain of 33% can be achieved.Execution time for the complete scalar multiplication is therefore at maximum ((|p| − 1) • (6|p| + 7) + (5|p| + 7)) = 6|p| 2 + 6|p| clock cycles for the combination of point add and point double.

Double Scalar Multiplication.
For verification of ECDSA signatures two independent scalar multiplications have to be executed (see Algorithm 2, step 7).Instead of computing independently in sequence, it is faster to compute them together using an approach proposed originally by Shamir (see [23]) also known as "Shamir's trick" shown in Algorithm 5.
In contrast to Algorithm 4, the central operations in steps 4 and 5 of Algorithm 5 cannot be parallelized as they depend directly on each other.The maximum time consumption of the algorithm is therefore clock cycles.The composition of the double scalar multiplication on the different levels of computation is shown in Figure 5.

Signature and Certificate Control
System.On top of the elliptic curve (EC) operations and the control FSM performing them, the actual signature algorithms and the certificate verification are implemented.This is done in a separate FSM, controlling the EC arithmetic FSM, some registers, and the auxiliary hashing and random number generation. Figure 6 shows the sequence of operations of the signature verification.See Algorithms 1 and 2 for the implemented procedures.This FSM is the upmost layer of the signature module and provides a register interface for operands like messages, signatures, certificates, and keys.For integration in an embedded system, it has to be wrapped to support the message format and create the inputs to select the function needed.An example for an integration is given in Section 6.

SHA2 Hashing Module.
The SHA2 hashing unit provides functions SHA-224 and SHA-256 according to the Secure Hash Algorithm (SHA) standard [24].It is based on a freely available verilog SHA-256 IP-core (available as SHA IP Core at http://opencores.com/) adapted with a wrapper performing precomputation of the input data and providing a simple register interface accepting data in 32-bit chunks.In addition the core has been enhanced to support SHA-224.
The unit processes input data in blocks of 512 bit needing 68 clock cycles each at a maximum clock frequency of 120 MHz (after synthesis) and a resource usage of 2277 LUT/FF pairs.After finishing the operation, the result is available in a 256-bit output register.

Pseudorandom Number Generation.
For ECDSA signature generation, a random value k is needed.To provide this k the system incorporates a Pseudorandom Number Generator (PRNG) consisting of two linear feedback shift registers (LFSR), one with 256 bit length, feedback polynomial x 255 + x 251 +x 246 +1, and a cycle length of 2 256 −1 and a second LFSR with 224 bit length, feedback polynomial x 222 +x 217 +x 212 +1, and a cycle length of 2 224 − 1, both taken from [25].
The LFSR occupies 480 LUT/FF pairs and allows a maximum clocking of 870 MHz although operated in the system in the general system clock of 50 MHz.It is operated continuously to reduce predictability of the produced numbers.The current register content is read out on demand.
For further improvement of the security level, a True Random Number Generator (TRNG) could be integrated.An example implementation of an FPGA-based TRNG can be found in [26].

Certificate Cache.
Usually digital signatures or their respective public keys needed for verification are endorsed by a certificate issued by a trusted third party, a so-called certification authority (CA), to prove its authenticity.Verification of the certificate requires a signature verification itself and is therefore equally complex as the main signature verification of the message.If communicating several messages with the same communication peer using the same signature key, the certificate can be stored hence saving the effort for repetitive verification.
The system incorporates a certificate cache for up to 81 certificates stored in two BRAM blocks.It can be searched in parallel with the signature verification (see Figure 6).Replacement of certificates is performed using a least recently used (LRU) policy.

Application Example
The system offers complete ECDSA signature and certificate handling and can be used in a variety of embedded systems seeking authentication and security of communication.As an application example, we show the integration into a vehicleto-X (V2X) communication system.V2X communication is an emerging topic aiming at information exchange between vehicles on the road and between vehicles and infrastructure like roadside units [27].This can be used to enhance safety on roads, optimize traffic flow, and help to avoid traffic congestions [28].Usually two types of broadcasted messages are used, a network beacon sent regularly with a frequency of 2-10 Hz containing status information of the sender and additional event-triggered messages notifying about special events and situations.Latter messages can also be forwarded over several hops to reach receivers outside the direct wireless communication range.To be able to base decisions and applications on information received from other vehicles, trustworthiness of this information is mandatory.To ensure the validity and authenticity of information, signature schemes are used to protect the messages broadcasted by the participating vehicles against malicious attacks [29,30].As V2X communication is at present in the process of standardization, no fixed settings are available yet, but the use of ECDSA is proposed in the IEEE 1609 Wireless Access in Vehicular Environments (WAVE) standard draft [31] as well as the proposals of European consortia [32], put together by the COMeSafety project [33].

International Journal of Reconfigurable Computing
In the chosen realization V2X communication is performed by a modular FPGA-based On Board-Unit (OBU) presented in [34]; see Figure 7.
It consists of different functional modules connected by a packet-based on-chip communication system [35].The signature verification system is integrated as a submodule and performs signature handling for incoming and outgoing messages automatically, being therefore transparent to the other modules except for the unavoidable processing latency.It is connected to two different on-chip communication systems, one transmitting unsecured messages over an 8bit wide communication structure (BusNoC small), and the  other (BusNoC wide) transmitting only secured messages containing signatures and certificates.These messages are larger because of the additional data, and the latter communication structure is therefore 32 bit wide.A short description of the security system and its system integration is given in [36].Figure 8 depicts the wrapped signature system with the interfacing to both communication structures.This interfacing consists of a Direct Network Access (DNA) controller and two interfaces to the Network-on-Chip (NoC) communication structures.An 8-bit PicoBlaze processor controls and configures the components.The DNA controller manages the intramodular procedure and generates the input and control data to the encapsulated ECDSA module.The register set serves as data interface and buffer for intermediate results.
The signature system accepts incoming messages, verifies signatures and certificates, and passes only verified messages on to the Information Processing Module (IPM) for further processing.In case of an invalid signature the outer system (IPM and Routing) is informed.For outgoing messages, signatures are generated, and the corresponding certificate is attached to the message which is then passed on to the wireless interface.
6.1.Key Container.In the V2X environment privacy of participants is of major importance.As messages containing vehicle type and further information like current position, speed, and heading are continuously broadcasted from twice to up to ten times a second, these messages could easily be used by an eavesdropper to trace participants.To counter such attempts anonymity in the form of pseudonyms is used that are changed on a regular basis.A number of pseudonyms for change are stored directly in the signature module's key container (see Figure 8).It also contains the public keys of trusted certification authorities needed for verification of certificates.The change itself is triggered by a dedicated message sent to the signature processing system by the central information processing module of the C2X system.For all other modules this privacy function is fully transparent as well.

Caching of Certificates.
As V2X communication is not deployed in the fleet so far and also realistic field tests with larger numbers of vehicles are only just beginning (e.g., simTD [39] in Germany), large-scale predictions of message numbers and network behaviour have to be based on simulations and estimations.For an estimation of the expected cache hit rate results from the literature are used.Seada [40] show based on real-world measurements on American freeways that the average communication time between two vehicles is approximately 65 seconds.Based on that and assuming a beaconing frequency of 10 Hz and a sufficient cache size in only one out of 650 messages, the certificate has to be validated.In addition pseudonym change has to be regarded.Papadimitratos et al. [41] propose exchange of pseudonyms every 60 seconds.Assuming stochastical independence of both values, a cache hit rate of 99.68% is possible.Since the communication is regular while the peer vehicle is in range, an LRU strategy is suitable.The required cache size depends strongly on the number of vehicles in range and should therefore be adapted to the expected situations.

Resources and Performance
The presented system has been realized using a Xilinx XC5VLX110T Virtex-5 FPGA [42] on a Digilent XUP ML509 evaluation board [43].The following values refer to an implementation of the complete signature generation and verification unit with interfacing for the application example given previously.Table 3 shows an overview of the resource usage.
After integration of all submodules, the ECDSA unit allows a maximum clock frequency of 50 MHz that has been successfully tested.Table 4 shows signature verification performance values of the ECDSA unit at 50 MHz.Values for signature generation are given in Table 5.
In both tables the worst case values given are calculations based on the statistically estimated runtime of the algorithms for scalar multiplication.As these runtimes depend on the operand values, the measured average computation times are different.
Direct comparison of the system's performance is difficult, because implementations of complete ECDSA signature and verification units with certificate handling are scarce.So we can only compare the performance of the GF(p) processing unit, where values are available.Table 6 gives an overview in comparison to some implementations presented already in Section 2.

Discussion
The presented system implements the complete ECDSA signature processing in a modular way.As shown in the application example, it can be integrated as an autonomous subsystem to authenticate message traffic and provide verified information to the overlaying system.In comparison to known full implementations (see Section 2), the system's performance of up to 110 verifications per second is by one to two orders of magnitude better than software implementations on microcontrollers, providing sufficient performance for most applications.For high-performance applications like the V2X application example given in detail in Section 6, a still higher throughput of up to 1600 [41,44], respectively, over 2500 [45] signatures per second is needed though.This can be achieved by a number of optimization steps; see Section 9.
The complete signature module from Section 6 is nevertheless prepared for further improvements.As can be seen in Figure 8, the ECDSA system is encapsulated as a submodule wrapped by the control and communication system that fits to the external system structure.The ECDSA system can therefore easily be replaced by a more performant system without having to adapt the overall system structure.

Conclusion and Further Work
We presented a hardware-implemented subsystem for ECDSA signature processing for integration into embedded systems based on reconfigurable hardware.It can be integrated as a stand-alone subsystem performing transparent authentication functionality for communication systems.Applicability of the system has been shown using vehicle-to-X communication as a practical example.
The performance values presented in Section 7 are sufficient for applications like entry control systems or electronic payment, where the number of communication peers is small.For V2X communication even larger throughput is necessary.Further work therefore includes speeding up the computation.Promising approaches that are subject to ongoing work here are the use of windowing techniques on algorithmic level, the tailored use of optimized representations like projective coordinates on mathematical level, and the speedup of the field operations on implementation level, for example, by the use of hardware multipliers.Also the use of low-cost FPGAs and reduction of the footprint is required for the use in embedded systems.

Input: Sender :( 1 )Figure 1 :Figure 2 :
Figure 1: Execution layers of double scalar multiplication on E. On each layer the numbers of operations are given that are needed for a single operation on the respective upper layer.

Figure 3 :
Figure 3: Schematic overview of the F p ALU.

Figure 6 :Figure 7 :
Figure 6: Procedure for signature and certificate verification on the implemented ALU.The blue states mark the main steps.

Figure 8 :
Figure 8: Wrapping of the signature system for V2X integration.
for details.The runtime depends on the input values,

Table 2 :
Hardware execution of point doubling.
Algorithm 4: Scalar multiplication in E. auxiliary register t 1 , t 2 , t 3 for storing intermediate results.As can be seen in the tables, the third register t 3 is used only once in each point operation and reduces the cycle count in each case by one.If this additional clock cycle is accepted, one 256-bit register can be saved.
three 5.2.1.Scalar Multiplication on E. Scalar multiplication in step 2 is the central operation of the signature generation of Algorithm 1.

Table 3 :
Resource usage on an XC5VLX110T with 69.120 LUTs.

Table 4 :
Performance of signature verification at 50 MHz.

Table 5 :
Performance of signature generation at 50 MHz.

Table 6 :
Performance comparison for signature verification and generation for ECDSA on GF(p).Values marked with an asterisk ( * ) are only for the core operations scalar multiplication and multiple scalar multiplication, respectively, without all pre-and postprocessing and hashing.