A High-Speed Elliptic Curve Cryptography Processor for Teleoperated Systems Security

Teleoperated robotic systems are those in which human operators control remote robots through a communication network. The deployment and integration of teleoperated robot’s systems in the medical operation have been hampered by many issues, such as safety concerns. Elliptic curve cryptography (ECC), an asymmetric cryptographic algorithm, is widely applied to practical applications because its far significantly reduced key length has the same level of security as RSA. The efficiency of ECC on GF (p) is dictated by two critical factors, namely, modular multiplication (MM) and point multiplication (PM) scheduling. In this paper, the high-performance ECC architecture of SM2 is presented. MM is composed of multiplication and modular reduction (MR) in the prime field. A two-stage modular reduction (TSMR) algorithm in the SCA-256 prime field is introduced to achieve low latency, which avoids more iterative subtraction operations than traditional algorithms. To cut down the run time, a schedule is put forward when exploiting the parallelism of multiplication and MR inside PM. Synthesized with a 0.13 um CMOS standard cell library, the proposed processor consumes 341.98k gate areas, and each PM takes 0.092 ms.


Introduction
In teleoperated robotic systems, human operators, often geographically distant, interact with and control robots through a communication network. Teleoperated robotic systems have many applications such as bomb disposal, search and rescue, robotic surgery, and medical operation. Teleoperated robotic surgery is a particularly important application of medical operation. Expert surgery is able to be performed remotely and without direct human presence. It is expected to have a significant impact on the quality of medical services in isolated regions, battlefields, or disaster areas. With the development of teleoperated systems and robots, the deployment and integration of teleoperated robots in the medical operation have encountered many problems such as safety concerns [1], time delay [2], and bilateral control [3]. Security is one of the biggest issues that hamper the deployment and integration of teleoperated robots and there are some works on it [4].
Telerobotic surgery is expected to be employed in extreme conditions, where teleoperated robots may have to operate in harsh and low-power conditions, connecting to the Internet with potential loss. As depicted in Figure 1, the last communication link may even be a wireless link to a drone or a satellite, providing the connection to a trusted facility (possibly a large hospital with an established infrastructure) [5].
In such operating conditions, the security of the longrange control is significant, since if the teleoperated robotics are attacked by hackers, potential damage might be caused due to loss of proper control. Besides, verifying that these requirements are established and maintained during a teleoperated procedure is necessary [6].
In harsh conditions, low-power and time delay are significant. Hence, the security process, like digital signature/verification and encryption/decryption, should be implemented by hardware acceleration. Compared with software implementation, hardware implementation has many advantages, such as high efficiency, low power consumption, and safety. ECC is a kind of public key cryptography algorithm that can provide these security processes, proposed in 1986 by Miller [7] and Koblitz [8]. It has been demonstrated to be used as an alternative to the classical RSA [9] thanks to its significantly reduced key lengths [10]. ECC when using 160-256 bits provides similar security compared with RSA or discrete logarithm schemes over finite fields (1024-4096 bits) [11]. SM2, as an ECC algorithm, was included in ISO/IEC14888-3/AMD1 in November 2017.
Considerable efforts have been made to implement the ECC with hardware as can be noticed in [12][13][14][15][16][17][18][19][20][21][22], during which MM operation is widely used for PM in ECC. In order to accelerate the MM, the proposed designs should be considered into three categories [23]: (1) the recommended prime modular multiplication algorithm, (2) Montgomery multiplication algorithm, and (3) the interleaved modular multiplication algorithm. Among those three categories, the first category is the fastest and it is limited by the specific prime field, such as NIST and SCA-256. e architecture in [12] equips Montgomery multiplier among 8-bit × 8-bit to 64-bit × 64-bit aiming to improve area efficiency and reduce delay at the cost of retarding speed. e designs in [9,20] are based on the recommended prime modular multiplication algorithm. However, those MR algorithms only contain one stage, which will generate an intermediate result Z, such as Z[ 0, 14p ) in [9] and Z[ −4p, 5p ) in [20]. Besides, an extra calculation is required to get the final result Z[ 0, p ). Notably, the architecture in [9] adopts a full-word 256-bit × 256-bit multiplier, and all the calculations are executed in the SCA-256 prime field. In MR operation of design [9], 13 subtractions are taken to transfer the intermediate value Z (0 ≤ Z < 14p) to the final value in the most needed situation, following with large latency.
Traditional software methods to implement cryptography algorithms will bring larger time delay and power consumption. However, hardware implementation can resolve these issues. Motived to provide highly efficient safety assurance for teleoperated systems, we realize ECC by hardware implementation. e main contributions of this paper include the following: We propose a high-performance hardware processor, which adopts a half-word multiplier to improve performance while reducing hardware consumption.
Compared with most of the other works, it has a better trade-off between performance and hardware overhead.
e TSMR algorithm in SCA-256 is proposed to implement low latency. e algorithm obtains the intermediate result Z [0, 2p), which requires one subtraction to get the final result Z [0, p). Compared with the traditional method [9] which obtains intermediate result Z [0, 14p), our method avoids lots of subtractions to get the final result. TSMR algorithm is implemented by a carry-save adder architecture to reduce latency and hardware overhead. Combined with Karatsuba-Ofman (KO) multiplication algorithm and pipeline design, MM requires an average of five clock cycles, even though one clock cycle for modular reduction and five clock cycles for multiplication are required. e arrangement of this paper is as follows. In Section 2, the elliptic curve and PM are introduced. In Section 3, highperformance architecture is illustrated. en, the proposed method is implemented and validated in Section 4. Finally, in Section 5, the conclusion of this work is provided.

Elliptic Curve.
A nonsupersingular elliptic curve (EC) over GF (p) is defined as a set of points (x, y) that conform to equation (1), also known as the Weierstrass equation, and an infinity point additionally: where a and b are parameters, identifying the EC which satisfied 4a 3 + 27b 2 ≠ 0(mod p).

Point
Multiplication. PM describes a transformation that k identical EC points add up to one, denoted as a scalar times an EC point "kP," where k � (k l−1 · · · k 0 ), and l represents the binary length of k. In this work, the width NAF addition-subtraction method [24], given in Algorithm 1, is applied to point multiplication. PM operation is the elemental operation of ECC and is performed as a sequence of elliptic curve addition (ECADD) and elliptic curve doubling (ECDBL). Let EC point ; the ECADD is defined as P 3 � P 1 + P 2 and ECDBL is defined as P 3 � 2P 1 . To avoid time-consuming modular inversion/division operation, ECADD reaches the fastest efficiency in mixed affine-Jacobian coordinates, while there is ECDBL in Jacobian coordinates [25].
ECADD in mixed affine-Jacobian coordinates and ECDBL in Jacobian coordinates are given in the two following equations:

High-Performance Architecture of SM2
e PM architecture based on full-word multipliers is described below. TSMR and full-word multiplication constitute MM, while the binary modular inversion algorithm in [26] was applied to execute modular inversion (MI) operation.

Modular Reduction.
SCA-256 has the characteristic that it can be denoted as p � 2 256 − 2 224 − 2 96 + 2 64 − 1. e traditional MR for SCA-256 [9] is given in Algorithm 2. After the fast reduction operation, the intermediate value can be represented as Z � s 1 + s 2 + 2s 3 + 2s 4 + 2s 5 + s 6 + s 7 + s 8 + s 9 + 2s 10 where Z∈[0, 14p). It will cost at most 13 subtractions to get the final result Z ∈ [0, p). Since the modular reduction would be computed in a single clock cycle, the repetitive subtractions have a significant influence on the latency and bring about a lot of hardware resources consumption. A TSMR algorithm on SCA-256 is proposed in this paper to address this problem (Algorithm 3). e first state takes sixteen addition/subtraction operations to calculate Z 1 , while the second one just costs two to calculate Z 2 . e intermediate value after two state fast reduction operations is , and it only needs one subtraction at most to obtain the final value Z ∈ [0, p).
In ECADD or ECDBL operation, modular addition (MA) or modular subtraction (MS) operations are always required by the following MM operation. One cycle can be reduced when MA/MS was carried out. e max delay of carry-save addition only cares about the final carry. erefore, adding one value to the other twenty values will not have a huge impact on latency. As shown in Algorithm 3, operand a in previous MA/MS is added to (c + a ) mod p. In Algorithm 5 proposed below, such an operation appears twice in ECADD (Step 9: T2T2-T4, Step 11: T1T2-T4) and in ECDBL (Step 6: T2T2-T1, Step 8: T1T2-T5), respectively. e clock cycles, m/(w + 1) * 2 + m * 2 � 256/(4 + 1) * 2 + 26 * 2 � 614, are reduced.

Carry-Save Adder
Architecture. In TSFR algorithm, there are five subtraction operations in Z 1 and one in Z 2 . In order to reduce the area consumption and clock latency, a kind of new carry-save adder (CSA) architecture is presented for Algorithm 3, and the main advantage of CSA is that it can deal with subtraction operation. e subtraction operation becomes an addition operation by using the subtrahend's complement. e first stage reduction result Z 1 , 0 ≤ Z 1 < 16p, was designed as 261-bit data, and it contains 21 operands and 20 256-bit CSAs. Due to one extended sign bit for five subtrahends' complement, as shown in Figure 2, it is noted that the 20 most significant bits (MSBs) of CSA cannot be cumulated. e CSA of 261 or more bits is not met. As shown in Figure 2, the MSB of Z 1 [261] could not be got from the sum of sc14[260] to sc21 [260]. However, the 256th to 260-th bits of subtrahend's complement are set to 1, while the 257-th to 261-th bits of addend are set to 0. e sum of the 256-th to 260-th bits of the subtrahend can be precalculated, getting 5 * 5'b11111 � 7'b1011011. Only the low 5 bits (5'b11011) are needed, and it can be placed in row 1 of 1-bit CSA. In this case, the proposed CSA is completed with the function of settling the subtraction operations.

Implementation and Validation
e architecture described above is implemented with the Verilog-HDL language. It is synthesized using Design Compilers with the SMIC 130 nm CMOS standard cell library and is evaluated based on the 2-way NAND gate. Apart from that, for comparison with other designs on FPGA platform, it is also implemented on Xilinx Virtex-6 xc6vlx760, using Xilinx ISE 14.7. e performance is obtained by ModelSim simulation. e testing data meet the ECC cryptography protocol and are randomly generated. For a hardware design, the performance and hardware consumption are two main evaluation metrics. Besides, the time-area product is a metric to validate the trade-off between performance and hardware consumption.
With the window NAF recoding method, the time executing point multiplication is denoted as where m � log 2 p; w refers to the width of NAF; A is the cycle that ECADD required, while D is ECDBL's cycle consumption. In this work, w is set to 4. e calculations of 1P, 3P, 5P, and 7P are precalculated. Table 1 shows the clocks that are required by each operation. In the fixed point, MM operation uses NAF4 recoding of scalar k and takes an average of 14242 cycles by testing 1000 times. After PM operation, two MI operations are required for coordinates conversion from Jacobian coordinates to affine coordinates. Table 2 shows the comparison among other designs over 256-field-order GF (p). e architecture in [9] is using 256bit multipliers. In this case, its area is large and there are 659 K gates. As it consumes many large hardware resources, it is not suitable for teleoperated robots. e architecture in [18] relies on two multiplier units using interleaved modular multiplication algorithms. Hence, it is featured with a smaller area but worse computation efficiency. e proposed design is 32.7 times faster in [18]. e architecture in [22] adopts a systolic arithmetic unit and obtains smaller areas but takes more clock cycles. e AT (area-time products) of our architecture are smaller than those of [18,22]. e design in [28] adopts projective coordinates to avoid MI and employs a radix-2 modular multiplication algorithm for MM. In [29], Shah et al. presented a high-speed processor on the basis of redundant signed digit (RSD) arithmetic to prevent lengthy carry propagation delay. It is able to run at a high frequency of 327 MHz and requires 0.47 ms to perform a single PM operation. e architecture in [11] uses half-word multipliers based on the Barrett modular multiplication algorithm. In [19], a unified architecture of computing MA, MS, and MM is proposed. e designs in [30,31] only apply adder results in a worse performance than ours. e radix-4 booth encoding interleaved modular multiplication algorithm is adopted in [30,31]. Besides, the NAF point multiplication algorithm is applied in [31], while the double-and-always-add point multiplication algorithm is employed in [30]. As NAF2 has the merits of decreasing PM complexity from (m/2 * A + m D) to (m/3 * A + m D), the design in [30] takes more LUTs to get the comparable clock cycle consumption in the same platform compared with the design in [31]. e architecture proposed here needs fewer clock cycles and is faster when concerning performing point multiplication than those architectures in [11,18,19,21,30,31]. e security concern is one of the most important issues in teleoperated robotics systems. In a harsh condition, time delay and power consumption are important, so using hardware to realize cryptographic algorithms has become an imperative tendency. e ECC processor we proposed here is implemented in hardware and can provide a high performance. e most complicated operations, such as PM, PA, and modular operations, are implemented by the hardware proposed here and this hardware module can be called by software to realize digital signature/verification and encryption/decryption to resolve the safety issue of teleoperated systems.

Conclusion
In a teleoperated system, robots interact with and are controlled by human operators through a communication network. erefore, security becomes an import issue and ECC is the well choice among different cryptographic algorithms due to its lower key length. In this work, a highperformance ECC architecture of SM2 is proposed, which is suitable for the teleoperated robot's security. To reduce latency owing iterated subtractions, a TSMR algorithm on SCA-256 is presented.
us, the intermediate result Z ∈ [ 0, 2p ) is improved when compared with Z ∈ [ 0, 14p ) of traditional algorithms. To avoid iterated subtractions, a TSMR algorithm in SCA-256 is shown and implemented with a carry-save adder architecture with the subtraction. To the area/performance trade-off, the half-word multiplier is adopted, equipped with pipeline design fully enhancing the calculation parallelism. e experimental results show that the proposed design takes 0.092 ms to perform 256-bit PM with 153.8 MHz frequency and consumes 341.98 k gate areas. Furthermore, the implementation result indicates that the proposed architecture has better performance and smaller AT than previous works.
In the future, the optimization of modular multiplication will be studied to further reduce the hardware overhead. e portability of the hardware modules and the softwarehardware codesign will be further studied to extend the application fields. Antiattack technology is another interesting piece of work worth studying.
Data Availability e raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Mathematical Problems in Engineering 7