^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

In the emerging IoT ecosystem in which the internetworking will reach a totally new dimension the crucial role of efficient security solutions for embedded devices will be without controversy. Typically IoT-enabled devices are equipped with integrated circuits, such as ASICs or FPGAs to achieve highly specific tasks. Such devices must have cryptographic layers implemented and must be able to access cryptographic functions for encrypting/decrypting and signing/verifying data using various algorithms and generate true random numbers, random primes, and cryptographic keys. In the context of a limited amount of resources that typical IoT devices will exhibit, due to energy efficiency requirements, efficient hardware structures in terms of time, area, and power consumption must be deployed. In this paper, we describe a scalable word-based multivendor-capable cryptographic core, being able to perform arithmetic operations in prime and binary extension finite fields based on Montgomery Arithmetic. The functional range comprises the calculation of modular additions and subtractions, the determination of the Montgomery Parameters, and the execution of Montgomery Multiplications and Montgomery Exponentiations. A prototype implementation of the adaptable arithmetic core is detailed. Furthermore, the decomposition of cryptographic algorithms to be used together with the proposed core is stated and a performance analysis is given.

The next generation of embedded systems and IoT devices will exhibit a much higher degree of internetworking which gives rise to security considerations [

In matters of algorithm agility an arithmetic engine with minimal hardware footprint, which can handle the arithmetic operations of a great variety of cryptographic algorithms, is of great importance for IoT based devices. Especially the calculability of the individual operations leading to lower and upper calculation time bounds is quite important.

This paper proposes a tiny-held vendor-neutral cryptographic arithmetic core exemplarily implemented in FPGA-logic. For efficiency, time-intensive modular operations, such as multiplication and exponentiation operations, Montgomery Arithmetic is used. Without the need of any expensive software precalculations the core is able to perform a high number of cryptographic algorithms and handle various key sizes by simply processing operation lists. Furthermore the core architecture is unified and can perform calculations in both prime finite fields (

The paper is organized as follows. Section

The efficiency of cryptographic algorithms when implemented on reconfigurable hardware is mainly determined by the fact of how the underlying finite field arithmetic operations are realized [

In [

However, no publication focuses on how the Montgomery Multiplication architecture can be embedded into a comprehensive solution. In this paper we propose an enhanced version of a bit-serial word-based unified Montgomery Multiplication core based on logic elements only which is controlled by a state machine and offers the functional range to be able to perform complete cryptographic algorithms without additional complex processing required in software.

Today a high number of different public-key algorithms are in use. To ensure compatibility, cryptographic applications must support a large portion of those algorithms. While typical software implementations often can easily be upgraded in order to adapt new algorithms and larger key sizes, the same is not necessarily true for hardware implementations. Therefore following requirements have been identified for the Enhanced Montgomery Multiplication Core:

Figure

Overall architecture of the Enhanced Montgomery Multiplication Core.

Besides the pipeline of processing units handling the main part of the word-based Montgomery Multiplication algorithm, the core features an enhanced word-based Carry Look-Ahead adder being responsible for the calculation of the final result after the pipeline has processed all bits of an operand as well as for performing single modular addition and subtraction operations. The register files of the original design have been replaced with an internal dual-ported RAM which holds the operands as well as intermediate results of the core operations. Furthermore a word-based comparator component has been described which is queried during operations to decide if a modular addition or subtraction step must be performed. Two additional

The intelligence of the core is the controlling state machine which utilizes the defined components to perform standard modular addition and subtraction operations, Montgomery Multiplications, Montgomery Exponentiations, Montgomery Parameter calculation, and RAM reorganisation operations. Therefore it is responsible for controlling the RAM write and read access, the source and destination address signals of RAM, as well as the values passed through to the first processing unit, to the CLA adder, and to the comparator component. Furthermore it controls the assignments of

The described core can be parametrised in three ways. The parameter named

The heart of the core is the pipeline of processing units implementing the multiple word version of the Montgomery Multiplication algorithm. Therefore the processing unit structure has been described from scratch. The processing unit can be held in reset and keeps track of the cycle number according to the number of words to be processed depending on the supplied parameters. This control logic is needed to determine whether the supplied modulus has to be added to the processed words in this cycle or not, depending on the value of the signal

Processing unit with word size

Each processing unit consists of a cascade of two layers of so-called Unified Full Adder (UFA) cells. The Unified Full Adder cells basically consist of simple full adder cells which have been enhanced by an additional finite field selection input

Since the pipeline generates the result in carry save form, an additional step is necessary at the end of each calculation to obtain a nonredundant version of the result. For the sake of uniformity a circuit is required that can operate in both finite fields

Enhanced n-bit wide CLA adder.

The internal signal

The RAM of the core must be capable of holding all the necessary operands and intermediate values required during the execution of cryptographic algorithms. The basic structure of the described RAM is pictured in Figure

Enhanced Montgomery Multiplication Core RAM organization.

It features four symbolic horizontal RAM operand locations with

Besides the horizontal RAM operand locations three symbolic vertical RAM operand locations with

This section provides a description of the functional range of the proposed core. The following precisions (denoted in bit-length) are supported:

If further or other precision widths should be supported, the described core can easily be adjusted in an appropriate manner. For the parametrisation and the execution/abortion of an operation a

The

The

In the case of prime field arithmetic the Montgomery Parameter

In the case of binary field arithmetic the Montgomery Parameter

The

In the case of prime field arithmetic the Montgomery Parameter

In the case of binary field arithmetic the Montgomery Parameter

The

The

The

In order to support cryptographic algorithms which have been disassembled into a list of instructions, RAM copy operations are needed. According to the proposed RAM layout stated above four individual copy operations have been defined.

The

The

The

The

The

This section gives exemplary descriptions of how the specified functional range of the proposed building-block Enhanced Montgomery Multiplication Core design can be utilized to support a wide range of cryptographic algorithms demanding the least possible memory capacity yet at the same time supporting as much precision widths as possible. Information is given of how to perform Chinese Remainder Theorem [

For the support of elliptic curve cryptography over prime and binary finite fields modular functions are given for preparing and conducting point operations for arbitrary elliptic curves for the supported precision widths. For all these algorithms a list of operations and the quantity of different operations is given allowing to perform cryptographic algorithms by simply processing these operation lists.

In order to speed up RSA private key operations the CRT-accelerated version is also supported by the core. Therefore some operations have to be performed with full precision whereas most of the operations have to be performed with half precision. Algorithm

Table

Core operations list CRT-RSA.

| | |
---|---|---|

1 | - | Clear RAM |

| ||

2 | - | Write |

| ||

3 | | |

| ||

4 | | |

| ||

5 | - | Write |

| ||

6 | | |

| ||

7 | | |

| ||

8-9 | | |

| ||

10 | | |

| ||

11 | | |

| ||

12 - 14 | | |

| ||

15 | | |

| ||

16 | | |

| ||

17 | | |

| ||

18 | | |

| ||

19 | | |

| ||

20 | - | Write |

| ||

21 | | |

| ||

22 | | |

| ||

23 - 24 | | |

| ||

25 | | |

| ||

26 | | |

| ||

27 - 28 | | |

| ||

29 | | |

| ||

30 | | |

| ||

31 - 32 | | |

| ||

33 | - | Write |

| ||

34 | | |

| ||

35 | | |

| ||

36 | | |

| ||

37 | | |

| ||

38 | | |

| ||

39 | | |

| ||

40 | | |

CRT-accelerated RSA private key operations require

Algorithm

Table

Core operations list for Miller-Rabin Primality Test.

| |
---|---|

1 | Clear RAM |

| |

2 | Write |

| |

3 | Write |

| |

4-5 | |

| |

6 | |

| |

7 | |

| |

8 | |

| |

9-11 | |

| |

12 | |

| |

13 | Read |

| |

14-15 | |

| |

16 | |

| |

17-18 | |

| |

19 | |

| |

20 | Read |

| |

21 | |

| |

22 | |

| |

23-25 | |

| |

26 | |

| |

27 | Read |

| |

28-29 | |

| |

30 | |

| |

31-32 | |

| |

33 | |

| |

34 | Read |

The total number of needed core operations depends on the security parameter

Unlike modular exponentiation which only is based on modular multiplications, elliptic curve Point Addition and Point Doubling operations also in the Jacobian projective coordinate representation [

In order to utilize the core for elliptic curve operations the following modular functions have been specified for both

In the following, algorithms for utilizing the core to perform EC operations in

The prime field EC Preparation steps include the calculation of the Montgomery Parameter

A core prime field EC preparation operation requires

The prime field EC Montgomery Transformation steps are responsible for the transformation of the supplied affine point coordinates

A core prime field EC Montgomery Transformation operation requires

The prime field EC Affine-to-Jacobi Transformation steps are responsible for transforming the supplied montgomerized affine point coordinates

A core prime field EC Affine-to-Jacobi Transformation operation requires

The prime field EC Point Validation performs a check, if a supplied (or calculated) point indeed is a valid point of the elliptic curve given by the equation

A core prime field EC Point Validation operation requires

The prime field EC Point Doubling steps perform a single Point Doubling operation of a Point

A core prime field EC Point Doubling operation requires

The prime field EC Point Addition steps perform a single Point Addition operation of two Points

A core prime field EC Point Addition operation requires

The prime field EC Jacobi-to-Affine Transformation steps are responsible for the transformation of the supplied montgomerized Jacobi coordinates

A core prime field EC Jacobi-to-Affine Transformation operation requires

The prime field EC Montgomery Backtransformation steps are responsible for the transformation of the supplied montgomerized point coordinates

A core prime field EC Montgomery Backtransformation operation requires

In this section parameter-dependent formulas for the calculation of the computation times in clock cycles of the described basic core operations are given which allows specifying upper and lower calculation boundaries. Furthermore for the supported precision widths in both finite fields the number of words to be processed and the possible numbers of processing units is given. In order to estimate the size ratio of different core variations the number of logic elements and dedicated logic registers for exemplary Altera and Xilinx FPGAs is stated. Furthermore results of power estimation are given. Depending on the resulting clock cycle times of core variations a reference implementation exhibiting a balance of performance and area consumption has been defined. For this reference implementation the computation times in clock cycles for the described exemplary cryptographic algorithms are given.

Table

Core RAM copy operations computation time in clock cycles (CC).

| | |
---|---|---|

| | |

| ||

| | |

| ||

| | |

| ||

| | |

The computing time formulas of prime field core operations given in clock cycles are listed in Table

Core

| |
---|---|

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

The computation time of the

The

For the

Precision-dependent values of

| | | | | | | |
---|---|---|---|---|---|---|---|

| 7 | 6 | 8 | 7 | 8 | 7 | 9 |

| 8 | 11 | 8 | 10 | 9 | 12 | 9 |

| |||||||

| | | | | | ||

| |||||||

| 9 | 10 | 10 | 11 | 11 | 12 | |

| 10 | 10 | 11 | 11 | 12 | 12 |

For the

For the

For the

The prime field

The computing time formulas of binary field core operations given in clock cycles are listed in Table

Core

| |
---|---|

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

The computation time of the

While the determination of the Montgomery Parameter

The

For the

For the

The binary field

Depending on the needs, in terms of performance, area consumption, supported precisions, and the interfacing structure, different variations of the core can be generated by defining the parameters

Number of words

| | | |
---|---|---|---|

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

| |||

| | | |

In contrast, Table

Number of words

| | | |
---|---|---|---|

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | | |

| |||

| | | |

| | |

Since all components of the design consist of simple logic elements, the proposed arithmetic core is vendor-neutral. In order to estimate the hardware footprint of different core implementations the design variations have been compiled on Altera and Xilinx FPGAs. Table

Amount of logic elements and logic registers for different core variations (Altera Cyclone IV).

| | | |
---|---|---|---|

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

Table

Amount of logic elements and logic registers for different core variations (Xilinx XC7Z020).

| | | |
---|---|---|---|

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

In order to evaluate the suitability of the proposed core for the application in the IoT area, a power estimation has been conducted using two common frequencies of 100 MHz and 200 MHz for various core variations. Timing analysis yields that the design can reliably be operated with these frequencies. The power consumption characteristics have been derived by applying the PowerPlay Power Analyzer Tool of the Quartus Prime IDE to the final design using default settings of a power toggle rate as well as a power input I/O toggle rate of

Total Thermal Power Dissipation (TTPD) values for different core variations (Altera Cyclone IV).

| | | |
---|---|---|---|

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

Furthermore it has to be mentioned that the optimization mode in the compiler settings was set to balanced and no specific compiler optimizations regarding power have been turned on. The results show that the core is quite suitable for applications which have special constraints regarding power consumption. According to such needs as well as the desired clock frequency a suitable variation can be implemented. The choice will have an impact on computing time and hardware footprint.

For the reference implementation a word width of

Core reference implementation RSA computation times.

| | | |
---|---|---|---|

| | | |

| |||

| | | |

| |||

| | | |

Table

Core reference implementation Miller-Rabin computation times.

| | | |
---|---|---|---|

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| |||

| |||

| |||

| | ||

| | ||

| | ||

| |

Table

Core reference implementation prime field EC computation times.

| | | |
---|---|---|---|

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| |||

| | | |

| |||

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

| | | |

A comprehensive adaptable hardware structure for efficient prime finite field and binary finite field arithmetic operations that expand the capabilities of single Montgomery Multiplier hardware designs has been proposed which allows carrying out cryptographic calculations for a large range of different algorithms all based on the same arithmetic unit operations with arbitrary parameters. The approach taken by the proposed core is to combine standard modulo addition / subtraction support with the capability of performing Montgomery Multiplications, full Montgomery Exponentiations, and the calculation of Montgomery Parameters

The given values of possible hardware footprint and power consumption for specific core variations allow choosing the proper configuration for a specific implementation. The reference implementation showed that with an internal RAM of merely 3.5 kB the core is capable of performing complete prime field and binary field EC operations for various precision widths of standardised curves. Furthermore the same core configuration is capable of performing (CRT-accelerated) RSA operations for typical precision widths required today, (safe) prime testing/generation, and Diffie-Hellman key exchange operations up to

However the type of implementation of some core operations, such as the Montgomery Multiplication and especially the Montgomery Exponentiation operation, necessitates additional security considerations, since the calculation times depend on the structure of the processed operands. This makes the design prone to side-channel attacks if security sensitive information, such as private keys, will be processed. But not all operations are critical and must be secured, such as the calculation of the Montgomery Parameters. Therefore during the writing of this article the core will be enhanced to provide a secure calculation bit within the command input word, which, if set, instructs the core to perform the specified arithmetic operation in a time-invariant fashion. In addition, special care has to be taken when defining core operation lists, for instance, for performing elliptic curve Point Multiplication operations. Descriptions performing in a fixed amount of time, e.g., the Montgomery ladder [

The authors declare that there are no conflicts of interest regarding the publication of this paper.