A Vendor-Neutral Unified Core for Cryptographic Operations in GF(p) and GF( 2 m ) Based on Montgomery Arithmetic

In the emerging IoT ecosystem in which the internetworking will reach a totally new dimension the crucial role of efficient security solutions for embedded devices will be without controversy. Typically IoT-enabled devices are equipped with integrated circuits, suchasASICsorFPGAstoachievehighlyspecifictasks.Suchdevicesmusthavecryptographiclayersimplementedandmustbeable toaccesscryptographicfunctionsforencrypting/decryptingandsigning/verifyingdatausingvariousalgorithmsandgeneratetrue randomnumbers,randomprimes,andcryptographickeys.InthecontextofalimitedamountofresourcesthattypicalIoTdevices willexhibit,duetoenergyefficiencyrequirements,efficienthardwarestructuresintermsoftime,area,andpowerconsumption mustbedeployed.Inthispaper,wedescribeascalableword-basedmultivendor-capablecryptographiccore,beingabletoperform arithmeticoperationsinprimeandbinaryextensionfinitefieldsbasedonMontgomeryArithmetic.Thefunctionalrangecomprises thecalculationofmodularadditionsandsubtractions,thedeterminationoftheMontgomeryParameters,andtheexecutionof MontgomeryMultiplicationsandMontgomeryExponentiations.Aprototypeimplementationoftheadaptablearithmeticcoreis detailed.Furthermore,thedecompositionofcryptographicalgorithmstobeusedtogetherwiththeproposedcoreisstatedanda performanceanalysisisgiven.


Introduction
The next generation of embedded systems and IoT devices will exhibit a much higher degree of internetworking which gives rise to security considerations [1].As a logical consequence, such devices must become cryptographic nodes, besides others, being capable of encrypting/decrypting and signing/verifying data as well as establishing spontaneous secured communications by exchanging common secrets used for secret key calculation.While many embedded chips already have support for hardware-accelerated symmetric algorithms (mainly AES) [2] and hash functions, due to various reasons, such as complexity, space, and costs, they lack in hardware support especially for supporting a wide range of public-key and key exchange algorithms with different precision widths.Besides, many modern cryptographic primitives necessitate the capability for producing true random numbers and random prime numbers.Typical IoT devices furthermore very often only exhibit a limited amount of resources which requires efficient cryptographic hardware structures in terms of area, power consumption, and calculation performance [3].In general enterprises developing IoT products basically have three options to include application functionalities in high integrated devices, using Application Specific Standard Products (ASSP), Application Specific Integrated Circuits (ASIC), or Field Programmable Gate Arrays (FPGA).Today FPGAs have become promising components for IoT applications [4], compared to ASSP solutions which often cannot provide the required functionality and can provide a better Total Cost of Ownership (TCO) compared to ASIC solutions.Thus for devices which are equipped with a FPGA device, it is valuable to examine how efficient hardware structures for performing cryptographic operations can be included.
In matters of algorithm agility an arithmetic engine with minimal hardware footprint, which can handle the arithmetic operations of a great variety of cryptographic algorithms, is of great importance for IoT based devices.Especially the calculability of the individual operations leading to lower and upper calculation time bounds is quite important.
This paper proposes a tiny-held vendor-neutral cryptographic arithmetic core exemplarily implemented in FPGAlogic.For efficiency, time-intensive modular operations, such as multiplication and exponentiation operations, Montgomery Arithmetic is used.Without the need of any expensive software precalculations the core is able to perform a high number of cryptographic algorithms and handle various key sizes by simply processing operation lists.Furthermore the core architecture is unified and can perform calculations in both prime finite fields (()) and binary extension fields ((2  )).To illustrate the versatility of the developed core, well-established cryptographic algorithms have been rewritten and fragmented into operation lists to be processed by the arithmetic engine.
The paper is organized as follows.Section 2 states the related work of this research.In Section 3 the design of the proposed Enhanced Montgomery Multiplication Core is stated; the specified functional range of the core is given in Section 4. In Section 5 some exemplary application descriptions for the core are mentioned and in Section 6 the results of the performance analysis are stated.Finally, Section 7 concludes the paper.

Related Works
The efficiency of cryptographic algorithms when implemented on reconfigurable hardware is mainly determined by the fact of how the underlying finite field arithmetic operations are realized [5].Several applications in cryptography such as ciphering and deciphering of asymmetric algorithms, the creation and verification of digital signatures, and secure key exchange mechanisms require excessive use of the basic finite field modular arithmetic operations addition, multiplication, and the calculation of the multiplicative inverse.Especially the field multiplication operation is crucial to the efficiency of a design, since it is the core operation of many cryptographic algorithms [6].
In [7] P. L. Montgomery introduced a representation of residue classes in order to speed up modular multiplications without affecting modular additions and subtractions.Over the years numerous designs have been proposed implementing modular multiplications based on Montgomery's multiplication algorithm [8].The foundation for these architectures was presented by A. Tenca and C ¸. Koc ¸in [9].The architecture is based on a word-based Montgomery Multiplication algorithm for prime finite fields in which multiplications are performed in a bit-serial fashion.E. Savas ¸et al. in [10] have proposed an extension which, in addition to the standard integer modulo arithmetic, also allows polynomial computations over binary finite fields.An overview about algorithms and hardware architectures for Montgomery Multiplication can be found in [11].Optimizations of the original design have been proposed concerning the hardware implementation of the Montgomery Multiplication algorithm [12] as well as by utilizing special arithmetic hardcore extensions of FPGAs to accelerate digital signal processing applications [13].Some designs only focus on utilizing the Montgomery Multiplication method to accelerate modular exponentiation operations as required by the RSA algorithm [14,15].
However, no publication focuses on how the Montgomery Multiplication architecture can be embedded into a comprehensive solution.In this paper we propose an enhanced version of a bit-serial word-based unified Montgomery Multiplication core based on logic elements only which is controlled by a state machine and offers the functional range to be able to perform complete cryptographic algorithms without additional complex processing required in software.

Enhanced Montgomery Multiplication Core
3.1.Requirements.Today a high number of different publickey algorithms are in use.To ensure compatibility, cryptographic applications must support a large portion of those algorithms.While typical software implementations often can easily be upgraded in order to adapt new algorithms and larger key sizes, the same is not necessarily true for hardware implementations.Therefore following requirements have been identified for the Enhanced Montgomery Multiplication Core: (i) Use of Montgomery Arithmetic.The design must be able to perform modulo operations in a time-efficient manner by using Montgomery Arithmetic.At least the core must support Montgomery Multiplications and Montgomery Exponentiations.Furthermore the core must support standard modulo additions and modulo subtractions.
(ii) Works on Both Finite Fields () and (2  ).The architecture must exhibit an unified structure supporting both standard integer modulo operations of prime finite fields as well as polynomial calculations of binary finite fields.
(iii) Montgomery Parameter Calculation.In general the Montgomery Parameters ( and  2 ) can be precomputed for previously known moduli.However, as a requirement the core must be able to handle arbitrary moduli.Therefore it must be capable of calculating the Montgomery Parameters  mod ,  2 mod  and  −1 mod  without the need of precalculations done in software.
(iv) Scalable Design.The architecture must be scalable in terms of timing, area, and power consumption.This includes the parametrisation of the word width, the internal storage size, and the amount of processing units within the pipeline.
(v) Multialgorithm Support.The core must be based on a building-block design.The functional range provided by the arithmetic unit should empower algorithm agility, by fragmenting cryptographic algorithms into a list of core operations.At least the core must be capable of performing RSA [16] operations, (safe) prime number generation and primality testing (MR) [ 17,18], key exchange operations (DH) [19], and elliptic curve calculations (EC) [20] over both prime and binary finite fields.
(vi) Supporting as Many Precision Widths as Possible.
The design must support a wide range of different precision widths determining the security level of the cryptographic algorithm.If a certain security level, due to increased attacking computing power, becomes inadequate, the precision width can be adjusted accordingly which makes the hardware less prone to become obsolete due to higher security demands.The core must support the current recommendations for minimum key sizes [21] and should also support larger key sizes.For RSA algorithm and Diffie-Hellman key exchange support the architecture should be able to handle precisions up to 4096 bit moduli, for elliptic curve cryptography support precisions up to 512 bits for prime finite fields and precisions up to 571 bits for binary finite fields should be possible.
(vii) Time-Invariant Operations.The architecture must be capable of performing its operations in a timeinvariant manner.If security sensitive information, such as private keys, will be processed, it must be ensured that all operations exhibit the same execution time to prevent side-channel attacks based on timing analysis.

3.2.
Overall Core Architecture.Figure 1 illustrates the overall architecture of the proposed Enhanced Montgomery Multiplication Core which is capable of meeting all requirements as specified above.
Besides the pipeline of processing units handling the main part of the word-based Montgomery Multiplication algorithm, the core features an enhanced word-based Carry Look-Ahead adder being responsible for the calculation of the final result after the pipeline has processed all bits of an operand as well as for performing single modular addition and subtraction operations.The register files of the original design have been replaced with an internal dual-ported RAM which holds the operands as well as intermediate results of the core operations.Furthermore a word-based comparator component has been described which is queried during operations to decide if a modular addition or subtraction step must be performed.Two additional -bit words for the  operand and the exponent  have been introduced with || being the RAM width (|| = 4 ⋅ ||) which will be fetched from RAM in case of Montgomery Multiplication and Montgomery Exponentiation operations.An auxiliary -bit word  is used for RAM reorganisation operations as well as for the calculation of the Montgomery Parameters  and  2 .
The intelligence of the core is the controlling state machine which utilizes the defined components to perform standard modular addition and subtraction operations, Montgomery Multiplications, Montgomery Exponentiations, Montgomery Parameter calculation, and RAM reorganisation operations.Therefore it is responsible for controlling the RAM write and read access, the source and destination address signals of RAM, as well as the values passed through to the first processing unit, to the CLA adder, and to the comparator component.Furthermore it controls the assignments of  operand,  exponent, and  words.
The described core can be parametrised in three ways.The parameter named MAX PRECISION WIDTH specifies the highest supported precision width ||, whereas the parameter WORD WIDTH is used to specify the word width || of the operands involved in the calculations.These two parameters determine the size and the address space of the internal core RAM.The third parameter MAX NUM PUS specifies the maximum number of processing units of the pipeline implemented for a specific core variation mainly affecting the performance and the size in terms of area consumption.

Processing Units.
The heart of the core is the pipeline of processing units implementing the multiple word version of the Montgomery Multiplication algorithm.Therefore the processing unit structure has been described from scratch.
The processing unit can be held in reset and keeps track of the cycle number according to the number of words to be processed depending on the supplied parameters.This control logic is needed to determine whether the supplied modulus has to be added to the processed words in this cycle or not, depending on the value of the signal  denoting an odd intermediate result.Note that buffering the output of a processing unit between two processing units is not required in this design.Compared to the original design presented in [10] for a given precision width || and a word size ||,  = ⌈||/||⌉ + 1 number of words are required for a unified solution and the pipeline must consist of a power of two (2  ) number of processing units with a maximum number of 2  < (−1) in order to avoid pipeline stalls.Figure 2 illustrates the internal architecture of an exemplary processing unit with word size || = 4.
Each processing unit consists of a cascade of two layers of so-called Unified Full Adder (UFA) cells.The Unified Full Adder cells basically consist of simple full adder cells which have been enhanced by an additional finite field selection input   .This allows for the creation of a unified multiplier architecture which can not only be used in prime fields () (  = 1) but also in binary fields (2  ) (  = 0) in which additions will be simple bitwise XOR calculations without any carry output.

Carry Look-Ahead Adder.
Since the pipeline generates the result in carry save form, an additional step is necessary at the end of each calculation to obtain a nonredundant version of the result.For the sake of uniformity a circuit is required that can operate in both finite fields () and (2  ).Furthermore, since the calculation in () could require one further subtraction step, the Carry Look-Ahead adder in the design has been formulated to be able to perform word-based modular additions and subtractions.Figure 3 illustrates the logic of the proposed enhanced −bit wide CLA adder of the core.
The internal signal   of the second operand will be calculated as    =   ⊕(⋅  ) in which  denotes an add-orsubtract signal ( = 0 means addition,  = 1 represents subtraction by performing an addition in two's complement representation).The modified CLA adder involves the same common Carry Look-Ahead adder logic for the calculation of the generate (  =   ⋅    ) and propagate (  =   +    ) functions.The output values   of the CLA adder logic will be calculated as  0 =   for the least-significant bit and   =  −1 + ( −1 ⋅  −1 ) for all further bits.The final sum output bits   will be calculated as   = (  ⋅   ) ⊕   ⊕    the carry output bit will be determined as   =   ⋅   .If the selected finite field is (2  ) (  = 0), then the add-or-subtract input will  be ignored, the final sum will simply be the bitwise modulo-2 addition of the two input values  and  and the carry output bit will be forced to zero.

Functional Range of the Core
This section provides a description of the functional range of the proposed core.The following precisions (denoted in bitlength) are supported: If further or other precision widths should be supported, the described core can easily be adjusted in an appropriate manner.For the parametrisation and the execution/abortion of an operation a 32-bit wide command input word has been defined.Besides the start, abort, and finite field selection signals also the encoded precision width, operation code as well as RAM offsets for the specified operation can be supplied.The following operations have been specified.

MontMult Operation.
The MontMult operation code instructs the core to perform a single Montgomery Multiplication with the supplied elements in the given finite field.A Montgomery Multiplication will start by reading the first bit word of operand  from RAM. Afterwards the pipeline will be started and the appropriate bits of  operand will be fed to the individual processing unit.If all bits of the  operand word have been fed to the processing units, a new word will be read from RAM.Once the last bit of  operand has been processed, the temporary sum and temporary carry words will be fed into the CLA adder in order to reunite the two streams.After the last words of temporary sum and temporary carry have been brought together, the carry output bit of the CLA adder will be evaluated.If a carry bit is set the modulus will be subtracted once; otherwise the result will be compared to the given modulus.If the result is equal or greater than the modulus the given modulus will be subtracted once.

MontR Operation.
The MontR operation code instructs the core to calculate the Montgomery Parameter  = 2  regarding a supplied modulus in the given finite field, with  being the bit-length of the given precision.
In the case of prime field arithmetic the Montgomery Parameter  will be  ≡ 2  mod , so  can be calculated as two's complement of  as bitwise inverse of the given modulus plus 1. Therefore the individual words of the modulus will be XOR-ed with a constant word consisting of all-ones.In addition the least-significant bit of the first word will be set to one.
In the case of binary field arithmetic the Montgomery Parameter  will be  ≡ 2  mod (), so  is equal to binary expression of the irreducible polynomial () with the most significant bit set to zero.Therefore the individual words of the modulus will be scanned and the appropriate most significant bit will be set to zero, depending on the given precision.

MontR2
Operation.The MontR2 operation code instructs the core to calculate the Montgomery Parameter  2 with  2 = 2 2⋅ for a supplied modulus in the given finite field with  being the bit-length of the given precision.
In the case of prime field arithmetic the Montgomery Parameter  2 will be given by  2 ≡  ⋅  ≡ 2  ⋅ 2  ≡ 2 2⋅ mod .Therefore in a first step the Montgomery Parameter  ≡ 2  mod  will be calculated for prime fields as described above.In order to calculate  2 one possible way is to calculate 2  ⋅ 2  mod  with  being a small divider of .In the given implementation  = 1.Therefore the bits of  will be shifted to the left by one bit.If the result is equal or greater than the modulus,  will be subtracted once.By using a square-and-multiply-like algorithm, multiple Montgomery Multiplications will be performed in order to calculate  2 ≡ 2  ⋅ 2  mod .
In the case of binary field arithmetic the Montgomery Parameter  2 will be given by  2 ≡  ⋅  ≡ 2  ⋅ 2  ≡ 2 2⋅ mod ().Therefore in a first step the Montgomery Parameter  ≡ 2  mod () will be calculated for binary fields as described above.In order to calculate  2 the resulting parameter  will be shifted -times bitwise to the left.After each shift, the most significant bit as given by the precision parameter will be evaluated.If the bit is one, the irreducible polynomial will be added to the intermediate result which represents a modulo reduction with ().Once the shift has been performed -times the result will be  2 ≡ 2  ⋅ 2  mod () Afterwards the first appearing one of the exponent word will be searched starting from the most significant bit.If the first word consists of all-zeros then the next word of exponent  will be read and evaluated.Once the highest bit of exponent  has been found, multiple Montgomery Multiplications will be performed until all bits of the exponent have been processed following a square-and-multiply algorithm.

ModAdd Operation.
The ModAdd operation code instructs the core to perform a modular addition of the supplied elements in the given finite field.After preparing the core for the addition operation, the CLA adder will add the given operands using the appropriate arithmetic given by the finite field selection input.Once the last words of the given operands have been added the carry output bit of the CLA adder will be evaluated.If a carry bit is set, the modulus will be subtracted once; otherwise the result will be compared to the given modulus.If the result is equal to or greater than the modulus, it will also be subtracted once.

ModSub Operation.
The ModSub operation code instructs the core to perform a modular subtraction of the supplied elements in prime fields.After preparing the core for the subtraction operation the CLA adder will be used to perform a word-based subtraction by performing an addition in two's complement representation with prime field arithmetic.After the last words of the given operands have been processed, the carry output bit of the CLA adder will be evaluated.If the carry bit signals a negative result, the modulus will be added once; otherwise the result will be compared to the given modulus.If the result is equal to or greater than the modulus, it will be subtracted once.

RAM Copy Operations.
In order to support cryptographic algorithms which have been disassembled into a list of instructions, RAM copy operations are needed.According to the proposed RAM layout stated above four individual copy operations have been defined.
The CopyH2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.
The CopyV2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.
The CopyH2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.
The CopyV2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.

MontMult1
Operation.The MontMult1 operation code instructs the core to perform a single Montgomery Multiplication of the supplied element with the constant 1 in the given finite field.This type of operation is needed when a montgomerized value should be transformed back from the Montgomery Domain and has been implemented as an independent operation since an operand  = 1 will unnecessarily occupy a vertical RAM slot.A Montgomery Multiplication with the constant 1 will be executed in an analogous manner as the MontMult operation with the only exception that, instead of the RAM words, constant words will be used for the  operand.

Exemplary Core Application Descriptions
This section gives exemplary descriptions of how the specified functional range of the proposed building-block Enhanced Montgomery Multiplication Core design can be utilized to support a wide range of cryptographic algorithms demanding the least possible memory capacity yet at the same time supporting as much precision widths as possible.Information is given of how to perform Chinese Remainder Theorem [22] (CRT) accelerated RSA private key operations and how to use the core in order to test/generate prime numbers.
For the support of elliptic curve cryptography over prime and binary finite fields modular functions are given for preparing and conducting point operations for arbitrary elliptic curves for the supported precision widths.For all these algorithms a list of operations and the quantity of different operations is given allowing to perform cryptographic algorithms by simply processing these operation lists.

CRT-Accelerated RSA Operation.
In order to speed up RSA private key operations the CRT-accelerated version is also supported by the core.Therefore some operations have to be performed with full precision whereas most of the operations have to be performed with half precision.Algorithm 1 lists the necessary steps to utilize the core for CRT-accelerated RSA private key operations.
Table 1 illustrates the abstract operations lists of the core for CRT-accelerated RSA application using the private key portion for all supported precision widths (512, 768, 1024, 1536, 2048, 3072, 4096).The number given in the index of the RAM locations denotes the offset given by the corresponding src addr, dest addr, src addr e, src addr x input signals.The width of the processed values depends on the supplied mwmac precision input signal which depends on the operation.In the table operations requiring full precision (the precision of the RSA modulus) are marked by (), operations requiring half precision are marked by (ℎ).The mwmac f sel signal must be set to () arithmetic.

Prime Generation/Testing
Operation.Algorithm 2 lists the necessary steps to utilize the core, in conjunction with a TRNG generator as Miller-Rabin Primality Tester.In the algorithm  denotes the random integer to be tested for primality and  denotes the confidence parameter determining the accuracy of the test, i.e., the amount of Miller-Rabin loops.In a precomputation step the parameters  and  with 2  ⋅  = ( − 1) must be calculated which can be done by simple shift operations and counter increments in software.
The test furthermore requires an amount of random integers { 1 , . . .,   } serving as random bases.RAM locations denotes the offset given by the corresponding src addr, dest addr, src addr e, src addr x input signals.The width of the processed values depends on the supplied mwmac precision input signal.The mwmac f sel signal must be set to () arithmetic.Note that since the results of the performed operations will be in the Montgomery Domain, they will be checked against the Montgomery Parameter  and ( − ) instead of 1 and ( − 1).Also note that the random bases   that will be checked must not necessarily be transformed into the Montgomery Domain first, they simply will be interpreted as random montgomerized values.

Elliptic Curve
Operations.Unlike modular exponentiation which only is based on modular multiplications, elliptic curve Point Addition and Point Doubling operations also in the Jacobian projective coordinate representation [23] involve modular additions, subtractions, and multiplications.The algorithms for prime field elliptic curve Point Addition and Point Doubling using Jacobian coordinates furthermore involve multiplications by some constants.Since the described core performs multiplication operations by  (viii) EC Montgomery Backtransformation.
In the following, algorithms for utilizing the core to perform EC operations in () are stated.For (2  ) EC support, similar algorithms have been derived.

GF(p) EC Preparation. The prime field EC Preparation steps include the calculation of the Montgomery Parameter
2 mod , the exponent exp =  − 2 as well as the montgomerized versions of the constants 2, 3, 4, 8 and the EC Domain Parameters  and  for a given elliptic curve  :  2 ≡  3 +  ⋅  +  mod  over ().Algorithm 3 lists the necessary steps to utilize the core for EC prime field preparation.
A core prime field EC preparation operation requires 1 × 2, 6 × , 1 × , 8 × 2, and 8 × 2. of processing units is given.In order to estimate the size ratio of different core variations the number of logic elements and dedicated logic registers for exemplary Altera and Xilinx FPGAs is stated.Furthermore results of power estimation are given.Depending on the resulting clock cycle times of core variations a reference implementation exhibiting a balance of performance and area consumption has been defined.For this reference implementation the computation times in clock cycles for the described exemplary cryptographic algorithms are given.

Core Computation Time Formulas.
The computing time formulas of prime field core operations given in clock cycles are listed in Table 4.For the 2 operation computation time a best case and worst case formula is given.In the best case, after the shift operation, the comparator will only evaluate one word and an initial modular subtraction operation is not necessary.For the involved Montgomery Multiplication operations the best case formula is used.In the worst case, after the shift operation the comparator has to evaluate all words and decide that an initial modular subtraction operation is needed.For the involved Montgomery Multiplication operations the worst case formula is used.The amount  of 2 and  of  operations depends on the chosen precision.Table 5 lists the values for all supported () precisions.
For the  operation, computation time in a best case and worst case formula is given.In the best case the exponent operand is 3; therefore only two Montgomery Multiplications and one 2 operation is necessary.For the involved Montgomery Multiplication operations the best case formula is used.In the worst case the exponent is assumed to be 2 (||−1) ; therefore 2⋅(||−2)×, (||− 2) × 2 and (|| − 3) × 2 operations have to be performed.For the involved Montgomery Multiplication operations the worst case formula is used.
For the  operation computation time a best case and worst case formula is given.In the best case, after the modular addition the CLA adder carry-out bit will not be set, the comparator will only have to evaluate one word and an additional modular subtraction is not needed.In the worst case the CLA adder carry-out bit will also not be set, but the comparator will have to evaluate all words to decide that an additional modular subtraction is necessary.
For the  operation computation time a best case, worst case, and absolute worst case formula is given.In the best case, after the modular subtraction the CLA adder carryout bit will not be set, the comparator will only have to evaluate one word and an additional modular subtraction is not needed.In the worst case, after the modular subtraction the CLA adder carry-out bit will be set and a modular addition must be performed.In the absolute worst case after the modular subtraction the CLA adder carry-out bit will not be set, the comparator will evaluate all words, and an additional modular subtraction step is necessary.Note that this will only occur if the resulting value after the first subtraction operation will be identical to the modulus, which under normal operation conditions will not be the case.The prime field 1 operation is identical to the ()  operation; therefore the same best and worst case formulas apply.
The computing time formulas of binary field core operations given in clock cycles are listed in Table 6.
The computation time of the l operation in (2  ) depends on the specified precision parameter , the number of active processing units , and the number of words  = ⌈(/||)⌉ + 1 running through the pipeline.Since the additions in (2  ) are simple XOR-operations and the most significant bit of the resulting value will never be set after calculation only one formula is given.
While the determination of the Montgomery Parameter  in (2  ) differs from the calculation rule for () it also only depends on the chosen precision  and specified word width || parameters.
The 2 operation in (2  ) is based on shifts and possible modular additions whenever the most significant bit of the intermediate value will be set after a shift.The amount of modular additions depends on the Montgomery Parameter  which itself depends on the irreducible polynomial.In order to specify lower and upper computation times a best case and worst case formula is given.The best case assumes that no modular addition operation is required at all, whereas the worst case assumes that a modular addition operation is required after each shift operation.
For the  operation computation time a best case and worst case formula is given.In the best case the exponent operand is 3 therefore only two Montgomery Multiplications and one 2 operation is necessary.In the worst case the exponent is assumed to be 2 (−1) therefore 2 ⋅ ( − 2) × , ( − 2) × 2 and ( − 3) × 2 operations are required.
For the  operation computation time a best case and absolute worst case formula is given.In the best case the comparator will only have to evaluate one word.In the absolute worst case the comparator will have to evaluate all words to decide that an additional modular addition is necessary.Note that this will only occur if the resulting value after the first addition operation will be identical to the modulus polynomial, which under normal operation conditions will not be the case.
The binary field 1 operation is identical to the (2  )  operation; therefore the same formula applies.
6.2.Core Variations.Depending on the needs, in terms of performance, area consumption, supported precisions, and the interfacing structure, different variations of the core can be generated by defining the parameters MAX PRECISION WIDTH, WORD WIDTH and MAX NUM PUS.Table 7 lists the resulting number of words  and the possible number of processing units  for the supported prime field precisions || and typical word widths || of 16, 32 and 64 bit.
In contrast, Table 8 lists the resulting number of words  and the number of possible processing units  for the supported binary field precisions  and typical word widths || of 16, 32, and 64 bits.Note that the number of possible processing units for binary fields within the defined core is subjected to a further constraint.Once all bits of  operand have been processed the remaining processing units in the pipeline must be bypassed and the  and  words must be directly fed into the CLA adder.Since the result of the CLA adder will be written back to RAM but remaining words must still be read from RAM and fed into the first processing unit, the RAM source and destination signals must never address the same memory location at one time.Therefore the equation ( − ( mod )) mod  must hold true to ( − ( mod )) ≡ 0 mod , meaning that no processing unit will be bypassed, or ( − ( mod )) ≡ 1 mod , meaning that the very last processing unit will be bypassed at the last cycle of  operand bits.6.3.Core Hardware Footprint.Since all components of the design consist of simple logic elements, the proposed arithmetic core is vendor-neutral.In order to estimate the hardware footprint of different core implementations the design variations have been compiled on Altera and Xilinx FPGAs.Table 9 lists the amount of total logic elements and comprised logic registers for varied values of WORD WIDTH (||) and MAX NUM PUS generated for an Altera Cyclone IV (EP4CE115F29C9L) device featuring 114, 480 logic elements and 3, 981, 312 memory bits.
Table 10 lists the amount of total logic elements and comprised logic registers for varied values of WORD WIDTH (||) and MAX NUM PUS generated for an Xilinx XC7Z020 (xc7z020clg484-1) device featuring 53, 200 logic elements final design using default settings of a power toggle rate as well as a power input I/O toggle rate of 12.5%, using a vectorless estimation and a board temperature of 25 ∘ C. Table 11 lists the Total Thermal Power Dissipation values for varied WORD WIDTH (||) and MAX NUM PUS parameters generated for the Altera Cyclone IV (EP4CE115F29C9L) device.The values are comparable to the ones given in [24] for RSA calculation.
Furthermore it has to be mentioned that the optimization mode in the compiler settings was set to balanced and no specific compiler optimizations regarding power have been turned on.The results show that the core is quite suitable for applications which have special constraints regarding power consumption.According to such needs as well as the desired clock frequency a suitable variation can be implemented.The choice will have an impact on computing time and hardware footprint. operation a total of 17×, 15×2 and 1 × 2 operations will be performed.Since the private exponent is different for varied RSA keys only worst case computation times for the supported precision widths are given.The worst case RSA private key and CRTaccelerated private key computation times assume the worst case clock cycle times of the underlying operations given in previous section.Table 13 lists the worst computation times in clock cycles of the reference implementation for Miller-Rabin prime testing application for one iteration.Note that the most time consuming operation is part one of the outer loop of Algorithm 2 which will always be performed for each iteration.Depending on the evaluation of the result it might be necessary to execute part two of the outer loop.Furthermore depending on the structure of the prime in question it might be necessary to execute part one and two of the inner loop multiple times.Table 14 lists the computation time in clock cycles of the reference implementation for prime field EC operations for all supported precision widths.The Affine-to-Jacobi Transformation step requires a precision dependent number of clock cycles.For the remaining steps worst case clock cycle times are given.For the Point Multiplication operation an absolute worst case computation time is stated in which a theoretical scalar is hypothesized to be 2 ||−1 , therefore a maximum of (|| − 1) Point Doubling and (|| − 1) Point Addition operations would be necessary assuming a simple double and add algorithm.

Conclusion and Future Work
A comprehensive adaptable hardware structure for efficient prime finite field and binary finite field arithmetic operations that expand the capabilities of single Montgomery Multiplier hardware designs has been proposed which allows carrying out cryptographic calculations for a large range of different algorithms all based on the same arithmetic unit operations with arbitrary parameters.The approach taken by the proposed core is to combine standard modulo addition / subtraction support with the capability of performing Montgomery Multiplications, full Montgomery Exponentiations, and the calculation of Montgomery Parameters  and  2 for arbitrary moduli, bringing together all required arithmetic operations for carrying out a wide range of cryptographic algorithms used today.Through the breakdown of these algorithms individual operation lists have been derived for the arithmetic unit rendering extra precomputations in software unnecessary.
The given values of possible hardware footprint and power consumption for specific core variations allow choosing the proper configuration for a specific implementation.The reference implementation showed that with an internal RAM of merely 3.5 kB the core is capable of performing complete prime field and binary field EC operations for various precision widths of standardised curves.Furthermore the same core configuration is capable of performing (CRTaccelerated) RSA operations for typical precision widths required today, (safe) prime testing/generation, and Diffie-Hellman key exchange operations up to 4096 bit precision widths.The design should further be optimized in terms of power consumption.
However the type of implementation of some core operations, such as the Montgomery Multiplication and especially the Montgomery Exponentiation operation, necessitates additional security considerations, since the calculation times depend on the structure of the processed operands.This makes the design prone to side-channel attacks if security sensitive information, such as private keys, will be processed.But not all operations are critical and must be secured, such as the calculation of the Montgomery Parameters.Therefore during the writing of this article the core will be enhanced to provide a secure calculation bit within the command input word, which, if set, instructs the core to perform the specified arithmetic operation in a time-invariant fashion.In addition, special care has to be taken when defining core operation lists, for instance, for performing elliptic curve Point Multiplication operations.Descriptions performing in a fixed amount of time, e.g., the Montgomery ladder [25], mitigating the risk of timing, and power analysis attacks must be chosen.

Figure 1 :
Figure 1: Overall architecture of the Enhanced Montgomery Multiplication Core.

3. 2 . 3 .
Core RAM Structure.The RAM of the core must be capable of holding all the necessary operands and intermediate values required during the execution of cryptographic algorithms.The basic structure of the described RAM is pictured in Figure 4.It features four symbolic horizontal RAM operand locations with MAX PRECISION WIDTH bit each which are organized as eight pieces of MAX PRECISION WIDTH/8 bit each.The location named  is intended to hold operand  in Montgomery Multiplication and Montgomery Exponentiation operations; the location named  is intended to hold the modulus.The location  usually holds the temporary sum value during Montgomery Multiplications and Montgomery Exponentiation or the first operand in modular addition or subtraction operations.The location  usually holds the temporary carry stream during Montgomery Multiplications and Montgomery Exponentiation or the second operand in modular addition or subtraction operations.Besides the horizontal RAM operand locations three symbolic vertical RAM operand locations with MAX PRECISION WIDTH bit each have been defined which are organized as eight pieces of MAX PRECISION WIDTH/8 bit each.The locations named , , and  for convenience usually are used to hold operand  in Montgomery Multiplication and Montgomery Exponentiation operations as well as the exponent operand  and the auxiliary operand  in Montgomery Exponentiation operations.In addition all RAM slots are intended to hold intermediate values during the execution of cryptographic algorithms.

4. 4 .
MontExp Operation.The MontExp operation code instructs the core to perform a Montgomery Exponentiation consisting of multiple Montgomery Multiplication steps in the given finite field.A Montgomery Exponentiation will start by reading the first -bit word of exponent  from RAM.

Table 3
specified precision (|| for () and  for (2  )), the word width || parameter for which the core variation has been generated and the resulting RAM width parameter || with || = 4 ⋅ ||.The operations 2, 2, and 2 exhibit the same computation time, whereas the operation 2 will be performed in less clock cycles.

Table 5 :
Precision-dependent values of  and  for () 2 operation.

Table 9 :
Amount of logic elements and logic registers for different core variations (Altera Cyclone IV).

Table 10 :
Amount of logic elements and logic registers for different core variations (Xilinx XC7Z020).
MAX PRECISION WIDTH was set to 4096 leading to a RAM consisting of 28, 672 bits.Table12lists the computation time in clock cycles of the reference implementation for RSA application.For RSA public-key operations best case and worst case computation times are given under the assumption that the public exponent is  = 010001.Therefore during the

Table 12 :
Core reference implementation RSA computation times.

Table 13 :
Core reference implementation Miller-Rabin computation times.

Table 14 :
Core reference implementation prime field EC computation times.