MontgomeryModular Multiplication on Reconfigurable Hardware : Systolic versusMultiplexed Implementation

This paper describes a comparison of two Montgomery modular multiplication architectures: a systolic and a multiplexed. Both implementations target FPGA devices. The modular multiplication is employed in modular exponentiation processes, which are the most important operations of some public-key cryptographic algorithms, including the most popular of them, the RSA. The proposed systolic architecture presents a high-radix implementation with a one-dimensional array of Processing Elements. The multiplexed implementation is a new alternative and is composed of multiplier blocks in parallel with the new simplified Processing Elements, and it provides a pipelined operation mode. We compare the time × area efficiency for both architectures as well as an RSA application. The systolic implementation can run the 1024 bits RSA decryption process in just 3.23 ms, and the multiplexed architecture executes the same operation in 4.36 ms, but the second approach saves up to 28% of logical resources. These results are competitive with the state-of-the-art performance.


Introduction
Modular multiplication is widely employed in public-key cryptography, especially where modular exponentiation is essential. For instance, the most commonly used asymmetric cryptographic algorithm is the RSA [1]. The RSA security depends on the difficulty of factoring large numbers. Here, large numbers mean prime numbers of up to 4096 bits, used as cryptographic keys.
In this cryptosystem the main operation is the modular exponentiation using the public and private keys, the first to encrypt and the second to decrypt messages. So, the performance of the whole system depends on the efficiency of modular arithmetic implementations.
As modular operations are time consuming, it is common to use hardware devices to perform both the modular multiplication and the exponentiation. Among the hardware approaches, the increased use of reconfigurable devices to implement cryptographic operations, especially the FPGAs, is evident.
One of the most suitable methods for performing modular multiplications in hardware is the Montgomery multiplication [2]. This algorithm is fast and power efficient in hardware implementations. Assuming the modular multiplication as A · B mod N , the Montgomery multiplication avoids the division by N by replacing the division by right shifts. Also, this method allows the use of multiprecision arithmetic, which is useful for employing highradix operations. High-radix operations in turn make it easier to develop modular multiplication architectures.
Aiming to implement RSA systems based on hardware, many authors proposed Montgomery multiplications in FPGAs [3][4][5][6][7][8][9]. Fully systolic architectures designed to speed up the modular multiplication have been presented. These architectures offer a Processing Elements (PEs) array where each PE performs arithmetic additions and multiplications 2 International Journal of Reconfigurable Computing in a multiprecision context with carry propagation [10]. Depending on the word size (or radix) used, the architecture can employ a high number of Processing Elements, consequently increasing the needs of the logic elements (area) in FPGA implementations.
As a new alternative in terms of implementation, the execution of additions and multiplications can be multiplexed by a block positioned parallel to the Processing Elements. This can be done by inserting multiplexed multipliers in parallel with Processing Elements. Forcing a pipelined operation mode and using a high-radix architecture (16 or 32 bits), the multiplexed multipliers ensure the high speed performance provided by systolic architectures, with reduced arithmetic and logic elements and also minimal carry signals propagation.
This paper presents a trade-off between two proposed modular multiplication architectures: a systolic and very high-radix multiplexed implementation. Our approach uses a radix-16 and radix-32 in both implementations to speed up the processes and to match the resource usage of Virtex-4 and Virtex-5 Xilinx FPGA Series [11]. The proposed architectures show significant improvements compared to our previous work [12]. Systolic architecture provides more simplified Processing Elements in order to reduce the utilization of FPGA resources. The multiplexed implementation is arranged in arithmetic cores, which allow us to handle the quantity of Processing Elements and multiplier blocks. Our goal is to highlight that the small increase in the number of clock cycles needed due to multiplexed multipliers made up for the significant reduction in the use of logical and architectural arithmetic. This paper is organized as follows: Section 2 presents the Montgomery modular multiplication algorithm. Section 3 discusses related state-of-the-art works. The proposed architectures are presented in Section 4. Finally, the results and conclusion are presented in Sections 5 and 6, respectively.

Montgomery Modular Multiplication
The Montgomery Multiplication Algorithm is a method of performing modular multiplication A · B mod N without needing to divide by N . In cryptography, the Montgomery Algorithm is very suitable for the hardware implementation of modular multiplication, because it allows long integer numbers to be represented in a numeric precision given by a radix (generally a power of two).
The algorithm version used in this work is the original one, with some preconditions. Algorithm 1 shows the modular multiplication with the notation proposed on [13], and used for the remainder of this text.
The N value is the modular inverse of N regarding the N modulus, computed so that N · N = 1 mod N . The final result is placed on S, after m iterations, and is equal to A · B · R −1 mod N , which must be corrected to retrieve the expected result (A · B mod N ). The correction is done by performing an additional Montgomery multiplication with S and R 2 mod N as parameters. It is interesting to highlight that this correction is inexpensive during a modular exponentiation, because it only needs to be made one time after the whole exponentiation.
Since its publication in 1985 by Montgomery [2], the Montgomery Algorithm has undergone many modifications and improvements [14,15]. One of those is particularly interesting, because it avoids the final subtraction simply by choosing the input data correctly. By limiting the operands A and B to integers less than 2N and by defining 2N as less than 2 km , the final S is guaranteed to be less than N [15]. These pre-conditions are shown in Algorithm 1 and applied to our architecture, as explained in Section 4.

Related Works
Tenca and Koç are widely referenced for their work on radix-2 Montgomery Algorithm implementations. These authors initially proposed architectures with improvements for the radix-2 Montgomery Algorithm, like in [16]. Even though the input operands are large numbers, radix-2 modular multiplications avoid expensive multiplications, which are visible on high-radix implementations (8 or more). Different from the classic radix-2 Montgomery Algorithm [13], Tenca and Koç's modifications allow the scalable property for modular multiplication architecture, that is, their proposed Montgomery multiplier is able to work with any precision of the input operands. In terms of hardware implementation, there is a systolic array architecture composed of Processing Elements and control blocks for managing the I/O words of the architecture. Each Processing Element contains only a few logic elements, providing a reduced area and high clock frequency, when synthesized for FPGA or ASIC.
Based on the above work, in [4,17] improvements are presented to the Tenca and Koç proposition. The advantage of these new approaches is concentrated in the Processing Elements optimizations and, consequently, in the reduced latency of the Montgomery modular multiplications by a minimum factor of two, that is, the modular multiplication is twice as fast than [16]. So, the main contributions are in the modular multiplication speed improvement, and in the reduced number of logical elements for the Processing Elements. In [4], a radix-4 scalable Montgomery modular multiplication architecture is proposed to enhance the speed. Despite improvements in speed, these radix-2 and radix-4 architectures are still limited by the large number of clock cycles required.
International Journal of Reconfigurable Computing Furthermore, in the context of high-radix implementations, a systolic architecture is presented in [3] which is composed of Processing Elements able to provide modular multiplication for a radix greater than 4. Despite its time and area efficiency, this architecture requires preprocessing before the modular multiplication execution. The authors make use of the optimized Montgomery algorithm initially proposed in [14], which presented a way to simplify the q i quotient calculus, making the quotient determine a simple truncation operation S mod 2 k . However, as a consequence, the input operands must meet the following limitations: and the optimized Montgomery Algorithm will need three additional iterations, because the B input operand is left shifted by 2 k and has to be corrected with these further iterations.
To avoid preprocessing in a high-radix modular multiplication, [5] presents a fully systolic array architecture composed of Processing elements containing internal multipliers and adders. The Montgomery algorithm version used in this implementation is also the optimized version proposed in [14]. As an implementation in radix-16, the modular multiplications take only 103 clock cycles, significantly less than other architectures [3,16,17].

The Proposed Architectures
The proposed architectures for performing Montgomery modular multiplication are detailed in this section. First, the systolic architecture is described in detail as well as the Processing Elements behaviour. Second, the multiplexed and systolic Montgomery modular multiplication architecture is presented.

The Systolic
Architecture. The concept of systolic architecture combines a highly parallel array of identical Processing Elements or data-paths with local connections, which take external inputs and process them in a predetermined manner and in a pipelined fashion.
The proposed systolic architecture is directly based on the arithmetic operations of the Montgomery Algorithm, which are performed in a numerical base 2 k , in which the large input operands are processed in a multi-precision context containing m words of k bits. As seen in Section 2, the Montgomery Algorithm has additions and multiplications involving large integers that make use of multiple-precision arithmetic.
The architecture is composed of m Processing Elements distributed in a one-dimensional array, where each Processing Element is responsible for the calculus involving k bits words of the input operands with the same index of the Processing Element. For example, for a 1024 bits modular multiplication with radix-32, the operands are split in 32 words of 32 bits which results in a one-dimensional array of 32 Processing Elements.
Between the Processing Elements, there is a propagation of carry signals which are the most significant bits of the arithmetic processes in each PE. The carry signals are processed as input parameters by the Processing Elements that receive them.
In the systolic architecture, the Processing Elements are designed by finite state machines. The control block communicates with the first Processing Element (PE1) and with the block responsible for the quotient calculation q i = (S 0 + a i · b 0 )N mod 2 k , according to line 4 of the Montgomery Algorithm. Figure 1 presents the systolic architecture.
The finite state machine structure of the control block is designed to provide the required words for a modular multiplication to the Processing Elements and to the quotient block. Thus, at each Montgomery Algorithm iteration, these words are read from an external RAM memory and passed to the remaining architecture. At the end of the modular multiplication, the control block provides the Montgomery multiplication result A · B · R −1 mod N through an output multiplexer.
The one-dimensional array of Processing Elements performs the calculation of S i+1 = (S i + q i × N + a i × B)/2 k , according to the Montgomery Algorithm. In this operation, there are two multiplications between an input operand and a k bits word, and after the addition between the result of these two multiplications. Therefore, the systolic architecture works in a multi-precision context, and each Processing Element is responsible for performing the   arithmetic operations involving one word of each input operand. Thus, the number of words of each operand is equal to the number of Processing Elements. Figure 2 shows the arithmetic operations flowchart within each processing element.
According to Figure 2, the multiplication between a i and B i words returns a 2k bits result, where the least significant bits of this multiplication are added to the least significant bits of the q i × N i multiplication result. Finally, the least significant bits of this add are also added to a k bits word of the S result of the previous iteration. The carry signals propagated to the next Processing Element are the most significant bits of the two multiplications and the most significant bits of the last addition.

General Processing Element. The other Processing
Elements are different from PE1 because they have a word from the S result as output and they also transmit and receive carry signals of the multi-precision multiplications and additions. Each Processing Element is activated by the previous Processing Element when the latter finishes its calculation and sends out its carry signals, which means that the architecture works with a pipeline behaviour. Only the last Processing Element provides two words of the S result as a response at each iteration of Algorithm 1 because the S m−1 word is obtained with a sum of carry signals. By avoiding a new Processing Element instantiation juts to perform this sum, it is calculated in the last Processing Element. Figure 4 presents the internal architecture of the general Processing Elements.

Quotient Block.
At each iteration of Algorithm 1, line 3 presents the q i quotient computation so that S + a i * B + q i * N becomes a multiple of 2 k . The internal architecture of the quotient block is shown in Figure 5. This structure has a combinational behaviour where the q i result is obtained in one clock cycle. S 0 , a i , B 0 , and N are k bits words which are provided for this block at each iteration of Algorithm 1.
The zero index of B and S means that these words contain the k least significant bits (LSBs) of B and S operands, respectively. As we can see in the right side of Figure 5, a multiplication between a i and b 0 will provide a 2k bits result. Just the LSB part of this result is used in the next operation. Another input of the quotient block, S 0 , is then added to the LSB part obtained from the first multiplication. Again, we only need the LSB part of this addition, which is finally multiplied by N , which corresponds to the modular inverse of N modulo 2 k . The LSB part of this last multiplication is the q i desired result. As seen in Algorithm 1, the numerical basis is power of 2, so for hardware architecture, the mod 2 k operation is simply performed by a right shift operation (LSB selection).
So, the complexity of the quotient block relies on two single precision multiplications and one single precision addition. To evaluate the number of clock cycles for a modular multiplication, we have to consider the first m cycles to read the A and B operands from RAM memories for a square or modular multiplication, respectively. The first iteration of Algorithm 1 also needs m clock cycles. The remaining iterations of Algorithm 1 are performed in 4 * m clock cycles.

The Multiplexed Systolic Architecture.
As seen in the previous section, the systolic architecture presents a onedimensional array of Processing Elements, and each PE is responsible for operations of addition and multiplication. When the numerical basis (2 k ) is high (2 16 , 2 32 ), the internal multiplications become more complex, mainly if the design is applied to an FPGA or an ASIC. So, as the number of multipliers increases, the physical limitations will increase proportionally, for example, in the maximum clock frequency, area, (etc.).    Based on these constraints, a multiplexed and systolic architecture with multiplier blocks working parallel to the Processing Elements is presented in this section. It provides a migration of k × k bits multipliers from the Processing Elements to the multipliers blocks. Each multiplier block, together with the four Processing Elements, forms an arithmetic core. The one-dimensional arrangement of these arithmetic cores forms the structure of the modular multiplication architecture. Figures 6 and 7 show the multiplexed and systolic architecture and the arithmetic core structure, respectively.
The multiplexed architecture is composed of exactly k/4 arithmetic cores, and the first one is managed by a control block designed by a finite state machine. According to Figure 7, each arithmetic core contains four Processing Elements, a multiplier, and an 8 × k bits RAM memory. Being a multi-precision arithmetic architecture, the number of Processing Elements is equivalent to the number of words  Figure 9: Multiplier block architecture.
in each input operand. So, the RAM memory placed in each arithmetic core stores four words of B and N operands. The multiplier block performs the q i × N i and a i × B i multiplications. The least significant bits of q i × N i multiplication are added to a k bits word of previous S i result. The least significant bits of this add operation and the least significant bits of the a i × B i multiplication are sent to the Processing Elements to be added. The Processing Elements provide the S words of the current iteration result. Figure 8 illustrates the executions performed by Arithmetic Core 1. By analysing this illustration, we can realize that instead of having two single precision multiplications in each Processing Element, there is a multiplier block that performs all single precision multiplications for a total of four Processing Elements. In other words, the quantity of single precision multiplications is reduced four times. With these improvements, each Processing Element needs to perform just one addition.
The calculation of the quotient q i is performed by a block with architecture that is identical to that of the quotient block presented in the systolic architecture.
The Montgomery Algorithm's multiplications are made by a multiplier block that utilizes the multipliers available in the FPGA. The internal architecture of the multiplier blocks is shown in Figure 9.
The carry signals propagated inside the multiplexed architecture are the k most significant bits of the q i · N and S i + a i · B operations presented in Algorithm 1 and are propagated between the multiplier blocks. The last multiplier block sends its carry signals to the fourth and final Processing Element placed in the last arithmetic core. The other carry signal, C PE , is the most significant bits of the result of the addition between the q i · N and S i + a i · B terms. This last addition is performed by the Processing Elements.
At the end of the m − 1 iteration, the S i+1 = A · B · R −1 mod N is sent out by an m : 1 k bits multiplexer. This result is sent to the memory that is part of the modular exponentiation architecture (described in the next section). In terms of clock cycles for the Montgomery modular multiplication, we can define the following: initially, m clock cycles are reserved for B operand internal storage. This operand is read from RAM memories. Considering that the N modulus is already available on internal RAM memories placed in arithmetic cores, the first iteration also takes m clock cycles and, it takes the architecture 6 × m clock cycles to perform the remaining iterations of Montgomery Algorithm.  Thus, the total number of clock cycles, for a modular (or squared) multiplication is n MM = m + m + 6 × m = 8m.

The Processing Elements
PEs. The proposed modular multiplication architecture is composed of m Processing Elements (where m is the number of words of the operands and also the number of iteration on Algorithm 1). Due to the placement of a multiplier block in each arithmetic core, each Processing Element needs to perform just one addition between two 2k bits words and sends out a word of S i+1 result at each iteration of the Algorithm 1. The first Processing Element must discard the least significant bits of its first addition in order to perform the right shift operation, which corresponds to the division of (S i + q i × N + a i × B) by 2 k . The remaining Processing Elements perform the addition between (S i + a i × B) and q i × N terms and the resultant k least significant bits word of this addition are sent out as a word of the S result. The k + 1 most significant bits are sent to the next Processing Element as a carry signal. The last Processing Element (PE m ) is responsible for providing two words of S result (S m−2 and S m−1 ), considering that the input words for S m−1 calculus are the carry signals from the last multiplier block. Figure 10 shows the first, general case and the last Processing Elements.

Modular Exponentiation.
For a real cryptographic application concerning the RSA algorithm, a modular exponentiation structure that incorporates the modular multiplication architecture is proposed in this section. The modular exponentiation algorithm used in this work is left-to-right square and multiply [13], and thus in average 1.5 * n modular multiplications (including squares and multiplies executions) are performed to achieve the final exponentiation result, which n is the operand's precision. Algorithm 2 shows the Montgomery modular exponentiation algorithm. Four Block RAM memories generated through Xilinx Coregen tool were placed to store the input operands of size n. These input operands are the N modulus, the E exponent, the message X in the Montgomery domain (X = X · R mod N ), and an auxiliary term A = R mod N ·A control block with a finite state machine manages the read and write operations from the memories (see Figure 11). The results of the successive modular multiplications are stored in the RAM memory that previously has stored the A = R mod N operand, because this operand is necessary just in the first square execution. Table 1 summarizes the FPGA synthesis results of two proposed modular multiplication architectures. The designs were described in hardware description languages (VHDL and Verilog) and synthesized for Virtex-4 and Virtex-5 Xilinx FPGAs. All results are postimplementation, and no area or speed optimizations were set for the synthesis. The results presented in this paper are improvements when compared with our previous work [12]. The multiplexed architecture is implemented with a reduced number of slices registers and DSP48s. However, synthesis for the systolic architecture presented high clock frequencies. Table 2 presents an RSA encryption and decryption applications of the proposed architectures. Since the modular exponentiation is performed by successive modular multiplication executions, the left-to-right (MSB) binary square and multiply algorithm was employed in the modular exponentiation. The results show that, considering the amount of clock cycles for a modular multiplication execution, the multiplexed architecture is faster than the systolic implementation. On the other hand, the systolic architecture has a clock frequency higher than the clock frequency presented by the multiplexed architecture. Table 3 shows a state-of-art comparison with our results. Every work referred in this table used the Montgomery Algorithm for their hardware modular multiplication architectures, and for a direct comparison with our approaches just 1024 bits applications are exposed. The time of modular multiplications, when not explained in the references, are estimated considering a modular exponentiation of n = 1024 bits through the Square and Multiply algorithm, running 1.5n modular multiplications.

Conclusion
This paper presented two Montgomery modular multiplication architectures and the results of their synthesis for Xilinx Virtex-4 and Virtex-5 FPGAs. A systolic implementation and a multiplexed implementation, suitable for RSA public-key cryptosystem, were developed, and the designs were carefully matched with features of the FPGAs, utilizing embedded DSP48Es Slices and Block RAM. The designs are improvements of a previous work. The multiplexed implementation presented a good performance considering time × area efficiency. The systolic architecture can run the 1024 bits RSA decryption process in 3.23 ms, and the multiplexed implementation executes the same operation in 4.36 ms. Because of the multiplexed approach, the architecture is scalable. If the key size increases, the architecture can be easily modified by adding arithmetic cores, keeping the performance. Another speed improvement can be achieved by using a parallel modular exponentiation algorithm, for example, the Montgomery Powering Ladder [18] where a full modular exponentiation would be performed in exactly n × n MM clock cycles, that is, 33% faster than square and multiply algorithm.