^{1}

^{2}

^{1}

^{1}

^{2}

This paper describes a comparison of two Montgomery
modular multiplication architectures: a systolic and a
multiplexed. Both implementations target FPGA devices. The
modular multiplication is employed in modular exponentiation
processes, which are the most important operations of some
public-key cryptographic algorithms, including the most popular
of them, the RSA. The proposed systolic architecture presents
a high-radix implementation with a one-dimensional array of
Processing Elements. The multiplexed implementation is a new
alternative and is composed of multiplier blocks in parallel
with the new simplified Processing Elements, and it provides a
pipelined operation mode. We compare the

Modular multiplication is widely employed in public-key cryptography, especially where modular exponentiation is essential. For instance, the most commonly used asymmetric cryptographic algorithm is the RSA [

In this cryptosystem the main operation is the modular exponentiation using the public and private keys, the first to encrypt and the second to decrypt messages. So, the performance of the whole system depends on the efficiency of modular arithmetic implementations.

As modular operations are time consuming, it is common to use hardware devices to perform both the modular multiplication and the exponentiation. Among the hardware approaches, the increased use of reconfigurable devices to implement cryptographic operations, especially the FPGAs, is evident.

One of the most suitable methods for performing modular multiplications in hardware is the Montgomery multiplication [

Aiming to implement RSA systems based on hardware, many authors proposed Montgomery multiplications in FPGAs [

As a new alternative in terms of implementation, the execution of additions and multiplications can be multiplexed by a block positioned parallel to the Processing Elements. This can be done by inserting multiplexed multipliers in parallel with Processing Elements. Forcing a pipelined operation mode and using a high-radix architecture (16 or 32 bits), the multiplexed multipliers ensure the high speed performance provided by systolic architectures, with reduced arithmetic and logic elements and also minimal carry signals propagation.

This paper presents a trade-off between two proposed modular multiplication architectures: a systolic and very high-radix multiplexed implementation. Our approach uses a radix-16 and radix-32 in both implementations to speed up the processes and to match the resource usage of Virtex-4 and Virtex-5 Xilinx FPGA Series [

This paper is organized as follows: Section

The Montgomery Multiplication Algorithm is a method of performing modular multiplication

The algorithm version used in this work is the original one, with some preconditions. Algorithm

The

Since its publication in 1985 by Montgomery [

Tenca and Koç are widely referenced for their work on radix-2 Montgomery Algorithm implementations. These authors initially proposed architectures with improvements for the radix-2 Montgomery Algorithm, like in [

Based on the above work, in [

Furthermore, in the context of high-radix implementations, a systolic architecture is presented in [

To avoid preprocessing in a high-radix modular multiplication, [

The proposed architectures for performing Montgomery modular multiplication are detailed in this section. First, the systolic architecture is described in detail as well as the Processing Elements behaviour. Second, the multiplexed and systolic Montgomery modular multiplication architecture is presented.

The concept of systolic architecture combines a highly parallel array of identical Processing Elements or data-paths with local connections, which take external inputs and process them in a predetermined manner and in a pipelined fashion.

The proposed systolic architecture is directly based on the arithmetic operations of the Montgomery Algorithm, which are performed in a numerical base

The architecture is composed of

Between the Processing Elements, there is a propagation of carry signals which are the most significant bits of the arithmetic processes in each PE. The carry signals are processed as input parameters by the Processing Elements that receive them.

In the systolic architecture, the Processing Elements are designed by finite state machines. The control block communicates with the first Processing Element (PE1) and with the block responsible for the quotient calculation

Systolic Architecture.

The finite state machine structure of the control block is designed to provide the required words for a modular multiplication to the Processing Elements and to the quotient block. Thus, at each Montgomery Algorithm iteration, these words are read from an external RAM memory and passed to the remaining architecture. At the end of the modular multiplication, the control block provides the Montgomery multiplication result

The one-dimensional array of Processing Elements performs the calculation of

Arithmetic operations of the Processing Elements to obtain

According to Figure

The first Processing Element (PE1) establishes communication with the control block and receives

First Processing Element internal architecture.

The other Processing Elements are different from PE1 because they have a word from the

General processing element internal architecture.

At each iteration of Algorithm

Internal architecture of the quotient block.

The zero index of

So, the complexity of the quotient block relies on two single precision multiplications and one single precision addition. To evaluate the number of clock cycles for a modular multiplication, we have to consider the first

As seen in the previous section, the systolic architecture presents a one-dimensional array of Processing Elements, and each PE is responsible for operations of addition and multiplication. When the numerical basis (^{16}, 2^{32}), the internal multiplications become more complex, mainly if the design is applied to an FPGA or an ASIC. So, as the number of multipliers increases, the physical limitations will increase proportionally, for example, in the maximum clock frequency, area, (etc.).

Based on these constraints, a multiplexed and systolic architecture with multiplier blocks working parallel to the Processing Elements is presented in this section. It provides a migration of

Proposed multiplexed modular multiplication architecture.

The Arithmetic core architecture.

The multiplexed architecture is composed of exactly

The multiplier block performs the

Arithmetic operations performed in arithmetic core 1.

The calculation of the quotient

The Montgomery Algorithm's multiplications are made by a multiplier block that utilizes the multipliers available in the FPGA. The internal architecture of the multiplier blocks is shown in Figure

Multiplier block architecture.

The carry signals propagated inside the multiplexed architecture are the

At the end of the

In terms of clock cycles for the Montgomery modular multiplication, we can define the following: initially,

The proposed modular multiplication architecture is composed of

The remaining Processing Elements perform the addition between

General PE with the carry propagation.

For a real cryptographic application concerning the RSA algorithm, a modular exponentiation structure that incorporates the modular multiplication architecture is proposed in this section. The modular exponentiation algorithm used in this work is left-to-right square and multiply [

Four Block RAM memories generated through

Modular exponentiation architecture.

The results of the successive modular multiplications are stored in the RAM memory that previously has stored the

Table

Proposed architectures synthesis.

Virtex-4 | ||||||

Slices | Clock cycles | DSP48 | Freq. (MHz) | BRAM (Bytes) | ||

Systolic architecture | ||||||

512 | 16 | 3322 | 192 | 68 | 110 | 128 |

512 | 32 | 4199 | 96 | 36 | 78 | 128 |

1024 | 16 | 7012 | 384 | 130 | 110 | 256 |

Multiplexed architecture | ||||||

512 | 16 | 2199 | 256 | 32 | 120 | 256 |

512 | 32 | 2499 | 128 | 32 | 80 | 256 |

1024 | 16 | 4876 | 512 | 64 | 120 | 512 |

1024 | 32 | 5118 | 256 | 64 | 80 | 512 |

Virtex-5 | ||||||

Systolic architecture | ||||||

512 | 16 | 3205 | 192 | 68 | 130 | 128 |

512 | 32 | 3876 | 96 | 36 | 95 | 128 |

1024 | 16 | 6642 | 384 | 130 | 130 | 256 |

Multiplexed architecture | ||||||

512 | 16 | 2078 | 256 | 32 | 120 | 256 |

512 | 32 | 2370 | 128 | 32 | 90 | 256 |

1024 | 16 | 4876 | 512 | 64 | 120 | 512 |

1024 | 32 | 5005 | 256 | 64 | 90 | 512 |

Table

RSA application (Virtex-5).

Freq. (MHz) | RSA decryption | Clock cycles | |
---|---|---|---|

Systolic Architecture | |||

1024 | 130 | 3.23 ms | 491520 |

Multiplexed Architecture | |||

1024 | 90 | 4.36 ms | 393216 |

Table

State-of-art implementations of modular multiplication architectures.

Design | FPGA | Clock | Area | Mod exp |
---|---|---|---|---|

Systolic | XC5VLX110T | 130 MHz | 6642 Slices | 3.23 ms |

Multiplexed | XC5VLX110T | 90 MHz | 5005 Slices | 4.36 ms |

[ | XV2VP70 | 101.86 MHz | 5709 Slices | 3.01 ms |

[ | XC5VLX110T | 95 MHz | 3044 Slices | 6 ms |

[ | XC2V2000 | 248 MHz | 4051 Slices | 9.4 ms |

[ | Virtex-4 | 150.5 MHz | 2613 Slices | 13.94 ms |

This paper presented two Montgomery modular multiplication architectures and the results of their synthesis for Xilinx Virtex-4 and Virtex-5 FPGAs. A systolic implementation and a multiplexed implementation, suitable for RSA public-key cryptosystem, were developed, and the designs were carefully matched with features of the FPGAs, utilizing embedded DSP48Es Slices and Block RAM. The designs are improvements of a previous work. The multiplexed implementation presented a good performance considering

This paper is result of project “INOVALAB-Laboratories Technological Innovation in Electronic and Microelectronic”. We acknowledge the financial support received from FINEP.