Revisiting Sum of Residues Modular Multiplication

In the 1980s,when the introduction of public key cryptography spurred interest in modular multiplication, many implementations performed modular multiplication using a sum of residues. As the ﬁeld matured, sum of residues modular multiplication lost favor to the extent that all recent surveys have either overlooked it or incorporated it within a larger class of reduction algorithms. In this paper, we present a new taxonomy of modular multiplication algorithms. We include sum of residues as one of four classes and argue why it should be considered di ﬀ erent to the other, now more common, algorithms. We then apply techniques developed for other algorithms to reinvigorate sum of residues modular multiplication. We compare FPGA implementations of modular multiplication up to 24 bits wide. The Sum of Residues multipliers demonstrate reduced latency at nearly 50% compared to Montgomery architectures at the cost of nearly doubled circuit area. The new multipliers are useful for systems based on the Residue Number System (RNS).


Introduction
Modular multiplication is important for many applications including cryptography and image processing.Many different modular multiplication algorithms have been published [1][2][3][4][5] and have been deployed for public-key cryptography or digital signal processing.In this paper, we reinvigorate the sum of residues class of modular multipliers by describing a new modular multiplication algorithm and implementation.
Section 3 surveys the literature of modular multiplication to arrive at 4 classes of algorithm: sum of residues, classical, Barrett and Montgomery.The goal of this section is not a comprehensive survey all publications, but to support the claim that sum of residues is a distinct class that has been largely ignored.In Section 4, we apply optimizations, originally proposed for other classes of reduction, to breathe new life into sum of residues modular multiplication.Section 5 evaluates the reinvigorated sum of residues approach by comparing it with Montgomery multiplication on an FPGA.

Motivation: The Residue Number System
Our interest in modular multiplication at word length up to around 24-bits stems from its application to systems built using the Residue Number System (RNS).By representing integers in independent short-word length channels, residue number systems offer advantages for digital signal processing [6][7][8] and long word length arithmetic, especially for cryptooperations [9,10].
A residue number system [11] is characterized by a set of N coprime moduli {m 1 , m 2 , . . ., m N }.In the RNS, a number X is uniquely represented in N channels: X = {x 1 , x 2 , . . ., x N }, where x i is the residue of X with respect to m i , that is, x i = X mi = X mod m i .
If X, Y , and Z have RNS representations given by X = {x 1 , x 2 , . . ., x N }, Y = {y 1 , y 2 , . . ., y N }, and Z = {z 1 , z 2 , . . ., z N }, then denoting * to represent the operations +, −, or ×, the RNS version of Z = X * Y satisfies (1) Thus, addition, subtraction, and multiplication can be concurrently performed on the N residues within N parallel channels, and it is this high-speed parallel operation that makes the RNS attractive.The multipliers described in this paper are intended for the modular multiplications x i ×y i mi within RNS channels.These typically have a word length from 4 to 24 bits.
2.1.Contribution.This paper contributes to the literature of modular multiplication by (1) identifying sum of residues as a separate class of modular reduction algorithm, (2) presenting a new algorithm and implementation for sum of residues modular multiplication, and (3) demonstrating that this approach can deliver performance benefits, particularly for channel-width modular multipliers for RNS systems on FPGA.

Notation.
We consider the modular multiplication C = A × B mod M, where A, B, and M are n-digit integers of the form X = n−1 i=0 x i r i .The radix r is typically a positive power of 2.
Note that it is common for modular multiplication algorithms to produce a result C > M such that a few subtractions of M are required to fully reduce the result.The usual approach is to design the algorithm so C can be fed back to the input without overflow, even if C is not fully reduced.

Classes of Modular Multiplication
Some early modular multipliers [4,[12][13][14][15] proceed by accumulating residues modulo M. Equation ( 2) is a typical starting point.The residues, typified by (Br i mod M) in (2), may be precomputed and retrieved from a table (e.g., [12]) or evaluated recursively during the modular multiplication (e.g., [13]).A typical algorithm will be shown in the next section, from which point the sum of residues algorithm will be analyzed and improved ( Instead of accumulating residues modulo M, reduction can be performed by subtracting multiples of M. Papers that take this approach include [5,[16][17][18][19].Algorithm 1 is typical.Reduction in this way can be understood as a division in which the quotient is discarded and the remainder retained.Development of modular multipliers along this line has, therefore, closely followed the development of division, especially SRT division (as originally in [20]).
The Quotient Digit Selection function (QDS) has received a great deal of attention to: permit quotient digits (q i ) to be trivially estimated from only the most significant bits of the partial result C, allow the partial result to be stored in a redundant form, and move the QDS function from the critical path (e.g., [17,18]).Require: α, β {Pre-defined parameters} Require:

Ensure: C
The relationship between division and modular multiplication is made explicit in This equation suggests an alternative mechanism: one may perform the division (A × B)/M by multiplying by M −1 .Papers that follow this line include [1,21,22].Note that M −1 is a real number so that correct evaluation of (3) using fixed-point arithmetic requires careful design of the representation.A typical example is the improved Barrett algorithm (named after Barrett's reduction in [1]) described in [22] and shown in Algorithm 2.
Most recently, modular multipliers based on Montgomery's reduction algorithm [2] have been popular [3,23,24].A typical form is shown in Algorithm 3. Note that the quotient digit selection step examines only the least significant digit of the partial result C. Also, note that Montgomery's method does not produce a fully reduced residue Computation can proceed with Montgomery residues as an internal representation.An extra modular multiplication is required to convert the final result to a fully reduced residue.
We have, therefore, identified 4 different classes of modular multiplication algorithm according to the way in which they perform reduction.
(1) Sum of Residues: reduction is achieved by accumulating residues modulo M.
(2) Classical: multiples of the modulus q i M are subtracted according to a QDS function that examines the most significant digits of the partial result C.
n + 1-bit carry-propagate adder n-bit carry-save adder Require: R = r n Require: (3) Barrett: multiplication by M −1 is used to reduce modulo M. (4) Montgomery: multiples of the modulus q i M are accumulated according to a QDS function that examines the least significant digits of the partial result C.
We note here that each of the 4 classes permits separated and interleaved implementations.In a separated implementation, A × B is evaluated before being reduced modulo M. The alternative is to interleave the multiplication and reduction steps.This has the benefit of keeping intermediate values to the approximate magnitude of M but does not take advantage of any preexisting nonmodular multiplier.In this paper, we are largely concerned with interleaved implementations.
Surveys of modular reduction offer different classifications.References [25][26][27] identify three classes: classical, Barrett, and Montgomery.References [22,28] include the Barrett algorithms with other classical algorithms and therefore divide the field into only two classes: classical and Montgomery.None of these surveys cover sum of residues papers.Other publications, [4,29,30], have made note of the sum of residues technique but categorize it along with other classical algorithms.At first sight, this may look like a classical algorithm as the quotient digit q is selected from the most significant bits of the partial result.The difference is that a classical algorithm then performs reduction by subtracting the multiple of the modulus q i M; Tomlinson's algorithm performs the reduction by setting the most significant bits to zero and accounting for this change by adding the precomputed residue (q × 2 n+1 mod M).

Reinvigorating Sum of Residues
An architecture for Tomlinson's algorithm is shown in Figure 1.Note that the intermediate result denotes the value C in the previous iteration.
A Carry-Save Adder (CSA) is used to perform the threeterm addition is n bits long, the same as the other two addends, q is set to be the upper 3 bits instead of 2 bits of the current partial result.
Note that the carry-save representation is a type of redundant representation that has been applied to other modular multiplication algorithms [15,19,31].We will keep using this technique to enhance the sum of residues modular multiplier.
n-bit carry-save adder n-bit carry-save adder Ensure: Algorithm 4: Tomlinson's sum of residues modular multiplication.
Reference [15] gives an similar algorithm but sets q only two bits long.This means that the partial result C[i] may be n + 1 bits long.To bound it within n bits, a subtracter is used to constantly subtract M until C[i] has only n bits.This redundant step greatly increases the latency of the algorithm.In the following sections, we describe new enhancements to improve the performance of this algorithm.

Eliminating the Carry-Propagate Adder.
There two obvious demerits of the architecture in Figure 1.Firstly, a Carry-Propagate Adder (CPA) is used to transform the redundant representation of C[i] to its nonredundant form.This is required because the upper 3 bits of C[i] have to be known to look up q × 2 n mod M before the next iteration.The CPA delay contributes significantly to the latency of the implementation.The second problem is that the lookup of q × 2 n mod M is on the critical path.
Both of these problems can be solved by keeping the intermediate result in a redundant carry-save form.The CPA of Figure 1 is eliminated so that the calculation of the partial result becomes as sum and carry terms, respectively.A modified architecture is shown in Figure 2. The CPA is replaced by a second CSA.
The precomputed residue (q 1 + q 2 ) × 2 n mod M, which must be retrieved from a lookup table (LUT), can be sent to the second CSA rather than the first.All three addends to the first CSA are available at the beginning of each iteration and the table lookup step can be performed in parallel with the first CSA.
In Figure 1, it can be seen that the carry output of the first CSA is n + 1 bits wide.This cannot be input directly to the second CSA which is only n-bits wide.Thus, in Figure 2, n-bit carry-save adder Modified n-bit carry-save adder Figure 3: n-bit carry-save adders.
n-bit carry-save adder Modified n-bit carry-save adder the MSB of the (n + 1)-bit carry is sent to the LUT circuit instead.The LUT retrieves two possible values of (q 1 + q 2 ) × 2 n mod M corresponding to the case of either a 0 or 1 in the MSB of the carry output from the first CSA.An MUX selects the appropriate value of (q 1 +q 2 )×2 n mod M once the MSB is available.Thus, although the LUT executes in parallel with the first CSA, an additional MUX appears on the critical path.

Further Enhancements.
If the second CSA in Figure 2 can be modified to accept an (n + 1)-bit input, the MUX can be eliminated.The left of Figure 3 shows a conventional n-bit CSA.Note that the output sum is only n bits wide.To accept an (n + 1)-bit input, we can just copy the MSB of the (n + 1)-bit input to the MSB of output sum.This is illustrated in the right of Figure 3.This modified CSA accepts 1 (n + 1)-bit input and 2 n-bit inputs and produces 2 (n + 1)-bit outputs.
Figure 4 shows the resulting modular multiplication architecture.The algorithm corresponding to this new architecture is given as Algorithm 5.The CPA has been eliminated from the iteration, and the residue lookup has been shifted from the critical path.Also, no subtraction is needed at the end of the algorithm to bound the output within n+1 bits.If C 1 [0] and C 2 [0] are simply summed using a CPA, the resulting output C could be n + 2 bits, which needs some further subtraction to be reduced.Therefore, the same technique as in the loop is applied.Both C 1 [0] and C 2 [0] are set to n − 1 bits and the n-bit residue corresponding to the 2 upper reset bits is retrieved from another LUT.The final sum yields an (n + 1)-bit output C.
The LUTs have a 4-bit input and an n-bit output so that a (2 4 × n)-bit ROM can be used.Moreover, note that the sum of (q 1 + q 2 ) is at most 110, which occurs when (q 1 and q 2 ) are both 11.This implies that the possible sum of (q 1 + q 2 ) is in the range from 000 to 110, which has 7 values only.Therefore, a ROM with a further reduced size of (7 × n) bits can be used for (q 1 + q 2 ) × 2 n mod M. For example, a 128-bit modular multiplier only needs a 1Kbit ROM, which is reasonable for a RNS channel modular multiplier.
Figure 5 shows an example of the new algorithm for the case r = 2, n = 4, A = 15 = (1111) 2 , B = 11 = (1011) 2 , and M = 9 = (1001) 2 .It is noted that at the last step, a second LUT of the same size is needed.Also, because the output C from the 4-bit CPA is at most 5 bits long, the final subtraction might not be necessary if a n+1-bit C is acceptable, as the case in quite a few other algorithms.Even if a 4-bit C is required, only one subtraction will do.
If r = 2 k , this version executes in n/k iterations; however, a larger LUT and (n + k)-bit CSAs are required.

Evaluation
5.1.Evaluation Environment.FPGA implementations have been prepared.A Xilinx Virtex2 FPGA was used as the implementation target.All the implementations have been performed using the Xilinx ISE environment using (n − 1)} {q 1 and q 2 are the upper 2 bits of C 1 [i] and Algorithm 5: New sum of residues modular multiplication.
n + k-bit carry-save adder n + k-bit carry-save adder XST for synthesis and ISE standard tools for place and route with standard effort for all speed optimizations (see Table 2).Pure delays of the combinatorial circuit were measured excluding those between pads and pins.They were generated from the post-place and route static timing analyzer with a standard place and route effort level. 1 for n = 4 to n = 24 at radix-2, the most popular word lengths of RNS channel modular multipliers [6,9,32].The table includes results for the old architecture based on Tomlinson's algorithm (Figure 1) as well as for a carefully optimized Montgomery architecture.

Results. Implementation results are listed in Table
The Montgomery architecture have incorporated various published techniques for optimization.For example, the techniques in [3,23] have been applied to improve performance by making the quotient digit selection step trivial and moving it from the critical path.The cost of these techniques is that they impose limits on the possible values of the modulus M which may impact on other enhancements such as the use of higher radices.Consequently, the Montgomery architecture is interleaved, uses radix 2 and trivial quotient digit selection (as in [3,33] described in Section 3) and was arrived at by varying these parameters to find the multiplier with lowest delay.
These multipliers are compared with the new binary sum of residues architecture of Figure 4.It can be seen that the new sum of residues modular multiplier is a competitive alternative implementation on FPGA.It demonstrates better timing performance than both Tomlinson's architecture and the Montgomery architecture although its hardware cost is the highest among the three.

Conclusions and Future Work
Sum of residues is a distinct class of modular multiplication that has been overlooked in recent years.We have shown that techniques pioneered for other modular multiplication algorithms, such as the use of redundant representations and higher radices, can also be applied to sum of residues.By doing this, we have arrived at a new sum of residues modular multiplier that uses carry-save adders and redundant number representation to achieve a more parallel structure than previous versions.FPGA implementations of the new architecture demonstrate low latency relative to previous sum of residues and Montgomery architectures at the cost of increased space overhead.Future work will be focused on reducing the space overhead.More advanced programmable logic devices will be attempted to utilize hardware resources more efficiently.ASIC design will be investigated, where a lot of the hardware redundancy expects to be reduced.For example, our new CSA shown in Figure 3 accepts one extra bit input without extra hardware resources needed.However, this benefit is hard to see in the FPGA implementation because splitting bits is impossible in the used configurable logic blocks (CLB) embedded on the FPGA.Nor do the embedded 18bit multipliers.However, this gets easily illustrated in an ASIC implementation.Therefore, the new sum of residues modular multiplication algorithm is expected to be suitable for an ASIC application.

{Reduction step} end for Algorithm 1 :
A Typical Example of Classical Modular Multiplication.

r {Reduction step} end for Algorithm 3 :
Montgomery modular multiplication.

4. 1 .
Tomlinson's Algorithm.We take the modular multiplier developed by Tomlinson in [4] as our starting point for further development.Tomlinson's algorithm is shown in Algorithm 4.

Figure 2 :
Figure 2: Modified sum of residues modular multiplier architecture.

Figure 4 :
Figure 4: New sum of residues modular multiplier architecture.

Figure 6 :
Figure 6: New higher-radix sum of residues modular multiplier architecture.

Table 1 :
Latency and space overhead of three interleaved modular multipliers.