A Parallel Residue-to-binary Converter for the Moduli Set

In this paper, a high-speed parallel residue-to-binary converter is proposed for a recently introduced moduli set S k 1⁄4 {2 2 1; 220m þ 1; 221m þ 1; . . .; 22km þ 1} for a general value of k. The proposed converter uses simple cyclic shift and concatenation operations and does not require any multiplier. Individual converters for the cases of k 1⁄4 0 and k 1⁄4 1 are derived from the general architecture and compared with those existing in the literature. The converter for S 0 is twice as fast requiring only onehalf of the hardware, while that of S 1 is three times as fast, but requiring only 60% of the hardware, as compared to the corresponding ones existing in the literature. Furthermore, the proposed converters are implemented using 0.5-micron CMOS VLSI technology. Based on S , the layouts for 8-bit, 16-bit, 32bit and 64-bit converters are generated, and the corresponding simulation results obtained.


INTRODUCTION
During the past decade, the residue number system (RNS) arithmetic has received considerable attention in arithmetic computation and signal processing applications, such as the fast Fourier transform, digital filtering and image processing. The main reasons for this attention are the inherent properties enjoyed by the RNS such as parallelism, modularity, fault tolerance and carry free operations [2 -4]. The crucial step for any successful RNS application is the residue-to-binary (R/B) conversion. In recent years, the conversion process has been studied very extensively [5 -20].
In order to use the RNS to represent binary numbers, a moduli set has to be chosen. Recently, several new moduli sets have been proposed [5 -9]. One such set is S k ¼ {2 m 2 1; 2 2 0 m þ 1; 2 2 1 m þ 1; . . .; 2 2 k m þ 1}, for which the R/B converters for the cases of k ¼ 0 and k ¼ 1 have been also proposed [5]. According to Ref [5], this moduli set is expected to play an important role in the RNS, since the multiplications in the R/B conversion of this moduli set have been replaced by simple shift operations of signed-digit numbers. It has been shown in Ref. [5] that the R/B converter for S k is much faster and simpler compared to the existing converters. For 8-bit dynamic range, the FA-based R/B converter of Ref. [10] requires 837 transistors with 51 gate delays while the converter based on S k of Ref. [5] requires only 510 transistors with 38 gate delays. However, no R/B converter for S k has been designed so far for k $ 2: Since more than two or three moduli must be considered for large dynamic ranges [11], an introduction of the converter for a general k is essential.
In this paper, we propose a high-speed parallel R/B converter for the general moduli set S k ; this converter also uses no multipliers. Instead of shifting the signed-digit numbers, we use simple cyclic shift and concatenation operations. For the purpose of comparison, the individual converters for the cases of k ¼ 0 and k ¼ 1 are derived from the general architecture. The new converter for S 0 is twice as fast as the one in Ref. [5] requiring only one-half of the hardware, while that for S 1 is three times as fast as the corresponding one in Ref. [5], but requiring only 60% of the hardware. For the same 8-bit dynamic range mentioned above, the proposed converter requires only 220 transistors with 18 gate delays. Furthermore, we implement the proposed converters using 0.5-micron CMOS VLSI technology. Based on the moduli set S 0 , layouts of the 8-bit, 16-bit, 32-bit and 64-bit R/B converters are generated and simulation results obtained.
The paper is organized as follows. In the second section, we introduce the necessary background material. In the third section, we propose a parallel converter for the general moduli set S k . Using these results, we derive new converters for S 0 and S 1 in the fourth section, while in the fifth section, we present VLSI implementation for these converters. In the sixth section, we present the conclusion.

BACKGROUND MATERIAL
For any two numbers X and P i , For convenience, we denote X mod P i by jXj P i : Residue Number System. A residue number system is defined in terms of a set of relatively prime moduli set {P 1 ; P 2 ; . . .; P n }; that is, GCDðP i ; P j Þ ¼ 1 for ij: A binary number X can be represented as . .P n is the dynamic range of the moduli set {P 1 ; P 2 ; . . .; P n }: To convert ðx 1 ; x 2 ; . . .; x n Þ into the binary number X, the Chinese remainder theorem (CRT) and mixed radix conversion (MRC) are generally used.
Chinese Remainder Theorem. The binary number X is computed by where N i ¼ M=P i and jN 21 i j P i is the multiplicative inverse of jN i j P i [4].
Mixed Radix Conversion. The number X can be computed by the formula where v i ¼ Q i21 j¼1 P j for 2 # i # n and v 1 ¼ 1; the a i , called the mixed radix digits, are computed by the formulas: The MRC approach is a sequential algorithm and is not as "parallel" as the CRT method. Thus, to solve the residue-to-binary conversion problem, the CRT schemes are considered for efficient VLSI implementations [21].
Assuming m and k to be integers, we define the moduli set S k as S k ¼ {P 1 ; P 2 ; . . .; P n } ¼ {2 m 2 1; 2 2 0 m þ 1; 2 2 1 m þ 1; . . .; 2 2 k m þ 1} and M ¼ P 21 P 0 . . .P k ¼ 2 2 kþ1 m 2 1: A binary number X in the dynamic range ½0; M 2 l is represented as ðx 21 ; x 0 ; x 1 ; . . .; x k Þ; where x 21 is an m-bit binary number, x i is an (m2 i þ 1)-bit binary number for i ¼ 0; 1; . . .; k; and x i is the one's complement of x i . The values of x 21 , x i and x i are given by For the moduli set S k , the binary number X ¼ ðx 21 ; x 0 ; x 1 ; . . .; x k Þ can be computed by the following proposition [5], which has been derived from the CRT.
The computation of X given by Eq. (5) is now carried out in three steps as suggested in Ref. [5]. In Step 1, the multiplications in Eq. (5) are performed by partitioning x 21 , x 0 and x 1 into eight sections, while the additions and subtractions are performed by redundant adders/subtractors to output one signed-digit number for each section. These are detailed below.
In Step 2, the above eight signed-digit outputs are converted to binary numbers. Redundant adders/subtractors are used to produce a sum and a carry bit. The carry bit is a signed-digit number in the range ½22; 2 and controls the operation of Step 3. For the case under consideration, 0 þ 2 1 £ 1 þ 2 2 £ 2 þ 0 þ 2 4 £ ð22Þ þ 2 5 £ 1 þ 0 þ 2 7 £ 2 ¼ 100001010 represents the sum and the carry bit; the sum is 00001010 and the carry bit is 1.
In Step 3, X is generated by adding the carry bit 1 to the sum. Thus, X ¼ 00001010 þ 1 ¼ 1011: In order to develop the R/B converter for the general moduli set S k , we introduce the following definitions.
Definition 1 We define the variables T 1 , T 2iþ2 and T 2iþ3 as It is easy to see that all the T i 's are (m2 kþ1 )-bit binary numbers. In "R/B converter for S k section", we will show that each of the T i 's can be generated by concatenation and cyclic shift operations, which are defined below.

Definition 3
We define concatenation of two numbers x 1 and x 2 to be kx 1 lkx 2 l ¼ x 1 2 m 2 þ x 2 , where x 1 is an m 1 -bit number and x 2 an m 2 -bit number.
Definition 4 For any integer a . 0; we denote ½0 a ¼ k0l a · · ·k0l and ½1 a ¼ k1l a · · ·k1l: Assuming that n, n 0 ðn $ n 0 Þ and k 0 are integers, and x is an n 0 -bit binary number, we denote Therefore, from Definitions 3 and 4, we get In addition, we need the following well-known results: The parallel R/B converter to be proposed later in this section is based on the following theorem, which presents a method of generating T 1 , T 2iþ2 and T 2iþ3 by the concatenation and cyclic shift operations on x 21 , x i and x i ; respectively. Theorem 1 Proof ðby Definition 2; 3aÞ ðusing the modulo operationÞ ðby Definition 2; 3cÞ Hence the theorem. A Note: 1. For T 1 , n ¼ n 0 ¼ m: Hence, from Definition 4, we have ½x 21 ¼ x 21 : 2. For T 2iþ2 , n ¼ m2 iþ1 and n 0 ¼ m2 i þ 1: Therefore, from Definition 4, we have ½x i ¼ ½0 m2 i 21 kx i l: 3. For T 2iþ3 , n ¼ m2 iþ1 and n 0 ¼ m2 i þ 1: Thus, from Definition 4, ½x i ¼ ½1 m2 i 21 k x i l: Example 2 We now apply Theorem 1 to Example 1 already considered. Recalling that m ¼ 2; k ¼ 1; S 1 ¼ ð3; 5; 17Þ and M ¼ 255; we find the binary number X ¼ ð2; 1; 11Þ for the moduli set S 1 as follows.
It is observed that it is not necessary to calculate [x 1 ] and ½x 1 , since the term ½x i 2 k2i 21 is eliminated when k ¼ i ¼ 1: Now, T 1 ¼ kx 21;1 . . .x 21;0 l½x 21 2 1þ1 21 kx 21;221 . . .x 21;1þ1 l ¼ 10j10j10j10; It is easy to see that the above calculations based on Theorem 1 are very much simpler than those in Example 1. Theorem 1 also enables us to implement a parallel R/B converter for the general moduli set S k without using multipliers.
The converter to be presented later in this section uses an m2 kþ1 -bit carry save adder (CSA) with an end-around carry (EAC) as a fundamental building block. A CSA with EAC consists of a series of adders as shown in Fig. 1. Since each adder accepts a 3-bit input and produces a 2-bit output, the m2 kþ1 -bit CSA accepts three m2 kþ1 -bit numbers, A ¼ a m2 kþ1 21 . . .a 0 ; B ¼ b m2 kþ1 21 . . .b 0 , and D ¼ d m2 kþ1 21 . . .d 0 ; and produces two m2 kþ1 -bit numbers, namely, Sum and Carry. The highest significant bit in Carry, C m2 kþ1 21 ; is moved to the lowest significant bit, which is called the end-around carry. Assuming Carry ¼ C m2 kþ1 22 . . .C 0 C m2 kþ1 21 and Sum ¼ S m2 kþ1 21 . . .S 0 ; we have jA þ B þ Dj 2 2 kþ1 m 21 ¼ jSum þ Carryj 2 2 kþ1 m 21 : For the (2k þ 3) input numbers T 1 ; T 2 ; . . .; T 2kþ3 ; we need a CSA tree to reduce these input numbers to a pair of Sum and Carry. This CSA tree consists of (2k þ 1) CSA's each with an EAC arranged in dlog 3=2 ð2k þ 3=2Þe levels, where dxe stands for the least integer i $ x: Each CSA is of m2 kþ1 bits [22]. This CSA tree is shown in Fig. 2.
The proposed parallel R/B converter for S k is shown in Fig. 3. It consists of two parts: a CSA tree and a modulo adder. The CSA tree reduces the modulo addition of the (2k þ 3) T i 's to a pair of Sum and Carry, i.e.
Then, the modulo adder adds Sum and Carry together to generate the binary number X.
The CSA tree consists of (2k þ 1) CSA's, each of which consists of m2 kþ1 full adders or half adders  (FAs/HAs). The modulo adder used in Fig. 3 is the one proposed in Ref. [23] and has an approximate complexity of m2 kþ1 full adders. Thus, the converter has a total of (2k þ 2)m2 kþ1 FAs/HAs.
The CSA tree has dlog 3=2 ð2k þ 3=2Þe levels, each having a delay, t FA , of a full adder [24]. The delay of the modulo adder is approximately m2 kþ1 t FA [23]. Thus, the total delay of the converter is ðdlog 3=2 ð2k þ 3=2Þe þ m2 kþ1 Þt FA : is a special case of the moduli set S k ðk ¼ 0Þ: A binary number X in the dynamic range ½0; M 2 1 is represented by ðx 21 ; x 0 Þ: By Theorem 1, X is computed as follows: For the moduli set S 0 , we obtain the converter from the general architecture of Fig. 3. This converter consists of a 2m-bit CSA with EAC and a 2m-bit modulo adder. This is shown in Fig. 4.
Out of the 2m adders needed for the 2m-bit CSA with EAC, (2m 2 2) adders are half adders, since T 2 has (m 2 1) bits of 0's and T 3 has (m 2 1) bits of 1's [15]. Hence, the CSA has 2 full adders and (2m 2 2) half adders. As mentioned in the previous section, the 2m-bit modulo adder is equivalent to 2m full adders in terms of complexity. Thus, the complexity measured by the number of transistors of the proposed converter for S 0 is equivalent to 2 £ 20 þ ð2m 2 2Þ £ 10 þ 2m £ 20 ¼ 60m þ 20 transistors, using Table I of Ref. [5].
The delay calculation is carried out in a similar manner. Table I gives a performance comparison of the proposed converter with the one in Ref. [5], where D refers to the delay of a logic gate. It is clear from Table I that the proposed converter is twice as fast as the converter in Ref. [5], but requiring only one-half of the hardware. For 8-bit dynamic range ðm ¼ 4Þ; the FA-based R/B converter of Ref. [10] requires 837 transistors with 51 gate delays, the converter based on S k of Ref. [5] requires 510 transistors with 38 gate delays, while the proposed converter requires only 220 transistors with 18 gate delays.

R/B Converter For S 1
The moduli set S 1 ¼ {2 m 2 1; 2 2 0 m þ 1; 2 2 1 m þ 1} with M ¼ 2 4m 2 1 is another special case of the moduli set S k ðk ¼ 1Þ: A binary number X in the dynamic range ½0; M 2 1 is represented by ðx 21 ; x 0 ; x 1 Þ: By Theorem 1, X is computed as follows: For the moduli set S 1 , we obtain the converter again from the general architecture of Fig. 3. This converter
The performance of the proposed converter is compared with that of the converter in Ref. [5] using the same measures (namely, the number of transistors and delays) as used in "R/B converter S 0 section". These results are given in Table II. It is evident from Table II that the proposed converter is three times as fast, while requiring only 60% of the hardware.

VLSI IMPLEMENTATION
The VLSI implementation of the proposed converters is carried out using 0.5-micron CMOS technology. The design flow is as follows [25,26]. First, we construct a new library to include all the building blocks mentioned in Sections "R/B converter for S k " and "Two special cases". Then, the architectures of the converters are coded in VHDL language using this library. The codes are simulated at the RTL level to verify the correctness of the design. The logic synthesis is carried out to optimize the design, and the gate-level simulation performed. Next, the placement and routing are carried out automatically to generate the layout. Finally, the performance analysis at the layout level is carried out. The software packages used include Cadence V4. 4 Following the design flow, the R/B converters based on S k can be implemented for any given k, for the 8-bit, 16bit, 32-bit and 64-bit dynamic ranges. Here, we only show the implementation of the converters based on S 0 . For this implementation, the 2 8 2 1, 2 16 2 1, 2 32 2 1, and 2 64 2 1 ranges are chosen instead of the four entire binary ranges. The reason is that these ranges are efficient for S 0 and the entire binary ranges can be accommodated by the proposed architectures with an additional combinational circuit or ROM for the overflow cases; this will introduce additional complexity. The implementation of the 8-bit, 16-bit, 32-bit and 64-bit converters of S 0 for the 2 8 2 1, 2 16 2 1, 2 32 2 1, and 2 64 2 1 ranges based on the special sets are shown in Table III.
Results concerning the area and time performance of these four converters are obtained using Cadence V4.4.1 (Release 9504) tools and summarized in Table IV. The core area refers to the area of the circuit layout, the chip area refers to the area of the circuit and input/output/power pads, while the time indicates the latency of the chip. No comparison of the performance has been carried out with the existing converter for S 0 in Ref. [5], since no implementation data is available in Ref. [5].

CONCLUSION
A high-speed parallel R/B converter for the general moduli set S k ¼ {2 m 2 1; 2 2 0 m þ 1; 2 2 1 m þ 1; . . .; 2 2 k m þ 1} has been proposed. The new converter uses no multipliers. The individual converters for the moduli sets S 0 and S 1 have been derived from the general architecture. The R/B converter for S 0 is twice as fast requiring only one-half of the hardware, while that for S 1 is three times faster requiring only 60% of the hardware,