Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs

Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures.The first one implements a simple restoring shift-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carrypropagate adders, and the fifth divider decomposes each digit into three components and requires neither a ROMnor amultiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given.


Introduction
Many applications, particularly commercial and financial applications, require decimal floating-point operations to avoid errors from conversions between binary and decimal formats.This paper presents five different decimal fixedpoint dividers and analyzes their performances and resource requirements on FPGA platforms.All five architectures apply a radix-10 digit recurrence algorithm but differ in the quotient digit selection (QDS) function.
The first fixed-point divider (type1) implements a simple shift-and-subtract algorithm.It is characterized by an unsigned and nonredundant quotient digit calculation.Nine divisor multiples are precomputed, and in each iteration step nine carry-propagate subtractions are performed on the residual.Finally, the smallest, nonnegative difference is selected by a large fan-in multiplexer.This type1 implementation is characterized by a high area use.
The second divider (type2) uses a signed-digit quotient calculation with a redundancy of  = 8/9 and operands scaling to get a normalized divisor in the range of 0.4 ≤ divisor < 1.0.The quotient digit selection (QDS) function can be implemented fully by a ROM because it depends only on the two most significant digits (MSDs) of the residual as well as the divisor.The residual uses a redundant carrysave representation but, because of performance issues, the two MSDs are implemented by a nonredundant radix-2 representation.
The quotient digit selection (QDS) functions of the third and fourth divider (type3.aand type3.b)are based on comparators for the two most significand digits.The comparators consist of short binary carry-propagate adders (CPA), which can be implemented very efficiently in the FPGA's slice structure.The corresponding comparative values depend on the divisor's value and are precomputed and stored in a small ROM.The redundancy is  = 8/9; thus, 17 binary CPAs are required.Similar to the type2 divider, the type3.aand type3.bdividers use prescaling of the divisor 0.4 ≤ divisor < 1.0, a redundant carry-save representation of the residual, and a nonredundant radix-2 representation of the two MSDs.The dividers type3.aand type3.bdiffer only in the implementation of large fan-in multiplexers in the digit recurrence step.The type3.a divider implements multiplexers that minimize the LUT usage but have long latencies, whereas International Journal of Reconfigurable Computing the type3.bdivider implements a faster dedicated multiplexer that exploits the FPGA's internal carry chains.
The quotient digit selection function of the last divider (type4) requires neither a ROM nor a multiplexer.It is characterized by divisor scaling (0.4 ≤ divisor < 0.8) and a signed-digit redundant quotient calculation with a redundancy of  = 8/9.The quotient digit is decomposed into three components having values {−5, 0, 5}, {−2, 0, 2}, and {−1, 0, 1}.The components are computed one by one, whereby the digit selection function is constant; that is, the selection constants do not depend on the divisor's value.Similar to the type2, type3.a,and type3.bdividers, the type4 divider uses a carry-save representation for the residual with a nonredundant radix-2 representation of the two MSDs.
The fixed-point division algorithms are implemented and analyzed on a Virtex-5 FPGA.Finally, the type2 divider, which shows the best tradeoff in area and delay, is extended to a floating-point divider that is fully IEEE 754-2008 compliant for decimal64 data format, including gradual underflow handling and all required rounding modes.
The architectures of the type1, type2, and type4 dividers have already been published in [1].However, this paper gives a more detailed description of the previous research and introduces two new dividers, which fill the design gap between the type1 and type2 dividers because they are based on two extreme examples of algorithms: the type1 divider implements a restoring quotient digit selection (QDS) function that requires nine decimal carry-propagate adders (CPAs) of full precision, whereas the nonrestoring QDS function of the type2 divider is implemented fully by means of a ROM with limited precision.In comparison, the new dividers implement nonrestoring QDS functions that are based on fast binary CPAs with limited precision.
The outline of this paper is given as follows: Section 2 motivates the use of decimal floating-point arithmetic and its advantage compared to binary floating-point arithmetic.The underlying decimal floating-point standard IEEE 754-2008 is introduced in Section 3. The digit recurrence division algorithms as well as the five different architectures of fixedpoint dividers are presented in Section 4. These fixed-point dividers are extended to a decimal floating-point divider in Section 5. Postplace and route results are presented in Section 6, and finally in Section 7 the main contributions of this paper are summarized.

Decimal Arithmetic
Since its approval in 1985, the binary floating-point standard IEEE 754-1985 [2] is the most widely used implementation of floating-point arithmetic and the dominant floating-point standard for all computers.In contrast to binary arithmetic, decimal units are more complex, require more area, and are more expensive, and the simple binary coded decimal (BCD) data format has a storage overhead of approximately 20%.Thus, at that time of approval the use of binary in preference to decimal floating-point arithmetic was justified by the better efficiency.
Most people in the world think in decimal arithmetic.These decimal numbers must be converted to binary numbers when using a computer.However, some common finite numbers can only be approximated by binary floating-point numbers.The decimal number 0.1, for example, has a periodical continued fraction 0.1 10 = 0.0001100 2 .It cannot be represented exactly in a binary floating-point arithmetic with finite precision, and the conversion causes rounding errors.
As a consequence, binary floating-point arithmetic cannot be used for any calculations which do not tolerate conversion errors between decimal and binary numbers.These are, for instance, financial and business applications that even require decimal arithmetic by law [3].Therefore, commercial application often use nonstandardized software to perform decimal floating-point arithmetic.However, these software implementations are usually from 100 to 1000 times slower than equivalent binary floating-point operations in hardware [3].
Because of the increasing importance, specifications for decimal floating-point arithmetic have been added to the IEEE 754-2008 standard for floating-point arithmetic [4] that has been approved in 2008 and offers a more profound specification than the former radix-independent floatingpoint arithmetic IEEE 854-1987 [5].Therefore, new efficient algorithms have to be investigated, and providing hardware support for decimal arithmetic is becoming more and more a topic of interest.
IBM has responded to this market demand and integrates decimal floating-point arithmetic in recent processor architectures such as z9 [6], z10 [7], Power6 [8], and Power7 [9].The Power6 is the first microprocessor that implements IEEE 754-2008 decimal floating-point format fully in hardware, while the earlier released z9 already supports decimal floating-point operations but implements them mainly in millicode.Nevertheless, the Power6 decimal floating-point unit is as small as possible and is optimized to low cost.It reuses registers from the binary floating-point unit, and the computing unit mainly consists of a wide decimal adder.Thus, its performance is rather low.Other floating-point operations such as multiplication and division are based on this adder and are performed sequentially.The decimal floating-point units of z10 and Power7 are designed similarly to those of the Power6 [7,9].

IEEE 754-2008
The floating-point standard IEEE 754-2008 [4] has revised and merged the IEEE 754-1985 standard for binary floatingpoint arithmetic [2] and IEEE 854-1987 standard for radixindependent floating-point arithmetic [5].As a consequence, the choice of radices has been focused on two formats: binary and decimal.In this paper we consider only the decimal data format.A decimal number is defined by the triple consisting of the sign (), significand (), and exponent (): with  ∈ {0, 1},  ∈ [0, 10 In this paper we exclusively use the integer representation  with the exponent .
Unlike binary floating-point format, the decimal floatingpoint number is not necessarily normalized.This leads to a redundancy, and a decimal number might have multiple representations.The set of representations is called the floating-point number's cohort.For example, the numbers 123 ⋅ 10 0 , 1230 ⋅ 10 −1 , and 12300 ⋅ 10 −2 are all members of the same cohort.More precisely, if a number has  ≤  significant digits, the number of representations is  −  + 1.
IEEE 754-2008 defines three decimal interchange formats (decimal32, decimal64, and decimal128) of fixed width 32, 64, and 128 bits.As depicted in Figure 1, a floating-point number is encoded by three fields: the sign bit , the combination field , and the trailing significand field.The combination field encodes whether the number is finite, infinite, or not a number (NaN).Furthermore, in case of finite numbers the combination field comprises the biased exponent ( =  + bias) and the most significant digit (MSD) of the significand.The remaining 3 ⋅  digits are encoded in the trailing significand field of width  =  ⋅ 10.The trailing significand can either be implemented as a binary integer or as a densely packed decimal (DPD) number [4].Binary encoding makes software implementations easier, whereas DPD encoding is favored by hardware implementations, as it is the case in this paper.
The encoding parameters for the three fixed-width interchange formats are summarized in Table 1.This paper focuses on the data format decimal64 with DPD coded significand.DPD encodes three decimal digits (four bits each) into a declet (10 bits) and vice versa [4].It results in an storage overhead of only 0.343% per digit.
IEEE 754-2008 defines five rounding modes.These are two modes to the nearest (round ties to even and round ties to away) and three directed rounding modes (round toward positive, round toward negative, and round toward zero) [4].As floating-point operations are obtained by first performing the exact operation in the set of real numbers and then mapping the exact result onto a floating-point number, rounding is required whenever all significant digits cannot be placed in a single word of length .Moreover, inexact, underflow, or overflow exceptions are signaled when necessary.

Decimal Fixed-Point Division
Oberman and Flynn [10] distinguish five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency, whereby many practical algorithms are combinations of multiple classes.Compared to binary arithmetic, decimal division is more complex.Currently, there are only a few publications concerning radix-10 division.
Wang and Schulte [11] describe a decimal divider based on the Newton-Raphson approximation of the reciprocal.The latency of a decimal Newton-Raphson approximation directly depends on the latency of the decimal multiplier.A pipelined multiplier has the advantage that more than one division operation can be processed in parallel; otherwise the efficiency is poor.However, the algorithm lacks remainder calculation, and the rounding is more complex.The first FPGA-based decimal Newton-Raphson dividers are presented in [12].The dividers as well as the underlying multipliers are sequential; hence, these dividers have a high latency.
Digit recurrence division is the most widely implemented class of division algorithms.It is an iterative algorithm with linear convergence; that is, a fixed number of quotient digits is retired every iteration step.Compared to radix-2 arithmetic, radix-10 digit recurrence division is more complex because, on the one hand, decimal logic is less efficient by itself and, on the other hand, the range of the quotient digit selection (QDS) function comprises a larger digit set.Therefore, the performance of the digit recurrence divider depends on the choice of the QDS function and the implementation of decimal logic.
Nikmehr et al. [13] select quotient digits by comparing the truncated residual with limited precision multiples of the divisor.Lang and Nannarelli [14] replace the divisor's multiples by comparative values obtained by a look-up table and decompose the quotient digit into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required.Vázquez et al. [15] take a different approach: the selection constants in the QDS function are obtained of truncated multiples of the divisor, avoiding lookup tables.Therefore, the multiples are computed on-thefly.Moreover, the digit recurrence iteration implements a slow carry-propagate adder, but an estimation of the residual is computed to make the determination of the quotient digits independent of this carry-propagate adder.The decimal divider of the Power6 microprocessor [16] uses extensive prescaling to bound the divisor to be greater than or equal to 1.0 but lower than 1.1112 in order to simplify the digit selection function.However, this prescaling is very costly, it requires a 2-digit multiply, and it needs overall six cycles on each operand.Furthermore, the digit selection function still requires a look-up table.
The first decimal fixed-point divider designed for FPGAs applies digit recurrence algorithm and is proposed by Ercegovac and McIlhenny [17,18].However, it has a poor cycle time, because it does not fit well on the slice structure of FPGAs but would probably fit well into ASIC designs with dedicated routing.

Digit Recurrence Division.
In the following, we use different decimal number representations.A decimal number B is called BCD- 0  1  2  3 coded when B can be expressed by A number representation is called redundant if one or more digits have multiple representations.
Moreover, we consider a division  = / in which the decimal dividend  and the decimal divisor  are positive and normalized fractional numbers of precision , that is, 0.1 ≤ ,  < 1.The quotient  and the remainder rem are calculated as follows: Then the radix-10 digit recurrence is implemented by with the initial residual [0] = /10 and with the quotient digit calculated by the selection function Digit recurrence division is subdivided into the classes of restoring and nonrestoring algorithms, that differ in the quotient digit selection (QDS) function and the dynamical range of the residual.A restoring divider selects the next positive quotient digit 0 ≤  +1 ≤ 9 such that the next partial residual is as small as possible but still positive.The IBM z900 architecture, for instance, implements such a decimal restoring divider [19].By contrast, the QDS function of a decimal nonrestoring divider uses a digit set that is positive as well as negative − ≤  +1 ≤ , and the partial residual [ + 1] might also be negative.One advantage of restoring division is the enhanced performance since estimates of limited precision ( ŵ[] ≈ [] and d ≈ ) might be used in the QDS function SEL (10 ŵ[], d).This performance gain is achieved by using a redundant digit set, − ≤   ≤ , which defines a redundancy factor Then, in each iteration step the quotient digit  +1 should be selected such that the next residual [ + 1] is bounded by which is called the convergence condition [20].This paper presents five different decimal fixed-point dividers that are described in the following.The first divider (type1) is a restoring divider whereas the four others (type2, type3.a,type3.b,and type4) are nonrestoring dividers.

Type1 QDS Function.
The QDS function of the type1 algorithm is very simple.Nine multiples of the divisor (1, . . ., 9) are subtracted in parallel from the residual by carry-propagate adders (CPAs), and the smallest positive result is selected.The CPAs exploit the FPGA's internal fast carry logic, as described in [21].
The nine multiples are precomputed and are composed of the multiples 1, 2, 5, 10 and their negatives (10's complement), which require at most one additional CPA per multiple.The multiples 2 and 5 can be easily computed by digit recoding and constant shift operations where ( 9) is read as follows.A BCD-5421 coded number X left-shifted by one bit is equivalent to the corresponding BCD-8421-coded number multiplied by two.In a similar fashion we obtain a multiplication by five using (10).
The implementation of the type1 digit recurrence divider is characterized by a high area use, due to the utilization of nine parallel CPAs.The corresponding algorithm of the type1 digit recurrence is shown in Algorithm 1.

Type2 QDS Function.
The approach of the type2 division algorithm is based on the implementation of the QDS function fully by a ROM.This ROM is addressed by estimates of the residual and divisor.The use of estimates is feasible because a signed digit set together with a redundancy greater than 1/2 is used.The estimates 10[] and d are obtained by truncation; that is, only a limited number of MSDs of the residual 10[] and divisor  are regarded.Furthermore, the residual is implemented in BCD-4221 carry-save representation such that the maximum error introduced by an estimation with precision  (one integer digit and  − 1 fractional digits) is bounded by  = 2 ⋅ 10 −+1 .Negative residuals are represented by their 10's complement.
These selection constants are bounded by selection intervals Furthermore, the continuity condition states that every value 10[] must belong at least to one selection interval [20].Considering the maximum error due to truncation, the continuity condition can be expressed by If we suppose the subrange width to be constant ( +1 −   = 10 − ) and consider that the minimum overlapping occurs for minimum  min and maximum  max , then we obtain the term We choose  = 8/9 ( max = 8) and  min = 0.4 that leads to  = 2 and  = 2. Thus, for the divisors's estimation d is required an accuracy of two fractional digits, and for the residual's estimation 10 is required an accuracy of one integer and one fractional digit.Furthermore, the divisor must be pre-scaled such that 0.4 ≤  < 1.0.The P-D diagram is a visualization technique for designing a quotient digit selection function and for computing the decision boundaries   ().It plots the shifted residual versus the divisor.The P-D diagram for the type2 quotient digit selection function with the selection constants   () is shown in Figure 2.
The two MSDs of the dividend and residual address the ROM in order to obtain the signed quotient digit, which comprises 5 bits.In order to reduce the ROM's size, the two MSDs of the divisor and the residual are implemented by using a nonredundant radix-2 representation.This leads to an address that is composed of 15 bits: seven bits from the ( unsigned radix-2 representation of the divisor's two MSDs and eight bits from the signed radix-2 representation of the residual's two MSDs.Hence, the size of the ROM is 2 15 × 5 bits. Unfortunately, the use of radix-2 representation complicates the multiplication by 10, which is required according to (5).Therefore, an additional binary adder is needed: The corresponding algorithm for the calculation of the ROM entries can be derived from the P-D diagram and is depicted in Algorithm 2. It should be noted that the radix-2 representations of d and ŵ are truncations of  and ; that is, d ≤  and ŵ ≤ .
Once the quotient digit  +1 has been determined, the multiple  +1 ⋅  is subtracted from the current residual to compute the next residual, as stated in (5).The subtracter is a fast redundant BCD-4221 carry-save adder (CSA), as described in [21], except for the two MSDs in wich we apply  (1) Select Algorithm 3: Pseudocode for the type2 digit recurrence.
For both, radix-2 and BCD-4221 radix-10 representation, the complements can be computed by inverting each bit and adding one.In summary, the subtraction uses a (3 : 1) radix-2 CPA for the two MSDs and a redundant radix-10 (4 : 2) CSA for the remaining digits.The digit recurrence algorithm of the type2 divider is summarized in Algorithm 3, and its block diagram is depicted in Figure 3.

Type3 QDS Function.
The frequency limiting component of the type2 divider is the digit recurrence step, which comprises the ROM (BRAM) to calculate the next quotient digit.This BRAM is slower compared to common FPGA logic.Hence, it appears to be beneficial to remove the slow BRAM from the critical path and use fast FPGA logic instead.
To analyze this impact of the BRAM delays, we implement two dividers (type3.aand type3.b),which can also be realized without BRAM in the critical path.Lang and Nannarelli [14] propose a divider that implements a quotient digit selection function based on CPAs and sign detection.These CPAs subtract fixed values from the current residual with limited precision.These fixed values depend on the estimates of the divisor and are precomputed and stored in a ROM.Furthermore, the quotient digit is divided into two parts, which further simplifies the quotient digit selection function and reduces the critical path.Unfortunately, this divider cannot be implemented efficiently on FPGAs because the required carry-save adder with small error estimation shows a poor performance on the FPGA's slice structure.Nevertheless, to investigate the impact of BRAM delays in the type2 divider, we implement another two dividers (type3.aand type3.b)that have no BRAM in the critical path but a CPA-based quotient digit selection function instead.
The type3 dividers are modifications of the type2 divider.The common architectural features are (i) the multiples of the divisor are compos of two components: , where  1,2 ∈ {0, ±1, ±2, ±5, ±10},   ( Algorithm 5: Pseudocode for the type3 digit recurrence. iteration step.The advantage of the type3 QDS function is its short critical path, which comprises only one 8-bit binary carry-propagate adder.However, this short critical path is bought at a high price because the complexity is moved to the selection of the divisors multiples.Hence, large fan-in multiplexers with poor latencies are required.In order to minimize the impact of these multiplexers, we designed a new dedicated (17 : 1) multiplexer that exploits the FPGA's fast carry chains and has a delay of only one LUT instance.In the following we name the divider with improved fast multiplexers type3.b and with traditional multiplexers type3.a.The new dedicated (17 : 1) multiplexer uses the 16-sign bits as select lines and is coded as follows: if sel =  1111.1111.1111.1111 then  =  (16) if sel =  0111.1111.1111.1111 then  =  (15) . . .if sel =  0000.0000.0000.0001 then  =  (1) if sel =  0000.0000.0000.0000 then  =  (0) ,  ( 15) ( 14) (1) (0) the type2, type3.a,and type3.bdividers, the type4 divider uses signed-digit redundant quotient calculation, carry-save representation of the residual, fast BCD-4221 CSAs for the digit recurrence, and a radix-2 implementation of the MSDs.Quotient digits are decomposed into three components  +1 = 5 ⋅  3 +1 + 2 ⋅  2 +1 +  1 +1 , and each component   is computed by a distinct selection function.These components can hold three values   ∈ {−1,0,1} so that only two comparators per component are required to distinguish between these values.Since −8 ≤  +1 ≤ 8, we get a redundancy factor of  = 8/9.Furthermore, the selection functions become very simple due to prescaling of the divisor (0.4 ≤  < 0.8); that is, they are constant functions SEL  that do not depend on the divisor anymore: The recurrence is then defined as follows: The multiples 2 and 5 can be easily and fast computed according to ( 9) and (10) Likewise, the lower limits    can be computed Due to the redundancy factor  = 8/9 > 1/2 we have overlapping regions [  +1 ,    ], where more than one quotient digit component   +1 may be selected.The decision boundary of a selection function should lie inside this overlapping regions.Figure 6 shows the P-D diagram for  3 .The figures for  2 and  1 look similar.As  3 can hold three different values, the selection function requires two decision boundaries.Generally, it is implemented by a staircase function (see Figure 2), but for the bounded divisor  ∈ [0.4,0.8) it is independent of the divisor and is reduced to a constant function (see Figure 6).Fortunately, the selection functions for  2 and  1 are also independent of the divisor and can also be implemented by constant functions in a similar fashion.
We now determine the required number of fractional digits for the estimates 10, v1 , and v2 .Moreover, we choose suitable selection constants   pos and   neg .These constants depend on the maximum error due to the estimates and on the minimum overlapping width, which is given by   0 (0.4) −   1 (0.8) as well as   −1 (0.8) −   0 (0.4).Since we use BCD-4221 carry-save redundant digit representation, for a precision of  digits (one integer digit and  − 1 fractional digits), we have a maximum error of  = 2 ⋅ 10 −(−1) .Moreover, the estimations ŵ are computed by truncations ( − ŵ ≤ ).This means, for given positive and negative selection constants   pos > 0 and   neg < 0, the following expressions must be true: If we choose the selection constants as listed in Table 2, all conditions are fulfilled for a precision of  = 2 digits, and each selection function is reduced to two simple 2-digit comparators.
As soon as one component   of the quotient digit is computed, the corresponding multiple of the divisor must be subtracted from the partial residual.Each component   can hold three different values   ∈ {−1,0,+1} that are multiplied with the corresponding weighted divisor  ⋅  with  ∈ {5, 2, 1}.In other words,  ⋅  is either passed, negated, Similar to type2 and type3 division, we implement the two MSDs of the type4 residual by using a nonredundant radix-2 representation, which requires less LUTs and results in faster quotient digit selection function.The pseudocode for the digit recurrence iteration is shown in Algorithm 6, and the corresponding block diagram is depicted in Figure 7.

Proposed Decimal Fixed-Point Divider.
For each type of division algorithms presented in the preceding sections we have implemented a corresponding fixed-point divider, which is described in the following.For all fixed-point dividers we expect the input operands to be normalized.
The block diagram of the type1 divider is depicted in Figure 8. First, the divisor multiples {2, . . ., 9} are precomputed.In the following cycles  = 16 quotient digits (16 + 1 if the first digit is zero) are computed one by one, followed by an additional rounding digit.The quotient (Z) and the quotient +1 (ZP1) are computed on-the-fly by using two registers that are updated every cycle.The algorithm has the advantage that no additional slow decimal CPA is required to compute the incremented quotient.It is described more precisely in Section 5.The normalization of the result is also performed on-the-fly by locking the conversion when 16 + 1 quotient digits for the significand and the rounding digit have been computed; that is, in the worst case, when the first quotient digit is zero, 16 + 1 + 1 = 18 cycles are required.Furthermore, the sticky bit is calculated, which is required for rounding.It is set to one whenever the final remainder is unequal to zero.
The architectures of the type2, type3.a,type3.b,and type4 are similar in their structure but differ in their digit recurrence algorithm and scaling.The common block diagram is shown in Figure 9. First, the dividend and the divisor are rescaled and five as well as two times the divisor are precomputed according to ( 9) and (10).Then, the operands are recoded because the digit recurrence iteration uses a redundant digit representation with BCD-4221 carry-save adders.
The digit recurrence retires in each iteration one signed quotient digit   .The on-the-fly-conversion algorithm converts this signed-digit representation to the BCD-8421-coded International Journal of Reconfigurable Computing  (1) select Algorithm 6: Pseudo code for the type 4 digit recurrence.result [20] and computes the quotient (Z), the quotient +1 (ZP1), and quotient −1 (ZM1).This conversion is accomplished every iteration step and does not need a slow CPA.The incremented and decremented quotients are required for rounding.Moreover, the quotient is also normalized on-thefly by locking the conversion when 16 + 1 quotient digits (including the rounding digit) have been computed.The algorithm is described explicitly in Section 5 because it is accomplished together with the gradual underflow handling of the floating-point divider.Furthermore, the sticky bit is calculated, which is required for rounding.It is set to one whenever the final remainder is unequal to zero.Moreover, when the final remainder is negative, the quotient of the type2, type3.a,type3.b,and type4 dividers has to be adjusted by subtracting one LSD.This subtraction does not need another slow CPA because the quotient −1 (ZM1) has already been computed on-the-fly.The calculations of both, the sticky and sign bit, require the reduction of the redundant remainder by a CPA.This CPA might be subdivided into multiple smaller CPAs to keep the latency low.

IEEE 754-2008 Floating-Point Division
The IEEE 754-2008 compliant decimal floating-point divider is an extension of the normalized fixed-point divider.The divider presented in this paper supports the interchange format IEEE 754-2008 decimal64 with DPD encoding, but it can be easily adapted to any other precision and exponent range.
The block diagram of the divider is depicted in Figure 10.The DFP division begins with decoding of the dividend X and divisor D and the extraction of their signs, significands, and exponents.In the following, the significands of the operands X (  ) and D (  ) are regarded as integers.Therefore, the corresponding exponents are   and   .
According to [4], if one of the operands is a signaling NaN (sNAN) or quiet NaN (qNAN), then the result is also a quiet NaN with the payload of the original NaN.Hence, in order to preserve the payload while using an unmodified divider, the NaN handling unit sets the NaN holding operand as the dividend and resets the divisor to 1.0 ⋅ 10 0 , as depicted in Algorithm 7.
The fixed-point dividers presented in Section 4 require normalized operands.Therefore, the number of leading zeros for both operands is counted, and the significands are normalized by barrel shifters.Due to performance issues, the leading zeros counter exploit the FPGA's fast carry chains, as proposed in [23].
The exponent of a floating-point division is first estimated by the normalized quotient in fractional representation (QNF) and is updated iteratively in each digit recurrence step.Since most of the decimal floating-point numbers specified by IEEE 754-2008 have multiple representations of the same value, the result of a division might also have more than one correct representations that differ in the exponent.However, to obtain a unique and reproducible result in the case of such an ambiguity, IEEE 754-2008 defines the exponent of the result, which is called preferred exponent (QP).
Both exponents, QNF and QP, are positive integers and are determined as follows: The offset1 is a bias and assures the quotients to be positive since unsigned integer calculation is used in this paper.
The normalized fixed-point division unit then computes  = 16 significand digits and one additional rounding digit.Additionally, the divider encompasses the on-the-fly conversion unit that also detects and handles gradual underflow with zero delay overhead.One signed quotient digit   is retired in each cycle, and the on-the-fly-conversion algorithm converts this signed-digit representation into the BCD-8421-coded partial result [20].The conversion is accomplished every iteration step and does not need a slow CPA (see Algorithm 8).The partial quotient is stored in a register Z  .Moreover, two additional registers ZM1  and ZP1  are provided.ZM1  is the partial quotient decremented by one least significant digit (LSD), and ZP1  is the partial quotient incremented by one LSD.ZM1  is selected when the final residual in the last iteration is negative, and ZP1  is required by the following rounding operation.Furthermore, the exponent for the normalized significand in integer representation (QNI) is calculated (see (29)), and gradual underflow handling is performed at no extra cost (see Algorithm 8).The exponent QNI is computed by decrementing the exponent of the normalized significand in fractional representation (QNF) in each iteration step by one.The recurrence iteration terminates earlier in case of gradual underflow.The gradual underflow signal (GUF) is asserted high when QNI =  min , and the calculated integer quotient is less than 10  : The extended quotient (composed of the calculated quotient and the rounding digit) must be decremented by one if the rounding digit or the sticky bit is unequal to zero.Overflow and underflow are computed as described in the exponent calculation unit and are listed in Algorithms 8 and 9.The infinity exception is asserted when the result is greater than the largest number, that is, when either overflow or division by zero exception is asserted, or when the dividend is infinity while the divisor is a finite number.

Implementation Results
All dividers are modeled using VHDL and are implemented for Xilinx Virtex-5 devices with speed grade −2 using Xilinx ISE 10.1.The postplace and route results of the fixed-point dividers are listed in Table 4.
All five types of fixed-point dividers require 19 cycles to perform a division.This includes 17 cycles to determine the quotient digits and the rounding digit, one additional cycle to normalize the result (when the first quotient digit is zero), and one cycle latency to perform on-the-fly conversion and sticky bit determination.As expected, the type1 divider uses the most resources in terms of look-up tables (LUTs) and flipflops (FFs).The type2, type3.a,and type3.bdivider consumes less LUTs but require, further five (type2) or two (type3.aand type3.b)BRAMs.However, comparing the delay of the three dividers leads to an unexpected result.Contrary to decimal divider implementations in ASIC designs, on FPGA platforms the shift-and-subtract algorithm is the fastest.The reason is that the signal propagation on the FPGA's internal fast carry chains is faster than interconnections between slices over the FPGA's general routing matrix.The type1 divider exploits this fast carry chains, whereas type2, type3.a,type3.b,and type4 dividers only use the normal, slow slice interconnection resources.In ASIC implementations, for instance, the longest paths of the type4 divider would be much shorter than the longest path of the type1 divider, but on FPGA architectures the situation is the opposite.
One of the fastest decimal fixed-point divider on ASICs is the design of Lang and Nannarelli [14].They minimize the critical path in the digit recurrence by using a fast quotient digit selection function based on binary carry-propagate adders that subtract fixed values from the estimation of the current residual.These fixed values are dependent on estimations of the divisor and are precomputed and stored in a ROM.Therefore, the latency of the ROM does not contribute to the critical path of the digit recurrence.The implementation of such a divider on FPGAs would have a poor performance because the required carry-save adder with small error estimation cannot be implemented efficiently on the FPGA's slice structure.However, the concept of removing the ROM from the critical path is applied to the type3.aand type3.bdividers, and the corresponding postplace and route results are shown in Table 4.These results point out that the type3.autilizes a similar amount of LUTs as the type2 divider but the cycle time is much higher.The reason for this increased cycle time can be explained by the raised complexity of the multiplexers for the selection of the divisor's multiples in the digit recurrence step.These multiplexers show a poor performance in terms of propagation delay because they are implemented as a tree with slow slice interconnections.On the contrary, the type3.bdivider applies dedicated fast multiplexers, which have a propagation delay of only one LUT instance and eight fast carry chains.This dedicated fast multiplexer speeds up the divider and reduces the maximum cycle time by one nanosecond.Unfortunately, the number of used LUTs is increased dramatically by approximately 1000 LUTs.
The comparison of the type2, type3.a,type3.b,and type4 dividers in terms of speed and area (number of occupied LUTs and FFs combined) shows that the type2 algorithm is the fastest and requires the least number of slices.Hence, there is no benefit in removing the ROM's latency from the critical path because the complexity is moved to the selection of the divisor's multiples.The type1 divider is only 5% faster than the type2 divider, but the speed is bought at a high price because the number of occupied slices is 75% higher.Therefore, the type2 divider shows a good tradeoff in terms of area and latency and is used for the floating-point divider presented in this paper.In the following paragraph we compare the type2 implementation with other published fixed-point dividers.
Four other FPGA-based dividers are presented by Ercegovac and McIlhenny [17,18], Deschamp and Sutter [24], Zhang et al. [25], and Véstias and Neto [12].These implementations are compared to our type2 divider in the following.The divider presented in [17,18] is based on a digit recurrence algorithm that only requires limited-precision multipliers, adders, and LUTs.Furthermore, a compensation term is computed in the digit recurrence that compensates the error caused by this limited precision.The design is optimized for Virtex-5 FPGAs and has a good area characteristic (a 14digit divider requires 1263 LUTs) but suffers from a high cycle time of 13.1 ns, which is more than 50% higher compared to the type1 or type2 design proposed in this paper.The reason for that high cycle time is (similar to the type4 divider) the long critical paths across many LUTs that cannot exploit the FPGA's fast carry chains.Therefore, the design would probably fit better on ASICs with dedicated routing.The dividers implemented in [24] apply two different digit recurrence algorithms with a redundancy of  = 1.The quotient digit selection function is very complex because it requires three MSDs of the divisor and four MSDs of the residual.The cycle times are high but it must be taken into account that these designs are optimized for Virtex-4 FPGAs.Nevertheless, the number of used LUTs is very high.The divider presented in [25] applies a radix-100 digit recurrence algorithm with comprehensive prescaling of the dividend and divisor.It is optimized for Virtex-II Pro devices; hence, it is hard to compare it with the dividers presented in this paper.However, the critical path includes a decimal carry-save adder tree, two carry-propagate adder stages, and three multiplexer stages.It can therefore be assumed that this algorithm would also have a poor performance on Virtex-5 devices, although two quotient digits are retired in each iteration step.Véstias and Neto [12] propose four different decimal dividers optimized for different speed and area tradeoffs.Contrary to the circuits presented in this paper, the dividers apply the Newton-Raphson algorithm.The used multiplier is sequential; therefore, many cycles per division are required.However, each divider has a higher delay and a greater LUT usage compared to our type2 implementation.Table 5 lists the results of the dividers presented in [17,24,25] together with the results of the type2 divider.However, for a fair performance comparison the precision of the type2 divider is equalized to match the number of calculated quotient digits of the corresponding divider.Furthermore it should be noted that the dividers presented in [12,17,24,25] do not provide rounding support.
For the floating-point implementation, the fixed-point divider of type2 is used because type1 and type3.bconsume too many resources, and type3.aand type4 are too slow.However, five additional BRAMs are required, but these are available in sufficient quantities on Virtex-5 devices.The corresponding postplace and route results for two configurations with different numbers of serialization stages are listed in Table 6.A comparison with other implementations is complicated because there are no other FPGA-based implementations of decimal floating-point dividers published yet.Therefore, we can only compare the decimal divider with binary dividers implemented on the same FPGA.In Table 7 postplace and route results of binary floating-point dividers for the double data format on a Virtex-5 provided by the Xilinx Core Generator [26] are listed.The dividers also provide exception signals overflow, underflow, invalid operation, and divide-by-zero.Unfortunately, the binary floatingpoint divider does not support gradual underflow because this feature increases the complexity.The divider with one cycle per division is an unpipelined and fully parallel design, and the divider that requires 55 cycles per division is a fully sequential design.The binary floating-point divider with 20 cycles per division is best suited for a comparison with the decimal floating-point divider because it requires a comparable number of cycles for one operation.Obviously, decimal arithmetic has a great overhead regarding the total number of used LUTs, but the cycle time and latency are similar.This resource overhead can be explained, on the one hand, by the inefficient representation of decimal values on binary logic and, on the other hand by the more complex specification of the decimal standard IEEE 754-2008.For instance, decimal floating-point division requires additional encoders, decoders, and a normalization stage.

Conclusion
In

Figure 2 :
Figure 2: P-D diagram for the type2 QDS function.

(
ii) the fast redundant BCD-4221 carry-save adders for the digit recurrence step as stated in(5), (iii) the radix-2 implementation of the two MSDs, and (iv) the 10's and 2's complements for  1,2 < 0.Furthermore, since the architectures of the type2 and type3 dividers are similar, with the exception of the quotient digit selection function, most of the dividers' characteristics are also identical, including the (i) the P-D diagram, (ii) the redundancy factor  = 8/9,  max = 8, (iii) the need of prescaling the divisor 0.4 ≤  < 1.0, (iv) the accuracy of the divisor's estimation  = 2, and (v) the accuracy of the residual's estimation  = 2.Contrary to the type2 divider, the QDS functions of the type3 dividers are implemented by 16 carry-propagate adders with sign detection.These adders subtract fixed selection constants from the estimation of the current residual.The selection constants are dependent on the current divisor and are stored in a ROM, which is then no more part of the critical path.The selection constants are computed according to Algorithm 4. The common digit recurrence algorithm of the type3.aand type3.bdividers is shown in Algorithm 5, and the corresponding block diagram is depicted in Figure4.The sign signals of the binary carry-propagate adders are used to determine the corresponding quotient digit and to select the multiples of the divisor in the digit recurrence

4. 5 .
Type4 QDS Function.The type4 divider applies a new algorithm as proposed by us in a previous paper[22].It is based on the decomposition of the signed quotient digit into three components having values {−5, 0, 5}, {−2, 0, 2}, and {−1, 0, 1} as well as a fast constant digit selection function.In this implementation, neither a ROM for the QDS function nor a multiplexer to select the multiple   ⋅  in the digit recurrence iteration step is required.The divider is intended to utilize less resources than other implementations.As

Figure 6 :
Figure 6: P-D diagram for the type4 QDS function.
this paper we have presented five different radix-10 digit recurrence division algorithms.The type1 divider implements the simple shift-and-subtract algorithm.The type2 divider is based on a nonrestoring algorithm with ROM-based quotient digit selection function.The type3.a and type3.bdividers are both similar to type2 but use fast binary carry-propagate adders in the quotient digit selection function.However, the type3.adivider uses LUT-based multiplexers whereas the type3.buses fast carry chain-based multiplexers.The type4 divider applies a new algorithm with constant digit selection functions.This type4 divider requires neither a ROM nor multiplexers to select multiples of the divisor.We have shown that the subtract-and-shift algorithm with the worst latency in ASIC architectures is the fastest design on FPGAs, but uses the most FPGA resources in terms of LUTs.A good tradeoff between latency and area is the architecture type2 with a redundant carry-save representation of the residual, a radix-2 representation of the two MSDs, and a quotient digit selection function implemented in ROM.Furthermore, we have extended thisfixed-point to a fully IEEE 754-2008 compliant decimal floating-point divider for decimal64 data format, and, finally, we have shown implementation results of all dividers.International Journal of Reconfigurable Computing Otherwise, if the divisor is greater than 10 −1 it follows that   ⏟⏟ ⏟⏟⏟ ⏟⏟  digits = (9.9⋅ ⋅ ⋅ 9) ⋅   + rem = (10 − 10 −+1 ) ⋅   + rem = 10 ⋅   + (−10 −+1 ⋅   + rem) ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ <0 = 10 ⋅   +  ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⩾+1 digits (A.5) with −10 <  < 0. The left term of the equation has, by definition, a precision of  digits while the right term of the equation has a precision of more than  digits.This contradiction, finally, proves that no rounding overflow can occur.
. Each selection function requires two comparators implemented as carry-save adders that subtract constant values from the residuals' estimations ( ŵ[], v1 [], and v2 []).As we will show in the following, these estimations require only the two most significant digits (MSDs) of the corresponding exact values.First, the selection intervals [   ,    ] for   +1 =  with  ∈ {−1, 0, 1} and  ∈ {1, 2, 3} have to be determined, where    is the smallest and    is the greatest value of the selection constant   () such that the next residual is still bounded.

Table 2 :
Constants for type4 quotient digit selection function.

Table 5 :
Performance comparison of fixed-point dividers.

Table 6 :
Postplace and route result of the floating-point divider based on type2.

Table 7 :
Postplace and route result of a 64-bit binary floating-point divider (Core Gen).