A Novel Low-power Shared Division and Square-root Architecture Using the GST Algorithm *

Although SRT division and square-root approaches and GST division approach have been known for long time, square-root architectures based on the GST approach have not been proposed so far which do not require a final division/multiplication of the scale factor. A GST square-root architecture is developed without requiring either a multiplication to update the scaled square-root quotient in each iteration or a division/ multiplication by the scaling factor after completing the square-root iterations. Additionally, quantitative comparison of speed and power consumption of GST and SRT division/square-root units are presented. Shared divider and square-root units are designed based on the SRT and the GST approaches, in minimally and maximally redundant radix-4 representations. Simulations demonstrate that the worst-case overall latency of the minimally-redundant GST architecture is 35% smaller compared to the SRT. Alternatively, for a fixed latency, the minimally-redundant GST architecture based division and square-root operations consume 32% and 28% less power, respectively, compared to the maximally-redundant SRT approach.


INTRODUCTION
Among the four basic arithmetic functions, divi- sion is the most difficult algorithm to be implemented in silicon.Assuming the simplest implementation for adder, multiplier and divider, the addition using a carry-ripple adder requires a critical path of n full-adder and the multiplication employing a carry-save scheme and a final carry- ripple adder requires 2. n full-adder delays (the multiplexer delays are neglected).However, the division using the paper and pencil method, in which a carry-ripple adder is used in each itera- tion, requires a critical path of n. n full-adder delays.Hence, it is obvious that there is the demand of obtaining fast, area and power efficient divider algorithms and architectures.The computation of the square-root can be based on the division, where most of the hardware can be shared among the two operations.Division and   Square-root operations are required in pocket calculators as basic arithmetic operations.Hence, division and SQRT have to be included in every micro-processor, mostly implemented in the form of a co-processor.Nevertheless, due to the introduction of smaller technologies and the trend of implementing an entire systems on a single chip (SOC), division and square-root operations are nowadays implemented on the main processor.Bose et al. [1] indicate that the total number of division operations can range from one-third to one-half the number of multiplications in compu- tations.Oberman [2] concludes that even though the division is an infrequent operation, it can result in performance degradation if the implementation is being ignored.
In general, division and square-root are defined as N Qd D + R, (1) X--Qs.Qs+R, (2) where N, D, Qd, R, X and Qs are dividend, divisor, division quotient, remainder, square-root operand and square-root quotient, respectively.Generally divider algorithms can be separated into two different kinds of algorithms, the Multiplicative Algorithms (MA) [3-5] and the Iterative Digit Recurrence Algorithms (IDRA), which is also known as the paper-and-pencil method.A general overview of existing divider algorithm is given in Figure 1.
The IDRA performs the division by repeated subtractions ri+l r Ri qi D, (3) while the MA employs repeated multiplications to obtain the final quotient The Newton-Raphson and the Goldschmidt algorithm are the most popular among the multi- plicative algorithms.Among the IDRA, there are the two main algorithms, the SRT (due to the accomplishments of Sweeney, Robertson and  Tocher [6,7]) and the GST (the g represents generalized, ST due to the accomplishments of Svoboda and Tung [8,9]).The case of higher radix division with redundant digit sets which also implies similar considerations for the corresponding higher radix square-root, has been extensively studied in [10].Other significant work on division and division/square-root can be found in [11][12][13][14][15][16][17].
The number of iterations of the Multiplicative Algorithm is proportional to logz(word-length).However, since every iteration doesn't consist of simple addition/subtraction operations, but of two multiplications and one addition, very fast multi- pliers are needed to obtain the same efficiency for small and medium word-lengths as the IDRA.The large area requirement of this approach is a major drawback.

LOW-POWER ARCHITECTURE 367
According to IEEE standard 754, the unsigned operands of the division and the square-root have to be in the range [1, 2) and [0.25, 1), respectively.In case that the inputs do not correspond to this standard, they must be normalized and the their exponents be adjusted.Assuming binary operands of word-length n, it becomes obvious that the division/square-root computation requires n addi- tions/subtractions to obtain the entire quotient.
The computation can be accelerated by either reducing the number of iteration (using higher radices) or by reducing the critical path of an iteration.The SRT and GST algorithms take advantage of a redundant number system and a higher radix r.This leads to a word-length independent critical path of the addition and subtraction and reduces the number of iterations to n/(logzr).In [18] a novel shared architecture for division and square-root was presented, which developed a GST square-root architecture with- out requiring an additional division by the scaling factor after the square-root operation as in [191.This paper is organized as follows.Section 2 presents the mathematical background for GST division and GST square-root operations.Section 3 presents the architecture of the GST divider/ square-root unit and the estimated power con- sumption of the SRT and GST square-root architectures.Section 4 presents the simulation results while Section 5 concludes the paper.

THE GST DIVISION
AND SQUARE-ROOT 2.1.Mathematical Background The quotient digit selection is a 2-dimensional function depending in every iteration on the partial remainder Ri and the denominator D for the division and the partial remainder Ri and the square-root quotient Qs,i for the square-root computation.By employing the recurrence equa- tion for the division and square-root, the new remainder Ri+ can be computed according to Ri+l r.Ri qi D (5) and Ri+l r Ri qi (2.Qi-1 -+-r-i" qi) r. e qi" 2Q], where qi is the newly computed square-root quotient digit of iteration and Qi-1 the square- root quotient from the previous iteration.Note that these recurrence equations are also used for the SRT division.The decision criteria can be simplified by restricting the range of D and 2Q] to a certain range [1, +6), in which the decision function is independent of D or 2Q].This can be performed by a multiplication.Pre-scaling of both operands does not alter the sought quotient, since: X-k X' Osquare-root Os as z-7. (8)  Before the first iteration can be performed, the scaling of the operands is required.Additionally, the arithmetic condition which guarantees that the most significant digit equals zero after the sub- traction/addition of the multiple of the denomi- nator, has to be satisfied.Additionally, since the square-root quotient is unknown at the beginning of the computation, it has to be updated and scaled according to the quotient digits qi and k.
Since Qs changes every iteration, kQs has to be updated according to the quotient digits qi and k.The sought result is obtained after every iteration by: Qs,i as,i-1 + r-i qi (9) Scaling Eq. ( 6) results in: Ri+l r.Ri qi (2.Qs,i-1 + r-i qi) k :?'" Ri qi 2Qii.(10) The goal of this section is to obtain an equation of Qs,i+l in terms of Therefore, increasing by in (10) leads to Ri+2 r Ri+l qi+ (2 Qs,i + r -(i+) qi+ k r.Ri+l qi+l 20s,i+ "k.
Recalling that in (6) 2O,i was defined as 2Q,i 2. Qs,i-1 -1-qi r i, and by comparing Eqs. ( 10), (11) and ( 12), the sought equation is obtained as: The next scaled square-root quotient can now be obtained by just adding the quantities qi" ri.k and qi+ "r-(i+l)'k to the previous scaled square-root quotient [18].

The Arithmetic Condition
and the Range of the Radicand Let us assume the same symmetric redundant digit set D(r,o) as used for the division [20].The maxi- mum and the minimum valid remainders are given by R ax 0.ccc... c ( 14) which can be rewritten as By employing the arithmetic condition in the square-root recurrence equation Ri+l r Ri qi (2Oi q-qi r-i) (17) an upper and lower bound can be computed in which the square-root quotient has to be located.
By obtaining this upper and lower bounds for the square-root quotient, the limitations for the operand X can be computed.By introducing the same rewrite condition as in the division [20], the square-root quotient has to be scaled in the range of 2Qs<-+a This corresponds to the same range as for the division.By obtaining the upper and lower bounds for the square-root quotient Q, the corresponding bounds for the operand X can be computed according to Xl,u 0.25-(2Ql,u)2, where Xt, Xu, 2Qt and 2Qu represent the lower and upper bound of the operand X and the square-root quotient, respectively.
An even distribution of the scale intervals does not lead to an optimal solution as shown in Table I.
Besides the fact that eight bits of the square-root operand X have to be examined, the shown bounds of X lead to a longer critical path in the update of the square-root quotient since the word-length of k corresponds to seven bits.To minimize the number of bits to be examined for the prediction of the scale factor k and to limit the scale factor to a multiple of 1/16, the upper and lower bounds of X have to be slightly altered (see Tab. II).This also leads to a useful symmetry which can be used for simplifying the scale factor selection.In [21] a minimally redundant radix-4 architecture for a shared division and square-root implementation was introduced (see Fig. 2).Every iteration requires a decision criteria, multiplexers which select a multiple of the denominator D or the square-root quotient Qi, a redundant adder and a part of the on-line converter.As already mentioned entirely 10 bits of the remainder and the denominator/square-root quotient have to be examined to predict the next quotient digit.Depending on the choice of adder, the redundant remainder has to be converted to a two's complement number by employing a fast seven bit adder or an on-line converter for four digits.In order to reduce the costs for the decision criteria, this 2's compliment representation is converted into a sign-magnitude representation.This guarantees a smaller implementation of the decision criteria.Thus, the decision criteria only distinguishes between the quotient digit qi=O, qi =1 and qi=2.The MSB of the remainder determines if the multiple of the denominator qi" D is added or subtracted to/from the previous remainder.Nevertheless, the combinational logic to implement the decision criteria is quite costly even though, it only has to differentiate between three quotient digits.Most efficiently, the decision criteria of qi=O is implemented using random logic while the distinction of the quotient digit qi is rather implemented using a Programmable Logic Array (PLA).The addition is either performed by carry-save adders or by a Hybrid- adder.The on-line converter has to simultaneously operate in two different modes.The first one com- putes the correct square-root quotient Qi while the second one updates the previous square-root quotient Qi-1 by half of the new quotient digit.
The architecture presented in [22] and [23] use a radix-8 and radix-2 digit set, respectively, and are not compared to the radix-4 implementations due to the different number of iterations.In [17], an architecture is presented which does not require an initial PLA.However, after the initial iteration, the design still uses a PLA to predict the new quotient digit.
3.2.Maximally Redundant Radix-4 SRT As already mentioned, the digit set DS(4,3) has larger overlapping regions and hence, requires less bits to be examined to predict the new quotient digit.The drawback of the triple multiple of the denominator can be resolved by employing two stages of CSA adders, the first one adds a multiple from the digit set qi,! E {2,0,2}, the second one adds a multiple from the digit set qi,2 E { 1,0, }.By adding qi,1 and qi,2, the quotient digit qi-qi,1 n t-qi,2 is obtained which is performed in DD.The corresponding circuitry for one iteration is shown in Figure 3 was introduced in [24].
The quotient selection function is divided into two parts LS and NS, the first one predicts the digit qi,1 the latter one qi,2.The on-line converter (OC) converts the redundant quotient digit into a binary format.Like in the architecture of the minimally redundant radix-4 architecture, the redundant remainder is converted into a binary representation by a fast binary adder (MXAT) and then converted into a sign-magnitude representa- tion (EOL) to guarantee a smaller implementation of the decision criteria.The on-line converter has to simultaneously operate in two different modes.
The first one computes the correct square-root quotient Qi while the second one updates the previous square-root quotient Qs,i-by half of the new quotient digit.  .3.Minimally Redundant Radix-4 GST In [25] and [20] two minimally redundant radix-4 division architectures have been presented.The architecture has been expanded according to (13).
The first remainder R0 can be obtained by the summation of the three partial sums by a row of Carry-Save adders and an additional binary tree adder.The final scaled numerator can be-without any hardware costs-converted into a redundant representation.Alternatively, T and T2, which own a smaller critical path than T3, can be added by a Hybrid-adder resulting in Sred,.Hence, either T1 or T2 have to be converted into an redundant representation.This is done without any hardware costs.T3 is added to the Sred, by another Hybrid- adder.
In the first iteration, the scaled denominator k.D and the scaled square-root operand k.X are not yet available.However, in case of division, the first quotient digit is restricted to either q0 or q0 2. This is caused by the restricted range of the operands.Furthermore, q0--1 covers the range from Qs [1/2, 5/3).Thus, q0 2 is only selected, if the numerator is close to 2 and the denominator is close to 1. Hence, by checking the second to forth significant bits of numerator and denominator, the most significant quotient digit can be predicted.To obtain q0-2, the numerator has to be larger than N>_ 1.875 which corresponds to nl-n:z-n3-1 and the denominator has to be smaller than D < 1.125 which corresponds to dl-d2-d3-O.
The output of the first decision criteria selects the multiple of the two partial sums Oscaled,carry and Dscaled,sum that is subtracted from the first remain- der R0.The required modified Hybrid-adder consists of a critical part of 2.5 full-adder and does not lengthen the overall critical path of an iteration.In case of a square-root operation, the most significant quotient digit is always one while the following quotient digit is either ql-or ql 2. This covers the entire range from 0.5 to 1.
To overcome the additional iteration due the leading q0--1 and the need to compute 13 fractional digits to meet precision requirements, the first subtraction can be performed by replacing h, which is zero, by the scale factor k. The second quotient digit can be obtained by applying random logic to the non-scaled radicand.The combination of qo.q-1.1 covers the range [0.584, 1).Hence, similar to the division, it is sufficient to examine 3 bits (X1X2X3) of the unscaled input operand to predict the correct digit for q.
In every iteration, a decision has to be made, a selection between division and square-root performed (kD or kQi), a multiple of k Y be selected (using a multiplexer structure) and subtracted from the previous remainder (Hybrid-adder) (see Fig. 4).RW/DC represents the rewrite and decision criteria of the most significant two digits.The most significant two digits have to be rewritten for the following cases: 2. 12, 1. 02, 2 0., 22  12.This step insures the conversion of the algorithm [20].The residual bounds can also be found in [20] and easily applied to the square-root architecture.Rewriting the most significant two digits which is performed by random logic results in a quotient digit which corresponds exactly to the most significant digit.The three select signals neg, val and zero selects the correct multiple of the denoninator or square-root quotient according to following criterias.If R0 <_ 0 (the bit with the weight -2 in the coding of the minimally redundant digit set is 1), neg is set to high.The control signal val indicates which value the new quotient digit has (either or 2).This can be achieved by simply XORing the bits of weight + of the most significant digit.The control signal zero is set to high if all three bits of the most significant digit are all zero or one.The control signal div distinguished between the division and square-root operation.
The square-root iteration also consists of a four bit adder that adds twice the value of the corresponding multiple of the scale factor k to the previous scaled square-root quotient (2qi.r-i.k).In case of a negative multiple, the addition leads to a wrong result, since the most significant bits of the negative multiple of k are ones.Normally, this calls for a full-length addition increasing the critical path tremendously.Nevertheless, this bottleneck can be solved by modifying the Hybrid-adder in such a way, the not one but two hybrid-additions are performed.The first Hybrid-addition subtracts the possible wrong scaled square-root quotient while the second addition corrects the result by adding ones up to the bit position 2i, where corresponds to the current iteration.In case that there are negative quotient digits, entirely different words with leading ones have to be added.This bottleneck can be eliminated by realizing that the addition of all those correcting terms can be simplified by using an on-the-fly converter which uses the scale-factor k and the quotient digit qi as its inputs.
The signal divneg selects the correct value for h in case of square-root operation and a negative quotient digit has been predicted.The result of the updated square-root quotient and the correcting term are pipelined for the next iteration.To perform the update of the term qi2+l r -(i+1) k, a simple multiplexer structure can be chosen which selects between the 0, k and 4k.This term is always positive due to 2 qi+l.The most significant bit of this term has the weight 2-1 smaller than the correcting term h and can be added in parallel to h. Figure 5 indicates the scheme of the update of h.Depending on a positive or negative quotient digit, either a word with all zeros or ones is added to the previous value of h.However, the ones are only placed upto the bit position 2.i, where corre- sponds to the iteration number.
qi h  In [13] a very high radix square-root architecture which utilizes prescaling and rounding is pre- sented.The shown architecture indicates that two multiplications per iteration have to be performed.These multiplications are in the critical path and increase the iteration delay.

SIMULATION RESULTS
The algorithm has been implemented using 32-bit operands, 24 bits for the mantissa and 8 bits for the exponent using a 0.5 gm CMOS technology.
The building blocks are designed using minimal transistor widths of W,-3A.However, Wp-6A is chosen to guarantee equal slew rates.Only drivers make use of larger transistor widths.HSPICE simulation have shown a 35% (ta-4ns.13) smaller latency of the GST compared to the maximally redundant SRT implementation (ta-6.2ns. 13),and 44% smaller latency com- pared to the minimally redundant SRT imple- mentation (ta= 7.1 ns 13) [26, 18].Other power consumption studies have been published in [27,28], however, they are limited to dividers.
In the Tables III, IV and V, the results of the power estimation are shown for a frequency of 100 MHz and Vaa-3.3V.All elements are simu- lated separately.However, the load capacity of the next stage and wiring capacity between the two blocks are considered in the power simulations.
Those results are also shown in Tables VI and VII.Many improvements to the divider implementa- tions have been suggested in [30].Since these improvements are applicable to both SRT and GST dividers in an identical way, so these don't change the overall ratio between the GST and SRT behavior of speed and power.

CONCLUSION
The GST division algorithm has been successfully applied to the square-root operation in a hard- ware-efficient manner.The additional hardware costs increase critical path only slightly, so that the benefits in speed of the GST algorithm over the SRT algorithm are maintained.The power con- sumption increases by a significant amount (plus 36%), however, by operatifig the architectures at the same speed, the supply voltage of the GST architecture can be reduced so that the critical path match the clock frequency.Simulations have shown that the overall latency of the GST is 35% smaller compared to the fastest implementation of the SRT.Alternatively, by fixing the latency, the GST division/square-root implementation requires 29% less power compared to the SRT using a maximally redundant radix-4 digit set.To con- clude, the GST approach leads to a superior design for division, square-root and shared division/ square-root architectures for latency and power critical applications.

FIGURE 2
FIGURE 2 Block diagram of the SRT minimally redundant radix-4.

FIGURE 3
FIGURE 3 Block diagram of the SRT maximally redundant radix-4.

FIGURE 4
FIGURE 4 The architecture of the division/square-root algorithm based on the GST.

FIGURE 5
FIGURE5 The selection of the correcting term h required for the square-root computation.

TABLE Square -
root K-selection table for the digit set DS(4,9_)

TABLE III
Power estimation of the minimally redundant radix-4 GST architecture

TABLE IV Power
estimation of the minimally redundant radix-4 SRT

TABLE V
Power estimation of the maximally redundant radix-4 SRT

TABLE V (
Continued)

TABLE VII
Comparison between GST and SRTmr Algorithm