Low-power Application-speciﬁc Parallel Array Multiplier Design for DSP Applications

Digital Signal Processing (DSP) often involves multiplications with a ﬁxed set of coefﬁcients. This paper presents a novel multiplier design methodology for performing these coefﬁcient multiplications with very low power dissipation. Given bounds on the throughput and the quantization error of the computation, our approach scales the original coefﬁcients to enable the partitioning of each multiplication into a collection of smaller multiplications with shorter critical paths. Signiﬁcant energy savings are achieved by performing these multiplications in parallel with a scaled supply voltage. Dissipation is further reduced when conventional array multiplier is modiﬁed disabling the multiplier rows that do not affect the multiplication’s outcome. We have used our methodology to design low-power parallel array multipliers for the Fast Fourier Transform (FFT). Simulation results show that our approach can result in signiﬁcant up to 76% power savings over conventional array multipliers on 64-coefﬁcient FFT computation.


INTRODUCTION
Many common Digital Signal Processing (DSP) functional units such as Finite Impulse Response (FIR) filters and Fast Fourier Transform (FFT) modules perform extensive sequences of multiply-and-accumulate computations. Multipliers tend to be the most dissipative elements in these computations, and power-efficient multipliers are therefore essential for the design of lowpower DSP hardware.
In this paper we present a low power design methodology that manipulates the bit-patterns of a DSP computation's coefficients to increase the energy efficiency of the multiplications performed. Specifically, our approach scales the values of the given coefficients to derive a representation that enables the partitioning of the original multiplication into several, smaller multiplications that can be performed in parallel. The critical path of each small multiplication is shorter than the original one. The coefficient values are chosen so that a quantization error bound for the overall computation is satisfied. Therefore, the supply voltage can be reduced to decrease power dissipation while maintaining any given throughput and quantization error constraints. Additional reductions in power dissipation are achieved by disabling the portion of multiplier that do not affect the final result of the multiplication. For high-speed operation, the multiplier can be pipelined to achieve a low worst-case critical delay. Our design methodology focuses on multiplications where the number of coefficients is larger than the number of multipliers available in which case previously proposed techniques [1 -4], that hardwire the coefficient bits to reduce circuit complexity and power are not applicable. Throughout the paper, we assume that it is possible to select arbitrary voltage in the multiplier design even though it may incur a problem of voltage compatibility in some implementation.
The remainder of this paper has six sections. In the second section we review previous research on power reduction approaches applied to multiplier design. The third section describes our methodology for designing low-power multipliers. The fourth section discusses implementation details, including the selective disabling of rows, and an analysis of the worst-case delay and power-dissipation of the multiplier. In the fifth section, we present results from the application of our methodology to the design of a multiplier for the 64-coefficient FFT. Numerical results show that the low-power multipliers proposed in this paper can be used to cut the power requirements of this computation in half without degrading its throughput or quantization error. Our contributions are summarized in the fourth section.

BACKGROUND ON LOW-POWER MULTIPLIERS
In this section, we describe previous research on lowpower design for DSP multiplications. Specifically, we describe architectural-based approaches separately from coefficient-based ones, even though the distinction is not clear.

Architectural Approaches
To first order, the average power required to perform a DSP operation of type i is given by the expression where C eff is the average effective capacitance switched per operation of type i (corresponding to adder, multiplexor, etc.), V supply is a single operating supply voltage, and f clock is the clock frequency which is fixed in many DSP and dedicated applications. Thus, given the fixed data sampling rate f clock , an important design goal is to minimize the product term C eff £ V 2 supply : The average effective switching capacitance C eff is the product of load capacitance and its switching activity ratio [5]. Most of power reduction techniques attempt to reduce the power dissipation of multipliers by changing the power-affecting parameters of Eq. (1) in circuit, logic and architecture level [6 -11].
The power dissipation of a multiplication operation is directly proportional to the switching activity of all internal node capacitances of the multiplier. The internal node capacitance switching activity depends on the multiplier input values (data samples and coefficients), thus the multiplier may be designed to eliminate any unnecessary switching activity by deactivating any circuitry that is not needed for the evaluation [12]. This approach may increase the physical capacitance due to the introduction of additional circuitry for deactivation control, but the effective switching capacitance C eff may be reduced.
Power dissipation can also be reduced by reducing the supply voltage V supply which influences the power dissipation quadratically. However, to maintain a given throughput, the delay penalty must be compensated since the multiplier delay T d depends on the supply voltage as: where I is the supply current, V th is the threshold voltage of the transistors, and C load is the load capacitance. In recent short channel device, exponent of ðV supply 2 V th Þ in the above equation is about 1.3. The proportionality factor comes from the given processing technology and is assumed to be fixed. In order to compensate for the delay penalty at the architectural level, a simple solution is to parallelize and/or pipeline the multiplier [12 -14]. A conventional M £ N multiplier can be duplicated with additional control circuitry to perform the multiplications at a half of the execution frequency. In this structure, the combined effect on the power equation is identical but the voltage scaling technique can be easily employed resulting in the reduction of the power dissipation. For example, let C before , V before , and f before be the effective switching capacitance, supply voltage, and execution frequency of the original multiplier and let C after , V after , and f after be the corresponding parameters for the parallel multiplier where, C after ¼ 2C before þ C overhead and f after ¼ f before =2: Assuming C overhead is negligible if the supply voltages V before ¼ V after ; the no power reduction has been achieved. Since the voltage scaling reduces supply voltage of the parallel multiplier due to reduced throughput requirement so that V after , V before ; the power reduction achieved by the parallelizing the multiplier is a factor of V 2 after =V 2 before : Thus, this architectural change primarily reduces the operating frequency f clock which results in the reduction of the operating supply voltage V supply at the cost of increase in the switching capacitance C load . One of the drawbacks of a large degree of parallelism is that the overhead complexity in terms of the overhead switching capacitance C overhead may become larger than the power reduction gained by reducing the f clock and V 2 supply [15]. In the case where the scaled supply voltage reaches the minimum allowable voltage level V min set by the processing technology, any further parallelization and/or pipelining do not improve power efficiency.

Coefficient Optimization
Even though substantial research has been devoted to lowpower multiplier design at the circuit and logic level alone, power reduction can be enhanced by properly designing the coefficient bit patterns. The coefficient multiplication is very common in DSP where the multiplication requires random data samples and known set of coefficients. The class of the multiplication can be broadly categorized as fixed-and variable-coefficient multiplications. In the fixed-coefficient multiplication, the number of coefficients is equal to the number of multipliers available. Thus, the coefficients can be effectively hardwired into each multiplier. This type of multiplications is often encountered in DSP. One main drawback is that the complexity of the overall architecture may increase significantly when the number of coefficients is very large. In the variablecoefficient multiplication, the number of coefficients is larger than the number of multiplier available. Each multiplier uses different coefficient in every multiplication. The variable-coefficient multiplication is often incorporated in the time-multiplexed multiplication such as in large transform size FFT or FIR filters and its coefficients are usually stored in a non-volatile memory.
Power reduction strategies for variable-coefficient multiplication differ from those for fixed-coefficient multiplication. In the fixed-coefficient multiplications, previous known approaches include minimization of nonzero coefficient bits by using either signed-magnitude or canonic signed digit number representations [5,12]. The minimization of the number of nonzero bits is very effective in reducing power, since the number of nonzero bits in a coefficient corresponds to the number of rows of adders when array multiplier is adopted. In addition, further minimization of nonzero bits is achieved by perturbing the coefficients by multiplying a scale factor [1 -4,16]. So when the fixed coefficients are hardwired into the circuits, any unnecessary circuitry corresponding to zero coefficient bits can be removed. Moreover, the architecture-driven voltage scaling technique can be easily employed since the critical path influenced by the depth of the multiplier is reduced. However, in the variablecoefficient multiplications, a simple minimization of nonzero coefficient bits cannot be directly incorporated since the coefficient values cannot be hardwired.
Due to the linearity possessed by both FIR or FFT, scaling the output preserves the filter characteristics in terms of filtering performance, but the results in an overall magnitude gain equal to the scale factor. For scale factor a, the equations for FIR or FFT, respectively, can be written as Thus the coefficients of the scaled FIR or FFT are given by ah[k ], aW kp P ; respectively. For the purpose of discussion, we will denote the coefficients as h[k ] to represent both FIR and FFT coefficients. This coefficient perturbation technique with the scaling factor a is widely used in digital filter designs [3,4]. Given an allowable range of scaling, an optimum scaling factor a can be found such that the total number of ones in the binary representations of ah[k ] coefficients is less than the total number of ones in the binary representation of the h[k ] coefficients. This can save power in hardwired coefficient multiplication with an array multiplier.
Differential coefficient method [16] is another approach to reducing computational complexity and hence the power dissipation of DSP applications. The main goal is to reduce the Hamming distance between the consecutive coefficients so that the internal node switching is minimized. This is achieved by using various orders of differences between the coefficients in conjunction with stored pre-computed results rather than using the coefficients directly. The results in [16] indicate that for certain type of DSP applications this technique enables significant power reduction.
Another approach is to code the data samples and the coefficient differentially simultaneously so that the overall switching is low. Significant power saving can be achieved with this approach in cases where the Hamming distance between the consecutive data samples or the coefficient is very small. Thus, this method is most beneficial for data samples with high temporal correlation.

POWER REDUCTION METHODOLOGY
In this section, we highlight our power reduction methodology for multiplication using a fixed set of coefficients. We first describe how to reduce power dissipation by coefficient clustering. We then describe coefficient partitioning which can further reduce power dissipation by partitioning the original multiplication into several smaller ones.

Clustering
The cluster width C of an N-bit coefficient Y is defined as the distance between the first and the last nonzero bits in Y. Clustering confines the maximum among the cluster widths of the multiplication coefficients to a value C max smaller than N. Consequently, the effective bit-width of the coefficient is reduced from N to C max . From a power dissipation standpoint, clustering results in two main benefits. First, the multiplier can be designed to reduce switching activity by ignoring the bit positions outside the cluster. Second, the supply voltage required to meet a given throughput requirement can be reduced, since the worst-case critical path among all multiplications is decreased. These benefits are applicable to most multiplier architectures but especially to array multipliers.
The effect of reducing C on the worst-case delay of the multiplier is illustrated in Fig. 1. As C decreases, the depth of the multiplier also decreases. The output of the M £ C multiplier is shifted by an amount S equal to the width of zeros following the cluster (S controls the shifter). Since power dissipation reduces quadratically with the supply voltage, the reduction of C is intuitively expected to result in decreased dissipation primarily due to voltage supply scaling rather than due to the reduction in switching activity.

Partitioning
A particularly effective approach to lowering energy consumption is to operate at the lowest possible power supply voltage. At lower supply voltages, however, the individual circuit elements run slower, and circuit performance degrades. One way to maintain throughput at reduced voltages is to reduce the input-to-output delay by parallelization. In this subsection, we describe coefficient partitioning, a technique that can be used to parallelize coefficient multiplications.
For the partitioned coefficient, we have where C a is the cluster width of Y a , C b is the cluster width of Y b , C is the cluster width of Y, and K is the partition point. Inequalities (5) and (6) imply that coefficient partitioning can decrease coefficient cluster widths.
After coefficient partitioning, the original multipli- Thus, the original M £ N multiplication is turned into an M £ N a multiplication and an M £ N b multiplication which can be performed in parallel. These two parallelized multiplications can employ the coefficient clustering discussed previously to further reduce complexity as shown in Fig. 2. The depth of each resulting multiplication X £ Y a and X £ Y b ; which is C a and C b , respectively, is much less than that of the multiplication X £ Y. Two shifters and an additional adder are required to generate the final product. The output of the M £ C a multiplier is shifted by a number  of bits equal to the sum of N b and the number S a of zero bits between the LSB of Y a and the LSB of the cluster. Similarly, the output of the M £ C b multiplication is shifted by the number S b of zero bits preceding the cluster of Y b . The outputs of the two partitioned multipliers must be shifted properly and added to correctly represent the final product. The power dissipation of the two shifters and the additional adder is offset by the savings that can be achieved due to the strictly smaller cluster widths C a and C b of the multiplications X £ Y a and X £ Y b : These multiplications can also be pipelined to achieve the highspeed operation with low-power consumption.
In general, the delay of a multiplier increases with the maximum cluster width of its coefficients. Thus, the optimal cluster partitioning problem is to select the partition value K such that the maximum critical delay of two clusters is minimized. The computation of an optimum partition can thus be a particularly challenging problem from a computational standpoint, since the solution space is not convex.
Coefficient partitioning can be applied to generate more than two smaller multipliers. In this case, however, the dissipation overhead of the additional shifters and adders may reduce the gains of parallelization. Even though multi-way partitioning could still be beneficial for larger coefficient bit-widths, we only investigate 2-way partitioning in this paper. A heuristic algorithm for coefficient partitioning for example of FFT will be explained in the "Case study: FFT multiplier design" section. This is an reasonable class of partitions in practice, since the number of bits used is in the range 12 -24 in most DSP applications.

LOW-POWER MULTIPLIER DESIGN
The degree of spatial and temporal correlation between the architectural attributes and the coefficient bit-patterns can significantly affect the power dissipation of the multiplier. In this section, we describe a multiplier design whose worst-case delay and switching activity are highly correlated with the cluster widths and the number of ones in the coefficients.

Conventional Array Multiplier
A generic M £ N array multiplier operates by computing partial products in parallel and by shifting and accumulating the partial products. The multiplier width corresponds to the word length M of the data samples. The multiplier depth corresponds to the word length N of the coefficients. The switching activity of this multiplier is poorly correlated with the coefficient. In particular, if the k-th bit of a coefficient is 0, the k-th row of adders does not need to be activated. The partial product of the previous adder rows can simply be shifted and bypassed to the next row of adders. In this case, the function of the k-th row of adders is simply a one-bit shift of the partial products. The adders of a conventional array multiplier corresponding to zero coefficients are still switching, however, even though the addition is not required. The increased switching activity of the internal node capacitance results in unnecessary power dissipation.

Array Multiplier With Reduced Switching Activity and Critical Path
In order to reduce the power dissipated by the multiplier, the conventional array multiplier structure is modified by incorporating deactivation circuitry and bypass logic so that one-bit shifts can be performed without unnecessarily switching adder cells. Figure 3 shows a 4 £ 4 parallel multiplier with reduced switching activity (RSA) that uses the adder cells shown in Fig. 4. Two multiplexors added to the original adder cells route partial product terms to bypass the adders when the corresponding bits are zero. Three-state buffers are incorporated at the inputs of the adder cells to avoid any switching activity in the adder cells which are bypassed.
An L-bit adder, called a vector-merge adder, is used to generate the final result of the multiplication. The actual word-size of L may be less than the actual word-size of the product of the multiplication, which is determined by the applications depending on accuracy requirement (i.e. L MSBs are generated). For a given coefficient Y, the critical path delay t mult (Y ) of the RSA multiplier is given by the expression where t carry,o and t carry,m are the propagation delays between the input and output carry of the conventional and the RSA adder cells, respectively, t mux is the delay of a multiplexor, t and is the delay of an AND gate, and C nz is the number of nonzero bits in the cluster of Y. The worstcase critical delay is defined, when C nz is equal to C. The critical delay and the power dissipation of our RSA multiplier are decreased by selecting coefficient representations with small cluster widths (i.e. small effective multiplier size) and a small number of nonzero bits in the cluster (i.e. fewer switching adders). Another approach for reducing the propagation delay of our RSA array structure is to reduce Mt carry,o by using a fast adder implementation for the merge adder such as a carry-select or a carrylookahead structure.
The RSA multiplier architecture presented in the above reduces the power dissipation over the conventional array multiplier by reducing the C eff and the V supply . However, the worst-case delay is still limited by the cluster width. For a high speed operation, the multiplier will not allow effective voltage scaling.
In order to reduce the power dissipated by the RSA multiplier by allowing the voltage scaling at the high operating speed requirement, the RSA multiplier structure is modified again by pipelining. Additional circuitry latches, are incorporated into the adder cell at the output. Figure 5 shows a 4 £ 4 RSA pipeline multiplier that uses the RSA adder cells shown in Fig. 6. Three one-bit registers are added to the RSA adder cells with multiplexers.
For a given coefficient Y, the worst-case critical path delay t mult (Y ) of the pipelined RSA multiplier is given by the expression where t carry,m are the propagation delays between the input and output carry of the RSA adder cells, t mux is the delay of a multiplexor, and t latch is the setup-time and clock-tooutput of a latch. The number of stage is equal to the cluster width C. The critical delay of the pipelined RSA multiplier is decreased by reducing the number of stages and consequently the power dissipation can be achieved through scaling the voltage at the high operating speed.

Throughput and Power-dissipation
The RSA multiplier in this paper was evaluated and characterized in terms of its throughput and dissipation using HSPICE simulation with RC-parameters extracted from standard cell layout using a 0.35-mm CMOS standard process technology. Figure 7(a)shows worstcase delay of the RSA multiplier when the supply voltage was varied from 1.2 to 3.3 V. The curves in the graph correspond to the two-way partitioned RSA multiplier with cluster widths C ¼ 3; 4; 5; 6: Critical delay and average switching power dissipation of an individual RSA  adder have been increased approximately by 15 and 60%, respectively, due to additional circuitry. The multiplier delay of the RSA multipliers are compared with 12 £ 12 conventional array multiplier. It is evident in the figure that the critical delay of the RSA multiplier with cluster width of six is comparable to the delay of the conventional array multiplier due to the additional circuitry. Also shown in the figure are pipelined version of the RSA multiplier. Figure 7(b) shows the normalized average power-delay product of the RSA multiplier as well as the pipelined version of the RSA multipliers. As shown in these figures, the RSA multipliers generally have much lower powerdelay product than that of the conventional array multiplier due to its lower critical delay and power dissipation. Even though the additional circuitry is incorporated into the pipelined RSA multiplier, its power-delay product is the lowest among the multipliers considered mainly because of its small delay. When the coefficients are optimized by partitioning, our RSA multipliers can result in significant power savings.

CASE STUDY: FFT MULTIPLIER DESIGN
In this section, we describe our low-power design methodology through the design of multipliers for the 64-coefficient FFT. First, we present our coefficient optimization algorithm. We subsequently give HSPICE simulation results from multiplier designs that were obtained using our algorithm.

Coefficient Optimization
The P-point FFT, including quantization noise and some scaling factor a, is given by the equation aðX½p þ e d ½pÞðW kp P þ e w ½pÞ; ð9Þ where W kp P is the pth coefficient, X[ p ] is the input data sample, e w [ p ] is the error due to twiddle-factor coefficient approximation, and e d [ p ] is the input sample quantization error.
Given a set W of infinite-precision twiddle-factor coefficients W kp P ; p ¼ 0; 1; . . .; P 2 1; an error-ratio bound d, a delay bound T max , and a cluster width bound C target , our optimization process returns an encoded set of P twiddle-factor coefficients Y p ¼ aW kp P ; p ¼ 0; 1; . . .; P 2 1 such that max p {C a;p } # C target ; max p {C b;p } # C target ; the multiplier critical delay is less than T max , the coefficient quantization error ratio is less than d, and the multiplier dissipation is minimal.
The pseudo-code of our heuristic coefficient optimization algorithm is given in Fig. 8. This algorithm comprises three nested loops. The outer loop steps through the possible scaling factors a. The middle loop steps through the possible partition points K. The inner loop steps through the possible number representations f (two's complement or canonic signed digit representations). For each a, K, and f, the algorithm partitions the encoded coefficients with respect to the current partition point. It subsequently compares its power dissipation, delay, and error ratio with the best previous encoding of a scaled coefficients set. The minimum supply voltage for which the delay bound T max is not exceeded is determined by Eqs. (8) and (9) in its main body. For the minimum supply voltage, the power constraint is checked by simulation or using the expression where C p,nz is the number of nonzero bits in the coefficient Y p and N p is the coefficient width and C load is the output load capacitance. The error ratio constraint d is checked using the expression where Ỹ p is the unquantized coefficient value. The best  solution encountered until any given iteration is stored in the variable temp which can be initialized to any possible encoding of the coefficients.

Numerical Results
We have applied our algorithm to the design of a lowpower multiplier for the 64-coefficient FFT. One multiplier is used to compute FFT which requires 64 coefficients. Both data and coefficient are 12-bit. Two number representations were considered: Two's complement and canonic signed digit. Table I gives the minimum error ratio d for various maximum cluster width pairs C a,max , C b,max . For reference, d for 12-bit quantization is about 2 £ 10 27 : Depending on the error ratio specification, the appropriate multiplier with small cluster widths is incorporated for low power applications.
Two separate comparison are performed. First of all, the effectiveness of our coefficient optimization methodology in array multipliers with and without the switching activity reduction feature is compared with a conventional array multiplier under identical quantization error bound d. Second, our design methodology is applied to the other well-known multipliers, independent of the multiplier architecture for reducing the power dissipation.
In the simulation, 2-way partition with a cluster width (effective coefficient width) of 4 is used for the multipliers with the coefficient optimization and the coefficient width of 12 is used for the multipliers without the coefficient optimization. These two provide identical quantization error ratio bound d , 2 £ 10 210 : For both coefficientoptimized and unoptimized multipliers, the data width is fixed at 12. The entire FFT was performed assuming a random input data sample. These multipliers are designed and synthesized with 0.35 mm CMOS. However, no lowlevel circuit optimization is considered.
For all three multipliers considered in the first comparison, the worst-case critical delay of the multipliers and the normalized average power-delay product of the multipliers during the 64-coefficient FFT operation are shown in Fig. 9 as a function of the operating supply voltage ranges from 1.2 to 3.3 V. The maximum computation speed of the FFT operation is limited by the worst-case critical delay of the multipliers. In Fig. 9(a), the partitioned multipliers with the coefficient optimization are generally faster than that the conventional array multiplier since the critical delay paths of the optimized multipliers is shorter for a given operating supply voltage.
However, between the coefficient optimized multipliers, the delay of the RSA multiplier is longer than that of the conventional array multiplier. This is expected since additional circuitry introduced in the RSA multiplier for row disabling contributed to the increase in the worst-case critical delay. In Fig. 9(b), the conventional array multiplier has the largest average power-delay product for a given operating supply voltage. The switching capacitance and activity make the difference in power dissipation among coefficient optimized multipliers. Even though, the optimized conventional array multiplier has smaller delay, the optimized RSA multiplier has lower average power-delay product than the optimized conventional array multipliers mainly due to the less effective switching capacitance.  Similar analysis is performed in the second comparison. All multipliers use optimized coefficients except the baseline conventional array multiplier. In Fig. 10(a), the Booth encoding multiplier with the coefficient optimization is generally faster than the RSA multiplier. This is expected since the Booth encoding multiplier has only half the number of partial products to sum. Moreover, the RSA multiplier has many additional circuitry that contributes to the increase in the delay. The pipelined RSA multiplier is the fastest among the multipliers considered. In Fig. 10(b), the conventional multiplier has the largest average powerdelay product among the multipliers and the RSA multiplier has the smallest. The pipelined RSA multiplier also consumes considerable amount of power compare to the RSA multiplier but its power-delay product is much smaller due to the speed. Figure 11(a) shows the power dissipation savings achieved with our multiplier design methodology (for C a,max and C b,max set to 4) over the use of conventional array multipliers. The curves in this graph give power reduction as a function of operating clock frequency. The supply voltage of the multipliers has been scaled appropriately. In general, relative savings increase as operating clock frequency increases. Because voltage scaling dominates at high operating clock frequency resulting in a significant power reduction. As the required clock frequency decreases, most of the power savings come from the reduced switching activity of the multiplier. For example, 76 and 63% power reductions by the optimized RSA multiplier were achieved at 100 and 40 MHz, respectively. The graph also shows that the coefficient optimization alone saves a significant amount FIGURE 10 Four multipliers are considered: conventional 12 £ 12 array multiplier, 2-way partitioned RSA multiplier with cluster width 4, 2-way partitioned pipelined RSA multiplier with cluster width 4, and 2way partitioned Booth encoding multiplier with cluster width 4. Part (a) gives the worst-case delay of the multipliers as a function of the supply voltage. Part (b) gives their normalized average power-delay product for 64-coefficient FFT operation as a function of the supply voltage. FIGURE 11 (a) Relative power savings over 12-bit conventional array multiplier as a function of throughput when using our multiplier design methodology to implement a 64-coefficient FFT computation. (b) Relative power savings over 12-bit conventional array multiplier as a function of throughput when using our multiplier design methodology on Booth encoding multiplier and pipelined RSA multiplier to implement a 64-coefficient FFT computation. of power. For example, 69 and 50% power reductions by the optimized conventional array multiplier were achieved at 100 and 40 MHz, respectively. The power reduction gap between the two curves are narrower at the high operating speed than low operating speed. Since the worst-case critical delay is smaller for the multiplier without rowdisabling, the benefit of low switching is compensated by the amount of supply voltage scaled. However, as the required operating speed becomes slower, all multipliers operates at the lowest supply voltage which results in no supply voltage scaling and the power reduction due to the reduced switching activity dominates.
In Fig. 11(b), voltage scaling dominates at high operating frequency resulting in a significant power reduction. This is evident for the pipelined RSA multiplier since it has the lowest worst-case delay hence a significant reduction can be achieved by the voltage scaling. However, at low operating speed, the power reduction is not as significant since power reduction by the switching capacitance dominates. Even though the worst-case critical delay of the Booth encoding multiplier is smaller than the RSA multiplier so that the amount of voltage scaled is larger with the Booth encoding multiplier given a operating speed, the actual power is compensated by the switching activity difference. In summary, the effectiveness of power saving is different for different region of the operating frequency. At low operating frequency, the pipelined RSA multiplier is the worst among the multipliers considered, but it is the best as the operating clock frequency increases beyond 65 MHz.

CONCLUSION
In this paper we have presented a novel methodology for designing low-power multipliers by coefficient optimization and architectural modifications. Specifically, the coefficients are scaled and encoded to achieve minimum bit clustering and partitioning. In conjunction with a reduced switching activity (RSA) multiplier that uses selective row disabling, this coefficient optimization scheme can result in significantly more efficient DSP computations over conventional array multipliers. In simulation with the 64-coefficient FFT, our methodology reduced dissipation in half, in comparison with a conventional array implementation, without any throughput or error rate degradation. Our design methodology can be applied to every DSP computation that performs multiplications with a set of coefficients. Many well-known design approaches at the device and the circuit level as well as coefficient level such as differential encoding the coefficient can be applied to the multiplier obtained by our methodology for additional gain in the power efficiency.