Graphical Design Techniques for Fixed-point Multiplication

This is a tutorial paper that examines the problem of performing fixed-point constant 
integer multiplications using as few adders as possible. The driving application is the 
design of digital filters, where it is often required that several products of a single 
multiplicand are produced. Thus two specific problems are examined in detail, i.e., 
the one-input/one-output case and the one-input/several-output case. The latter is of 
interest because it can take advantage of redundancy in the different coefficient multipliers. 
Graphical methods can be used to design multipliers in both cases.


INTRODUCTION
This paper examines the problem of minimising the number of adders required to perform fixed-point shift-and-add multiplication by one or more constants.Constant integer multiplication can be implemented using a network of binary shifts and adders.In an integrated circuit implementation of parallel arithmetic, binary shifts (scaling by a power of 2) can be hardwired and therefore do not require gates.Gates are thus only required to implement adders and subtractors, which require approximately equal numbers of gates.The hard- ware cost of the multiplier is thus approximately proportional to the number of adders and sub- tractors required; for convenience, both will be referred to as "adders", and their number as "adder cost".The paper is divided into three parts, addressing the distinct problems of the single multiplier case, the case where several products are required of a single multiplicand, and the ma- trix multiplication case where the requirement is for several sums of products of several inputs.Almost all of the material discussed and presented has been published before, but this is the first paper where these findings have been gathered together.
The problem of reducing the number of adders required for a single fixed-point multiplication has been studied for some years.As early as 1951, Booth [1] recognised that if subtractors were also allowed, the total number of adders and subtrac- tors, hereafter collectively called "adders", could be reduced.The 1960's saw the introduction of signed-digit (SD) representation by Avizienis  [2].A coefficient in "n-bit" signed-digit representation can be written: representation that has fewest non-zero digits is known as the canonic signed-digit (CSD) repre- sentation.Reitweisner [3] showed that the repre- sentation with no string of non-zero bits is canonic, and an algorithm for finding this repre- sentation was presented by Hwang [4].Garner [5] showed that, on average, and for long wor- dlengths, CSD requires 33% fewer non-zero digits than binary.Since non-zero digits represent addi- tions (or subtractions), CSD therefore is signifi- cantly more efficient in adders than binary.The graph multiplier technique, developed by the author and described in Section 2, has proven to be significantly more efficient than CSD.
In signal processing applications, several pro- ducts of a single multiplicand are often required, e.g., in a direct form finite impulse response (FIR) digital filter, the input data at some stage is multi- plied by each of the coefficients.Transposition of the FIR filter as in Figure allows all of these multiplications to be performed at once, with the individual multipliers replaced by the "multiplier block".Redundancy between the coefficient multi- pliers can be exploited in order to reduce the number of adders required to produce all of the products.Various techniques have been proposed to maximise these savings.The multiplier block method described in Section 3 was developed by the author from a technique first described by Bull and   Horrocks [6].It uses the same graphical methods used for the graph multipliers of Section 2 and has proven to be more efficient in terms of adders than any other method examined by the author.

GRAPH MULTIPLIERS
where bi is taken from the set {,0, 1} where represents -1.It is therefore a ternary rep- resentation (binary representation would be iden- tical except the bi would be taken from the set {0,1}).In general, there are several different SD representations for a given integer, and the

Graph Representation of Multiplication
Multiplication by a constant integer can be described in terms of a graph as follows.There is an initial vertex of the graph, which can nominally be assigned the value 1.There is a terminal vertex of the graph, which is assigned the value of the multiplier being designed.The multiplicand can be considered as being input to the initial vertex.The product is output from the terminal vertex.

FIGURE
Replacement of individual coefficient multipliers by a single multiplier block.In the general transposed Nth-order direct form (a), there are N+ multipliers, which are replaced by a multiplier block (b).Similarly, for an even order-N symmetric filter (c), the N/2 + multipliers are replaced with a single multiplier block.Note that in all cases, there are N adders intrinsic to the filter structure, i.e., those not used for multiplication.vertex of the graph except the initial vertex represents an adder, which has two inputs, and can have any number of outputs.Each edge of the graph can be assigned a value of ( 4-1, 2, 4, 8,... ), representing multiplication of the value of the initial vertex of that edge by a value which can be implemented as a binary shift.Since adders and subtractors are treated as having equal cost, positive and negative values can be freely chosen.The example in Figure 2 is 45, the smallest integer that be represented using fewer adders than in CSD. Figure 2(a) shows 45x computed as ((x- 4x) 16x) + 64x, using 3 "adders", i.e., this (cost- 3) graph has adder cost 3.This corresponds to CSD optimal (MAG) 45=64--16--4+ 1, or 1010101 in signed-digit representation.Because there are no zero-free sub- strings longer than one signed-digit, this signed- digit representation is canonic [4], i.e., this is a CSD graph.Optimum (cost-2) representations, which require only two adders, are shown in Figure 2(b) and (c)note the optimum representation is not necessarily unique.The important thing to note about these graphs is that they have a different topology from the CSD graph.Shifted versions of the newly created vertex values are used to produce the result, rather than simply adding shifted versions of the input (multiplicand), which is the basis of both binary and CSD representations.
We have used the term "fundamental" to describe the values assigned to the vertices of the graphs.The three graphs of Figure 2 have sets of fundamentals {-3,-19,45}, {5,45} and {9,45}, respectively.
2.2.General Algorithms for Graph Design

General Algorithms
There are two types of algorithm that have been devised for the design of graph multipliers.
Exhaustive algorithms, discussed in Section 2.3, evaluate all multipliers possible for all graph topologies.This is a very time-consuming task, and is typically done once only to produce a lookup table that includes the graphs for a given multiplier.This lookup table is also expensive in terms of memory, so in practice, the algorithms are restricted to shorter wordlengths.Due to their exhaustive nature, these algorithms produce opti- mal results, i.e., they use fewest possible adders.
On the other hand, general algorithms, the subject of this section, are suboptimal in general.They are given the multiplier value required and design the graph with no other information.This tends to produce results relatively quickly using far less memory, so long wordlengths can be read- ily accounted for.The problem of producing an optimal general algorithm remains a target of research.
Both types of algorithm operate on positive odd integers only.Even integers can be produced with a shift, and simply replacing some adders with subtractors can produce negative integers.

The Bull and Horrocks Algorithm
The algorithm of Bull and Horrocks [6] was not designed to produce single multipliers, although it can be used for this purpose.Originally, it exploited redundancy in designing multiplier blocks, a problem more completely addressed in Section 3. The reader is referred to [6] for a detailed description of the algorithm, hereafter denoted "BH", and to [7] for a version modified by the author, denoted "BHM" which produced significantly improved results.The basis of the algorithm is that it starts with the input ("1") vertex, and takes all pairwise sums of powers-oftwo multiples of that vertex.The sum closest to the required multiplier value is added to the graph as a vertex.This process is repeated, using powers- of-two multiples of all vertices in the graph, until the multiplier value is added to the graph.

The Bernstein Algorithm
The problem of reducing the number of adders required for a hardware integer multiplier is closely related to the problem of reducing the number of ADD, SUB and SHIFT operations required to perform an integer multiplication in software.An algorithm for this purpose was prop- osed by Bernstein  [8], and has been used in various software compilers (e.g.[9,10]).Recently, the algorithm was extended by Wu [11] to account for the LEA (shift/add) instruction of the Pentium processor.
Bernstein measured cost in instructions, where SHIFT had the same cost as ADD and SUB.For the hardware case, shifts can be wired and are thus essentially costless.Therefore the algorithm, mod- ified to evaluate the number of adders required to produce a product by integer n is: 2.2.4.The BBB Algorithm It was found [12] that each of the BHM and BERN algorithms can perform better than the other for individual designs because they produce graphs of different basic topology.An example of this is shown in Figure 3 [12].For integer 711, BERN's "product" graph is optimal, whereas for 707, BHM's "entangled" graph is optimal.There- fore, an algorithm that selects the "better of BHM and BERN" (BBB) algorithm was defined: It is difficult to predict which of the BHM or BERN graph topology will be more appropriate for a given integer, so the BBB algorithm simply designs using both methods and chooses the better result.
Cost(l) 0 where and n/(2i 1) are integral.This algorithm will be referred to as "BERN".

Comparison of Algorithms
Figure 4 [7] clearly shows the advantage of the BHM algorithm over binary, CSD and the original BH algorithm.It can be seen that the BH algorithm does not produce noticeable improvements over CSD for wordlengths of less than 22 bits.The BHM algorithm always produces better average results than CSD, with a 26.6% reduction in adder BHM BERN ..!., 707., , 7 3""-z" /"  Wordlength (bits) FIGURE 4 Average adder costs for multipliers designed using various techniques.For wordlengths up to 12 bits, results are exhaustive.For wordlengths above 12 bits, the averages are over 1000 uniformly distributed random integers.cost for 32-bit words.For wordlengths between 14 and 32, a similar comparison in Figure 5 [12] shows that despite the BERN algorithm being inferior on average, individual instances of superiority can be exploited to give a significant average cost gain for the BBB algorithm.For 32-bit words, average BERN cost is 3.5% worse than BHM cost, whereas BBB cost is 3.4% better than for BHM.Similarly, for all 12-bit words, the results in Table I indicate that the BBB algorithm provides designs much closer to optimal cost, in general, than either the BHM or BERN algorithm.The optimal costs are Because we are looking at the VLSI implementa- tion of multiplication, we have always assumed that a shift operation is free.However, algorithms have been designed to cater for the case where shifts bear a cost (e.g., the original Bernstein algorithm [8]) or are not available.If shifts and, further, subtractions, are assumed not to be available, the problem of designing the graph for multiplier n is equivalent to finding the shortest addition chain for n, a well-researched area in mathematics.An addition chain for n is defined as a sequence of integers a 0 al a2 ar n with the property that ai aj -+-ak, for all i-1,2,...,r.
for some k _<j < i, An example of an addition chain represented by a graph is shown in Figure 6, illustrating the Fibonacci sequence.Knuth's "power trees" [13] optimally search these chains and hence could design an optimal graph with the above constraints.If subtractors are allowed, the problem becomes that of finding the shortest FIGURE 6 The graph to produce the Fibonacci sequence.Note that edges are not labelled each represents a scaling of 1.This is an example of an addition chain.
addition/subtraction chain [14].Bull and   Horrocks "add-only" and "add-subtract" algo- rithms [6] are general algorithms that perform suboptimal searches for the shortest addition and addition/subtraction chain, respectively.Graphs such as we have seen earlier are produced, but there is no scaling of the graph edges, and neces- sarily require more adders.
The context in which Knuth introduces his "power trees" [13] is not, in fact, efficient multi- plication, but efficient exponentiation, i.e., the best way to raise an input x to a power n.This problem, which involves the best choice of square and multiply operations is a direct analogy to our choice of shift and add operations.For instance, to calculate x 11 (note 11 1011 in binary), square x, square the result, multiply by x, square, multiply by x, giving, sequentially, x2, x4, x5, x 1 and x11.
The algorithmic procedure is simply to replace the binary digits "1" (excepting the leading "1") with "square, multiply" and "0" with "square".Almost identically, a multiplication by 11 can be achieved by replacing binary "1" with "shift-add" and binary "0" with "shift".Thus lx could be evaluated as shift x, shift the result, add x, shift, add x, giving, sequentially, 2x, 4x, 5x, 10x, x.Note the product is built up identically to the power of x in the previous example.This method, when used for multiplication is almost 4000 years old, and is over 2000 years old for exponentiation [13].
Efficient exponentiation is a subject of research interest because the popular RSA cryptographic algorithm [15] requires the calculation of m(mod N) where m and N are very large integers.Any of the algorithms already discussed could be used to perform this task, but when e is large, heuristic techniques have been used.These require certain powers to be pre-computed.The analogy with the graphical method is to force certain fundamentals into the graph prior to the graph design.Two algorithms, the k-SR algorithm [16] and the SS(1) algorithm [17] trade off the amount of work in this pre-computation (and more importantly the memory space that this requires, as the pre-com- puted results are stored in a look-up table) with the speed with which the rest of the exponentiation task is completed, measured as the number of multiplications required.They could be consid- ered useful for designing extremely long word- length shift-and-add multipliers.For these long wordlength multipliers, it was recently shown [18] the number of adders required for a multipli- cation by n is O(n/(log n)) for any c.

Exhaustive Algorithms for Graph Design 2.3.1. Requirements
There are three important areas of consideration that ensure that an algorithm has searched all pos- sible graph configurations: 1.All possible graph topologies are searched, 2. All possible vertex values (fundamentals) are accounted for, and  3.All possible edge values are accounted for.
These requirements place different restrictions on the algorithm.For graphs up to cost-4, all of the graphs of Figure 7 [7] must be searched.A recent algorithm presented by Li  [19], which claims to be optimal does not search all of these graphs [20] and is therefore not truly exhaustive.However, be- cause it exhaustively searches the graphs that it does recognise, it can be considered to be an algorithm of the exhaustive type.Vertex value theo- rems in [7] show that we need only search using positive, odd fundamentals, simplifying the search significantly.Edge value theorems in [7] indicate that there is a limit to the size of integers that can appear as edge values and fundamentals, thus en- suring that the algorithm is exhaustive and can be executed in finite time.These considerations led to the design of the MAG algorithm.

The Minimised Adder Graph (MAG) Algorithm
The operation of the minimised adder graph (MAG) algorithm is described in detail in [7].Here it need only to be said that it exhaustively searches Costl 4 FIGURE 7 The possible graph topologies for costs to 4 (Dotted lines indicate building of higher cost graphs from lower cost graphs).
all the graphs of Figure 7.The algorithm produces two lookup tables, one with the cost of the multi- plier and the other with a record of the fundamentals of all the graphs used to produce the multiplier at that cost.This second "fundamen- tals" table grows exceedingly quickly with word- length, and the capability of the machine used to produce the results limited the extent of those results to 12-bit wordlengths.The results of ap- plying the algorithm to all integers up to 212 are shown in Figure 8 [7].In general, binary and CSD implementations are limited to the graph topology labelled "1" in Figure 7 for each cost, although tree structures such as cost-3 graph 7 Wordlength (bits) 14 FIGURE 8 Average number of adders required against wordlength in bits for single integer multipliers.
been proposed [21] to minimise propagation delay.Not restricted by these limitations, the BHM and MAG algorithms are shown in Figure 8 to be clearly less costly.Although the advantage of MAG over BHM is only about 5%, this advantage is expected to improve for longer wordlengths.

MULTIPLIER BLOCKS
Multiplier blocks produce several products of a single multiplicand by exploiting redundancy in the multiplication process.For example, if the multipliers 5 and 45 are required, Figure 2b could be used to produce the multiplication by 45, with the multiplication by 5 as a "free" by-product (pun intended).

The n-dimensional Reduced Adder Graph (RAG-n) Algorithm
The n-dimensional reduced adder graph (RAG-n) algorithm is currently the best algorithm for designing short-wordlength multiplier blocks.It is in two parts; the first is optimal, i.e., if the set of coefficients is completely synthesised by this part of the algorithm, minimum adder cost is assured, and the second part is heuristic.It uses the two lookup tables generated by the MAG algorithm, which, at present, cover the range to 4096.
The algorithm is described in detail in [22] and operates roughly as follows.A set of coefficients are input to the algorithm, and the graph is built up in stages.First, the cost-1 coefficients are added to the graph.Power-of-two multiples of those fundamentals in the graph are then added together and if another coefficient is produced, it is added to the graph and the process is repeated until no further coefficients can be added to the graph.If the whole coefficient set is synthesised by this part of the algorithm, the resulting graph is guaranteed to be optimal.If there are still some coefficients left to synthesise, heuristic methods are used using the MAG algorithm fundamentals lookup table to try and select the best coefficient to add next.Under certain conditions, the result of this process is also optimal.A hybrid algorithm has also been defined, which replaces this final heuristic stage with the BHM algorithm in order to increase speed.The algorithm mentioned earlier in Section 2.2.2 was originally designed for the multiplier block application.Again, in the multiplier block context, we denote the algorithm "BH".Modifications to the algorithm for multiplier blocks are described in [22], and the modified algorithm is again de- noted "BHM".
Nakayama's permuted differences [23], the sub- expression elimination techniques of Hartley [24,25] and Potkonjak et al. [26], and the nested structures of Mahanta et al. [27] have been shown to produce structures that can be represented by particular types of multiplier block graphs.However, these methods have been shown [28] to be far less versatile than the two mentioned above and therefore need not be discussed further here.
The author has also defined an optimal algorithm for the design of 2-coefficient multiplier blocks, known as MAG2 [28].The computation time of this algorithm increases factorially, so exhaustive results have only been calculated up to 8 bits, where MAG2 produces a 27% average reduction in adder cost over CSD coefficients.For a pair of coefficients of the same wordlength, application of BHM results in a 21% reduction and RAG-n a 24% reduction.

RAG-n and BHM Performance
In the experiments used to test the performance of the algorithms, uniformly distributed random coefficients were used.The non-uniform distribu- tion of coefficients in typical FIR filters leads to even better results [22].For set sizes (numbers of coefficients) of 3, 5, 7, 10, 15, 25, 40 and 80, one hundred uniformly distributed random sets of coefficients were costed for even wordlengths up to 12 bits.The average adder cost is shown in Figure 9, where it can be seen that for a given wordlength, average adder cost increases roughly linearly with set size.A set of 80 coefficients of 12- bit wordlength requires fewer than one adder per coefficient on average.For the smaller wordlengths in the figure, an asymptote is reached which is the cost of the graph that can fully represent all of the coefficients of that wordlength.The value of this asymptote is the number of odd integers of wordlength w, i.e., 2w-.Once this asymptote is reached, any "new'" coefficient is simply a repeti- tion of a coefficient already in the set.
For a set size of 5 (the number of coefficients that would be required for the implementation of linear phase FIR filters of order 9 or 10), the average adder cost of a multiplier block for the BH, BHM, hybrid and RAG-n algorithms, were compared over a range of wordlengths, as shown in Figure 10.Comparisons with individual multi- pliers using CSD and binary are also shown.The RAG-n algorithm provides a significant improve- ment in cost over BHM (8.4% for 12 bit words), which in turn provides a significant improvement (10.6%) over the original BH method.All the algorithms that utilise graph synthesis techniques are far superior in terms of adder cost to CSD and binary.
The computation time of these multiplier block design algorithms is an interesting subject and has been discussed at some length in [22,28].The optimal part of the RAG-n algorithm is actually very quick while the heuristic part is slow.It was also found in [22] that for a given wordlength, there is an "optimality threshold" set size above Number of coefficients lOO FIGURE 9 Average adder costs for the RAG-n algorithm for various wordlengths against uniformly-distributed coefficient set size.  which it is highly likely that the design has optimal cost.These are counterintuitive results, since given the potential explosive growth,of complexity of the problem with set size (the problem is NP- complete [6]), optimality might seem to be less likely for large set sizes.The explanation is that the number of optimal solutions grows quickly and they are easy to find.

Application of Multiplier Blocks
to Digital Filters

Cost Reductions
We have used multiplier blocks in the design of both finite impulse response (FIR) and infinite impulse response (IIR) filters.We measured the adder cost of these filters to be the number of the adders in the multipliers plus the adder cost of the structural elements, the adders and delays.Due to the more efficient multiplier implementa- tion, the proportion of the total adder cost due to the multipliers is drastically reduced.In the exam- ples examined, this proportion dropped approxi- mately from 40% to 20% for FIR filters [22] and 80% to 50% for IIR filters [28,29].This has a number of important implications.In the past, the emphasis on reducing the complexity of a filter has focused on the multipliers.Multiplier blocks have been so successful in reducing that cost that there is incrementally less to be achieved by further attention to reducing multiplier com- plexity.In other words, elaborate schemes which select the coefficients in some "optimal" way (e.g. [30]) may not offer significant savings over a tech- nique which selects the coefficients in a straightforward fashion and implements them as a multiplier block.
It is important to note that multiplier blocks are applied directly to a selected set of coefficients, and the complexity savings they offer are limited to that application.Methods that select simple coefficients can be used in conjunction with multi- plier blocks, such as statistical wordlength mini- misation as described by Crochiere [31] for IIR filters, Grenez [32] for FIR filters, and the author [33, 34] for average wordlengths.There are many techniques that are aimed at reducing filter wordlength (see the reference list for [22]) which can all be used in conjunction with multiplier blocks.
The multiplier block method does not in itself attempt to minimise the number of adders that are required to meet a given filter specification; instead it aims to minimise the number of adders required to produce the products for a given set of coeffi- cients.Some methods have been described that attempt to minimise the number of adders in the filter directly.The earliest technique of this type, described by Jain et al. [35,36], aimed to minimise the number of CSD "bits" required by the coefficients (without using redundancy between the coefficient multipliers).Another method, of Wade et al. [37,38], tries to reduce the number of adders in a cascade of primitive sections that meets the filter specification.The cost functions asso- ciated with this type of optimisation are badly behaved so non-gradient searches such as genetic algorithms have been devised for this task by Roberts [39] and Suckley [40] and for the relation- ship between this adder cost function and the filter error specification by Wilson and Macleod [41].
These various optimisation algorithms could be modified to operate with multiplier blocks producing the cost function.However, this cost function has been shown to be relatively flat in a local region [28,29], providing further discouragement for optimisation in addition to the reasons dis- cussed earlier in this section.

The Complexity Hierarchy
Multiplier blocks apply where several products of a single multiplicand are required.The larger the block, the more that the cost of the multipliers can be reduced.Therefore, structures that allow the use of large blocks, such as direct-form FIR and IIR filters, can be expected to gain more from using multiplier blocks than structures with isolated multipliers, such as the lattice wave structure.In fact, early results [42,43] showed that using multiplier blocks can make the direct form structure more efficient than the wave structure!In these studies, we found that whereas tradition- al methods favour the lattice wave structure for filters, multiplier block implementation so drasti- cally reduces the cost of cascaded second-order forms that they become significantly less costly.The cost of the direct structure is reduced to less than that of the wave structure, despite having more coefficients and requiring a much longer coefficient wordlength.Even when the data word- length noise effects are taken into account [28], the direct structure is still competitive with the wave structure.However, the direct form has always had poor limit-cycle (instability due to non- linear feedback) performance and a more recent study [44] shows that although cascaded second- order forms still outperforms the lattice wave structure when limit cycles are eliminated, the direct form no longer competes. 3.3.3.The Order-complexity Trade-off In Section 3.3.1,reference was made to the flattening of the cost function in coefficient space due to the use of multiplier blocks.This means that the total cost of a set of coefficients in a block does not vary very much if the values of the coefficients are varied in value.This slow varia- tion in cost has also been observed when the number of coefficients in the block is varied, cor- responding for example to a variation in filter order for an FIR filter.This slow variation means that there may be an incentive to increase the ord- er of the filter in order to reduce the complexity.
Studies of both FIR [45] and IIR [28] filters in- deed show that there is an incentive to increas- ing the order and that multiplier blocks flatten the cost of FIR filters such that any order up to 10% above the estimate produced by the usual order estimators may produce the most efficient design.

Comparison with Other Efficient
Filter Design Methods We applied multiplier blocks to some of the filters published as examples of advanced techniques of designing low-complexity filters.These methods include Powell and Chau's CSD delta-modulation of the coefficients [46] and the cascade of primi- tive sections due to Wade et al. [37, 38].For the multiplier block filters used in the comparison, the output of the Remez exchange algorithm design was simply quantised prior to application of multi- plier blocks, i.e., no special technique was used to select the coefficients.For all the examples, it was shown [28,47] that multiplier blocks produced a filter that required fewer adders.Where a recursive running sum (RRS) prefilter was not used, the cost of the multiplier block filter ranged from 48% to 78% of the cost of the other design.Where an RRS prefilter was used, the advantage of the multiplier block design was less significant.
Jones [48] has proposed a distributed arithmetic method, which extends the idea originally de- scribed by Peled and Liu [49].He shows that this method requires more adders than the Bull and   Horrocks method [6], and also uses RAM, ROM and control circuitry.It is therefore also less efficient than the RAG-n algorithm, but it is not coefficient-dependent.Filter banks (parallel connections of digital filters) are used in many signal processing applications including design of analysis and synthesis filters for multirate signal processing [50], time-frequency analysis [51], wavelets [52] and for fractional delay filter design [53].Figure 11 [54] shows that for a simple filter bank of two second-order FIR filters, all of the coefficients in the filter bank can be x(n) incorporated into a single block.Figure a is the usual direct-form interconnection of the filter bank.If we consider the filter banks as separate filters, each accepting the same input (i.e., remove link A), and then transpose this structure, we get the structure in Figure lb (without link B).With link B in place, however, we see that all of the coefficients multiply a single data input, and can be placed in a single multiplier block as in Figure c.We have examined the application of multi- plier blocks to filter banks and found [54] that once again, costs of the multiplier elements can be reduced significantly.Hence, the cost of the structural components becomes even more significant than if each filter was built separately.In de- signing a filter bank, there are a variety of structures that can be used, including, for the interpolation application we examined, the Farrow structure [53].If multiplier blocks are used for multiplication, this choice of structure then domi- nates the overall cost of the filter bank.

MATRIX MULTIPLICATION
The problem of performing matrix multiplication using graphical techniques increases the complexity of the problem by one dimension.If the single multiplier case of Section 2 has zero dimension- ality, and the single-input, multiple-output case of Section 3 has dimension (a vector multiplied by a scalar), then the multiple-input, multiple-output task of multiplying a matrix by a vector has dimension 2. The algorithms described already can be used to design the multipliers, but there is no guarantee of an optimal result.The multiple-input nature of the problem means that an optimal graph will be even more "entangled" than some X2 FIGURE 12 Implementing a matrix multiplication using graphs.(a) Using the RAG-n design and combining separate outputs requires 8 adders.(b) A better method, using only 6 adders, that uses an intermediate vertex C (-3xl + 5x2) that uses both inputs and feeds both outputs.
of the complex-looking graphs produced by RAG-n or BHM.Take, for example, the matrix equation: Using the RAG-n or BHM algorithm to design the matrix multiplier would result in the structure of Figure 12a, where the various products of the inputs are produced and then combined at the end.This method uses 7 adders.A more efficient graph is shown in Figure 12b, which requires only 6 adders.The outputs are synthesised using the equations yl 7(3Xl -+-5X2) / 4x2 Y2 3x1 / 5X2 / using an intermediate result, C 3x / 5x2, which uses both inputs and supplies both outputs.It would appear that algorithms of the type used for the 0-and 1-dimensional designs are not appropriate for matrix multiplication.The search for an appropriate algorithm remains the subject of ongoing research.

DISCUSSION AND CONCLUSIONS
Graph multipliers and multiplier blocks have many advantages as we have discussed above.
However, there are also some limitations that affect their application.First, they are only of use where constant multipliers are required.If the filter coefficients need to be programmable or variable, another technique should be used.Second, when synthesised, they do not produce regular struc- tures.It is believed that the gains they make will outweigh the inefficiency due to irregular layout, but this conjecture has yet to be tested.The com- parisons made herein and throughout this work have been at an adder level in an attempt to make the comparisons as independent of technology as possible.Technology-dependent comparisons will be explored in the near future.These comparisons will extend to serial arithmetic, where all the comparisons here effectively apply to parallel arithmetic.Third, the products produced from a multiplier block do not necessarily have the same latency, so for pipelined applications, extra pipelining registers will be required.
To summarise the findings of the multiplier work: 1.For single coefficients, the MAG algorithm guarantees minimum adders for a given multi- plier.Due to memory use, the MAG algorithm has been limited to a given wordlength.Above this wordlength, the BBB algorithm, the better of BHM and BERN, is the best available.For extremely long wordlengths, the exponentiation algorithms k-SR and SS(1) are worth consider- ing.In addition to the VLSI application of primary interest here, all of these algorithms can also be used for reducing the number of ADD (and SHIFT) instructions a software compiler assigns to a multiply, and may assist in reducing the exponentiation overhead in cryptograhic algorithms.2. When multipliers can be blocked, i.e., where several products of a single multiplicand are required, the RAG-n algorithm is the best.It is often optimal, but its use of the MAG algorithm also limits its maximum wordlength.
The BHM algorithm is the best to use above that wordlength.These algorithms design filters that are more efficient in terms of adders than any other method to which they have been compared.3.For both FIR and IIR filters, the use of multiplier blocks substantially reduces the con- tribution to overall complexity made by the multipliers, reducing the imperative to optimise the multiplier contribution.The remaining elements (adders and delays) are intrinsic to the structure of the filter and cannot be optimised.Attention must then turn to the selec- tion of structure and order. 4.This choice of structure should not be made without examination of the effects of the use of multiplier blocks.Without the use of multiplier blocks, wave structures are the most efficient choice.Application of multiplier blocks so dramatically reduces the cost of cascade struc- tures that they are then least costly. Each

FIGURE 3
FIGURE 3 The graphs designed by BHM, (a) and (c), and BERN, (b) and (d) for integers 707 and 711.Both cost 3 designs are optimal.

35 FIGURE 5
FIGURE 5 Average adder cost of 100 uniformly distributed integers of indicated wordlength, using the BHM, BERN, and BBB algorithms.

FIGURE 10
FIGURE 10 Average cost in adders evaluated for various algorithms against wordlength.Each point represents the average over one hundred uniformly distributed 5-coefficient sets.

3. 5 .
Multiplier Blocks and Filter Banks FIGURE 11 How to incorporate all coefficients of an FIR filter bank into a single block.

TABLE
The increase in average costs of the BHM, BERN and BBB algorithms compared with the optimal costs for12-bit