^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

We present a method for implementing high speed finite impulse response (FIR) filters on field programmable gate arrays (FPGAs). Our algorithm is a multiplierless technique where fixed coefficient multipliers are replaced with a series of add and shift operations. The first phase of our algorithm uses registered adders and hardwired shifts. Here, a modified common subexpression elimination (CSE) algorithm reduces the number of adders while maintaining performance. The second phase optimizes routing delay using prelayout wire length estimation techniques to improve the final placed and routed design. The optimization target platforms are Xilinx Virtex FPGA devices where we compare the implementation results with those produced by Xilinx Coregen, which is based on distributed arithmetic (DA). We observed up to 50% reduction in the number of slices and up to 75% reduction in the number of look up tables (LUTs) for fully parallel implementations compared to DA method. Also, there is 50% reduction in the total dynamic power consumption of the filters. Our designs perform up to 27% faster than the multiply accumulate (MAC) filters implemented by Xilinx Coregen tool using DSP blocks. For placement, there is a saving up to 20% in number of routing channels. This results in lower congestion and up to 8% reduction in average wirelength.

There has been a tremendous growth for the past few years in the field of embedded systems, especially in the consumer electronics segment. The increasing trend towards high performance and low power systems has forced researchers to come up with innovative design techniques that can achieve these objectives and meet the stringent system requirements. Many of these systems perform some kind of streaming data processing, which requires the extensive evaluation of arithmetic expressions.

FPGAs are being increasingly used for a variety of computationally intensive applications, especially in the realm of digital signal processing (DSP) [

The major contributions of this paper are as follows.

The development of a novel algorithm (will be referred as modified CSE throughout the paper) for optimizing the fixed coefficient multiplier block for FIR filters for FPGA implementation. The modified CSE algorithm utilizes a modified cost function for common subexpression elimination that explicitly considers the underlying FPGA architecture.

Taking interconnect delay into account which makes physical and logic co-synthesis feasible.

The rest of the paper is organized as follows: Section

Most signal processing and communication applications including FIR filters, audio, video and image processing use some sort of constant multiplication. The generation of a multiplier block from the set of constants is known as the multiple constant multiplication (MCM) problem. Finding the optimal solution, namely, the one with the fewest number of additions and subtractions, is known to be NP-complete [

While there has been a lot of work on optimizing constant multiplications using adders and employing redundancy elimination [

Dempster and Macleod present a method using addition chains to reduce the number of adders in the multiplier block [^{12} can be computed using only four additions.

Gustafsson et al. [

Gustafsson et al. present two approaches [

In comparison with the other algorithms for common subexpression elimination [

Meyer-Baese et al. [

Macpherson and Stewart [

SPIRAL [

In this section, a review of FIR filter architecture is presented. This is followed by the illustration of two major implementations of FIR filters that are widely used: MAC and DA methods.

Equation (

Mathematically identical MAC FIR filter structures: (a) The direct form of a finite impulse response (FIR) filter (b) The transposed direct form of an FIR filter.

Constant multipliers of Figure

Most FPGAs include embedded multipliers/DSP blocks to handle these multiplications. For example, Xilinx Virtex II/Pro provides embedded multipliers while more recent FPGA families such as Virtex 4/5 devices offer embedded DSP blocks. In either case, there are two major limitations. First, the multipliers or DSP blocks can accept inputs with limited bit width, for example, 18 bits for Virtex 4 devices. A Virtex 5 device provides additional precision of 25 bit input for one of the operands. In the case of higher input width, Xilinx Coregen tools combines these blocks with CLB logic [

Since many FIR filters use constant coefficients, the full flexibility of a general purpose multiplier is not required, and the area can be reduced using techniques developed for constant multiplication [

An alternative to the MAC approach is DA which is a well known method to save resources and was developed in the late 1960’s independently by Croisíer et al. [

Assuming coefficients

A serial DA FIR filter block diagram.

In a conventional MAC, with a limited number of MAC blocks, the system sample rate decreases as the filter length increases due to the increasing bit width of the adders and multipliers and consequently the increasing critical path delay. However, this is not the case with serial DA architectures since the filter sample rate is decoupled from the filter length. As the filter length is increased, the throughput is maintained but more logic resources are required. While the serial DA architecture is efficient by construction, its performance is limited by the fact that the next input sample can be processed only after every bit of the current input sample is processed. Each bit of the current input sample takes one clock cycle to process.

As an example, if the input bit width is 12, a new input can be sampled every 12 clock cycles. The performance of the circuit can be improved by using a parallel architecture that processes the data bits in groups. Figure

A 2 bit parallel DA FIR filter block diagram.

(a) Nonregistered output adder used by DA or other competing algorithms that do not take FPGA architecture into account. (b) Registered output adder used in add and shift method leveraging the new cost function that takes FPGA architecture into account.

A transposed direct form FIR filter as shown in Figure

Our research considers two aspects of the FIR design: First, it presents the development of a novel algorithm for optimizing the multiplier block for FIR filters, using a modified algorithm for common subexpression elimination. Here, our goal is to produce a design that can provide the maximum sample rate with the least amount of hardware. Our algorithm takes into account the specific features of FPGA architecture to reduce the total number of occupied slices. The reduced number of slices also leads to a reduction in the total power on the FPGA. The implementation results in terms of FPGA area (registers, LUTs, slices), performance, and power consumption are compared against the industry standard Xilinx Coregen as well as SPIRAL [

The goal of our optimization is to reduce the area of the multiplier block by minimizing the number of adders and any additional registers required for the fastest implementation of the FIR filter. In the following, a brief overview of the common subexpression elimination methods is presented in Section

An occurrence of an expression in a program is a common subexpression if there is another occurrence of the expression whose evaluation always precedes this one in execution order and if the operands of the expression remain unchanged between the two evaluations. The CSE algorithm essentially keeps track of available expressions block (AEB) that is, those expressions that have been computed so far in the block and have not had an operand subsequently changed. The algorithm then iterates, adding entries to and removing them from the AEB as appropriate. The iteration stops when there can be no more common subexpressions detected.

The CSE algorithm uses a polynomial transformation to model the constant multiplications. Given a representation for the constant C, and the variable

We use the

Common subexpression elimination is used extensively to reduce the number of adders, which leads to a reduction in the area. Additional registers will be inserted, wherever necessary, to synchronize all the intermediate values in the computations. Performing common subexpression elimination can sometimes increase the number of registers substantially, and the overall area could possibly increase. Consider the two expressions

Extracting common subexpression (a) Unoptimized expression trees. (b) Extracting common expression (A + B + C) results in higher cost due to inserting additional synchronizing registers. (c) A more careful extraction of common subexpression (A + B) applied by our modified CSE algorithm results in lower cost.

FPGAs have a fixed architecture where every slice contains a LUT/flip flop pair. If either the LUT or flip flop are unused, then FPGA resource usage efficiency is reduced. For example, the structure shown in Figure

Another important factor is minimizing the number of registers required for our design. This can be done by arranging the original expressions in the fastest possible tree structure, and then inserting registers. For example, for the six term expression

The fastest possible tree is formed and a synchronizing register is inserted, such that new values for the inputs can be read in every clock cycle.

The first step of the modified CSE algorithm is to generate all the divisors for the set of expressions describing the multiplier block. The next step is to use our iterative algorithm where the divisor with the greatest value is extracted. To calculate the value of the divisor, we assume that the cost of a registered adder and a register is the same. The value of a divisor is the same as the number of additions saved by extracting it minus the number of registers that have to be added. After selecting the best divisor, the common subexpressions can be extracted. We then generate new divisors from the new terms that have been generated due to rewriting, and add them to the dynamic list of divisors. The modified CSE algorithm halts when there is no valuable divisor remaining in the set of divisors. Algorithm

_{i} in

_{i});

_{i}->MinRegisters = Calculate Minimum registers required

_{i};

The modified CSE algorithm presented here is a greedy heuristic algorithm. In this algorithm for the extraction of arithmetic expressions, the divisor that obtains the greatest savings in the number of additions is selected at each step. To the best of our knowledge, there has been no previous work done for finding an optimal solution for the general common subexpression elimination problem, though recently there has been an approach for solving a restricted version of the problem using Integer Linear Programming (ILP) [

The area optimization algorithm presented in Algorithm

Interconnect delay is the dominant factor in the overall performance of modern FPGAs. Pre-layout wire length estimation techniques can help in early optimizations and improve the final placed and routed design. Our modified CSE algorithm (see Algorithm

We propose a metric to evaluate the proximity of elements connected in a netlist. This metric is capable of predicting short connections more accurately and deciding which groups of nodes should be clustered to achieve good placement results. Here, divisors are referred as nodes. In other words, we are trying to find the common subexpression that not only eliminates computation, but also results in to the best placement and routing. This metric is embedded into our cost function and various design scenarios are considered based on maximizing or minimizing the modified cost function on total wirelength and placement. Experiments show that taking physical synthesis into account can produce better results.

The first step to produce more efficient layout is to predict physical characteristics from the netlist structure. To achieve this, the focus will be on pre-layout wire length and congestion estimations using mutual contraction metric [

Multipin net (a) versus two pin net (b) [

The contraction measure for groups of nodes quantifies how strongly those nodes are connected to each other. A group of nodes are strongly contracted if they share many small fan-out nets. In general a strong contraction means shorter length of connecting wires in placed design. Connectivity [

In order to include mutual contraction in wire length prediction, a clique has to be formed for multi-pin nets. Given a graph with nodes

A more precise metric for mutual contraction is used, which is the product of the two relative weights to measure the contraction of the connection as in (

Consider the circuit in Figure

Calculating the edge weights according to modified CSE algorithm: (a) Divisors that are used multiple times are shown as multi-terminal nets with edge weights based on equation (

The cost function presented in the previous section considers only area reduction as a constraint. This cost function can be modified according to mutual contraction concept. We have defined different cost functions based on maximizing or minimizing the average mutual contraction (AMC):

In the following we compare our results with other architectures for both area and performance. Add and shift method results are compared with the Coregen DA approach and SPIRAL software developed by Carnegie Mellon University in Sections

The main goal of our experiments is to compare the number of resources consumed by the add and shift method with that produced by the cores generated by the commercial Coregen tool based on DA. We compared the power consumption of the two implementations, and also measured the performance. The results use 9 FIR filters of various sizes (6, 10, 13, 20, 28, 41, 61, 119 and 151 tap filters). The target platform for experiments is Xilinx Virtex II device. The constants were normalized to 17 digit of precision and the input samples were assumed to be 12 bits wide. For the add and shift method, all the constant multiplications are decomposed into additions and shifts and further optimized using the modified CSE algorithm explained in Section

Figure

(a) Resource utilization in terms of # of slices, flip flops, and LUTs for various filters using add and shift method (this paper). (b) Performance implementation results (Msps) for various filters using add and shift method (this paper) versus parallel distributed arithmetic.

In Figure

Figure

Reduction in resources for add and shift method (this paper) relative to that for DA showing an average reduction of 58.7% in the number of LUTs, and 25% reduction in the number of slices and FFs.

In DA full parallel implementation, LUT usage is high. Therefore the percentage of saving amount is also high. Though our modified CSE algorithm does not optimize for performance, the synthesis produces better performance in most of the cases, and for the 13 and 20 tap filters, an improvement of about 26% can be seen in performance (see Figure

Figure

Comparison of power consumption for add and shift (this paper) relative to that for the DA showing up to 50% reduction in dynamic power consumption.

Resource utilization and performance implementation results for various filters using add and shift method (this paper) versus MAC method on Virtex IV. (a) Resource utilization in terms of no. of slices and DSP blocks presented in logarithmic scale. (b) Performance (Msps).

In Figure

Figure

In this work, a comparison is made primarily with the Coregen implementation of DA, which is also a multiplierless technique. Based on the implementation results, our designs are much more area efficient than the DA based approach for fully parallel FIR filters. We also compare our method with MAC based implementations, where significantly higher performance is achieved (see Figure

In the following, the add and shift method experimental results are compared against two competing methods: SPIRAL automatic software and RAG-

SPIRAL is a system that automatically generates platform-adapted libraries for DSP transforms. The system uses a high level algebraic notation to represent, generate, and manipulate various algorithms for a user specified transform. SPIRAL optimizes the designs in terms of number of additions and it tunes the implementation to the platform by intelligently searching in the space of different algorithms and their implementation options for the fastest on the given platform. The SPIRAL software is available for download. SPIRAL generates the C code (not the HDL code) for multiplier block of the FIR filter. In order to have a complete comparison, the C code for the multiplier block was generated for each filter using SPIRAL software and then converted to HDL code with the addition of the delay line. The resulting code was run by Xilinx ISE software and the implementation results are shown in Figure

Resource utilization and performance implementation results for various filters using add and shift method (this paper) relative to that of SPIRAL automatic software. SPIRAL shows a saving of 72% in FFs, 11% in LUTs, and 59% in slices at the cost of 68% drop in performance. (a) Resource utilization in terms of # of FFs, LUTs, and SLICEs. (b) Performance (Msps).

In order to have a fair comparison, all inputs and outputs were registered. We implemented all experiments with the HDL codes (converted C code that was generated by SPIRAL software) and the results are shown in Figure

The SPIRAL implementation is an optimum solution for software oriented platforms since it focuses on minimizing number of additions. However, this is not necessarily the best method for FPGA implementation. An important factor in FPGA implementation is to use the slice architecture in an efficient way and have a balanced usage of LUTs and registers.

Figure

High level resource utilization in terms of # adders and registers for various filters using add and shift method (this paper) versus SPIRAL automatic software. SPIRAL shows a saving of 15% in number of adders and 81% in number of registers at the cost of 68% drop in performance.

It is impossible to compare our implementation results with RAG-

Mutual contraction defines a new edge weight for nets and then computes the relative weight of a connection. It can be used to estimate the relative length of interconnect. This concept can be extended to measure the contraction of a node group. Our CSE based cost function considers only area reduction as a constraint. It is based on extracting the divisors in a polynomial that minimizes the number of operations needed but constraints have been modified to incorporate mutual contraction concept. Section

Figure

Implementation flow using mutual contraction concept.

For placement and routing we have followed VPR design flow summarized in [

We have implemented various size FIR filters taking mutual contraction into account. We have embedded four additional constraints introduced in Section

Number of routing channels versus filter size for various cost functions discussed in Section

Average wirelength versus filter size. for various cost functions discussed in Section

Figure

For placement, as Figure

In comparison with [

In this paper, we presented a multiplierless technique, based on add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. Our techniques are validated on Virtex II and Virtex 4 devices where significant area and power reductions are observed over traditional DA based techniques. In future, we would like to improve our modified CSE algorithm to make use of the limited number of embedded multipliers available on the FPGA devices. Also, the new cost function can be embedded into other optimization algorithms such as RAG-

We have extended our add and shift method to reduce the FPGA resource utilization by incorporating mutual contraction metric that estimates pre-layout wirelength. The original cost function in add and shift method is modified using mutual contraction concept to introduce five different constraints, two of which maximize and two others minimize the average mutual contraction. As a result, an improvement is expected in routing and total wirelength in routed design. Based on the overall results

For routing, there is up to 8% saving in average wirelength and up to 20% in number of routing channels for

In comparison with SPIRAL, our method shows better performance. SPIRAL shows a saving of 72% in FFs, 11% in LUTs, and 59% in slices at the cost of 68% drop in performance. SPIRAL multiplier block is not pipelined and depending on the coefficients used, the cascaded adder tree could synthesize to several levels of logic and consequently result into low performance. This is a good solution for software implementation but not necessarily for FPGA implementation. An important factor in FPGA implementation is to use the slice architecture in an efficient way. Each FPGA slice includes a combinatorial part (LUT) and a storage element (register). Multiplier block generated by SPIRAL uses only the LUTs and registers that are left can not be used for other logic and consequently they are wasted.