High Efficiency Generalized Parallel Counters for Look-Up Table Based FPGAs

Generalized parallel counters (GPCs) are used in constructing high speed compressor trees. Prior work has focused on utilizing the fast carry chain and mapping the logic onto Look-Up Tables (LUTs). This mapping is not optimal in the sense that the LUT fabric is not fully utilized. This results in low efficiency GPCs. In this work, we present a heuristic that efficiently maps the GPC logic onto the LUT fabric. We have used our heuristic on various GPCs and have achieved an improvement in efficiency ranging from 33% to 100% in most of the cases. Experimental results using Xilinx 5th-, 6th-, and 7th-generation FPGAs and Stratix IV and V devices from Altera show a considerable reduction in resources utilization and dynamic power dissipation, for almost the same critical path delay. We have also implemented GPC-based FIR filters on 7th-generation Xilinx FPGAs using our proposed heuristic and compared their performance against conventional implementations. Implementations based on our heuristic show improved performance. Comparisons are also made against filters based on integrated DSP blocks and inherent IP cores from Xilinx. The results show that the proposed heuristic provides performance that is comparable to the structures based on these specialized resources.


Introduction
Multioperand addition is an important operation in many arithmetic circuits.It is frequently used in many applications like filtering [1], motion estimation [2], array multiplication [3][4][5][6][7], and so forth.Compressor trees form the basic elements in multioperand additions.Compressor trees based on carry save adders (CSA) typically provide higher speeds due to the avoidance of long carry chains.Dadda [3] and Wallace [7] trees are CSA based compressor trees which are frequently used in application specific integrated circuit (ASIC) design.However, the introduction of fast carry chains in FPGAs has made ripple carry addition faster than the carry save addition.Evidently CSA based compressor trees are not well suited for implementation involving FPGAs [8].
Prior work on compressor tree synthesis using FPGAs has used GPCs as basic constituent element.It has been demonstrated that the usage of GPCs can lead to a considerable reduction in the critical path delay with comparable resource utilization [8][9][10][11][12][13][14].Initial attempts in this regard were made by Parandeh-Afshar et al. [8][9][10][11].In [9] they claim to report the first method that synthesizes compressor trees on FPGAs.The proposed heuristic constructs compressor trees from a library of GPCs that can be efficiently implemented on FPGAs.Their latter work [11] focuses on further reducing the combinational delay and any increase in area by formulating the mapping of GPCs as an integer linear programming (ILP) problem.The authors reported an average reduction in delay by 32% and area by 3% when compared to an adder tree.In [10] focus is on reducing the combinational delay by using embedded fast carry chains.This concept was further extended in [8] and a delay reduction of 33% and 45% was achieved in Xilinx Virtex-5 and Altera Stratix-III FPGAs, respectively.
Matsunaga et al. [12,14] also formulated the mapping of GPCs as an ILP with speed and power as optimization goals.Their results show a 28% reduction in GPC count when compared to [9].A reduction in GPC count results in reduction of compression stages thereby reducing the delay and power consumption.Recent attempts from Kumm and Zipf [15,16] focus on exploiting the low-level structure of Xilinx FPGAs to develop novel GPCs with high compression ratios and efficient resource utilization.Both general purpose LUT fabric and specialized carry chains have been used for synthesizing resource-efficient delay-optimal GPCs.All the abovementioned approaches (except [9]) focus on exploiting the fast carry chain embedded in modern FPGAs.The idea is to use the fast carry chain to connect the adjacent logic cells and bypass the programmable routing network to reduce delay [10].This results in reduced critical path delay and hence increased speed.The mapped GPCs, however,      suffer from poor efficiency.In this paper, we use a heuristic that improves the efficiency of the mapped GPCs by reducing the number of LUTs required to map the GPC logic.Our heuristic mainly targets 6-input LUTs from Xilinx FPGAs that can implement a single 6-input function or two 5-input functions with shared inputs.However, the same heuristic can be used for Altera FPGAs that support decomposition of 6-input LUTs into dual 4-input LUTs, when used in the arithmetic or shared-arithmetic mode.Additionally in both devices the fast carry chain is used to handle the carry rippling so that there is no increase in the critical path delay.The rest of the paper is organized as follows: Section 2 briefly introduces the basic preliminaries about the GPCs, the Xilinx and Altera LUT architecture, and the terminology used in this paper.Section 3 discusses the heuristic used to synthesize different GPCs.Synthesis and implementation are carried out in Section 4. Conclusions are drawn in Section 5 and references are listed at the end.

Preliminaries and Terminology
A compressor tree is a circuit that takes , -bit unsigned operands  −1 ,  −2 ⋅ ⋅ ⋅  1 ,  0 and generates two output values, sum () and carry (), such that (1)    A generalized parallel counter computes the sum of bits having different weights.A GPC is traditionally represented as a tuple ( −1 ,  −2 ⋅ ⋅ ⋅  1 ,  0 ; ), where   denotes the number of input bits of weight  and  is the number of output bits.The upper limit on the value of GPC is given by The efficiency of a GPC is measured by the number of reduced bits in relation to the hardware resources and is given by where   is the number of input bits;   is the number of output bits; and  is the number of LUTs used.As an example, a (1, 4, 1, 5; 5) GPC has five input bits of weight 0; one input bit of weight 1; four input bits of weight 2; and one input bit of weight 3. The upper limit on the output value is 31 and five bits are required to represent the output.Logic synthesis is concerned with hardware realization of a desired functionality with minimum possible cost.The cost of a circuit is a measure of its speed, resource utilization, power consumption, or any combination of these.A Boolean network is a directed acyclic graph (DAG) that represents a combinational function.Logic gates, primary inputs (PIs), and primary outputs (POs) within this network are represented by nodes.A node may have zero or more predecessor nodes known as fan-in nodes.Similarly a node may drive zero or more successor nodes known as fan-out nodes.A network is said to be k-bounded if the fan-in of every node does not exceed .Each node implements a local function.A global function is implemented by connecting the logic implemented by individual nodes.The level of the node V is the length of the longest path from any PI node to V. Network depth is defined as the largest level of a node in the network.The critical path delay and area of a circuit are measured by the depth and number of LUTs, respectively.The transformation of a Boolean network into targeted logic  elements gives the circuit-netlist.For FPGAs the targeted element is a -LUT.Altera Stratix IV and V FPGAs have Adaptive Logic Module (ALM) as the basic logic cell.The LUT resources within each ALM are divided into two adaptive LUTs (ALUTs).Normal operating mode uses a combination of these ALUTs within an ALM to implement functions with up to eight different inputs [17].However, Altera supports specialized arithmetic and shared-arithmetic modes for each Stratix ALM for arithmetic extensive applications.In these modes each individual LUT can implement two 4-input functions with shared inputs.The arithmetic and shared-arithmetic modes also enable the use of fast carry chains that result in efficient implementation of different arithmetic functions.A typical Stratix IV ALM in arithmetic mode driving the carry chain is shown in the schematic of Figure 1.
Xilinx 5th-, 6th-, and 7th-generation FPGAs have 6input LUTs as basic logic elements.Each logic slice provides combinatorial and synchronous resources, supporting 6input LUTs, storage elements, function generators, arithmetic logic gates, and a fast carry chain in the form of CARRY4 primitive [18,19].The 6-input LUTs can be used in dual mode to implement two 5-input Boolean functions that share inputs as shown in Figure 2. The carry chain along with the logic gates performs fast arithmetic addition based operations in a slice.Each carry chain supports four-bit operand.The absence of a special arithmetic mode in Xilinx FPGAs sometimes results in inefficient arithmetic circuits in these devices.Also the use of carry chain primitive requires an additional XOR gate to be included at each CARRY4 input.

GPC Mapping Heuristic
This section describes the heuristic for efficiently mapping the GPC logic onto LUTs.The primary goal of the heuristic is to map the GPCs onto minimum possible LUTs.The heuristic involves elimination of the carry logic; combination of the redundant nodes; covering and restructuring of the Boolean network; and finally reinsertion of the carry logic.We explain the different steps involved in the heuristic by considering the mapping of GPC (  requires four LUTs and a CARRY4 primitive.The same steps can be used for Altera FPGAs; however, the availability of the special arithmetic mode makes the mapping process highly efficient and the combination step of the heuristic is only used in some complex cases.Figure 3 shows the Boolean network for (1, 4, 1, 5; 5) GPC.The carry logic is shown by the shaded portion.All the primary inputs, primary outputs, and intermediate signals have been labeled.Figure 4 shows the conventional implementation using the fast carry chain [15].

Elimination.
In the elimination step the logic implemented by the carry chain is eliminated from the GPC.For a (1, 4, 1, 5; 5) GPC the Boolean network after the elimination of the carry chain is shown in Figure 5.The GPC consists of four separate networks and each of these networks is mapped onto a separate LUT.Note that for Altera FPGAs the XOR gates in each LUT will be eliminated.

Combination.
In this step the heuristic looks for redundant nodes for possible combination.The feasibility for combination is determined by the total number of outputs of the networks whose nodes are being combined.If the total number of outputs does not exceed two, then the nodes can be combined to eliminate the redundancy.For example, in the GPC network of Figure 5, network 1 and network 2 have a common full adder node with the same inputs.Each network has only one output so that the total number of outputs does not exceed 2. The redundant nodes from these two networks are thus combined, such that the two networks now share a common node.Similarly the full adder nodes in network 3 and network 4 share common inputs and thus can be combined into a single node.The Boolean network for a (1, 4, 1, 5; 5) GPC after node combination is shown in Figure 6.

Covering and Restructuring.
The Boolean network after combination is traversed in a post-order depth-first fashion and the individual nodes are covered with suitable LUTs.Since we are targeting 6-input LUTs with dual output capability, the aim is to completely utilize the LUTs.For the combined network of Figure 6, the covering process is straightforward.Each of the combined networks has five  inputs and two outputs and will map optimally onto 6input LUTs when operated in dual output mode as shown in Figure 7.However, in some cases the number of outputs per combined network may exceed two.In such case the heuristic restructures the combined network by either moving the nodes from one network to the adjacent one or duplicating the nodes to create additional networks.For example, the covering of (0, 6; 3) GPC is shown in Figure 8.The Boolean network obtained after carry chain elimination has five inputs and four outputs and would require two 6-input LUTs.This is possible if the individual nodes are duplicated to form two networks which are then mapped onto two separate LUTs.For Altera FPGAs the covering process is straightforward.Each of the full adders can be mapped onto single LUT operating in the arithmetic mode to obtain both sum () and carry () outputs.

Reinsertion.
After the covering and restructuring, the carry logic is inserted back into the GPC structure by simply including the fast carry chain in the GPC network.For (1, 4, 1, 5; 5) GPC this is shown in Figure 9.The LUT count is reduced to two and there is a 100% increase in the efficiency.
Different GPCs from prior work were implemented using the proposed heuristic.An increase in efficiency was observed in most of the cases.The implemented circuits for different GPCs are shown in Figures 10 and 11 for Xilinx FPGAs and in Figures 12 and 13 for Altera FPGAs.A theoretical evaluation of different GPCs is listed in Table 1.

Implementation and Results
The main aim of the proposed heuristic is to improve the efficiency of the GPCs as defined in Section 2, since this is the performance metric used in the prior literature.From an implementation point of view this means that the mapped GPCs will utilize the underlying FPGA fabric more efficiently.Since dynamic power dissipation in FPGAs is a function of the amount of logic utilized, a reduction in the mapped logic will result in reduced dynamic power dissipation.Another useful side effect of area reduction is the reduction in the critical path which leads to increased speed.However, this occurs only in GPCs where the reduced logic is the part of the critical path.In our implementation, therefore, area and power are the primary metrics of concern.
1 Delay associated with LUT.
2 Delay associated with routing. 3Delay associated with carry chain.
Synthesis and implementation are done using XC5VLX30-2FF324 device from Virtex-5; XC6VLX75T-2FF484 device from Virtex-6; and XC7K70T-2FBG676 device from Kintex-7 FPGAs.The parameters considered are resources utilization (in terms of LUTs) and power delay product (PDP).Constraints relating to synthesis and implementation are duly provided and a complete timing closure is ensured in each case.Design entry is done using VHDL.However, instead of writing inferential codes, we have adopted an instantiation based coding strategy.This complicates the design entry but a better control over mapping is achieved.Dynamic timing analysis is done for each GPC to verify the functionality after Placement and Routing (PAR).This is done by applying different test vectors and checking for correct output vectors.Dynamic timing analysis gives information about the switching activity of the design, which is captured in the value charge dump (VCD) file.Apart from post-PAR timing analysis the functionality of the design is also verified by dumping the design on the Virtex-5 platform.Synthesis and implementation are carried out in Xilinx ISE 14.2 [20] with speed as the optimization goal.Power analysis is done using the Xpower analyzer tool.For power analysis switching activity is provided by the VCD file obtained during dynamic timing analysis.Similar test benches have been used to ensure a fair comparison.
For initial comparison we have implemented all the GPCs reported in prior work and compared their performance against the implementation based on the proposed heuristic.Table 2 provides a comparison of performance metrics for different GPCs.From Table 2 it is observed that most of the GPC mappings based on the proposed heuristic show an increase in area efficiency ranging from 33% to 100%.Since the underlying LUT fabric is utilized efficiently, there is a reduction in resources utilized.This results in reduced power  dissipation.The critical path delay is also reduced in GPCs where the reduced logic is a part of the critical path.Our implementation shows an average reduction in power delay product by more than 20%.
We have also implemented FIR filters using different GPCs.The implementation is carried for different filter orders and for an operand word-length of 16 bits.The filter structures are based on fixed-point array multipliers and multioperand adders.Each of these units is constructed using the GPCs.Since FPGAs provide a high potential for pipelining we have used both combinational and pipelined versions of these individual units.For pure combinational structures critical path delay is used as a metric for speed and for pipelined designs throughput gives an idea about the speed of the structure.For high throughput DSP systems it is more appropriate to quantify the power efficiency through energy analysis.In our implementation we have used three energy related parameters for FIR systems.These include energy per operation (EOP), which is the average amount of energy required to compute one operation; energy throughput (ET) which is the energy dissipated for every output bit processed; and energy density (ED) which is the energy dissipated per LUT.We have used GPCs with maximum efficiencies from [8,9,15] and compared their performance against those based on the proposed heuristic.Figures 14,15,16,17,18,and 19 provide the performance comparison of the filters based on different GPCs implemented using the conventional methods and using our proposed heuristic.Note that the Xilinx synthesizer uses its own optimization strategies during the mapping process.In our implementation we have done separate analysis for area and speed.This is done by selecting the desired optimization goal prior to synthesis and implementation.The optimization effort in each case is selected to be high.
Finally, we have compared our filter implementation against that based on integrated DSP blocks and IP cores.DSP based filters have adders and multipliers constructed using DSP macros.For IP based filters the adder unit is the LogiCORE IP adder/Subtractor v 11.0 and the multiplier unit is the LogiCORE IP Multiplier v 11.2.Although these specialized inbuilt resources are highly optimized they do suffer from some disadvantages like fixed bit-width, limited number, and so forth [9,21].However, their biggest drawback is that they remain fixed in the FPGA fabric.This limits   the ability of the synthesizer to alter their position during the PAR phase of the design cycle and sometimes the post-PAR performance may be highly degraded.In our analysis we have synthesized 16-tap direct form filters with input bitwidth of 16 bits.The target platform is from Kintex-7 and the optimization goal is speed.The results provided in Table 3 show that the performance of our design is comparable to filters based on these specialized resources.

Conclusions
GPCs form an inherent part of high speed compressors.In this work we proposed a heuristic that mapped GPCs onto minimum possible LUTs by exploiting the improved logic handling capability of modern day FPGAs.A comparative analysis of our implementation against prior work showed a reduction in LUT count and the average power dissipated.This resulted in an increased compressing efficiency in most of the GPCs.Filter structures based on our modified GPCs show enhanced performance when compared to the conventional GPC-based filters.We also compared our results against filters based on specialized resources like DSP macros and IP cores.The results indicated that the performance of our design is comparable with these specialized resources.Our future work aims at efficiently pipelining the GPCs by eliminating the carry chain and using only a combination of LUTs and registers to implement the GPCs.

Figure 10 :
Figure 10: Area optimal mapping for different GPCs using proposed heuristic on Xilinx FPGAs.

FAFigure 11 :
Figure 11: Area optimal mapping for different GPCs using proposed heuristic on Xilinx FPGAs.

Figure 15 :
Figure 15: Critical path delay (in nS) for filters based on different GPCs on Kintex-7 FPGA.

Figure 17 :
Figure 17: Energy per operation (in nJ) for filters based on different GPCs on Kintex-7 FPGA.

Table 3 :
Performance comparison for proposed GPC-based filters and DSP, IP based filters.