Generalized parallel counters (GPCs) are used in constructing high speed compressor trees. Prior work has focused on utilizing the fast carry chain and mapping the logic onto Look-Up Tables (LUTs). This mapping is not optimal in the sense that the LUT fabric is not fully utilized. This results in low efficiency GPCs. In this work, we present a heuristic that efficiently maps the GPC logic onto the LUT fabric. We have used our heuristic on various GPCs and have achieved an improvement in efficiency ranging from 33% to 100% in most of the cases. Experimental results using Xilinx 5th-, 6th-, and 7th-generation FPGAs and Stratix IV and V devices from Altera show a considerable reduction in resources utilization and dynamic power dissipation, for almost the same critical path delay. We have also implemented GPC-based FIR filters on 7th-generation Xilinx FPGAs using our proposed heuristic and compared their performance against conventional implementations. Implementations based on our heuristic show improved performance. Comparisons are also made against filters based on integrated DSP blocks and inherent IP cores from Xilinx. The results show that the proposed heuristic provides performance that is comparable to the structures based on these specialized resources.
Multioperand addition is an important operation in many arithmetic circuits. It is frequently used in many applications like filtering [
Prior work on compressor tree synthesis using FPGAs has used GPCs as basic constituent element. It has been demonstrated that the usage of GPCs can lead to a considerable reduction in the critical path delay with comparable resource utilization [
Matsunaga et al. [
Recent attempts from Kumm and Zipf [
All the abovementioned approaches (except [
The rest of the paper is organized as follows: Section
A compressor tree is a circuit that takes
A generalized parallel counter computes the sum of bits having different weights. A GPC is traditionally represented as a tuple (
The efficiency of a GPC is measured by the number of reduced bits in relation to the hardware resources and is given by
Logic synthesis is concerned with hardware realization of a desired functionality with minimum possible cost. The
Altera Stratix IV and V FPGAs have Adaptive Logic Module (ALM) as the basic logic cell. The LUT resources within each ALM are divided into two adaptive LUTs (ALUTs). Normal operating mode uses a combination of these ALUTs within an ALM to implement functions with up to eight different inputs [
Stratix IV, V ALM when used in arithmetic mode.
Xilinx 5th-, 6th-, and 7th-generation FPGAs have 6-input LUTs as basic logic elements. Each logic slice provides combinatorial and synchronous resources, supporting 6-input LUTs, storage elements, function generators, arithmetic logic gates, and a fast carry chain in the form of CARRY4 primitive [
Xilinx 6-input dual LUT for 5th-, 6th-, and 7th-generation FPGAs.
This section describes the heuristic for efficiently mapping the GPC logic onto LUTs. The primary goal of the heuristic is to map the GPCs onto minimum possible LUTs. The heuristic involves
Figure
Boolean network for (1, 4, 1, 5; 5) GPC.
FPGA realization of (1, 4, 1, 5; 5) GPC using fast carry chain.
In the elimination step the logic implemented by the carry chain is eliminated from the GPC. For a (1, 4, 1, 5; 5) GPC the Boolean network after the elimination of the carry chain is shown in Figure
Boolean network for (1, 4, 1, 5; 5) GPC with no carry logic.
In this step the heuristic looks for redundant nodes for possible combination. The feasibility for combination is determined by the total number of outputs of the networks whose nodes are being combined. If the total number of outputs does not exceed two, then the nodes can be combined to eliminate the redundancy. For example, in the GPC network of Figure
Boolean network for (1, 4, 1, 5; 5) GPC after node combination.
The Boolean network after combination is traversed in a post-order depth-first fashion and the individual nodes are covered with suitable LUTs. Since we are targeting 6-input LUTs with dual output capability, the aim is to completely utilize the LUTs. For the combined network of Figure
Covering of the combined network using 6-input dual LUTs.
Covering for (0, 6; 3) GPC. (a) Combined network. (b) Restructuring for optimal covering.
After the covering and restructuring, the carry logic is inserted back into the GPC structure by simply including the fast carry chain in the GPC network. For (1, 4, 1, 5; 5) GPC this is shown in Figure
Area optimal mapping for (1, 4, 1, 5; 5) GPC.
Different GPCs from prior work were implemented using the proposed heuristic. An increase in efficiency was observed in most of the cases. The implemented circuits for different GPCs are shown in Figures
Comparison of different GPCs.
GPCs | Previous mappings |
Mappings based on proposed |
Mappings based on proposed | ||||||
---|---|---|---|---|---|---|---|---|---|
LUTs | Delay | Efficiency | LUTs | Delay | Efficiency | LUTs | Delay | Efficiency | |
|
|||||||||
(3; 2) | 1 |
|
1 | 1 |
|
1 | 1 |
|
1 |
(6; 3) | 3 |
|
1 |
|
|
|
|
|
|
(1, 5; 3) | 3 |
|
1 |
|
|
|
|
|
|
|
|||||||||
|
|||||||||
(6; 3) | 4 | 2 |
0.75 |
|
|
|
|
|
|
(1, 5; 3) | 3 |
|
1 |
|
|
|
|
|
|
(2, 3; 3) | 3 |
|
0.67 |
|
|
|
|
|
— |
(7; 3) | 4 | 2 |
1 |
|
|
|
|
|
|
(1, 6; 4) | 4 | 2 |
0.75 |
|
|
|
|
|
|
(3, 5; 4) | 4 | 2 |
1 |
|
|
|
|
|
|
(4, 4; 4) | 4 | 2 |
1 |
|
|
|
|
|
|
(5, 3; 4) | 4 | 2 |
1 |
|
|
|
|
|
|
(6, 2; 4) | 4 | 2 |
1 |
|
|
|
|
|
|
|
|||||||||
|
|||||||||
(6; 3) | 3 | 2 |
1 |
|
|
|
|
|
|
(1, 5; 3) | 2 |
|
1.5 |
|
|
|
|
|
|
(2, 3; 3) | 2 |
|
1 |
|
|
|
|
|
— |
(7; 3) | 3 | 2 |
1.33 |
|
|
|
|
|
|
(5, 3; 4) | 3 | 2 |
1.33 | 3 |
|
1.33 |
|
|
|
(6, 2; 4) | 3 | 2 |
1.33 |
|
2 |
|
|
|
|
(5, 0, 6; 5) | 4 |
|
1.5 | 4 |
|
1.5 | 4 |
|
1.5 |
(1, 4, 1, 5; 5) | 4 |
|
1.5 |
|
|
|
|
|
|
(1, 4, 0, 6; 5) | 4 |
|
1.5 |
|
|
|
|
|
|
(2, 0, 4, 5; 5) | 4 | 2 |
1.5 | 4 | 2 |
1.5 | 4 | 2 |
1.5 |
2Delay associated with routing.
3Delay associated with carry chain.
Area optimal mapping for different GPCs using proposed heuristic on Xilinx FPGAs.
Area optimal mapping for different GPCs using proposed heuristic on Xilinx FPGAs.
Area optimal mapping for different GPCs using proposed heuristic on Altera FPGAs.
Area optimal mapping for different GPCs using proposed heuristic on Altera FPGAs.
The main aim of the proposed heuristic is to improve the efficiency of the GPCs as defined in Section
Synthesis and implementation are done using XC5VLX30-2FF324 device from Virtex-5; XC6VLX75T-2FF484 device from Virtex-6; and XC7K70T-2FBG676 device from Kintex-7 FPGAs. The parameters considered are resources utilization (in terms of LUTs) and power delay product (PDP). Constraints relating to synthesis and implementation are duly provided and a complete timing closure is ensured in each case. Design entry is done using VHDL. However, instead of writing inferential codes, we have adopted an instantiation based coding strategy. This complicates the design entry but a better control over mapping is achieved. Dynamic timing analysis is done for each GPC to verify the functionality after Placement and Routing (PAR). This is done by applying different test vectors and checking for correct output vectors. Dynamic timing analysis gives information about the switching activity of the design, which is captured in the value charge dump (VCD) file. Apart from post-PAR timing analysis the functionality of the design is also verified by dumping the design on the Virtex-5 platform. Synthesis and implementation are carried out in Xilinx ISE 14.2 [
For initial comparison we have implemented all the GPCs reported in prior work and compared their performance against the implementation based on the proposed heuristic. Table
Performance comparison of different GPCs on XC5VLX30-2FF324.
GPCs | XC5VLX30-2FF324 | XC6VLX75T-2FF484 | XC7K70T-2FBG676 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Previous | Proposed | Previous | Proposed | Previous | Proposed | |||||||
LUTs | PDP | LUTs | PDP | LUTs | PDP | LUTs | PDP | LUTs | PDP | LUTs | PDP | |
|
||||||||||||
(3; 2) | 1 | 0.008 | 1 | 0.008 | 1 | 0.021 | 1 | 0.021 | 1 | 0.007 | 1 | 0.007 |
(6; 3) | 3 | 0.021 | 2 | 0.019 | 3 | 0.054 | 2 | 0.05 | 3 | 0.018 | 2 | 0.017 |
(1, 5; 3) | 3 | 0.021 | 1 | 0.015 | 3 | 0.054 | 1 | 0.039 | 3 | 0.018 | 1 | 0.013 |
|
||||||||||||
|
||||||||||||
(6; 3) | 4 | 0.043 | 2 | 0.019 | 4 | 0.114 | 2 | 0.05 | 4 | 0.039 | 2 | 0.017 |
(1, 5; 3) | 3 | 0.024 | 1 | 0.015 | 3 | 0.061 | 1 | 0.039 | 3 | 0.021 | 1 | 0.013 |
(2, 3; 3) | 3 | 0.024 | 1 | 0.015 | 3 | 0.061 | 1 | 0.039 | 3 | 0.021 | 1 | 0.013 |
(7; 3) | 4 | 0.041 | 2 | 0.03 | 4 | 0.118 | 2 | 0.085 | 4 | 0.039 | 2 | 0.028 |
(1, 6; 4) | 4 | 0.041 | 3 | 0.035 | 4 | 0.118 | 3 | 0.098 | 4 | 0.039 | 3 | 0.032 |
(3, 5; 4) | 4 | 0.041 | 2 | 0.032 | 4 | 0.118 | 2 | 0.091 | 4 | 0.039 | 2 | 0.03 |
(4, 4; 4) | 4 | 0.041 | 3 | 0.035 | 4 | 0.118 | 3 | 0.098 | 4 | 0.039 | 3 | 0.032 |
(5, 3; 4) | 4 | 0.041 | 3 | 0.022 | 4 | 0.118 | 3 | 0.057 | 4 | 0.039 | 3 | 0.019 |
(6, 2; 4) | 4 | 0.041 | 2 | 0.032 | 4 | 0.118 | 2 | 0.091 | 4 | 0.039 | 2 | 0.03 |
|
||||||||||||
|
||||||||||||
(6; 3) | 3 | 0.041 | 2 | 0.019 | 3 | 0.105 | 3 | 0.05 | 3 | 0.036 | 3 | 0.017 |
(1, 5; 3) | 2 | 0.021 | 1 | 0.015 | 2 | 0.053 | 2 | 0.039 | 2 | 0.018 | 2 | 0.013 |
(2, 3; 3) | 2 | 0.021 | 1 | 0.015 | 2 | 0.053 | 2 | 0.039 | 2 | 0.018 | 2 | 0.013 |
(7; 3) | 3 | 0.036 | 2 | 0.03 | 3 | 0.102 | 3 | 0.085 | 3 | 0.034 | 3 | 0.028 |
(5, 3; 4) | 3 | 0.036 | 3 | 0.022 | 3 | 0.102 | 3 | 0.057 | 3 | 0.034 | 3 | 0.019 |
(6, 2; 4) | 3 | 0.036 | 2 | 0.032 | 3 | 0.102 | 3 | 0.091 | 3 | 0.034 | 3 | 0.03 |
(5, 0, 6; 5) | 4 | 0.026 | 4 | 0.026 | 4 | 0.063 | 4 | 0.063 | 4 | 0.022 | 4 | 0.021 |
(1, 4, 1, 5; 5) | 4 | 0.026 | 2 | 0.021 | 4 | 0.063 | 4 | 0.051 | 4 | 0.022 | 4 | 0.017 |
(1, 4, 0, 6; 5) | 4 | 0.033 | 3 | 0.03 | 4 | 0.066 | 4 | 0.061 | 4 | 0.023 | 4 | 0.02 |
(2, 0, 4, 5; 5) | 4 | 0.043 | 4 | 0.043 | 4 | 0.079 | 4 | 0.079 | 4 | 0.027 | 4 | 0.027 |
We have also implemented FIR filters using different GPCs. The implementation is carried for different filter orders and for an operand word-length of 16 bits. The filter structures are based on fixed-point array multipliers and multioperand adders. Each of these units is constructed using the GPCs. Since FPGAs provide a high potential for pipelining we have used both combinational and pipelined versions of these individual units. For pure combinational structures critical path delay is used as a metric for speed and for pipelined designs throughput gives an idea about the speed of the structure. For high throughput DSP systems it is more appropriate to quantify the power efficiency through energy analysis. In our implementation we have used three energy related parameters for FIR systems. These include energy per operation (EOP), which is the average amount of energy required to compute one operation; energy throughput (ET) which is the energy dissipated for every output bit processed; and energy density (ED) which is the energy dissipated per LUT. We have used GPCs with maximum efficiencies from [
Area usage for filters based on different GPCs on Kintex-7 FPGA.
Critical path delay (in nS) for filters based on different GPCs on Kintex-7 FPGA.
Throughput (in MHz) for filters based on different GPCs on Kintex-7 FPGA.
Energy per operation (in nJ) for filters based on different GPCs on Kintex-7 FPGA.
Energy throughput (in nJ/bit) for filters based on different GPCs on Kintex-7 FPGA.
Energy density (in nJ/LUT) for filters based on different GPCs on Kintex-7 FPGA.
Finally, we have compared our filter implementation against that based on integrated DSP blocks and IP cores. DSP based filters have adders and multipliers constructed using DSP macros. For IP based filters the adder unit is the LogiCORE IP adder/Subtractor v 11.0 and the multiplier unit is the LogiCORE IP Multiplier v 11.2. Although these specialized inbuilt resources are highly optimized they do suffer from some disadvantages like fixed bit-width, limited number, and so forth [
Performance comparison for proposed GPC-based filters and DSP, IP based filters.
Filter design | LUTs | Registers (pipelined) | DSP cores | Critical path (nS) | Throughput (pipelined) (MHz) | EOP (nJ) | ET (nJ/bit) |
---|---|---|---|---|---|---|---|
IP based | 747 | 1883 | 0 | 14.701 | 303.366 | 0.765 | 0.003 |
DSP based | 128 | 240 | 31 | 18.398 | 363.086 | 0.581 | 0.0022 |
Proposed (5, 3; 4) | 732 | 1683 | 0 | 7.68 | 355.65 | 0.573 | 0.0022 |
Proposed (6, 2; 4) | 716 | 1683 | 0 | 11.31 | 335.4 | 0.599 | 0.0023 |
Proposed (1, 4, 0, 6; 5) | 693 | 1683 | 0 | 12.69 | 355.21 | 0.571 | 0.0022 |
Proposed (1, 4, 1, 5; 5) | 672 | 1683 | 0 | 12.34 | 355.2 | 0.555 | 0.0021 |
GPCs form an inherent part of high speed compressors. In this work we proposed a heuristic that mapped GPCs onto minimum possible LUTs by exploiting the improved logic handling capability of modern day FPGAs. A comparative analysis of our implementation against prior work showed a reduction in LUT count and the average power dissipated. This resulted in an increased compressing efficiency in most of the GPCs. Filter structures based on our modified GPCs show enhanced performance when compared to the conventional GPC-based filters. We also compared our results against filters based on specialized resources like DSP macros and IP cores. The results indicated that the performance of our design is comparable with these specialized resources. Our future work aims at efficiently pipelining the GPCs by eliminating the carry chain and using only a combination of LUTs and registers to implement the GPCs.
The authors declare that there is no conflict of interests regarding the publication of this paper.