^{1, 2}

^{1}

^{2}

^{1}

^{2}

This paper presents optimized implementations of two different pipeline FFT processors on Xilinx Spartan-3 and Virtex-4 FPGAs. Different optimization techniques and rounding schemes were explored. The implementation results achieved better performance with lower resource usage than prior art. The 16-bit 1024-point FFT with the R2^{2}SDF architecture had a maximum clock frequency of 95.2 MHz and used 2802 slices on the Spartan-3, a throughput per area ratio of 0.034 Msamples/s/slice. The R4SDC architecture ran at 123.8 MHz and used 4409 slices on the Spartan-3, a throughput per area ratio of 0.028 Msamples/s/slice. On Virtex-4, the 16-bit 1024-point R2^{2}SDF architecture ran at 235.6 MHz and used 2256 slice, giving a 0.104 Msamples/s/slice ratio; the 16-bit 1024-point R4SDC architecture ran at 219.2 MHz and used 3064 slices, giving a 0.072 Msamples/s/slice ratio. The R2^{2}SDF was more efficient than the R4SDC in terms of throughput per area due to a simpler controller and an easier balanced rounding scheme. This paper also shows that balanced stage rounding is an appropriate rounding scheme for pipeline FFT processors.

The Fast Fourier Transform (FFT), as an efficient algorithm to compute the Discrete Fourier Transform (DFT), is one of the most important operations in modern digital signal processing and communication systems. The pipeline FFT is a special class of FFT algorithms which can compute the FFT in a sequential manner; it achieves real-time behavior with nonstop processing when data is continually fed through the processor. Pipeline FFT architectures have been studied since the 1970's when real-time large scale signal processing requirements became prevalent. Several different architectures have been proposed, based on different decomposition methods, such as the Radix-2 Multipath Delay Commutator (R2MDC) [^{2} Single-Path Delay Feedback (R2^{2}SDF) [^{2} to Radix-2^{4} SDF FFTs were studied and compared in [^{3}SDF was implemented and shown to be area efficient for 2 or 3 multipath channels. Each of these architectures can be classified as multipath or single-path. Multipath approaches can process

From the hardware perspective, Field Programmable Gate Array (FPGA) devices are increasingly being used for hardware implementations in communications applications. FPGAs at advanced technology nodes can achieve high performance, while having more flexibility, faster design time, and lower cost. As such, FPGAs are becoming more attractive for FFT processing applications and are the target platform of this paper.

The primary goal of this research is to optimize pipeline FFT processors to achieve better performance and lower cost than prior art implementations. In this paper, two comparative implementations (R4SDC and R2^{2}SDF) of pipeline FFT processors targeted towards Xilinx Spartan-3 and Virtex-4 FPGAs are presented. Different parameters such as throughput, area, and SQNR are compared.

The rest of the paper is organized as follows. Section

The major characteristics and resource requirements of several pipeline FFT architectures are listed in Table ^{2} Single Path Delay Feedback (R2^{2}SDF) architectures provide the highest computational efficiency and were selected for implementation. The R4SDC architecture is appealing due to the computational efficiency of its addition; however the controller design is complex. The R2^{2}SDF architecture has a simple controller but a less efficient addition scheme. These designs are both radix-4 and scalable to an arbitrary FFT size

Hardware resource requirements comparison of pipeline FFT architectures (based on [

Complex multipliers | Complex adders | Memory size | Control logic | Comp. Utilization | ||

add/sub | Multiplier | |||||

R2SDF | simple | 50% | 50% | |||

R4SDF | medium | 25% | 75% | |||

R4SDC | complex | 100% | 75% | |||

R2^{2}SDF | simple | 75% | 75% | |||

R2MDC | simple | 50% | 50% | |||

R4MDC | medium | 25% | 25% |

The R4SDC was proposed by Bi and Jones [

R4SDC commutator of stage

R4SDC butterfly element of stage

The R2^{2}SDF architecture was proposed by He and Torkelson [

Using the Common Factor Algorithm (CFA) to decompose the twiddle factor, the FFT can be reconstructed as a set of 4 DFTs of length

The R2^{2}SDF algorithm can be mapped to the architecture shown in Figures

^{2}SDF pipeline FFT processor architecture.

R2^{2}SDF BF2 I structure.

R2^{2}SDF BF2 II structure.

Both of these FFT architectures were implemented with generic synthesizable VHDL code and verified with simulation against Matlab scripts using Modelsim. Synplify or XST was used to perform the synthesis, and ISE was used for place and route and implementation. The architectures were optimized to achieve maximum throughput with minimal area (slices). The tools and development environment used are shown in Table

Implementation tools.

Design step | Tool |
---|---|

VHDL simulation | ModelSim SE 6.2b |

FPGA synthesis | Synplicity Synplify Pro; Xilinx XST |

FPGA implementation | Xilinx ISE 9.1 |

Target FPGA | Spartan-3 Family; Virtex-E Family; |

Virtex-4 Family | |

Verification | Matlab R2006a |

Some general optimization measures were performed, including FSM encoding, retiming, and CAD-related optimizations. Since the FFT processors were targeted to Xilinx Spartan-3 and Virtex-4 FPGAs (as well as synthesized for Virtex-E FPGAs), the SRL16 component, which can implement a 16-bit shift register within a single LUT, was inferred as much as possible to preserve LUTs. This particularly helped the R2^{2}SDF architecture because of the large number of shift registers. R4SDC also benefited from SRL16 components in its commutator registers. Block RAMs were used to store twiddle factors, which dramatically reduced the combinational logic utilization.

A number of architecture-specific optimizations were used. For both architectures, a complex multiplication technique was used. Usually, a complex multiplication is computed as:

This requires 4 multiplications and 2 add/suboperations. As is well known, the equation is simplified to save one multiplier:

This requires only 3 multiplications and 5 add/suboperations. Pipeline registers were also added in order to avoid the long critical path brought by the connection of real adders and multipliers. Figure

The pipelined complex multiplier.

The R4SDC has a complex controller, which creates a long critical path. By observing that all stages have the same control bits but have different sequences, using a ROM with an incremental address was a simpler solution than using a complex FSM. Pipeline registers were also added to the butterfly elements, multipliers, and between stages. Figure

Adding pipeline registers to R4SDC butterfly element.

Adding pipeline registers between elements and stages.

There were some special measures taken into account within controller in order to keep proper timing of signals. Twiddle factors should also be delayed to cope with the delayed sequence.

Due to its simple control requirements, a simple counter was sufficient as the entire controller for the R2^{2}SDF. To speed up the controller, a fast adder could potentially be faster than a simple ripple-carry adder. However, due to the small number of stages (^{2}SDF is not suited for adding pipeline registers within individual butterfly elements, because this would break the timing for the data feedback path. Figure

Adding pipeline registers for R2^{2}SDF.

Due to finite wordlength effects, the implemented FFTs always scaled by

In this scenario, if the MSB of the number to be divided is 0 (i.e., positive number) it is rounded-half-up. This will have a positive bias. On the other hand, if the MSB is 1 (i.e., negative number) it is truncated, leaving a negative bias. Assuming that the positive and negative numbers are uniformly distributed, this approach will lead to an unbiased rounding scheme. However, selecting the bias based on the MSB implies that these two rounding methods coexist in a single rounding position, which requires extra hardware. This increases the critical path, harming the performance. So it is not chosen.

In this scenario, if the bit to be rounded is 1, a random up or down rounding is performed. If it is 0, the same rounding scheme as done previously is performed. From the statistical point of view, no bias exists. But this method requires a random bit generator and a long accumulation time, requiring big extra hardware resources and significantly affects the performance. So it is not implemented.

This rounding method explores balancing between stages. Round-half-up and truncation are used in an interlaced fashion, as shown in Figure

For an even number of stages, this will achieve the same result as the randomized approach, while having a smaller resource usage and simpler control. This scheme fits the R2^{2}SDF architecture particularly well, because the two butterfly elements within same stage of R2^{2}SDF can be naturally balanced. This method was chosen for the designs presented in the paper.

In order to compute the signal-to-quantization noise ratio (SQNR), random generated noise was used as the input to the pipeline FFT. A Matlab script generated double precision floating point FFT results, which were used as the true values. Figure

Balanced stages rounding.

SQNR calculation.

Figure ^{2}SDF, respectively, for a 16-bit data width (input data, twiddle factors, and output data are 16 bits). The balanced stage rounding typically improved the SQNR by 1-2 dB. The balanced stages scheme gives better SQNR, because it leverages the randomness between stages. The truncation and round-half-up only reserve half of the information.

Rounding effects on SQNR.

R4SDC

R2^{2}SDF

Table

SQNR with different FFT sizes.

FFT size | Input data width | Twiddle factor width | Stage number | SQNR (dB) | |
---|---|---|---|---|---|

R4SDC | 16 | 16 | 16 | 2 | 82.29 |

64 | 16 | 16 | 3 | 73.49 | |

256 | 16 | 16 | 4 | 67.47 | |

1024 | 16 | 16 | 5 | 61.25 | |

R2^{2}SDF | 16 | 16 | 16 | 2 | 81.82 |

64 | 16 | 16 | 3 | 74.47 | |

256 | 16 | 16 | 4 | 68.22 | |

1024 | 16 | 16 | 5 | 62.68 |

The FFT architectures with smaller wordlengths than 16 bits are also implemented. The example in the Figure

SQNR variation with different word lengths.

Table ^{2}SDF achieved a smaller area and better throughput per area than the R4SDC. Due to the pipeline design, the maximum clock frequency did not change drastically with FFT size for either design. As expected the throughput per area decreases for larger FFT sizes, which require more stages and area.

Implementation results on Spartan-3 devices.

Point | Input data | Twiddle factor | Slices | Block | Max. speed | Latency | Transform time | Throughput | Throughput/area | ||

size | width | width | RAM | (MHz) | (cycles) | Cycles | Time ( | (MS/s) | (MS/s/slice) | ||

R4SDC | 16 | 16 | 16 | 468 | 2 | 108.20 | 21 | 16 | 0.15 | 108.20 | 0.231 |

64 | 16 | 16 | 952 | 2 | 107.23 | 73 | 64 | 0.60 | 107.23 | 0.113 | |

256 | 16 | 16 | 1990 | 3 | 111.98 | 269 | 256 | 2.76 | 111.98 | 0.056 | |

1024 | 16 | 16 | 4409 | 8 | 123.84 | 1041 | 1024 | 8.27 | 123.84 | 0.028 | |

R2^{2}SDF | 16 | 16 | 16 | 427 | 2 | 121.24 | 22 | 16 | 0.13 | 121.24 | 0.284 |

64 | 16 | 16 | 810 | 2 | 98.14 | 74 | 64 | 0.65 | 98.14 | 0.121 | |

256 | 16 | 16 | 1303 | 3 | 98.73 | 270 | 256 | 2.59 | 98.73 | 0.076 | |

1024 | 16 | 16 | 2802 | 8 | 95.25 | 1042 | 1024 | 10.75 | 95.25 | 0.034 |

Table

Implementation results on Virtex-4 devices.

Point | Input data | DSP48 | Slices | Block | Max. speed | Latency | Transform time | Throughput | Throughput/area | ||

size | width | RAM | (MHz) | (cycles) | Cycles | Time ( | (MS/s) | (MS/s/slice) | |||

R4SDC | 16 | 16 | 4 | 530 | 1 | 236.7 | 21 | 16 | 0.07 | 236.7 | 0.447 |

64 | 16 | 8 | 803 | 2 | 236.4 | 73 | 64 | 0.27 | 236.4 | 0.294 | |

256 | 16 | 12 | 1370 | 3 | 218.9 | 269 | 256 | 1.17 | 218.9 | 0.160 | |

1024 | 16 | 16 | 3064 | 8 | 219.2 | 1041 | 1024 | 4.67 | 219.2 | 0.072 | |

R2^{2}SDF | 16 | 16 | 4 | 517 | 1 | 237.9 | 22 | 16 | 0.07 | 237.9 | 0.460 |

64 | 16 | 8 | 779 | 2 | 236.7 | 74 | 64 | 0.27 | 236.7 | 0.304 | |

256 | 16 | 12 | 1234 | 3 | 236.7 | 270 | 256 | 1.08 | 236.7 | 0.192 | |

1024 | 16 | 16 | 2256 | 8 | 235.6 | 1042 | 1024 | 4.35 | 235.6 | 0.104 |

Comparisons with prior art are shown in Table ^{2}SDF method for a 16-bit 1024-point FFT was published by Sukhsawas and Benkrid in [

Performance comparison versus prior art on Virtex-E devices.

FFT Design | Point | Input data | Twiddle factor | Slices | Block | Max. speed | Latency | Transform time | Throughput | Throughput/area | |

size | width | width | RAM | (MH) | (Cycle) | Cycles | Time ( | (MS/s) | (MS/s/slice) | ||

Amphion [ | 1024 | 13 | 13 | 1639 | 9 | 57 | 5097 | 4096 | 71.86 | 14.25 | 0.009 |

Xilinx [ | 1024 | 16 | 16 | 1968 | 24 | 83 | 4096 | 4096 | 49.35 | 20.75 | 0.011 |

Sundance [ | 1024 | 16 | 10 | 8031 | 20 | 49 | 1320 | 1320 | 27.00 | 49.00 | 0.006 |

Suksawas R2^{2}SDF [ | 1024 | 16 | 16 | 7365 | 28 | 82 | 1099 | 1024 | 12.49 | 82.00 | 0.011 |

Our R2^{2}SDF | 1024 | 16 | 16 | 5008 | 32 | 95.0 | 1042 | 1024 | 10.78 | 95.00 | 0.019 |

Our R4SDC | 1024 | 16 | 16 | 7052 | 32 | 94.2 | 1041 | 1024 | 10.87 | 94.20 | 0.013 |

On the Virtex-E, our R2^{2}SDF achieved better performance of 95 MHz and a smaller area of 5008 slices, giving a superior throughput per area ratio of 0.019 Msamples/s/slice. Our R4SDC architecture was also superior to prior art, running at 94.2 MHz and using 7052 slices, a throughput per area ratio of 0.013 Msamples/s/slice.

Another point of reference is the Xilinx FFT IP core. For comparison sake, the IP core for Virtex-E is shown in the table. The Virtex-E core shows four times the latency (4096) in cycles due to its internal architecture. Its throughput per area ratio is also only 0.011 Msamples/s/slice. Note that all comparisons for throughput per area do not take into account block RAMs, though each of the designs had a similar number of required block RAMs. However, the Xilinx FFT DSP core could perform better with new Xtreme technology: on Virtex 4 device 4vsx25-10, 1024-point FFT could be finished within 2.85 nanoseconds in best case, while cost 2141 Slices, 7 block RAMs, and 46 Xtreme DSP slices [

In this paper, optimized implementations of R4SDC and R2^{2}SDF pipeline FFT processors on Spartan-3, Virtex-4, and Virtex-E FPGAs are presented. The 16-bit 1024-point FFT with the R2^{2}SDF architecture had a maximum clock frequency of 95.2 MHz and used 2802 slices on the Spartan-3. The R4SDC ran at 123.8 MHz and used 4409 slices on the Spartan-3. On Virtex-4 device, the numbers became 235.6 MHz and 2256 slices for R2^{2}SDF and 219.2 MHz and 3064 slices for R2^{2}SDF, respectively. Different rounding schemes were analyzed and compared. SQNR analysis showed the balanced stages rounding scheme gave high SQNR with small overhead. The SQNR will gain around 6 dB with every bit increment of word length.

The R2^{2}SDF architecture outperformed the R4SDC architecture in terms of throughput per area, a measure of efficiency, for the 1024-point FFT. This is due to its simpler controller and compatibility with pipelining insertion. Both architectures have comparable maximum clock frequency and SQNR with the balanced stages rounding scheme.