Modules for Pipelined Mixed Radix FFT Processors

Attribution


Introduction
Fast Fourier transform (FFT) algorithm is widely used in many signal processing and communication systems.Due to its intensive computational requirements, it occupies large area and consumes high power if implemented in hardware.
FFT uses divide and conquer approach to reduce the computations of the discrete Fourier transform (DFT).In Cooley-Tukey radix-2 algorithm, the -point DFT is subdivided into two (/2)-point DFTs and then (/2)-point DFT is recursively divided into smaller DFTs until a two-point DFT.The last procedure, named as radix-2 butterfly, is just an addition and a subtraction of complex numbers.
Higher radix algorithms, such as radix-4 and radix-8, can be employed to reduce the complex multiplications, but the butterfly structure becomes complex.So, a split radix algorithm [1] is adopted to get the benefits of both radix-2 and radix-4 algorithms.
Prime factor algorithms use Good-Thomas mapping and Chinese Remainder Theorem for decomposing the -point DFT into smaller DFTs, which are the factors of  and are mutually prime [2].With this mapping the twiddle factor multiplications are avoided at a cost of increased number of additions and irregular structure.A modification of the prime factor algorithm is Winograd fast Fourier transform algorithm.It is capable of achieving minimum complex multiplications but the number of additions is increased.
Pipelined FFT architectures are fast and high throughput architectures with parallelism and pipelining [3].Among them the single-path delay feedback architecture and multipath delay commutator architecture are the most popular.
In the first kind of architectures, each of the pipeline stages contains twiddle factor multiplier, -point DFT unit ( = 2, 4), and data buffer, which stays in the feedback of the stage.This buffer is filled before the DFT computations.Therefore, such structure could not be fully loaded [4,5].
In the second kind of architectures, the pipeline stages contain data buffer, -point DFT unit, and twiddle factor multiplier, which are connected in sequence.The data buffer is based on multipath delay commutator and provides sets of  complex data which feed the DFT unit in parallel.Such architecture provides the maximum throughput at the cost of the high hardware volume [3,6].Besides, it must implement the uniform radix  FFT, because for the mixed radix FFT the data buffers become too complex.
Systolic array scheme has also been proposed for FFT computations [7,8].The -point DFT in it is calculated as  separate sums of  weighted samples.It is attractive because of its regularity, scalability, locality of interconnections, and suitability for non-power-of-two transforms.However, such processor requires substantial hardware volume.For example, the 16 × 16-point processor for 16-bit data contains 3982 adaptive logic modules (ALMs) and 33 multipliers of the Altera Stratix III FPGA, comparing to the usual 20-bit width International Journal of Reconfigurable Computing pipelined FFT processor, which contains 4261 ALMs, and only 24 multipliers [8].
The implementation of the pipelined FFT architecture in modern field programmable gate arrays (FPGAs) provides the high-speed hardware solution with small energy consumption.One FFT of 256 16-bit complex points dissipates approximately one microjoule in FPGA [6].The FFT processor for  = 4096, which occupies 36.7 k ALMs, and 60 multipliers provides the speed up to 90 GFLOPS, and the efficiency near 10 GFOPS per watt, which is in many times higher than in CPU, or GPU implementation [9].Besides, this architecture can be accommodated in FPGA to the solved problem conditions by exchanging the throughput, transform length , or computation precision.
In most cases the power-of-two FFT processors are implemented in FPGA.In the paper [13] it was proven that Radix-2 FFT method provides the least number of FPGA slices, the Good-Thomas method is faster than Cooley-Tukey, and the Rader method had the lowest operating frequency of the pipelined processor in FPGA.
The non-power-of-two transforms are widely used in the OFDM modems.In such transform the Winograd algorithm minimizes the number of multiplications in the DFT modules but also adds a degree of complexity and significantly increases the total number of utilized adders in FPGA [14,15].
In [16] the parallel architecture of the DFT module has been proposed for the computation of this algorithm.This architecture is able to deal with a large amount of FFT sizes, decomposable in product factors that are 2, 3, 4, 5, 7, or 8.
In [17] the pipelined processor design is proposed, which uses the Cooley-Tukey FFT algorithm for FFT computation only in those cases where the factors of the number  are not relatively prime.
The DFT modules, which are used in the examples of the pipelined FFT processors, are designed by the one-toone mapping of the respective small point FFT algorithms.As a result, they need the data feeding through  input ports.Consider the two stage pipeline; the number  is factored to factors , ,  ̸ = .Then the buffer has  FIFOs of the length of more than , which are fed from  inputs in the nonuniform order.Therefore, to provide the proper input data order for these stages, the complex data buffers must be attached to their ports.
Consider the DFT modules, which accept the  input data sequentially for  steps.Then both data buffers and twiddle factor multipliers are simplified substantially.These modules have the slowdown operation in  times.Besides, the hardware volume of the DFT modules can be decreased up to  times.The disadvantage of this architecture is the decrease of the FFT processor throughput up to  times.But in this situation we can provide the proper system throughput by the increase of the FFT processors number, which are configured in FPGA.
In this paper we propose the design of a set of -point DFT units, which help to implement the pipelined FFT processors, when the data flow is a single sample per a clock cycle.

A Method of Pipelined Datapath Synthesis
By the high level synthesis the DSP algorithm is usually described by a signal flow graph or a synchronous data flow (SDF).In SDF the nodes-actors and edges represent the operators and data transfers between them, respectively.Each actor consumes and generates the same amount of data in each SDF cycle [21,22].
Uniform SDF has the property that its graph is equal to the graph of the pipelined datapath, which implements the algorithm with the period of  = 1 clock cycle.Then the SDF nodes are mapped into the operational resources like adders, multipliers, the edges are mapped into the connections, and the delays in edges represent the pipelined registers.This property is widely used by the synthesis of DSP modules for FPGA in many CAD tools like Matlab-Simulink System Generator [23].
The synthesis of the pipelined datapath with the period of  > 1 cycles is usually performed by the steps of resource selection, actor scheduling, and resource assignment.Then the datapath structure is found, and the control unit is synthesized.
It is worth mentioning that most of DFT algorithms are acyclic ones.The most popular scheduling methods for limited resources and execution time consider the acyclic SDF.These methods are list scheduling and force directed scheduling [24].The register allocation is effectively implemented by the Tseng heuristic and by the left edge scheduling.The use of the cyclic interval graph takes into account the cyclic nature of the SDF algorithm [25].The retiming methods and the graph folding methods simplify the SDF mapping [26,27].
In [28,29] the method of the datapath synthesis is proposed, which is based on SDF.This method, adapted to the acyclic SDF, is described below.In this method, SDF is represented in the three-dimensional space in the form of a triple   = (, , ), where  is the matrix of vectorsnodes K  , which mean the operators or actors,  is matrix of vectors-edges D  , performing the links between operators, and  is the incidence matrix of SDF.
In the vector K  = (  ,   ,   )  the coordinates   ,   , and   correspond to the type of operator, the processor unit (PU) number, and the clock cycle.The SDF graph in such representation is called spatial SDF.
Spatial SDF is split into the spatial configuration   = (  ,   , ) and event configuration   = (  ,   , ), which correspond to the datapath structure, and its schedule.By the splitting process the vectors K  = (  ,   ,   )  are decomposed into vectors K  = (  ,   )  , corresponding to the PU coordinates, and vectors K  =   , which mean the execution times of the relevant operators in PU K  .Then the temporal component D  =   of the vector D  is equal to the delay of transfer or processing of the respective variable.
We can assume that the matrix  encodes some acceptable structural solution, since the matrix  is calculated by The structural optimization consists in finding such matrix , which minimizes a given quality criterion.It is possible to specify a matrix   which provides the minimum value of   .Then the vectors K  are found from a relationship where   is the matrix of vectors-nodes and   is the incidence matrix of the maximum spanning tree for SDF.When looking for the effective structural solution, the following relations have to be considered.Spatial SDF is valid, if the matrix  has no two identical vectors; that is, The schedule with the period of  clock cycles is correct if the operators, which are mapped into the same PU, are performed in different cycles; that is, This inequality provides the correct circular schedule.Moreover, the next operator has to be executed no earlier than the previous one; that is, The operators of the same type should be mapped into PU of the same type; that is, where  , is a set of -type vectors-operators, which are mapped in the th PU of th type ( = 1, 2, . . .,   max ).Then the search for the effective schedule consists in the following.The vectors D  ∈   are assigned the coordinate   = 1; that is, the respective operators have the delays of a single clock cycle.The matrix   is found from (2).The remaining elements of the matrix   are found from (1).If inequality ( 5) is not satisfied for some of vectors, then the coordinate   is increased for certain vectors D  ∈   , and the schedule search is repeated.The rest of K  coordinates are found from conditions (3)-( 6).In such wise the fastest schedule is built, as each statement is executed in a single clock cycle without unnecessary delays.
The resulting spatial SDF can be described by the VHDL language, so the pipelined datapath description can be translated into the gate level description of the FPGA configuration by the proper compiler-synthesizer [30].
During the structure synthesis, the nodes are placed in the space according to a set of rules, providing the minimum hardware volume for the given number of clock cycles in the algorithm period.The resulting spatial SDF is described by VHDL language and is modelled and compiled using proper CAD tools.
The method is similar to the known method of the SDF folding in  times [31].However, it is distinguished from the intuitive folding procedure in the formal approach and directed optimization process.In this method, the steps of resource selection, operator scheduling, and resource allocation are implemented in a single step, providing more effective optimization.
The method was built in the framework which is intended for the SDF graph input and its graphical editing.Both algorithm and resulting structure are stored in XML files.The framework can translate the XML description into the VHDL synthesizable model, which can be modelled and synthesized by usual CAD tools provided by different companies.The present limitation consists in that the SDF graph is optimized only by hand using the relations, definitions, and theorems, mentioned above [32].The shown below examples were synthesized by Xilinx ISE, Ver.13.3.
The method is successfully proven by the synthesis of a set of pipelined FFT processors, IIR filters, and other pipelined datapaths for FPGA [33].A set of DFT modules was designed using it as well.

Example of the DFT Module Synthesis
Consider a design example of a DFT module of  = 3 points.It performs the Winograd DFT algorithm, which is described in [5]: where { 0 ,  1 ,  2 } is the input complex data set, { 0 ,  1 ,  2 } is the complex result set, and  = √ −1.In this algorithm cos 2/3 = −0.5;therefore,  1 = −1.5 ⋅ .The algorithm has twelve real additions and four real multiplications.
Then the multiplication ⋅ = sin 2/3⋅ is implemented as SDF of this algorithm is shown in Figure 1.The black circles in it represent the input-output nodes, circle with a International Journal of Reconfigurable Computing cross does complex addition, and symbols "≫" mean the shift right operation to  bits.The edge, which is loaded by , means the multiplication to , which is performed as inversion of the image part of data and swapping the real and image parts of data.Each node performs a delay to a single clock cycle.Therefore, this SDF makes the structure of a module, which computes DFT with the period of  = 1 cycle.
Due to the method described above, SDF is represented in the three-dimensional space for  = 3 as a spatial SDF, which is illustrated by Figure 2. Comparing to SDF in Figure 1, the spatial SDF is placed in the space with the coordinates of resources  and time .The coordinates of a node in it mean the number of PUs, where it is triggered, and the number of the clock cycles.The operator type is coded by the character  as a register and character  as an adder.Below the axis  the axis with figures ( mod 3) is placed to simplify the check of the SDF correctness due to formula (4).
We can see that the spatial SDF codes both the algorithm schedule and the module structure where it is performed.This SDF is formally translated into VHDL description in the synthesis style as shown in Algorithm 1.
Here the signals and ports are used, which represent the outputs of the respective operator nodes of SDF in Figure 2. Note that the complex variables are substituted by a couple of signed-type signals with the suffixes ,  in their names, respectively.The input impulse START synchronizes the phase generator, which is described in the process operator CNTRL.It generates the phase number signal CYC with the period of  = 3 clocks.
The DFT calculations are performed in the process operator CALC.The CASE operator in it consists of three alternatives depending on the signal CYC.In the th alternative the operators are placed, which are performed in the clock cycle  =  mod 3 due to the SDF in Figure 2.
Each signal assignment in this process is mapped into the respective pipeline register because the activation of it is implemented in the rising edge of the clock signal.The CASE operator alternatives are mapped into respective PUs of adders using the known resource sharing technique [35].The resulting DFT module structure is shown in Figure 3.It is shown only as an illustration, because its forming is not necessary for the module design.
As we can see in Figure 3, the adders with two input multiplexers are placed between two neighboring registers in this module.A single digit of such unit is mapped into a single 6-input CLB, which is used in modern FPGAs.Therefore, this module has the shortest critical path and maximized clock frequency.

Results of the DFT Module Synthesis
A set of DFT modules was designed using the method of spaced SDF, described above.Each of them inputs a single The example of 64-point FFT processor is compared with similar processors in Table 2. Its advantages are small hardware volume in the number of DSP48 units and high clock frequency by the nonrestrictive constraints.

Conclusions
The implementation of the -point DFT modules in FPGA provides the design of the high-speed pipelined FFT processors with optimized hardware volume.It is proven that the DFT module with the slowdown operation in  times has the high clock frequency and small hardware volume due to the pipelined calculations, properties of the 6-input LUTs, and application specific multipliers.The designed DFT modules were used to build the pipelined FFT processors with  = 64, 128, and 256, which are deposited in the free IP core site [36], and can be downloaded for investigation and use.
Our future work aims at design of the framework which provides automatic synthesis of pipelined FFT processors based on the DFT modules.

Figure 3 :
Figure 3: Structure of the DFT module.