Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation.

More and more functionality is integrated in state-of-the-art mobile systems. One of the applications is digital broadcasting of audio and video. As the reception of digital broadcasts requires a considerable amount of digital processing, efficient architectures are required that can provide the necessary processing at a low energy budget.

Until recently, the processing resources of mobile systems were mainly provided by means of Application Specific ICs (ASICs). Such architectures have the advantage of low energy consumption, but do not provide flexibility. Therefore, for a multistandard system a multichip solution can be much more expensive than a highly integrated reconfigurable single-chip architecture. Therefore, heterogeneous reconfigurable multicore architectures are getting more and more attractive for multistandard appliances. Such an architecture consists of multiple processor types (heterogeneity) that can be used to execute applications within a bounded application domain. In order to support multiple emerging applications from this application domain, the processors can be reconfigured.

An example broadcast application is Digital Radio Mondiale (DRM) [

The most challenging algorithms in the DRM application are the Fast Fourier Transform (FFT) and the Inverse Fast Fourier Transform (iFFT) required in the baseband processing of the OFDM receiver, as presented in [

The paper is organized as follows. Section

The Discrete Fourier Transform (DFT) transforms a digital signal from the time domain to the frequency domain. It is defined by the following relation between

The FFT efficiently implements the DFT, by exploiting symmetry in its twiddle factors. A well-known FFT algorithm is the “divide and conquer” approach reintroduced by Cooley and Tukey [

The restriction of the radix FFT is that it can only handle FFTs that have a length that is a power of the radix value (e.g., two for radix-2). If other lengths are required a mixed-radix algorithm [

Another more efficient approach was introduced by Good [

Good's mapping optimizes the PFA for the number of calculations to be done, but assumes that input data is ordered in Ruritanian Correspondence (RC) order and output data in Chinese Remainder Theorem (CRT) order or vice versa, as presented by [

Because the RC mapping is used for the input vector, the CRT mapping has to be applied to retrieve the correct output vector. The output vector

A graphical description of the steps required for a PFA decomposed FFT using Good's mapping is given in Figure

Steps in a PFA decomposed FFT.

The DRM receiver has to process the DRM audio samples at a certain rate, to avoid loss of synchronization, unwanted noise and clicks in the audio stream. Every 400 milliseconds, a DRM frame is transmitted. Depending on the transmission mode, such a frame consists of 15 up to 24 symbols. For each symbol, both the OFDM baseband processing and the audio source decoding have to be performed. Since the FFT is the most computational intensive task in the baseband processing, it should be executed efficiently and fast.

Depending on the channel quality, the transmission mode of the broadcast can change. If the receiver does not immediately adapt to the accompanying decoding scheme, data gets lost. Changing the coding scheme requires different lengths of FFTs and iFFTs. Therefore, the receiver should be very flexible.

During the last few decades, many efficient implementations of FFTs have been proposed. Often, the gain in computational performance obtained when optimizing the algorithm leads to irregularity in the algorithm. This requires the architecture to be flexible. Thus, performance and flexibility are tightly linked, which seems to be in contrast with many architectures used nowadays.

Generally, the amount of operations required to perform a DFT can be reduced by transforming it to smaller FFTs. However, for DFTs with a length that is a non-power-of-two, this transformation introduces some irregularity in the operations. This makes it hard to perform such an FFT efficiently.

A frequently used method to overcome this problem is “zero padding”, which appends zeros to the input vector and increases its length to a power-of-two, such that a regular power-of-two FFT can be applied. However, this changes the filter response of the FFT and it will lose its orthogonal characteristics. To illustrate the effects of zero padding on OFDM systems, we simulated an OFDM system with QAM-16 modulation. Figure

QAM-16 bit errors occurring due to transmission and decoding.

Transmitted data

Received data

Received data (using zero padding)

The effect of the white noise added by the channel is clear: small errors occur in the received samples. However, for the zero padding based-receiver the input samples are not recognizable at all.

Usually, in most applications the error introduced by zero padding is acceptable as the gain in performance is more important. However, OFDM-based applications use the orthogonal characteristics of FFTs to improve the spectral efficiency and, therefore, the requirements for the FFT are more stringent. In order to obtain an acceptable performance, efficient non-power-of-two FFT implementations are required.

For the DRM receiver, a large variety of FFTs is required. DRM can be used in several modes, each requiring a different set of FFTs. The OFDM processing requires a number of radix-2 FFTs (512, 256) and a set of non-power-of-two FFTs (1920, 576, 352, 288, 224, 176 and 112). The non-power-of-two FFTs mentioned before can be generalized to a group of 2-dimensional PFA-decomposable DFTs of the following form:

Table

A selection of the FFTs that can be generated with the PFA mapping. FFTs used in DRM are underlined.

4 | 5 | 6 | 7 | |

2 | 80 | 160 | 320 | 640 |

3 | 448 | 896 | ||

4 | 144 | 1152 | ||

5 | 704 | 1408 | ||

6 | 208 | 416 | 832 | 1664 |

7 | 240 | 480 | 960 |

Strictly taken, a DFT-

Implementations of FFTs are mainly focussed on the power-of-two FFTs that use the radix-2 FFT approach. Those are widely used to compare implementations and for benchmarking Digital Signal Processor (DSP) architectures. The algorithms for non-power-of-two FFTs are mainly focused on reducing the number of multiplications and additions, as, for example, discussed in [

We have not found many articles that implement non-power-of-two FFT on a DSP. Several ASIC implementations were found, but due to outdated process technology used, the results are not very useful [

A high-speed FFT-1872 implementation on an Field Programmable Gate Array (FPGA) is presented in [

As described in Section

An example of multiprocessor architectures is a heterogeneous tiled System-on-Chip (SoC), that consists of several (possibly small) processors connected in a very regular Network-on-Chip (NoC) topology [

Heterogeneous tiled SoC.

In such a heterogeneous system, several types of processors can be combined: General Purpose Processor (GPP), DSPs, FPGAs, ASIC and Domain Specific Reconfigurable Core (DSRCs).

A typical GPP found in mobile devices is the Advanced RISC Machine (ARM) family (

DSPs are designed for high performance and flexibility. Compared to a GPP, a DSPs performs much better within a bounded application domain, while its energy consumption is relatively low compared to the GPP.

FPGAs are bit-level reprogrammable. This allows a dedicated configuration for a specific application. Although operations on word level can be performed well on an FPGA, the infrastructure is more suited to bit-level operations. This can be seen when the performance and energy consumption are compared to the other architectures' figures: FPGAs provide a huge processing capacity, but at the cost of a relatively high energy consumption.

The ultimate processor for a certain task is the ASIC: once it is produced, it can only execute the application it was designed for. Therefore, the ASIC processor has both a high performance and low energy consumption, but is not flexible at all.

DSRCs are used to fill the gap between GPPs, DSPs, FPGAs and ASICs. Similarly to DSPs, their algorithm domain is bounded, but the computational performance is close to that of an ASIC.

An example of such a reconfigurable processor is the Montium tile processor [

Figure

Montium processing tile, consisting of a Montium TP and a CCU.

The ALU’s input operands are fetched from a register file, which can store up to four values per input. A large crossbar connects all ALUs, memories and register files, providing a very high connectivity that is required to utilize the ALU’s processing blocks as much as possible. Since there are 10 memories available which can be accessed simultaneously, the crossbar is based on 10 bidirectional 16-bit buses that can be read and written by each of the ALUs.

Each memory unit consists of an

The data path is controlled by a centralized sequencer, that contains an SRAM memory in which the instructions are stored. The program memory is not accessible via the datapath, hence it cannot be modified during execution. Within the sequencer, a program counter exists that addresses the instruction memory. Since there is a direct connection to the instruction memory, the fetch of an instruction only requires a single clock cycle. The selected instruction is decoded in two stages by decoders and configuration registers. Simultaneously, the program counter selects the next instruction in the instruction memory or, in case of a jump instruction, jumps to the address specified in the program. Since all instructions are single-cycle executions, the program behaves deterministic. Therefore, the instruction fetch can be considered to be transparent.

For each ALU, AGU, register file or interconnect component a small number of configurations (varying from 4 to 16 configurations) is stored in a local configuration register. The decoders contain combinations of these configurations (varying from 16 to 64 combinations) that are addressed by the sequencer. All configuration registers and decoders are implemented as asynchronous memories, which have to be filled prior to execution of a program. Therefore, the ALUs can be considered to be pipline-less, such that typical problems like pipeline stalls will never occur.

Table

Characteristics of the Montium TP.

Word size | 16 bits |

Area | 1.8 |

Memory size | 10 |

Clock frequency | 100 MHz |

CMOS technology | 0.13 |

Voltage | 1.2 V |

Power | 577 |

An example of a SoC in which the Montium TP is used, is the Annabelle SoC [

Annabelle SoC block diagram.

The streaming nature of the Montium TP's algorithm domain is based on data-driven operations. Typically, operations are performed on chunks of data for several tens to thousands of iterations before the operation has to be changed.

The Montium TP's instructions are programmed by filling the instruction memory, configuration memory and decoders. These memories can be modified by an external controller that takes care of the configuration and program control: the CCU [

Centralized in the CCU is a state register, which determines the Montium TP’s current state. Possible states are the following:

The Montium TP can communicate via the CCU in two modes:

State transition diagrams for the two communication modes.

Block mode

Streaming mode

Using the PFA decomposition (see (

The FFT-1920 is partitioned using the parameters

Figure

Generally, the

For the summations in

In general, the number of clock cycles required to compute an odd-size FFT-

The optimization proposed in [

The FFT-128 is implemented using a standard radix-2 approach. Radix-2 algorithms can be calculated efficiently on the Montium TP, since one FFT-2 can be executed in a single clock cycle. A detailed explanation of the mapping is presented in [

In traditional radix-2 FFT implementations the most difficult part is the bit-reversed addressing scheme of either the input or output values. In most DSP architectures, and Montium TP as well, special hardware in the AGUs overcomes this problem. However, in the PFA both input and output have to be ordered according the RC or CRT mapping. The input reordering in the Montium TP gives the user of the algorithm the possibility to stream in the data into the Montium TP in-order.

The address patterns for RC ordering cannot be generated efficiently with the AGUs. Moreover, since the input values for the FFT-15 are stored in two memories, address patterns become even less regular. A straight-forward solution for the ordering would be to use an indirection table of 1920 entries. However, there is not enough free memory space in the Montium TP for a table with this size. For smaller FFTs this is the preferred approach.

For the input reordering for the FFT-1920 we use the following steps.

The complex input vector is written in-order into 2 local memories

An indirection read address is calculated using (

Using the indirection address, an input value is selected from the local memories

The value is stored in the other local memories

The write address in memories

Steps 2 to 5 are repeated 1920 times, until all values

Memory organization before FFT-15 (RC order).

Figure

Memory organization after FFT-128 (CRT order).

The most complex step in the ordering process of the outputs is the calculation of the indirection address. This address has to be calculated using modulo operations. In the appendix we explain the output ordering for streaming out the complex sample

The total number of clock cycles required to calculate a non-power-of-two FFT of length

Implementation costs of FFTs used in DRM.

FFT | clock cycles | ||
---|---|---|---|

112 | 7 | 16 | 472 |

176 | 11 | 16 | 960 |

224 | 7 | 32 | 1014 |

288 | 9 | 32 | 1450 |

352 | 11 | 32 | 1950 |

576 | 9 | 64 | 3116 |

1920 | 15 | 128 | 14098 |

A fixed-point implementation of a digital signal processing algorithm is liable to overflow after an addition. To prevent overflow the amplitude of the input signal can be limited or the intermediate values can be scaled down. Scaling the intermediate fixed-point numbers results in a shift of the decimal point. Scaling a number in

In the FFTs considered in this paper, the signal is always scaled down with an integer factor

For an FFT the worst-case required scaling factor equals

Positions in the FFT-1920 algorithm where scaling can be applied.

For both communication modes (block mode and streaming mode), we created a Montium TP implementation for our FFT. In the block mode, the input samples need to be ordered (in RC order as explained before) before they can be transferred to the memories of the Montium TP. When the FFT has finished, the results can be transferred from the memories and then need to be reordered. Both ordering steps have to be done outside the Montium TP.

The streaming mode version requires no external processing. Simultaneously with the input ordering, the input can be scaled. Input scaling is applied by multiplying the input stream with a factor

The computation of the iFFT is almost similar to the computation of the FFT, as can be seen when comparing the FFT (

The main difference between the FFT and the iFFT is the

Additionally, for the

As described in Section

As an indication of the performance and accuracy, we compared the Montium TP implementation with a 32-bit reference implementation for an ARM9 platform (see Table

ARM9 reference architecture.

Word size | 32 bits |

Area | 4.7 |

Clock frequency | 96 MHz |

CMOS Technology | 0.13 |

Voltage | 1.2 V |

Power | 250 |

Suppose the Montium TP needs to be reconfigured to run an FFT-1920. This configuration requires three phases. First, we configure the configuration registers, then we initialize some of the register files and the last step is to write coefficients into the local memory. These three steps have to be performed once before the algorithm can be started. The configuration size depends on the actual algorithm settings and whether block mode or streaming mode is used. For streaming we use the most generic configuration, which enables very fast partial reconfiguration. This results in a configuration size as depicted in Table

Number of bytes required for configuration of the FFT-1920.

Streaming | Block | |
---|---|---|

Configuration memory | 2970 | 2438 |

Register files | 44 | 8 |

Coefficient memory | 904 | 452 |

Total | 3918 | 2898 |

The main differences between the generic streaming mode configuration and the specialistic block mode configuration are (1) the extra input and output reordering of the samples and (2) the enabling of partial reconfiguration.

After the main configuration it is possible to adjust the generic streaming configuration with small modifications. For example, to change from FFT to iFFT or to change the scaling factor. These modifications can be done via partial reconfiguration and require a limited amount of bytes as given in Table

Number of bytes required for partial reconfiguration of the FFT-1920.

Partial reconfiguration | Size |
---|---|

Enabling/disabling input scaling | 2 |

Change the input scaling factor ( | 8 |

Change the FFT-128 scaling factors ( | 14 |

Switch from FFT to iFFT (or visa-versa) | 8 |

Comparison of required clock cycles between streaming mode and block mode.

Phase | Operation | Streaming | Block |
---|---|---|---|

Initialization | Configuration | 2113 | 1871 |

Preprocessing | Load input | 1922 | 1924 |

Input scaling | 964 | ||

Input ordering | 2114 | N/A | |

Processing | FFT execution | 14098 | 14098 |

Postprocessing | Output ordering | 1927 | N/A |

Retrieve output | 1924 | ||

Total (w/o initialization) | 20061 | 18910 |

The execution of a FFT-1920 can be separated in several phases. Table

When adding up the processing time (initialization not considered), the block mode operation requires slightly less cycles than the streaming mode version. This is because the input and output ordering is not done in the block mode version; another processor or Montium TP has to take over this job. Obviously, it will be very hard for a GPP to efficiently reorder the data, due to the irregular address patterns (see also Section

Using the figures of Table

To demonstrate the accuracy of the algorithm we executed the FFT-1920 with several combinations of the scaling factors (see Table

11 cases to demonstrate the accuracy of the FFT-1920.

Case | Scaling factors | |||||||
---|---|---|---|---|---|---|---|---|

1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 |

3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |

4 | 4 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |

5 | 4 | 2 | 2 | 1 | 2 | 1 | 2 | 2 |

6 | 4 | 2 | 2 | 2 | 2 | 2 | 1 | 1 |

7 | 8 | 1 | 1 | 1 | 2 | 2 | 2 | 2 |

8 | 8 | 2 | 1 | 2 | 1 | 2 | 1 | 2 |

9 | 8 | 2 | 2 | 2 | 2 | 1 | 1 | 1 |

10 | 16 | 1 | 1 | 1 | 1 | 2 | 2 | 2 |

11 | 16 | 2 | 2 | 2 | 1 | 1 | 1 | 1 |

The input used for the test cases was a typical complex DRM sample stream consisting of 9600 samples. The sample stream was cut in 5 segments of 1920 samples and on each segment an FFT-1920 was applied. The amplitude of the stream was scaled to three levels (31%, 63% and 100% of the fixed-point scale) to analyze the effects of the input scaling and intermediate scaling. The results of the FFT computed by the Montium TP are compared with a floating-point FFT calculated by Matlab. For both, a total scaling factor of 128 was used.

Figure

Rounding errors for various scaling combinations.

From this figure it is clear that, for an input signal with 31% of the range, the low numbered cases have a higher accuracy. However, applying such input scaling decreases the dynamic range of the algorithm considerably, resulting in a less accurate Fourier transform. Therefore, input scaling should be avoided as much as possible. From Figure

These results show the benefit for partial reconfiguration, where the system can quickly adjust the scaling factors depending on the input signal level. It can make a trade-off between accuracy and the risk of an overflow.

From the number of clock cycles we can derive the power consumption of the FFT-1920. For the Montium TP the worst-case power consumption is estimated at 0.577 mW/MHz [

In this paper, the implementation of a wide range of non-power-of-two FFTs and iFFTs on the coarse-grain reconfigurable Montium TP architecture is discussed in detail. This range of FFTs showed to be an ideal test-case to explore and validate the flexibility of the coarse-grained architecture.

The Montium TP is very well suited for executing algorithms with a regular kernel operation. Due to the parallelism in the data path, it can perform up to 5 operations in parallel, while each operation can use up to 4 inputs. The memory bandwidth that is delivered by the 10 local memories is tremendous. For algorithms like the FFT, it is clear that the kernel operation (a butterfly) is done repeatedly. The memory bandwidth required for executing several butterfly operations in parallel can be provided by the Montium TP, while the address patterns that are used for accessing the memories are generated quite easily.

The non-power-of-two FFT is a less regular algorithm. By optimizing the algorithm for regularity and not for the number of multiplications we managed to map a non-power-of-two FFT on the Montium TP. Using the Prime Factor decomposition, the class of non-power-of-two FFTs could be partitioned such that a radix-2 component was recognized (which can be mapped and executed very efficiently on the Montium TP

The possibility to use the data path for the generation of addresses makes it possible to map almost any algorithm with less regular addressing patterns to the Montium TP. Although this type of address pattern calculation is difficult, there is still enough regularity left to map the address calculation efficiently. Generic modulo-operations are difficult to implement in hardware; however, the (pseudo-) modulo operations required for address calculations can be implemented efficiently using the Compare/Select unit available in each ALUs in the Montium TP.

After adding the input scaling, input ordering and output ordering, the Montium TP’s configuration space was almost fully utilized. This implies both the physical usage (e.g., the bandwidth provided by the memories, interconnections and the ALUs) and the logical usage (e.g., the amount of instructions stored in the configuration space).

Considering the performance regarding accuracy and energy consumption, the Montium TP outperforms the reference 32-bit ARM9 implementation by a factor of 10. By choosing smart scaling factors

This example explains the output ordering for streaming out the complex sample

When (

For example, to obtain the location of

Equation (

One ALUs is used for determining the values of

This research has been conducted within the Smart Chips for Smart Surroundings project (IST-001908) supported by the Sixth Framework Programme of the European Community.