^{1}

^{2}

^{2}

^{2}

^{1}

^{2}

Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. This leads to cross-shaped artifacts in the frequency domain due to spectral leakage. These artifacts can have critical consequences if the DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGA-based solution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications. Standard approaches for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate a surge in computation by significantly increasing the image size. We use a periodic plus smooth decomposition-based approach that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the so-called “intermediate storage” or “memory wall” problem, which is due to limited on-chip memory, increasingly large image sizes, and strided column-wise external memory access. We propose a “tile-hopping” memory mapping scheme that significantly improves the bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to

Discrete Fourier Transform (DFT) is a commonly used and vitally important function for a vast variety of applications including, but not limited to, digital communication systems, image processing, computer vision, biomedical imaging, and biometrics [

The Cooley-Tukey fast Fourier transform (FFT) algorithm [

While calculating 2D DFTs, it is assumed that the image is periodic, which is usually not the case. The nonperiodic nature of the image leads to artifacts in the Fourier transform, usually known as edge artifacts or series termination errors. These artifacts appear as several crosses of high-amplitude coefficients in the frequency domain, as seen in [

We propose optimized periodic plus smooth decomposition (OPSD) as an optimization for standard periodic plus smooth decomposition (PSD) for edge artifact removal (Section

Based on OPSD, we propose an architecture that can reduce the access to DRAM and can decrease the number of 1D FFT invocations by performing column-by-column operations on the fly (Section

Since OPSD is heavily dependent on efficient FPGA-based 2D FFT implementation which is limited by DRAM access problems, we design a memory mapping scheme based on

The proposed OPSD and memory “tile-hopping” optimizations also lead to better energy performance as compared to row-major access (Section

We use our implementation as an accelerator for filtered back-projection (FBP), an analytical tomographic reconstruction method, and show that for large datasets our 2D FFT with edge artifact removal (EAR) can significantly improve reconstruction run time (Section

As compared to our previous work [

There are several resource-efficient, high-throughput implementation approaches of multidimensional DFTs on a variety of different platforms. Many of these methods are software-based and have been optimized for efficient performance on general-purpose processors (GPPs), for example, Intel MKL [

Due to their inherent parallelism and reconfigurability, FPGAs are attractive for accelerating FFT computations, since they fully exploit the parallel nature of the FFT algorithm. FPGAs are particularly an attractive target for medical and biomedical imaging apparatus and instruments such as electron microscopes and tomographic scanners. Such devices do not have to be manufactured in bulk to justify application-specific solutions and require high bandwidth. Moreover, increasing mobility and portability constitute a future objective for many medical imaging systems. FPGAs are also more efficient for prototyping machine vision applications since they are relatively more fine-grained when compared to GPPs and GPUs and can serve as a bridge between general-purpose and application-specific acceleration solutions.

There have been several high-throughput 2D FFT FPGA-based implementations over the past few years. Most of these rely on repeated invocations of 1D FFTs by row and column decomposition (RCD) with efficient use of memory [

(a) An overview of row-column decomposition (RCD) for 2D FFT implementation. Intermediate storage is required because all elements of the row-by-row operations must be available for column-by-column processing. (b) An overview of strided column-wise access from DRAM as compared to trivial row-wise access. An entire row of elements must be read into the row buffer even to access a single element within a specific row.

Dataset level view

Memory level view

Since the column-by-column 1D FFT requires data from all rows, intermediate storage becomes a major problem for large datasets. Many implementations rely on local memory such as resource-implemented block RAM for intermediate storage which is not possible for large datasets [

As shown in Figure

(a) An overview of the DRAM hierarchy. (b) Image showing the structure of a single DRAM bank. (c) Flow chart explaining additional latency introduced when a new row has to be referred to in the row buffer to access a specific element.

DRAM hierarchy

Single DRAM bank

Row buffer hit and miss

If a new row has to be activated and accessed into the row buffer, a

While calculating 2D DFTs, it is assumed that the image is periodic, which is usually not the case. The nonperiodic nature of the image leads to artifacts in the Fourier transform, usually known as edge artifacts or series termination errors. These artifacts appear as several crosses of high-amplitude coefficients in the frequency domain (Figure

(1a) An image with nonperiodic boundary. (1b) 2D DFT of (1a). (1c) DFT of the smooth component, i.e., the removed artifacts from (1a). (1d) Periodic component, i.e., DFT of (1a) with edge artifacts removed. (1e) Reconstructed smooth component. (1f) Reconstructed periodic component.

The most common approach is by ramping the image at corner pixels to slowly attenuate the edges. Ramping is usually accomplished by an apodization function such as a Tukey (tapered cosine) or a Hamming window, which smoothly reduces the intensity to zero. Such an approach can be implemented on an FPGA as a preprocessing operation by storing the window function in a look-up table (LUT) and multiplying it with the image stream before calculating the FFT [

Simultaneously removing the edge artifacts while calculating a 2D FFT imposes an additional design challenge, regardless of the method used. However, these artifacts must be removed in applications where they may be propagated to subsequent processing levels. An ideal method for removing these artifacts should involve making the image periodic while removing minimal information from the image. Periodic plus smooth decomposition (PSD), first presented by Moisan [

A major concern while designing complex image processing hardware accelerators is how to fully harness on the divide-and-conquer approach. Algorithms that have to be mapped to multiple FPGAs are often marred by communication problems, and custom FPGA boards reduce flexibility for large-scale and evolving designs. For rapid prototyping of our algorithms, we used LabVIEW FPGA 2016 (National Instruments), a robust data-flow-based graphical design environment. LabVIEW FPGA provides integration with National Instruments (NI) Xilinx-based reconfigurable hardware, allowing efficient communication with a host PC and high-throughput communication between multiple FPGAs through a PXIe (PCI eXtensions for Industry Express) bus. LabVIEW FPGA also enables us to integrate external Hardware Description Language (HDL) code and gives us the flexibility to expand our design for future processing stages. We used NI PXIe-7976R FPGA boards that have a Xilinx Kintex 7 FPGA and 2GB high-bandwidth (10GB/s) external memory. This platform has already been extensively used for rapid prototyping of communication standards and protocols before moving to ASIC designs. The optimizations and designs we present here are scalable to most reconfigurable computing-based systems. Moreover, LabVIEW FPGA provides efficient high-level control over memory via a smart memory controller.

Periodic plus smooth decomposition (PSD) involves decomposing the image into a periodic and smooth component to remove edge artifacts with minimal loss of information from the image [

Let us have discrete

Since in general

The DFT of the image

Algorithm

While FPGA 1 and FPGA 2 can run in parallel, the result of step A from FPGA 1 has to be stored on the host PC while steps B, C, and D are completed on FPGA 2 before step E can be completed on the host PC.

The DRAM intermediate storage problem explained in Section

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus. The steps are associated with Algorithm

As for

In this section, we optimize the original PSD algorithm. This optimization is to effectively reduce the number of 1D FFT invocations and the number of times the DRAM is accessed.

Equation (

In computing the FFT of

It can be shown that the 1D FFT of the column

To compute the column-by-column 1D FFT of the matrix,

1:

2:

3:

4:

5:

6:

7:

8:

Table

Comparing mirroring, PSD, and OPSD.

Algorithm | DRAM Access | DFT |
---|---|---|

Points | Points | |

Mirroring | | |

P + S Decomposition (PSD) | | |

Optimized PSD (Proposed) | | |

Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring, periodic plus smooth decomposition (PSD), and our proposed optimized period plus smooth decomposition (OPSD).

In this section we propose a

We propose

Instead of writing the results of the row-by-row 1D FFT in row-major order we remap the results in a blocked or tiled pattern as shown in Figure

Image showing tile-hopping. (a) Image-level view showing tiles. (b) DRAM-level view showing tile placement while writing. (c) DRAM-level view showing column reading from the tiles.

Dataset view

Memory view

Memory view

We refer to this method as

Since 2D DFTs are usually used for simplifying convolution operations in complex image processing and machine vision systems, we needed to prototype our design on a system that is expandable for next levels of processing. As mentioned earlier, for rapid prototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme, we used a PXIe-based reconfigurable system. PXIe is an industrial extension of a PCI system with an enhanced bus structure that gives each connected device dedicated access to the bus with a maximum throughput of 24GB/s. This allows a high-speed dedicated link between a host PC and several FPGAs. The LabVIEW FPGA graphical design environment is efficient for rapid prototyping of complicated signal and image processing systems. It allows us to effectively integrate external HDL code and LabVIEW graphical design on a single platform. Moreover, it allows a combination of high-level synthesis (HLS) and custom logic. Since current HLS tools have limitations when it comes to complex image and signal processing tasks, LabVIEW FPGA tries to bridge these gaps by streamlining the design process.

We used FlexRIO (Flexible Reconfigurable I/O) FPGA boards plugged into a PXIe chassis. PXIe FlexRIO FPGA boards are adaptable and can be used to achieve high throughput, because they allow direct data transfer between multiple FPGAs at rates as high as 8GB/s. This can significantly simplify multi-FPGA systems, which usually communicate via a host PC. This feature allows expansion of our system to further processing stages, making it flexible for a variety of applications. Figure

Block diagram of a PXIe-based multi-FPGA system with a host PC controller connected through a high-speed bus on a PXIe chassis [

Specifically, we used two NI PXIe-7976R FlexRIO boards which have Kintex 7 FPGA and 2GB external DRAM with theoretical data bandwidth up to 10GB/s. This FPGA board was plugged into a PXIe-1085 chassis along with a PXIe-8880 Intel Xeon PC controller. PXIe-1085 can hold up to 16 FPGAs and has 8 GB/s per-slot dedicated bandwidth and an overall system bandwidth of 24 GB/s.

As per Algorithms

The design flow presented in Figure

Step B was accomplished using standard LabVIEW FPGA HLS tools for programming (

We need the boundary column vector for 1D FFT calculation of the first and last columns. We also need the boundary row vector for appropriate scaling of

Functional block diagram of PXIe-based 2D FFT implementation with simultaneous edge artifact removal using optimized periodic plus smooth decomposition. The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and a host PC connected over a high-bandwidth bus. The image is streamed from the PC controller to FPGA 1 and FPGA 2. FPGA 1 calculates the row-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back to the host PC. FPGA 2 receives the image, calculates the boundary image, and proceeds to calculate the 1D FFT column-by-column FFT using the shortcut presented in (

Block diagram of 2D FFT showing data transfer between external memory and local memory scheduled via a Control Unit (CU).

The overall performance of the system was evaluated using the setup presented in Figure

Comparison of OPSD^{1} 2D FFT with regular RCD-based implementation.

Platform | SEAR^{2} | Precision | RT^{3} (ms) | |
---|---|---|---|---|

Yes/No | bits | 512 × 512 | 1024 × 1024 | |

| | | | |

| | | | |

Kintex 7, 28mm [ | Yes | 16 (fixed) | 32.4 | 116.7 |

Stratix IV [ | No | 64 (double) | - | 6.1 |

Virtex-5-BEE3, 65nm[ | No | 32 (single) | 24.9 | 102.6 |

Virtex-E, 180nm [ | No | 16 (fixed) | 28.6 | 76.9 |

ASIC, 180nm | No | 32 (single) | 21.0 | - |

DRAM energy consumption baseline vs tile-hopping.

| | | |
---|---|---|---|

| | | |

EPR^{∘} Read | |||

(Baseline) | 4.46 | 5.77 | 7.12 |

| |||

EPR CW Read | |||

| 2.54 | 2.95 | 3.36 |

| |||

| |||

| | | |

Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern. (b) 2D FFTs with edge artifact removal (EAR) using OPSD. The performance evaluation shows the significance of the two optimizations proposed. Both axes are on a log scale.

2D FFT performance: DRAM tile-hopping

2D FFT with EAR performance: PSD versus OPSD

The overall energy consumption of the custom computing system depends on

We estimate the DRAM power consumption for both the baseline (standard, strided) and the optimized (

2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping).

| | | |
---|---|---|---|

| | | |

EPP | |||

Baseline | 36.92 | 41.25 | 48.35 |

| |||

2D FFT+EAR (Opt.) | |||

| 15.88 | 17.06 | 18.11 |

| |||

| |||

| 2.3× | 2.4× | 2.8× |

The metric used to compare the overall energy optimization achieved for 2D FFTs with EAR is energy per point, i.e., the amount of average energy required to compute the 2D FFT of a single point in an image with simultaneous edge artifact removal. This was achieved by calculating the energy consumed by Xilinx LogiCORE IP for 1D FFTs, the DRAM, and the edge artifact removal part separately. The estimated energy calculated does not include energy consumed by the PXIe chassis and the host PC. Essentially, the FPGA-based architecture presented here could be used without the host controller. The energy consumption incorporates dynamic as well as static power. The overall energy consumption per point is reduced by 56.9%, 58.6%, and 62% for calculating

In order to further demonstrate the effectiveness of our implementation, we use the created 2D FFT module as an accelerator for reducing the run time for filtered back-projection (FBP). FBP is a fundamental analytical tomographic image reconstruction method. In depth details regarding the basic FBP algorithm have been left out for brevity, but can be found in [

Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR).

3D Density | CPU (i7) | FPGA + Host PC (i7) |
---|---|---|

Sec | Sec | |

| 21.3 sec | 19.5 sec |

| 47.5 sec | 42.4 sec |

| 94.8 sec | 81.3 sec |

| 322.3 sec | 275.3 sec |

| 1687.7 sec | 1364.4 sec |

| 16463.1 sec | 12599.4 sec |

Figure showing a thin slice of filtered back-projection results by reconstructing a

2D FFTs often become a major bottleneck for high-performance imaging and vision systems. The inherent computational complexity of the 2D FFT kernel is further enhanced if effective removal (using PSD) of spurious artifacts introduced by the nonperiodic nature of real-life images is taken into account. We developed and implemented an FPGA-based design for calculating high-throughput 2D DFTs with simultaneous edge artifact removal. Our approach is based on a PSD algorithm that splits the frequency domain of a 2D image into a smooth component which contains the high-frequency, cross-shaped artifacts and can be subtracted from the 2D DFT of the original image to obtain a periodic component that is artifact-free. Since this approach calculates two 2D DFTs simultaneously, external memory addressing and repeated 1D FFT invocations become problematic. To solve this problem we optimized the original PSD algorithm to reduce the number of DFT samples to be computed and DRAM access. Moreover, to reduce strided access from the DRAM during column-wise reads we presented and analyzed

Our methods were tested using extensive synthesis and benchmarking using a Xilinx Kintex 7 FPGA communicating with a host PC on a high-speed PXIe bus. Our system is expandable to support several FPGAs and can be adapted to various large-scale computer vision and biomedical applications. Despite decomposing the image into periodic and smooth frequency components, our design requires less run time, compared to traditional FPGA-based 2D DFT implementation approaches and can be used for a variety of highly demanding applications. One such application, filtered back-projection, was accelerated using the proposed implementation to achieve better results specifically for larger size raw tomographic data.

The authors declare that they have no conflicts of interest.

This work was supported by Japanese Government OIST Subsidy for Operations (Ulf Skoglund) Grant no. 5020S7010020. Faisal Mahmood and Märt Toots were additionally supported by the OIST Ph.D. Fellowship. The authors would like to thank National Instruments Research for their technical support during the design process. The authors would also like to thank Dr. Steven D. Aird for language assistance and Shizuka Kuda for logistical arrangements.

File: 2D FFT-EAR-Demo.mp4: video demonstrating the FPGA output of 2D FFT with edge artifact removal.