On the Feasibility of Fast Fourier Transform Separability Property for Distributed Image Processing

Given the high algorithmic complexity of applied-to-images Fast Fourier Transforms (FFT), computational-resource-usage efficiency has been a challenge in several engineering fields. Accelerator devices such as Graphics Processing Units are very attractive solutions that greatly improve processing times. However, when the number of images to be processed is large, having a limited amount of memory is a serious problem. *is can be faced by using more accelerators or using higher-capability accelerators, which implies higher costs. *e separability property is a resource in hardware approaches that is frequently used to divide the two-dimensional FFT work into several one-dimensional FFTs, which can be simultaneously processed by several computing units. *en, a feasible alternative to address this problem is distributed computing through an Apache Spark cluster. However, determining the separability-property feasibility in distributed systems, when migrating from hardware implementations, is not evident. For this reason, in this paper a comparative study is presented between distributed versions of twodimensional FFTs using the separability property to determine the suitable way to process large image sets using both Spark RRDs and DataFrame APIs.


Introduction
In the image-processing field, two-dimensional convolution is the simplest way to filter images in a spatial domain. However, as long as the kernel size increases, processing times increase geometrically [1]. For this reason, the twodimensional Discrete Fourier Transform (DFT), whose applications include image filtering in the frequency domain [2,3], allows for the reduction (to some extent) of the complexity of operations with respect to two-dimensional convolutions.
Although two-dimensional DFTs improve filtering performance, they still represent a processing challenge due to their high computational complexity of O(N 4 ) [4], which translates to very long processing times. One solution to this problem is the Fast Fourier Transform (FFT), which consists of a set of more efficient algorithms that computes twodimensional DFTs in considerably shorter processing times.
Even though two-dimensional FFTs are more efficient, and a real benefit can be achieved in applications, they continue to show a high computational cost when processing high-resolution images. As a consequence, a new goal of researchers can be finding alternative ways to compute faster and more efficient FFTs, taking advantage of novel High Performance Computing (HPC) systems.
Recently, the parallel-nature two-dimensional FFTs have allowed users to take advantage of five different types of parallel hardware: Very-Large-Scale Integration (VLSI), Field Programmable Gate Array (FPGA), Graphics Processing Units (GPU), Digital Signal Processors (DSP), and Distributed Processing via clusters.
Most successful efforts for accelerating FFTs can be found in real-time hardware implementations. For example, Pythosh and Magnani [5] presented a VLSI implementation of a 256 × 256 two-dimensional FFT, modifying the original Cooley-Tukey algorithm to achieve a speed-up of 27x and a computational complexity of O( ). is approach divided the original data in several submatrices to compute data concurrently.
Because FPGAs are popular due to their reconfiguration ability, a large number of successful approaches can be found in the literature [6][7][8][9]. For instance, Shirazi proposed processing FFTs on images [1] using an FPGA, with which a speed-up of about 180x is achieved. is proposal approaches the DFT separability property by performing several row-wise one-dimensional FFTs, and after these results, several column-wise one-dimensional FFTs are computed.
Another recent effort for performing more efficient FFTs is using other types of accelerators, i.e., Graphics Processing Units (GPU) [10,11]. One of the most famous parallel FFTs can be found in the NVIDIA ® Compute Unified Device Architecture (CUDA) Toolkit. CUDA includes the cuFFT library [12], which provides CUDA-accelerated FFT algorithms that outperform FFTs 10x faster than CPU implementations. Although GPU scalability is more feasible than FPGA/VLSI scalability, increasing the number of accelerators implies higher costs.
Some other FFT implementations take advantage of multicore processing. For example, Kharin et al. [13] provide a DSP software implementation that concurrently computes radix-2 and radix-4 FFTs, achieving valid results up to 3 times faster than sequential implementations. Another implementation uses serial I/O ports to build a DSP network, where every DSP computes part of the entire FFT job. For instance, an FFT parallel implementation is performed with two DSPs [14], where a worker DSP cooperates with a master DSP to concurrently compute several FFTs, achieving a speed-up of up to 2x.
However, hardware-accelerator proposals present two disadvantages: low computational resources and low scalability. As a consequence, they cannot operate with large image sets and the possibility of increasing the number of processing elements is practically null, unfeasible, or very expensive.
Because of this, alternative implementations use distributed systems to increase the amount of computational resources in order to scale up processing. For instance, a cluster tool that uses Apache Spark's [15] Resilient Distributed Datasets (RDD) for analyzing seismic data (used in the oil industry) is presented in [16]. It integrates the Breeze [17] FFT library, achieving a speed-up of about 150x using a cluster with 576 processing cores. Another Spark RDD implementation is proposed by Yang et al. [18], who presented a distributed version of the Cooley-Tukey FFT. Although this is not a two-dimensional approach, it uses the divide-and-conquer strategy by breaking down one-dimensional FFTs into several simpler data structures. Here, each structure is computed in a distributed way and results are merged to achieve the final outcome.
is approach focuses on minimizing network traffic to avoid latencies that could degrade FFTperformance. However, it can be scaled up only in a single multidimensional array (big array), leaving aside applications that need to process large image sets.
Due to the great variety of FFT algorithms and the tendency to separate data to compute one-dimensional and two-dimensional FFTs, this paper presents the following contributions: (i) A comparative study on the feasibility of the frequently used divide-and-conquer strategy found in the literature, proper to the two-dimensional FFT separability property, for large-image-sets analysis (in the frequency domain) in Apache Spark clusters, in terms of computational resource usage. (ii) A comparative analysis on FFT processing times for large image sets using RDD and DataFrames.
In order to get the best possible performance, the highly optimized Fast Fourier Transform in the West software [23], the Java Native Interface, and the Apache Spark middleware are used. e variables used to determine the feasibility of separability-property usage are execution times, CPU and memory usage, and generated network traffic.

Methods
is section presents the main methods used in this work, namely, the basic knowledge of two-dimensional DFT and our proposed methodology for the calculation of Apache-Spark-based distributed FFTs.
One of the most exhaustive operations in image processing is the Fourier Transform, which is used for frequency analysis and filtering. e following sections describe details about the two-dimensional Fast Fourier Transform and its separability property.

e Two-dimensional Discrete Fourier Transform.
e DFT [4,24] of an image of M × N pixels is represented with the following equation: where e −j2π((xu/M)+(yv/N)) represents the Twiddle Factors.
With (1), image frequencies can be split into several sinusoidal components, which can be analyzed separately. Frequently, the imaginary part of the Fourier image is used to eliminate specific frequency components, resulting in low pass, high pass, or band pass filters. anks to the DFT separability property, (1), which computes the two-dimensional DFT, can be broken down in multiple one-dimensional DFTs [1], which can be expressed as follows: where e −j2π(xu/M) and e −j2π(yv/N) are also twiddle factors. is property helps to express a two-dimensional DFT in several one-dimensional DFTs applied to rows and columns of images.
However, (1)-(3) produce redundant twiddle factors, which can generate terms that can cancel each other out, so they do not contribute to the final result. An alternative to solve this problem is the use of FFT algorithms. In this paper, some of the most important and recent Fast Fourier Transform algorithms are presented.

e Fast Fourier Transform.
Due to the high computational complexity of a one-dimensional DFT [4], researchers have proposed alternatives to reduce the number of operations required for this procedure. ese algorithms are known as fast DFTs. One of the most popular is the Cooley-Tukey [25,26] algorithm, which reduces the onedimensional DFT algorithmic complexity from O(MN) to O(M log 2 N), eliminating redundant twiddle factors.
Although computational complexity of this algorithm is lower than that of (2) or (3), it has the disadvantage of requiring that M and N be a power of 2 to operate adequately. is makes this algorithm inefficient, because several zero values must be incorporated into the image (zero padding).
Other more efficient but less intuitive FFTs are found in the literature. For instance, the Prime Factor [27] FFT reexpresses the N−size DFT as a N 1 × N 2 two-dimensional DFT, where N 1 and N 2 are coprime. In comparison to the Cooley-Tukey FFT, its execution time is greatly reduced.
Currently, the fastest FFT algorithms can be found in the Fast Fourier Transform in the West (FFTW) software. It provides a set of FFT algorithms that can be optimized to specific hardware where it will be executed in order to ensure the fastest FFTs, no matter the values of M or N. In several works, its efficiency and feasibility are shown [21]. An evolved FFTW is also presented in [22], also known as FFTW++.
According to the FFTW documentation, this software is a collection of fast algorithms to compute FFTs in the C programming language that can adapt its underlying structure according to the hardware used. With this, it is possible to ensure high performance.
All FFT algorithms in FFTW are meticulously optimized and computed in two stages as follows: (1) A planner heuristically learns the fastest way to compute FFTs in a machine where it will be executed. After this, the FFTW planner creates a data structure called the FFTW plan. (2) In this stage, the FFTW plan transforms an input array. Plans can be reused later as necessary.
As described above, FFTW provides several algorithms to compute FFTs after a plan is scheduled: (i) Cooley-Tukey algorithm (ii) Prime factor algorithm (iii) Rader's algorithm for prime sizes (iv) Split-radix algorithm (conjugate pair variation) (v) FFTW Code generator's algorithms Also, according to FFTW documentation, it can deliver plans with a degree of rigor. For example, if users need to create plans quickly, regardless of whether the execution of such a plan is optimal or not, then the FFTW_ESTIMATE flag must be high. But if users need the fastest possible FFTs according to the underlying hardware, then the most appropriated mode is FFTW_EXHAUSTIVE.
Another interesting property of FFTW is the Words of Wisdom. Every time an FFTW plan is created, it is stored in a region of memory called the Wisdom. If users need to save all plans stored in the Wisdom to reuse them later, they have to export them to a string variable or to a local file system. Consequently, previously saved plans can be imported again to its Wisdom to reuse them as needed.

e Distributed FFT.
In order to determine the feasibility of the separability property of two-dimensional FFTs when large amounts of images are distributedly processed, the following methodology is proposed, where it is assumed that the Apache Spark distributed system [15] is selected. Also, we need to select a Spark-compatible Advanced Programming Interface (API) that provides RDD and DataFrame distributed image representations. In this case, the Spark Image Processing Toolkit (SIPT) [28] is selected to enable distributed image processing for large image sets.
Before explaining details of this methodology, it is necessary to define the distributed data representation for both APIs. For real image-set representation, the following structure has been selected: Moreover, in order to achieve maximum performance while massively computing FFTs on images, we must select an API with highly optimized FFTalgorithms such as FFTW; and a large image set, so a conventional computer (e.g., a desktop computer) cannot process it easily. In this case, the very well-known MIRFLICKR-25000 image set, which is comprised of 25,000 JPEG images that were collected from the Flickr ® social network, is used.
In order to ensure that Spark applications gain access to FFTW algorithms, the Java Native Interface (JNI) is contemplated, so we have to create the following: Scientific Programming (i) A wrapper class in the Scala language that allows users to invoke native methods (written in the C programming language) (ii) A dynamic library (lib.so file) linked to the previous wrapper class, which contains FFTW functionalities, written in the Scala language Once the abovementioned requirements are satisfied, we can now proceed to explain in detail the proposed methodology: (1) Create a Wisdom with 1D/2D-forward/backward plans using the FFTW_EXHAUSTIVE rigor. Subsequently, all Wisdom content is exported to each worker's local file system in the root user's directory (∼/sipt.wisdom). is step must be executed offline.
(2) Once the FFTW Wisdom has been created and exported to file, a single executor of every worker should load it once from the path (∼/sipt.wisdom), either when the Java Virtual Machine (JVM) is initialized or just before using plans. Because of this, plans should be loaded to Wisdom only once per worker. is should not impact the overall performance. (3) After all plans have been loaded to every worker's Wisdom, every executor is able to find the appropriate plans in the previously created and loaded Wisdom, for both direct (forward) or inverse (backward) FFTs. In this sense, wrapper-class native methods are called directly and concurrently, ensuring that only plans stored in Wisdom are executed. is means that every executor eventually loads and executes different plans and data so that they gain access to the same worker's Wisdom.
Once the Spark application finishes, each worker's Wisdom is forgotten. It is important to clarify that, in order to face an FFTW limitation that prohibits concurrent creation and destruction of plans, the mutex.h library has been used to force them to create plans from Wisdom sequentially on the same worker node. Because creation and destruction of plans are performed offline, performance should not be impacted. (4) Finally, every Spark worker can execute two-dimensional FFTs by using the following: (a) Separability: by using the FFT separability property, i.e., (2) and (3), compute one-dimensional FFT plans (see Figure 1(a)) as follows: (i) Split rows (ii) Compute one-dimensional FFT plans on rows (iii) Merge computed rows and store them in memory in a transposed way (iv) Split columns (v) Compute one-dimensional FFT plans on columns (vi) Merge computed columns and store them in memory in a transposed way (b) Nonseparability: compute two-dimensional FFT plans directly (see Figure 1(b)).
(5) Write results to local file system.

e High-Performance Computing Hardware.
To accomplish this research, a powerful and well-resourced HPC system with the appropriate software platform is needed so that it can massively process large image sets. For the experimentation, we use an HPC cluster system that is comprised of 4 nodes with the following characteristics: It should be clarified that the first two experiments are performed three times and execution times reported here are averaged. Also, execution times associated with inverse (backward) FFTs are not reported in this paper because their execution times are very similar to those obtained with direct (forward) FFTs.

Analysis of Results.
Given the limited availability of tools that process the FFT for a large number of images, we introduced four versions of distributed FFT (see Section 2.3) in Apache Spark. To achieve this, we use the sequential version of the single precision floating-point (complex to complex) FFTW library [23]. In order to differentiate between each implementation, we will give them a distinctive name in Tables 1-3 as follows: (1) Using the separability property:  After carrying out the experimentation, Table 1 presents the processing times comparison results between two-dimensional FFT implementations on 8192 images (taken from the MIRFLICKR-25000 image set), performing a variation in the number of cores, as specified in the first experiment described in the previous subsection.
According to Table 1, the best processing times are those obtained with the RDD-based FFT that does not use the separability property, as can be seen in all columns of row 4. Note that this happens in all cases, regardless of the number of cores used. As expected, the best result is obtained when using all 96 cores. Notice that the RDD-based FFT is approximately 2x faster than the DataFrame-based FFT when not using the separability property (see 4th and 5th rows) and is approximately 10x faster than when using it (3rd and 4th rows). However, observe that processing times obtained using the separability property are very similar between RDD and DataFrame versions across core variations (2nd and 3rd rows).
Notice in Table 1 that, when using a single core, i.e., when processing is performed sequentially, processing times are very similar to each other. For example, when the separability property is not used, RDD-based FFTs are only 18.8% faster than DataFrame-based FFTs; however, when using the separability property, RDD-based FFTs get processing times that are practically the same as for DataFrame-based FFTs.
Consequently, speed-up results are also presented in Table 2 based on data from Table 1. As can be seen, RDDbased FFTs that do not use the separability property perform computations more efficiently obtaining a speed-up of up to 32x with respect to its sequential version. A general observation is that as the number of cores increases, the processing of almost all FFT versions becomes less efficient.
is can be clearly seen in all rows in Table 2 (except in the 4 th row), where speed-ups tend to stagnate or degrade, according to Amdahl's Law [30], as the number of cores is increased.
On the other hand, in order to observe the influence on speed-up, throughout the variation of the number of images, Table 3 presents the following results. Observe in row 4 that     the best speed-ups obtained are those that were processed by RDD-based FFTs. Note that, regardless of the number of processed images, speed-ups obtained are higher than those obtained by the rest of the implementations. is confirms that RDD-based FFTs that do not use the separability property are the most efficient.
One of the limitations that we faced during this experiment is that versions that use the separability property could not process more than 8192 images. In order to find out the reason for this, the third experiment was carried out, in which the number of images to be processed is consecutively increased until reaching the maximum number possible (ranging from 128 to 8192 images), for both the RDD and DataFrame versions.
e obtained results are shown in Figure 2. Observe how both CPU and memory usage increase as the number of images increases over time. According to Figure 2(a), two peaks in memory usage can be seen (when 8192 images are processed), one for the RDD-based version and another for the DataFrame-based version, reaching almost all of the available memory in cluster. In contrast, note in Figure 2(b) that the nonuse of the separability property involves the use of fewer resources. is is the reason why more than 8192 images cannot be processed with the DataFramebased FFTs that use the separability property.
Although CPU occupancy is not fully exploited, the separability property requires more CPU usage, especially when processing DataFrame-based FFTs. is could be due to the additional processing that a DataFrame requires when inferring data (from a schema) for each one of its records. erefore, the use of the separability property directly impacts the two-dimensional FFT performance in Apache Spark due to the network usage because it generates greater incoming and outgoing traffic while shuffling data across the network.
After experimentation, several quite interesting findings arise. When processing FFTs on large image sets, using Spark + JNI + FFTW, we found the following: (i) RDD API is more efficient to process two-dimensional FFTs than DataFrame API, since RDD processing is faster and requires less memory and less computational effort than DataFrame processing. (ii) When avoiding the separability property while computing RDD-based FFTs, greater speed-ups can be achieved, no matter how many cores are used or how many images are transformed. In addition, the memory usage is lower for this case.

Conclusions
is research presented a comparative study for determining the feasibility of the separability property in distributed systems like Apache Spark. Based on the obtained results, we can conclude that the 2D FFT separability property is not feasible in this context. is is because the divide-and-conquer strategy (very useful in several hardware implementations found in the literature), which separates and distributes image information between cluster nodes, generates network traffic that seriously degrades the FFT performance.
In addition, we found that the RDD API is the most appropriate interface for massive FFT processing on large image sets because it requires less memory and less computational effort than DataFrame processing, which can be translated in greater speed-ups.
As future work, a combination of hardware accelerators and Apache Spark is proposed.
is could carry out the strategy of divide and conquer within accelerator devices in order to offer the highest possible performance when processing FFTs on large image sets.

Data Availability
We used the public image set MIRFLICKR-25000 which can be found at https://press.liacs.nl/mirflickr/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.