Multiple Memory Structure Bit Reversal Algorithm Based on Recursive Patterns of Bit Reversal Permutation

With the increasing demand for online/inline data processing efficient Fourier analysis becomes more and more relevant. Due to the fact that the bit reversal process requires considerable processing time of the Fast Fourier Transform (FFT) algorithm, it is vital to optimize the bit reversal algorithm (BRA).This paper is to introduce an efficient BRAwith multiple memory structures. In 2009, Elster showed the relation between the first and the second halves of the bit reversal permutation (BRP) and stated that it may cause serious impact on cache performance of the computer, if implemented. We found exceptions, especially when the said index mapping was implemented with multiple one-dimensional memory structures instead of multidimensional or one-dimensional memory structure. Also we found a new index mapping, even after the recursive splitting of BRP into equal sized slots. The fourarray and the four-vector versions of BRAwith new indexmapping reported 34% and 16% improvement in performance in relation to similar versions of Linear BRA of Elster which uses single one-dimensional memory structure.


Introduction
The efficiency of a bit reversal algorithm (BRA) plays a critical role in the Fast Fourier Transform (FFT) process because it contributes 10% to 50% of total FFT process time [1].Therefore, it is vital to optimize the BRA to achieve an efficient FFT algorithm.In 2009, Elster showed the relation between the first and the second halves of the BRP [2], but did not implement it.Elster stated that implementation of this relation may cause serious impact on cache performance of modern computers.As Elster stated, use of a twodimensional memory structure to implement this relation reduced the efficiency of the bit reversal permutation (BRP).In contrast, the efficiency of the BRA increased when a onedimensional memory structure was used for index mapping.When two equal sided one-dimensional memory structures were used, the performance was even much better than with single one-dimensional memory structure.Also, it was found out that bit reversal permutation can be further split into equal size blocks recursively, up to maximum of log 2 (n) times, where  is the number of samples and  = 2  ;  ∈  + .These two findings motivate us to introduce a BRA which is capable of using 2  ( = 2, 3, . . ., log 2()) equal-sized (/2  ) onedimensional memory structures.
In 1965 Cooley and Tukey introduced the FFT algorithm, which is an efficient algorithm to compute the Discrete Fourier Transformation (DFT) and its inverse [3].FFT is a fast algorithm that has replaced the process of DFT, which had been used frequently in the fields of signal and image processing [4][5][6][7].The structure of the FFT algorithm published by Cooley and Tukey known as radix-2 algorithm [3] is the most popular one [5].There are several other algorithm structures as radix-4, radix-8, radix-16, mixedradix, and split-radix [8].
To apply FFT to a certain signal, there are basically two major requirements.The first requirement is  =   , where  is the number of samples of the signal,  ∈  + , and  is the selected radix structure, for example,  = 2, 4, 8, and 16 for radix-2, radix-4, radix-8, and radix-16, respectively.The second requirement is that the input (or output) samples 2 Mathematical Problems in Engineering must be arranged according to a certain order to obtain the correct output [3,5,8,9].The BRA is used to create the order of input or output permutation according to the required order.The BRA, used in most FFT algorithms, including the original Cooley-Tukey algorithm [3], is known as bit reversal method (BRM).The BRM is an operation, for exchanging two elements () and ( k) of an array of length  as shown in (1) and (2), respectively, where   are either 0 or 1 and  is the relevant base 2, 4, 8, or 16 depending on the selected radix structure: All the later algorithms for creating BRP were named as BRA (bit reversal algorithm), though they used other techniques like patterns of BRP instead of bit reversing techniques.
During the last decades, many publications addressed new BRAs [10] by improving the already existing original BRA (BRM) or using totally different approaches.In 1996, Karp compared the performance of 30 different algorithms [10] against uniprocessor systems (computer system with a single central processing unit) with different memory systems.Karp found that the performance of a BRA depended on the memory architecture of the machine and the way of accessing the memory.Karp stated two hardware facts that influence the BRA, namely, the memory architecture and the cache size of the machine.According to Karp, a machine with hierarchical memory is slower than a machine with vector memory (computers with vector processor), and algorithms do not perform well when array size is larger than the cache size.Also Karp pointed out four features of an algorithm that influence the BRA, namely, memory access technique, data reading sequence, size of the index of memory, and type of arithmetic operations.According to Karp, an algorithm that uses efficient memory access techniques is the fastest among algorithms with exactly the same number of arithmetic operations.Algorithms are faster if (i) they require only a single pass over the data, (ii) they use short indexes, and (iii) they operate with addition instead of multiplication.
Karp especially mentioned that the algorithm published by Elster [11] in 1989 was different from other algorithms, because it used a pattern of BRP rather than improving decimal to binary and binary to decimal conversion.According to the findings of Karp, Elster's "Linear Bit Reversal Algorithm" (LBRA) performs much better in most of the cases.The publication of Elster (1989) [11] consists of two algorithms to achieve BRP.One algorithm used a pattern of BRP and the other one used bit shifting operations.Both algorithms are interesting because they eliminate the conventional bit reversing mechanism, which need more computing time.The algorithm by Rubio et al. (BRA-Ru) of 2002 [12] is another approach that uses an existing pattern of BRP.However, the pattern described in Rubio's algorithm is different from the pattern described in Elster's [11].In 2009, Elster and Meyer published an improved version of "Linear Register-Level Bit Reversal" which was published in 1989 [11] as "Elster's Bit Reversal" (EBR) algorithm.Elster mentioned it is possible to generate the second half of the BRP by incrementing the relevant element of the first half by one.Also, Elster mentioned there can be a serious impact on cache performance of the computer if the said pattern (Figure 1) is used.
Programming languages provide different data structures [13] which handle memory in different ways.In addition, the performance of the memory depends on machine architecture and operating system [14].Therefore, the efficiency of the memory is the resultant output of the performances of hardware, operating system, programming language, and selected data structure.
Based on the physical arrangement of memory elements, there are two common ways of allocating memory for a series of variables: "slot of continuous memory elements" and "collection of noncontinuous memory elements, " commonly known as "stack" and "heap" [15].In most programming languages, the term array is used to refer to a "slot of continuous memory elements." Arrays are the simplest and most common type of data structure [16,17] and, due to continuous physical arrangement of memory elements, provide faster access than "collection of noncontinuous memory elements" memory types.However, with the development of programming languages, different types of data structures were introduced with very similar names to the standard names like array, stack, and heap.The names of new data structures sometimes did not agree with the commonly accepted meaning."Stack, " "Array, " and "ArrayList" provided by Microsoft Visual C++ (VC++) [18,19] are good examples.According to the commonly accepted meaning they should be a "slot of continuous memory elements, " but they are in fact a "collection of noncontinuous memory elements." Therefore, it is not a good practice to determine the performance of a certain data structure just by looking at its name.To overcome this ambiguous situation, we use "slot of continuous memory elements" to refer to "primitive array" (or array) type memory structures.
Due to the very flexible nature, vector is the most common one among the different types of data structures [14].Vector was introduced with C++, which is one of the most common and powerful programming languages which has been used since 1984 [20].However, as most of other data structures, the term "vector" is used to refer to memory in computers with processor architecture called "vector processor." In this paper the term vector is used to refer to the vector data structure that is used in the C++ interface.
Index mapping is a technique that can be used to improve the efficiency of an algorithm by reducing the arithmetic load of the algorithm [8].If  : [0,  − 1] and  is not prime,  can be defined as  = ∏   −1 =0   , where   : [0,   − 1] < .This allows the usage of small ranges of   instead of large range of  and maps a function () into a multidimensional function   ( 1 ,  2 , . . .,   ).
There are two common methods of implementing index mapping: one-dimensional or multidimensional memory  structures.In addition, it is also possible to implement the index mapping using several equal size one-dimensional memory structures.However, this option is not popular as it is uncomfortable for programming.The performance of modern computers is highly dependent on the effectiveness of the cache memory of the CPU [4].To achieve the best performance of a series of memory elements, the best practice is to maintain sequential access [4].Otherwise, the effectiveness of the cache memory of the central processing unit (CPU) will be reduced.Index mapping with multidimensional data structures violates sequential access due to switching between columns/rows and thus reduces the effectiveness of the cache memory.Therefore, it is generally accepted that the use of a multidimensional data structure reduces computer performance [4].
In this paper an efficient BRA is introduced to create the BRP based on multiple memory structures and recursive patterns of BRP.The findings of this paper show that the combination of multiple one-dimensional memory structures, index mapping, and the recessive pattern of BRP can be used to improve the efficiency of BRA.These findings are very important to the field of signal processing as well as any field that is involved in index mapping techniques.

New Algorithm (BRA-Split).
Elster stated that it is possible to generate the second half of the BRP by incrementing items in the first half by one [2] (Figure 1), without changing the order and the total number of calculations of the algorithm.Due to the recursive pattern of BRP, it can be further divided into equal size blocks by splitting each block recursively (Maximum log 2 N times).After splitting  times, BRP is divided into 2  equal blocks each containing /2  elements.The relation between the elements in blocks is given as follows: where (2  + )[] is th element of block (2  + ) and  = 0, 1, . . .,  − 1,  = 1, . . ., 2  .Table 1 shows the relationship between elements in blocks according to the index mapping shown in (3), after splitting BRP one time and two times for  = 16.Depending on the requirement, the number of splitting can be increased.

Evaluation Process of New Algorithm.
To evaluate algorithms, we used Windows 7 and Visual C++ 2012 on a PC with multicore CPU (4 cores, 8 logical processors) and 12 GB memory.Detailed specifications of the PC and the software are given in Table 2. To eliminate limits of memory and address space related to the selected platform, the compiler option "/LARGEADDRESSAWARE" was set [21] and platform was set to "x64." All other options of the operating system and the compiler were kept unchanged.
The new algorithm was implemented using single onedimensional memory structure and the most common multidimensional memory structure.Furthermore, the new BRA was implemented using several equal size one-dimensional memory structures (multiple memory structure).The next task was to identify a suitable data structure from different types of available data structures.We considered several common techniques as summarized in Table 3. Data structure 1 mentioned in Table 3 is not supporting dynamic memory allocation (need to mention the size of the array when array is being declared).For general bit reversal algorithm, it is a must to have dynamic memory allocation to cater different sizes of samples.Even after setting the compiler option "/LARGEADDRESSAWARE" [21], data structures 3 and 4 mentioned in Table 3 were not supported for accessing memory greater than 2 GB.Therefore, structures 1, 3, and 4 were rejected and memory structures 2 (array) and 5 (vector) were used to create all one-dimensional memory structures.The same versions of array and vector were used to create multidimensional memory structures.
The new algorithm mentioned in Section 2.1 was implemented using C++ in 24 types of memory structures as shown in Table 4.The performance of these algorithms was evaluated considering the "clocks per element" (CPE) consumed by each algorithm.To retrieve this value, first, average CPE for each sample size of 2  where  : [21,31] (11 sample sizes) were calculated after executing each algorithm 100 times.This gave 11 CPE representing each sample size.Finally, the combined averageof CPE was calculated for each algorithm by averaging those 11 values along with "combined standard deviation." The combined average of CPE was considered as the CPE for each algorithm.The built-in "clock" function of C++ was used to calculate the clocks.Combined standard deviation was calculated using the following: where   = ∑  =1     / ∑  =1   ,  is the number of samples,   is number of samples in each sample, and   is the standard deviation of each sample.Algorithms 1, 2, and 3 illustrate the implementation of new BRA with single one-dimensional memory structure, multidimensional memory structure, and multiple memory structures, respectively.The algorithm illustrated in Algorithm 1 (BRA Split 1 1A) was implemented using primitive array for split = 1.The algorithm BRM Split 2 4A (Algorithm 2) was implemented using vectors for split = 2.The algorithm BRM Split 2 4A (Algorithm 3) was implemented using primitive array for split = 2.A sample permutation filling sequence of algorithms with single onedimensional memory structures is illustrated in Figure 2. Figure 3 illustrates a sample permutation filling sequence of both multidimensional and multiple memory structures.
Secondly, arithmetic operations per element (OPPE) were calculated for each algorithm.Arithmetic operations within each algorithm were located in three regions of the code: inner FOR loop, outer FOR loop, and outside of the loops.Then, the total number of operations (OP) can be defined as where  1 ,  2 , and  3 are the number of operations in inner FOR loop, outer FOR loop, and outside of the loops. 1 and  2 are the number of iterations of outer loop and inner loop.Equation ( 5) can be represented as where NS is the number of samples and  is the number of splits.
To evaluate the performance of new BRA, we selected three algorithms (LBRA, EBR, and BRA-Ru) which used a pattern instead of conventional bit reversing method.The performance of vector and array versions of the best version of new BRA was compared with the relevant versions of selected algorithms.

Results and Discussion
Our objective was to introduce BRA using recursive pattern of the BRP that we identified.We used multiple memory structures, which is a feasible yet unpopular technique to implement index mapping.According to Table 5, the numbers of operations in all the array and vector versions of both multidimensional and multiple memory structures are the same.Also, Figure 5 shows continuous decrement of OPPE when the number of splits increases.Then, the algorithm with the highest number of splits and the lowest number of operations is the one which is expected to be most efficient.
However, results in relation with CPE (Figure 5) show that the new algorithm with four memory structures of array is the fastest and most consistent in the selected range.Two, four, eight, and sixteen multiple array implementations of new BRA reported 25%, 34%, 33%, and 18% higher efficiency, , where dashed column corresponds to both array and vector versions of algorithm of Rudio, cross lines column corresponds to both array and vector versions of "Linear Bit Reversal" algorithm, dotted column corresponds to both array and vector versions of "Elster's Bit Reversal" algorithm, vertical lines column corresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure, inclined lines column corresponds to both array and vector versions of the new algorithm in single multidimensional memory structure, and horizontal lines column corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures.
respectively, in relation to the array version of LBRA.The algorithm with eight memory structures has nearly the same CPE as the four-array and four-vector versions, but is less consistent.On the other hand, the four-vector implementation of the new algorithm is the fastest and most consistent among all vector versions.Two, four, eight, and sixteen multiple vector implementations of new BRA reported 13%, 16%, and 16% higher and 23% lower efficiency, respectively, in relation to the vector version of LBRA.This result proves that at a certain point, multiple memory structure gives the best performances in the considered domain.Also, usage of multiple memory structures of primitive array is a good option for implementing index mapping of BRP compared to multidimensional or single one-dimensional memory structures.
Due to the flexible nature of the vector, it is commonly used for implementing algorithms.According to Figure 4 there is no difference in OPPE between array and vector versions.However, our results in Figure 5 show that the vector versions of BRA always required more CPE (44%-142%) than the array version.The structure of vector gives priority to generality and flexibility rather than to execution speed, memory economy, cache efficiency, and code size [22].Therefore, vector is not a good option with respect to efficiency.
The results in Table 5 and Figure 4 show that there is no difference between the number of calculations and OPPE for equal versions of algorithms with multidimensional and multiple memory structure.Structure and type of calculations are the same for both types.The only difference is the nature of the memory structure: multidimension or multiple one-dimension.When CPE is considered, it shows 19%-79% performance efficiency from algorithms with multiple onedimension memory structures.The reason for that observation is that the memory access of multidimensional memory structure is less efficient than one-dimensional memory structure [22].
We agree with the statement of Elster about index mapping of BRP [2] and the generally accepted fact (the usage of multidimensional memory structures reduces the performance of a computer) [4] only with respect to multidimensional memory structures of vector.Our results show that even with multidimensional arrays there are situations where the new BRA performs better than the same type of one-dimensional memory structure.The four, eight, and sixteen dimensional array versions of new BRA perform 8%, 10%, and 2% in relation to one-dimensional array version of new BRA.Some results in relation to single one-dimensional memory structure implementation of new BRA are also not in agreement with the general accepted idea.For example sample size = 2 31 , the two-dimensional vector version of new BRA (BRA Split 1 2DV) reported 5.42 − 05 CPE which is 389% higher in relation to average CPE of sample size range of 2 21 to 2 30 .Also, the inconsistency was very high.Therefore, we excluded values related to sample size = 2 31 for the twodimensional vector version.
We observed very high memory utilization with the twodimensional vector version, especially with sample size = 2 31 .Windows task manager showed that the memory utilization of all the considered algorithms was nearly the same for all sample sizes except for multidimension versions of vector.The multidimensional version of vector initially  utilizes higher memory and drops down to normal value.The results in relation to sample size = 2 30 showed that the extra memory requirement of two dimension vector was higher than that of the four-dimensional vector.Based upon that it can be predicted that BRA Split 1 2DV needs an extra 3 GB (total 13 GB) for normal execution at sample size = 2 31 , but the total memory barrier of 12 GB of the machine slows the process down.The most likely reason for this observation is the influence of memory pooling mechanism.When defining a memory structure it is possible to allocate the required amount of memory.This is known as memory pooling [23].In all the memory structures used in algorithms discussed in this paper we used memory pooling.Memory pooling allocates the required memory at startup and divides this block into small chunks.If there is no memory pooling, memory will get fragmented.Accessing fragmented memory is inefficient.When the existing memory is not enough for allocating, then it switches to use fragmented memory for allocating memory elements.In the considered situation, the existing memory (12 GB) is not sufficient for allocating the required amount (13 GB) which switches to use fragmented memory.
The total cache size of the machine is 8.25 MB, which is less than the minimum memory utilization of considered algorithms (16 MB to 8 GB) in relation to the sample size range from 2 22 to 2 31 .Only the sample size 2 21 occupies 8 MB memory, which is less than the total cache memory.Except BRA Split 4 1V structure, all algorithms reported constant CPE in relation to the entire sample size range.The best algorithms of each category, especially, reported very steady behaviour.This observation is in disagreement with the statement of Karp "that a machine with hierarchical memory does not perform well when array size is larger than the cache size" [10].
Comparison (Figure 6) of best reported version in the considered domain (four memory structure version) and the selected algorithms shows that the array version of EBR performs the best.The four-array version of new BRA reported 1% lower performance than the array version of EBR.However, the four-array version of new BRA reported 34% and 23% higher performances than array versions of LBRA and BRA-Ru.Also, the four-vector version of new BRA is reported to have the best performance among all the vector versions.It reported 16%, 10%, and 22% performances compared to vector versions of LBRA, EBR, and BRA-Ru, respectively.

Conclusion and Outlook
The main finding of this paper is the recursive pattern of BPR and the implementation method of it using multiple memory structures.With multiple memory structures, especially, the newly identified index mapping performs much better than multidimensional or single one-dimensional memory structure.Furthermore, findings of this paper show that the performance of primitive array is higher than vector type.The result is in disagreement with the statement of Karp "that a machine with hierarchical memory does not perform well when array size is larger than the cache size." Almost all the sample sizes we used were higher than the total cache size Mathematical Problems in Engineering 11  x( 4) x(2) x( 6) x(1) x( 5 x(0) x x(2) x(3) x(4) x( 5) x( 6) x( 7 of the computer.However, multiple memory structure and the multidimensional memory structure versions showed reasonable steady performance with those samples.In general these results show the effects of data structures and memory allocation techniques and open a new window of creating efficient algorithms with multiple memory structures in many other fields where index mapping is involved. The new bit reversal algorithm with 2  independent memory structures splits the total signal into  independent portions and the total FFT process into  + 1 levels.Then, these  signal portions can be processed independently by means of  independent processes on the first level.On the next level the results from the previous level stored in independent memory structures can be processed with /2 processes and so on, until the last level.Therefore, we suggest using the concept of multiple memory structures in total FFT process along with the new algorithm with multiple memory structures and suitable parallel processing technique.We expect that it is possible to achieve higher performance from FFT process with proper combination of parallel processing technique and new algorithm compared to using the new algorithm only to create bit reversal permutation.Figure 7 shows such approach with four (when  = 2) independent memory structures for sample size = 16.

Figure 1 :
Figure 1: Relation between first and second halves of the BRP for  = 16.

Figure 2 :
Figure 2: Permutation filling sequence of new BRA with single memory structure for  = 16 and split =1 (BRA Split 1 1A).

Figure 3 :
Figure 3: Permutation filling sequence of 4 individual and single 4-dimensional memory structure for  = 16.

= 4 Figure 4 :
Figure4: Operations per element versus reference algorithms and new algorithm with different s (splits), where dashed column corresponds to both array and vector versions of algorithm of Rudio, cross lines column corresponds to both array and vector versions of "Linear Bit Reversal" algorithm, dotted column corresponds to both array and vector versions of "Elster's Bit Reversal" algorithm, vertical lines column corresponds to both array and vector versions of the new algorithm in single one-dimensional memory structure, inclined lines column corresponds to both array and vector versions of the new algorithm in single multidimensional memory structure, and horizontal lines column corresponds to both array and vector versions of the new algorithm in multiple one-dimensional memory structures.

Figure 6 :
Figure 6: Clocks per element versus best version of new algorithm and selected algorithms, where blue dotted column corresponds to array version and green column corresponds to vector version of the algorithm.

Figure 7 :
Figure 7: Four-memory-structure version of algorithm and the parallel processes for sample size = 16.

Table 2 :
Hardware and software specifications of the PC.

Table 5 :
The number of operations in the inner loop ( 1 ) of each algorithm for different splits ().(The total number of operations in each algorithm ≈  1 .) Corresponds to array version of the algorithm Corresponds to vector version of the algorithm Corresponds to minimum reported CPE of the categoryFigure 5: Clocks per element (combined average) versus algorithm for the sample size range from 2 21 to 2 31 , where blue dotted column corresponds to array version of the algorithm, green column corresponds to vector version of the algorithm, and red rhombus corresponds to minimum reported CPE of the category.* For vector version, sample size 2 31 was excluded, because at sample size 2 31 it showed huge deviation due to memory limitation of the machine.