A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.


2
The Scientific World Journal The present paper is organized as follows. Section 2 explains a mathematical background about block matrix operations and discrete Fourier transform. Section 3 defines the concept of vector-valued DFT for vector-valued discretetime signals. Section 4 develops a mathematical framework of vector-valued DFT in terms of block matrix operations for vector-valued discrete-time signals with length = . This mathematical framework contributes to implementation of parallel algorithms. Section 5 explains an implementation and experimental investigation of this mathematical framework using parallel computing in multicore processors with MATLAB. Finally, some conclusions are presented in Section 6.
Throughout the paper, the following notations are used. Z = {0, 1, . . . , − 1} is the additive group Z of integers modulo , C × is the matrix space of rows and columns with complex numbers entries and C = C ×1 . The rows and columns of A ∈ C × are indexed by elements of Z and Z , respectively. A( , ), A( , :), A(:, ), and A represent entry ( , ), row , column , and transpose matrix of A, respectively. I ∈ C × is identity matrix.

Block Matrix Operations.
A block matrix A ∈ C × with row partitions and column partitions and a block vector x ∈ C with row blocks are defined as respectively, where A , ∈ C × designates ( , ) block and x ∈ C designates block. In this paper, the following block matrix operations are used: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator. The Kronecker product of two matrices A ∈ C × and B ∈ C × is defined as A ⊗ B ∈ C × and it replaces every entry ( , ) of A by the matrix A( , )B. In the special case A = I , it is called parallel operation [12].
The direct sum of matrices constructs a block diagonal matrix from a set of matrices, that is, for {C } ∈Z , such that C ∈ C × : where C ∈ C × , = ∑ ∈Z , and = ∑ ∈Z . Let = . The stride permutation matrix is defined as L ∈ C × such that it permutes the elements of the input signal x ∈ C as + → + , ∈ Z , and ∈ Z [12,14]. This matrix permutation governs the data flow required to parallelize a Kronecker product computation [12]. We clarify that the superscript is an index, not power.
The vec operator, V : C × → C , transforms a matrix into a vector by stacking all the columns of this matrix one underneath the other. On the other hand, the vec inverse operator, R , : C → C × , transforms a vector of dimension into a matrix of size × .

Discrete Fourier
Transform. Let 2 (Z ) be the set of Cvalued signals on Z ; that is, x ∈ 2 (Z ) if and only if x ∈ C [9]. Additionally, for each 1 ∈ Z, As mentioned in [14], there are two different approaches of representing the DFT: as matrix-vector products or using summations. Consequently, fast algorithms using parallel computing are represented with either a matrix formalism as in [10,[12][13][14] or summations as in most signal processing books. Below, the matrix formalism is introduced and used to express the Cooley-Tukey FFT algorithm, corresponding to the decomposition of the transform size into the product of two factors and ; that is, = .
The matrix representation of DFT of x is F x = F x, where F ∈ C × such that F ( , ) = − . If = , then the matrix formalism can be used to express F as factorizations of matrices using block matrices operations [10,12,13]: Here, T is a diagonal matrix containing the twiddle factors. We clarify that the superscript is an index, not power. This factorization of F is the matrix representation of the Cooley-Tukey FFT for = . In addition, this representation of F allows the implementation using parallel computing [14].

DFT for Vector-Valued Signals
Based on [2, 6-9, 15, 16], the space of vector-valued discretetime signals with samples is defined as The where W ∈ C × is the matrix kernel. Algorithm 1 shows the implementation of (5). This implementation is a sequential algorithm.
From the reviewed literature, there are two kinds of kernels for this transform: the first one is hypercomplex DFT kernel [8]: where J ∈ C × such that J 2 = −I , and the second one is DFT frame kernel [7]: where is a DFT frame. In this paper, subsets A ⊂ Z + are used, such that card(A) = , although it does not represent a DFT frame.
Proof. For hypercomplex DFT kernel, the proof of each case is similar to proof of th roots of unity. For DFT frame kernel, W is a diagonal matrix, and then the proof of each case is straightforward.

A Parallel Framework for =
In this section, the main results of this paper are presented. Firstly, a block matrix representation of the vector-valued DFT is given. Secondly, a new mathematical framework from matrix representation of vector-valued DFT is derived, using a block matrix formalism (i.e., Theorem 2). This new result is inspired in the matrix representation of the Cooley-Tukey FFT algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10,12,13]. The result obtained in Theorem 2 is transformed in a new block matrix representation such that it contributes to analysis, design, and implementation of parallel algorithms (i.e., Corollary 3). This new result is inspired in (3). Finally, a computational complexity analysis of new algorithm is developed.
Similar to the DFT matrix representation explained in Section 2.2, there are two different approaches of representing the vector-valued DFT: as summations (see (5)) or using matrix-vector products. Both approaches allow a parallel implementation. In fact, the proof of Theorem 2 is developed using summation notation.
The vector-valued DFT can be presented as matrix-vector products. The block matrix representation of vector-valued DFT of x ∈ 2 (Z , C ) is defined as F x = F x, where F ∈ C × such that (F ) , = W − ∈ C × , for , ∈ Z . We clarify that the superscript is an index, not power. In this section, a block matrix factorization of F is developed, and it is inspired in (3). First, a generalization of stride permutation is defined. Let = . The block stride permutation matrix [14,17] is defined as L , ∈ C × such that L , = L ⊗ I , and, for each x ∈ C with blocks x ∈ C , the operation L , x permutes each block of the input block x as + → + , ∈ Z , and ∈ Z .

Theorem 2. Let = and let F ∈ C × be the block matrix of DFT for vector-valued signals. Then
where Proof. Let x ∈ C , let 1 , 1 ∈ Z , and let 2 , 2 ∈ Z . The block vector y = (I ⊗ F )L , x is defined. Then Now, let z = T , y. From Lemma 1, W − 1 1 = W − 1 1 ; then Let w = (F ⊗ I )z. Then  Figure 1: Parallel model of vector-valued DFT for x ∈ 2 (Z , C ), = , using a matrix representation.

Corollary 3. Let = and let F ∈ C × be the block matrix of DFT for vector-valued signals. Then
where T , was defined in Theorem 2.

Computational Complexity Analysis.
In this section, the computational complexity analysis of (15) is developed. First, consider the matrix operation L , k. The computational complexity (CC) of L , k is O( ) [8] because it is the multiplication between a block matrix in C × and a block vector in C . But the operation L , k can be implemented with a CC O( ) (see, e.g., [12,14]). Let F ∈ C × be the block matrix and vector-valued signal x ∈ 2 (Z , C ), where = . It is known that the CC of operation y = F x is O( 2 2 ) = O( 2 2 2 ). Now consider operation y = F x using (15). If we consider each matrixvector multiplication, we obtain the following: (1) The CC of y 1 = L , x is O( ).
(2) The CC of y 2 = (I ⊗ F )y 1 is O( 2 2 ), because it is a block diagonal matrix multiplication.
(3) The CC of y 3 = T , y 2 is O( ), because T , is a diagonal matrix multiplication.
(4) The CC of y 4 = L , y 3 is O( ). Therefore, the CC of F x using (15) is Thus, the CC of operation F x is O( 2 2 2 ) and the CC of operation F x using (15) is O( ( + ) 2 ). The above mentioned shows the efficiency of matrix formulation in (15). The implementation of Algorithms 1 and 2 to compute the vector-valued DFT is performed using MATLAB. Algorithm 2 is computed using Parallel Computing Toolbox. MATLAB uses built-in multithreading and parallelism using MATLAB workers. Parallelism using MATLAB workers is used. We can run multiple MATLAB workers (MATLAB computational engines) on a multicore computer to execute applications in parallel with the Parallel Computing Toolbox. This approach allows more control over the parallelism compared to built-in multithreading. With programming constructs, such as parallel-for-loops (parfor) and batch, we write the parallel MATLAB programs of the parallel framework for the vector-valued DFT.

Results and Discussion
. Let * be the execution time of Algorithm 1 without any parallel implementation, and let be the execution time of Algorithm 2, where is the number of cores. The value of needs to be less than that of * for two reasons: Algorithm 2 has a parallel implementation and the matrix multiplication size is different. Algorithm 2 is computed with matrices in C × and C × . Algorithm 1 is computed with matrices in C × , where = .
The computational performance analysis of Algorithm 2 is evaluated using the metrics speedup (or acceleration) and efficiency. The speedup is the ratio between the execution times of parallel implementations with one core and parallel implementations with two or more cores [18]. The speedup is represented by the formula = 1 / . The efficiency estimates how well utilized the processors are in solving the problem compared to how much effort is wasted in communication and synchronization [18]. The efficiency is determined by the ratio between the speedup and the number of processing elements, represented by the formula = 1 /( ). Table 1 shows the execution time, in seconds (s), of both algorithms. A significant reduction in the parallel execution time of the vector-valued DFT is observed. Table 1 shows that Algorithm 1 with hypercomplex kernel for a Wiener CAZAC signal in 2 (Z 8192 , C 5 ) produces a time of serial execution * = 13408 s. Using Algorithm 2, however, we obtain 1 = 106.7 (0.80% of * ), 2 = 80.44 s (0.60% of * ), 3 = 57.35 s (0.43% of * ), and 4 = 32.67 s (0.24% of * ). This result shows the advantage of using multicore processors and a parallel computing environment to minimize the high execution time in the vector-valued DFT. This is because parallel computing is a form of computation in which many calculations are carried out simultaneously [19,20], operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently, and minimize the execution time [20,21]. The difference between * and is because is computed with matrices in C ×