Parallel Jacobi EVD Methods on Integrated Circuits

Design strategies for parallel iterative algorithms are presented. In order to further study different tradeoff strategies in design criteria for integrated circuits, A 10 × 10 Jacobi Brent-Luk-EVD array with the simplified μ-CORDIC processor is used as an example.The experimental results show that using the μ-CORDIC processor is beneficial for the design criteria as it yields a smaller area, faster overall computation time, and less energy consumption than the regular CORDIC processor. It is worth to notice that the proposed parallel EVD method can be applied to real-time and low-power array signal processing algorithms performing beamforming or DOA estimation.


Introduction
We are on the edge of many important developments which will require parallel data and information processing.The transmission systems are using higher and higher frequencies and the carrier frequencies are increasing to 10 GHz and above.Because of the smaller wavelength more antennas can be implemented on a single device leading to massive MIMO systems.Parallel VLSI architectures will be needed in order to provide the required computational power for 10 GHz and above, massive MIMO, and big data processing [1,2].
In parallel matrix computation at the circuit level, implementing an iterative algorithm on a multiprocessor array results in a tradeoff between the complexity of an iteration step and the number of required iteration steps.Therefore, as long as the algorithm's convergence properties are guaranteed, it is possible to adjust the architecture, which can significantly reduce the complexity with regard to the implementation.Computing the parallel eigenvalue decomposition (EVD) as a preprocessing step to MUSIC or ESPRIT algorithm with Jacobi's iterative method is used as an important example as the convergence of this method is extremely robust to modifications of the processor elements [3][4][5][6].
In [7], it was shown that Brent-Luk-EVD architecture with a modified CORDIC for performing the plane rotation of the Jacobi algorithm can be realized in advanced VLSI design.Based on it, a Jacobi EVD array is realized by implementing a scaling-free microrotation CORDIC (-CORDIC) processor in this paper, which only performs a predefined number of CORDIC iterations.Therefore, the size of the processor array can be reduced for implementing a large-scale EVD array in parallel VLSI architectures.After that, several modifications of the algorithm/processor are studied and their impact on the design criteria is investigated for different sizes of EVD array (10 × 10 to 80 × 80).Finally, a strategy to comply with the design criteria is established, especially in terms of balancing the number of microiterations and the computational complexity.The proposed architecture is ideal for real-time antenna array applications, such as a flying object carrying an antenna array for beamforming or DOA estimation that would require a real-time, low-power, and efficient architecture for EVD, or joint time-delay and frequency estimation using a sensor network.
This paper is organized as follows.Serial and parallel Jacobi methods are described in Section 2. In Section 3, the design issues of the parallel Jacobi EVD array are discussed, leading to the simplification from a regular full CORDIC to the -CORDIC processor with an adaptive number of iterations.Section 4 shows the implementation results.Section 5 concludes this paper.

Parallel Eigenvalue Decomposition
2.1.Jacobi Method.An eigenvalue decomposition of a real symmetric  ×  matrix  is obtained by factorizing  into three matrices  =  ∧   , where  is an orthogonal matrix (  = ) and ∧ is a diagonal matrix containing the eigenvalues of .The Jacobi method approximates the EVD iteratively as follows: where   is an orthonormal plane rotation by the angle  in the (, ) plane.
The plane rotations   , where  = 1, 2, 3, . .., can be executed in various orders to obtain the eigenvalues.The most common order of sequential plane rotations {  } is called cyclic-by-row, meaning (, ) is chosen as follows: The execution of all  = ( − 1)/2 index pairs (, ) is called a sweep.Matrix  will converge into a diagonal matrix ∧ once  sweeps are applied, where ∧ contains the eigenvalues (3) 2.2.Jacobi EVD Array.Instead of performing the plane rotations   one by one in a cyclic-by-row order, they can be separated into multiple subproblems and executed in parallel on a log  dimensional multicore platform.Ahmedsaid et al. [3] first presented a parallel array based on Jacobi's method.It consists of /2×/2 PEs and each PE contains a 2×2 subblock of the matrix . Figure 1 shows a typical 4 × 4 EVD array with 16 PEs.This Jacobi array can perform /2 subproblems in parallel.Initially, each PE holds a 2 × 2 submatrix of : where  and  = 1, 2, . . ., /2.
A rotation angle has to be chosen in order to zero out the off-diagonal elements of the submatrix by solving a 2 × 2 symmetric EVD subproblem as shown in the following: where  = [ cos  − sin  sin  cos  ].The maximal reduction {   ,    } = 0 can be obtained by applying the optimal angle of rotation  opt : where the range of  opt is limited to | opt | ≤ /4.This optimal angle  opt , which can annihilate the offdiagonal elements ( 2−1,2 and  2,2−1 ), is computed using diagonal PEs in (6).Once these rotation angles are computed, they will be sent to the off-diagonal PEs.This transmission is indicated by the dashed lines in the vertical and horizontal direction in Figure 1.All off-diagonal PEs will perform a two-sided rotation with the corresponding rotation angles obtained from the row (  ) and column (  ), respectively.
Once these rotations are applied, the matrix elements are interchanged between processors as indicated by the diagonal solid lines in Figure 1, for the execution of the next /2 rotations.One sweep needs to perform  − 1 of these parallel rotation steps.After several sweeps (iterations) are executed, the eigenvalues will concentrate in the diagonal PEs.

CORDIC Approach
3.1.Regular CORDIC.Within each PE, a simple way to solve the subproblem of ( 5) in VLSI for zeroing out the off-diagonal elements is to use the CORDIC algorithm.An orthogonal CORDIC rotator is defined as [8,9] when  → ∞,   ≅ 1.647.
In the Cartesian coordinate system, the CORDIC orthogonal rotation mode can be used to compute (5) by separating the two-sided rotation into two parts,  = [  1 ;   2 ] =  ⋅   and  ⋅ . ⋅   that is computed by where the plane rotation with the desired rotation angle  opt is executed using two CORDIC rotators.The CORDIC processors apply  steps, usually  = 32 for single floating precision.
A constant scaling value  = 1/  = 0.6073 is subsequently required to fix the rotated vectors in order to retain the orthonormality.Similarly, these two CORDIC rotators can also be applied to compute  ⋅ : Meanwhile, the angle  opt can also be determined by using the CORDIC orthogonal vector mode.The CORDIC rotates the input vector through whatever angle is necessary to align the resulting vector with the -axis: The CORDIC with an orthogonal vector mode can compute the arctangent result iteratively  = arctan(/), if the angle accumulator is initialized with zero ( 0 = 0).
In the VLSI design, two common approaches can be used to realize the CORDIC dependence flow graph in hardware: the folded (serial) or the parallel (pipelining) [10,11].Note that we limit our efforts to the conventional CORDIC iteration scheme, as given in (7).In Figure 2(a), the structure of a folded CORDIC PE is shown, which requires a pair of adders for plane rotation and another adder for steering the next angle direction (computing the following   and   ).All internal variables are buffered in the registers separately until the iteration number is large enough to obtain the result.The signs of all three intermediate variables are fed into a control unit that generates the rotation direction flags   to steer the add or suboperations and keep track of the rotation angle   .For example, off-diagonal PE 43 can directly apply the flags   from PE 33 to (8)'s  1 and PE 44 to (8)'s  2 .After the rotation, the required scaling procedure can be obtained using the part of Figure 2(b) that fixes   , where two multiplexers are required to select the inputs into the barrel shifters.This folded dependence graph is typical for the orthogonal rotation mode and benefits in a small area in the VLSI design.
In practice, the angle accumulator is not required for the off-diagonal PEs.The   from ( 7) can be used to steer the rotators.Thus, the transmission on the vertical and horizontal dashed lines in Figure 1 will be replaced by a sequence of   flags, meaning that the off-diagonal computation efforts for computing the optimal angle  opt can be omitted.

Simplified 𝜇-Rotation CORDIC.
As the process technologies continue to shrink, it becomes possible to directly implement a Brent-Luk-EVD array with the Jacobi method [12,13].However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, 4 × 4. Therefore, we must simplify the architecture in order to integrate more processors.A scaling-free -CORDIC for performing the plane rotation in ( 5) is used [5,6], where the number of inner iterations is reduced from 32 iterations to only one iteration.
The definition of -CORDIC can be developed from ( 7) as where m is the required scaling factor per iteration and  is the scaling error.The idea of the -CORDIC rotation is to reduce the number of iterations of the full CORDIC to only a few iterations.Meanwhile, the scaling error  will be small enough to be neglected as long as the orthonormality is retained.Figure 3 shows four different methods for different sizes of -rotation angles and Table 1 shows a lookup table for the -CORDIC, listing 32 approximated rotation angles for each rotation type, the required number of shift-add operations and its computation cycles.Note that the approximated angles are stored as two times of tan .When the rotation angle is very tiny (i.e.,  is tiny, too), Type I with only one iteration will comply with the limited working range 1 − 2 −(  +1) < m < 1 + 2 −(  +1) , if the selected   (  ∈ 1 ⋅ ⋅ ⋅ 32) is larger than 16.In Figure 3(a), a pair of shift-add operations realizing one iteration step is sufficient.Furthermore, it is scaling free when the angle 2 × tan  ≤ 3.05176 × 10 −5 .These orthonormal -rotations are chosen such that they satisfy a predefined accuracy condition in order to approximate the original rotation angles and are constructed by the least computation efforts.
Next, for the Type II rotation (as shown in Figure 3(b)), when   is selected from 8 to 15 for small angles, two pairs of shift-add operations are enough to retain the orthonormality.Moreover, when the   is selected from 5 to 7, Type III requires three -rotations.No scaling is required by Types I through III.Finally, for large rotation angles, the scaling errors cannot be omitted.Figure 3  itself, it requires two pairs of shift-add operations at the beginning of the flow graph, while 2 to 4 pairs of shift-add operations are required to fix the scaling factor m: Note that the scaling costs  = 2 to 4 pairs of shift-add operations.In general, the cost of Type IV is bounded by 2 +  pairs of shift-add operations.For example, when the index  is 2, the scaling is These four subtypes have three identical parts: Type I with one iteration, the scaling part of Type IV, and the second iteration of Type II.These three parts can be integrated together by using multiplexers to select the data paths, as shown in Figure 4, where 2 adders, 2 shifters, and 4 multiplexers are required [5].

Adaptive 𝜇-CORDIC Iterations.
To improve the computational efficiency, the -CORDIC has been modified to perform 6 iterations per cycle as CORDIC-6.As the global clock in a synchronous circuit is determined by the critical path, the maximum timing delay per iteration is 6 cycles (when the index  is 1, Type IV).Therefore, the inner iteration steps of the angles are repeated until they are close to the critical one.The required number of repetitions is quoted in Table 1.For example, when the rotation angle index  is 8, it will repeat three times from the index  = 8 to the index  = 10; when the rotation angle index  is 20, it is repeated six times from the index  = 20 to the index  = 25.On the other hand, we can adjust the number of iterations by selecting the average angle during the last sweep and name it as CORDICmean.

Matlab Simulation.
The full CORDIC with 32 iteration steps, the -CORDIC with one iteration step, and two different adaptive modes have been tested using numerous  random symmetric matrices  of size 8 × 8 to 160 × 160 (i.e., EVD array sizes range from 4 × 4 to 80 × 80). Figure 5(a) shows the average number of sweeps needed to compute the eigenvalues/eigenvectors for each size of the EVD array, where the sweep number increases monotonically.
When the Jacobi EVD array size is 10×10, the -CORDIC requires 12 sweeps while the full CORDIC only requires 6 sweeps per EVD computation.If we adjust the inner rotations to six times, the sweep number will be 10, smaller than the -CORDIC but more than the full CORDIC.Note that using the average rotation angle to decide the rotation number as the CORDIC-mean seems to be an unwise method because it requires more sweeps.Although the -CORDIC requires double sweeps than the full CORDIC, it actually reduces the number of the inner CORDIC rotations, which results in improved computational complexity.For example, a 10 × 10 array with the Full CORDIC PE needs 6 sweeps × 32 inner CORDIC rotations and the CORDIC-6 needs 10 sweeps × 6 inner CORDIC rotations whereas the -CORDIC PE requires only 12 sweeps × 1 inner CORDIC rotation.In Figure 5(b), the average number of shift-add operations required for each rotation method for different sizes of EVD arrays is demonstrated whereas -CORDIC needs significantly fewer shift-add operations than others.The adaptive CORDIC-6 method can offer a compromise between the hardware complexity and the computational effort.
Figure 5(c) shows the off-diagonal Frobenius norm versus the sweep numbers for each array size of 80 × 80 with double floating precision.Each rotation method converges to the predefined stop criteria: ‖ off ‖  × 10 −8 .The ‖ off ‖  is the Frobenius norm of the off-diagonal elements of  (i.e.,  off =  − diag(diag())).
Figure 5(d) shows the reduction of the off-diagonal Frobenius norm versus the sweep numbers for single floating precision.It can be noticed that the off-norms do not reach the convergence criteria, and each size of the EVD array has different stop criteria for each rotation method (default IEEE 754 single).Therefore, we can first analyze the Frobenius norm of the off-diagonal elements in Matlab and then observe it until it reaches its maximal reduction.Afterwards, a lookup table can be generated and directly assign these stop criteria to the target hardware circuit or IP component.

VLSI Implementation.
The -CORDIC is modeled and compared to the folded Full CORDIC in VHDL with the resizing feature.These two methods have been integrated into parallel EVD arrays, with sizes 4 × 4 and 10 × 10, through a configurable interface separately.After that, they have been synthesized by using the Synopsys Design Compiler with the TSMC 45 nm standard cell library.Note that the word length is 32 bits with the IEEE 754 single floating precision for both CORDIC methods using the same floating point unit from OpenCORE.Table 2 lists the synthesis results for area, timing delay, and power consumption.
As expected, the combinational logic area and the power consumption of the -CORDIC PE are much smaller than the Full CORDIC.Furthermore, in order to determine the time required to compute the EVD of a  ×  symmetric matrix, it can be obtained by where  = 8, 16, 20, 30, . . ., 160, Δ = /2 − 1.
The total timing delay per EVD operation is defined by the critical timing delay × the number of inner CORDIC rotations × average number of outer sweeps × size of the matrix .It can be observed that the total operation time is dependent on the relationship between the inner CORDIC rotations and the outer sweeps.Therefore, one obtains a speedup by a factor of 21.4 by reducing the number of inner CORDIC rotations.Although the reduction of power consumption is less significant due to an extra -CORDIC's controller and multiplexers, it actually 6 consumes much less energy per EVD computation due to the shorter computation time.Note that the -CORDIC PE requires two inner iterations on average due to the different rotation cycles, from six to one inner iteration, as shown in Table 1. Figure 6 shows the energy consumption for sizes of the array from 4 × 4 to 80 × 80.Both rotation methods consume much less energy than the Full CORDIC, where the 6-CORDIC can obtain a factor of 40.9 and the -CORDIC can obtain a factor of 104.3 on average for energy reduction compared to the Full CORDIC.
In [14], a Jacobi single cycle-by-row EVD algorithm [15] has been implemented with a single CORDIC processor.Since it requires a very complex controller and lookup tables, the throughput is not comparable with a real Brent-Luk's parallel EVD array [13].In comparison to [13], Full CORDIC for Jacobi Brent-Luk-EVD parallel architecture is implemented in FPGA; however, current configurable device can only perform 4 × 4 EVD array.The experimental results show that performing the unitary rotation in CORDIC processor is a good solution.It required smaller area size, improved the overall computation time, and reduced the energy consumption.Furthermore, the unitary-rotation method can be also applied to other more efficient CORIDC architectures as long as the orthogonality is obtained during CORDIC iterations, such as pipeline CORDIC [16,17], or implementing the rotators with better adder structures [18,19].
As the process technologies continue to shrink, it becomes possible to directly implement a Brent-Luk-EVD array with the Jacobi method [12,13].However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, 4×4.Therefore, we must simplify the architecture in order to integrate more processors.A scaling-free -CORDIC for performing the plane rotation in ( 5) is used [5,6], where the number of inner iterations is reduced from 32 iterations to only one iteration.

Conclusions
The EVD was computed by the parallel Jacobi method, which was selected as an example for a typical iterative algorithm which exhibits very robust convergence properties.A configurable Jacobi EVD array with both Full CORDIC and -CORDIC is implemented in order to further study the tradeoff strategies in design criteria for parallel integrated circuits.The experimental results indicate that the presented -CORDIC method can reduce the size of the combinational logic, speed up the overall computation time, and improve the energy consumption.
(d)  shows the corresponding dependence flow graph for Type IV.Besides the rotation

Figure 5 :
Figure 5: Simulations of four rotation methods.

Figure 6 :
Figure 6: The energy consumption per EVD operation with each size of EVD array (operating at 100 MHz).

Table 1 :
The lookup table for -rotations CORDIC with 32-bit accuracy, showing the rotation type, the 2 × tan  angle, the required shift-add operations for rotation and scaling, the required cycle delay, and repeat number for CORDIC-6.