Design strategies for parallel iterative algorithms are presented. In order to further study different tradeoff strategies in design criteria for integrated circuits, A 10 × 10 Jacobi Brent-Luk-EVD array with the simplified μ-CORDIC processor is used as an example. The experimental results show that using the μ-CORDIC processor is beneficial for the design criteria as it yields a smaller area, faster overall computation time, and less energy consumption than the regular CORDIC processor. It is worth to notice that the proposed parallel EVD method can be applied to real-time and low-power array signal processing algorithms performing beamforming or DOA estimation.
1. Introduction
We are on the edge of many important developments which will require parallel data and information processing. The transmission systems are using higher and higher frequencies and the carrier frequencies are increasing to 10 GHz and above. Because of the smaller wavelength more antennas can be implemented on a single device leading to massive MIMO systems. Parallel VLSI architectures will be needed in order to provide the required computational power for 10 GHz and above, massive MIMO, and big data processing [1, 2].
In parallel matrix computation at the circuit level, implementing an iterative algorithm on a multiprocessor array results in a tradeoff between the complexity of an iteration step and the number of required iteration steps. Therefore, as long as the algorithm's convergence properties are guaranteed, it is possible to adjust the architecture, which can significantly reduce the complexity with regard to the implementation. Computing the parallel eigenvalue decomposition (EVD) as a preprocessing step to MUSIC or ESPRIT algorithm with Jacobi's iterative method is used as an important example as the convergence of this method is extremely robust to modifications of the processor elements [3–6].
In [7], it was shown that Brent-Luk-EVD architecture with a modified CORDIC for performing the plane rotation of the Jacobi algorithm can be realized in advanced VLSI design. Based on it, a Jacobi EVD array is realized by implementing a scaling-free microrotation CORDIC (μ-CORDIC) processor in this paper, which only performs a predefined number of CORDIC iterations. Therefore, the size of the processor array can be reduced for implementing a large-scale EVD array in parallel VLSI architectures. After that, several modifications of the algorithm/processor are studied and their impact on the design criteria is investigated for different sizes of EVD array (10×10 to 80×80). Finally, a strategy to comply with the design criteria is established, especially in terms of balancing the number of microiterations and the computational complexity. The proposed architecture is ideal for real-time antenna array applications, such as a flying object carrying an antenna array for beamforming or DOA estimation that would require a real-time, low-power, and efficient architecture for EVD, or joint time-delay and frequency estimation using a sensor network.
This paper is organized as follows. Serial and parallel Jacobi methods are described in Section 2. In Section 3, the design issues of the parallel Jacobi EVD array are discussed, leading to the simplification from a regular full CORDIC to the μ-CORDIC processor with an adaptive number of iterations. Section 4 shows the implementation results. Section 5 concludes this paper.
An eigenvalue decomposition of a real symmetric n×n matrix A is obtained by factorizing A into three matrices A=Q∧QT, where Q is an orthogonal matrix (QQT=I) and ∧ is a diagonal matrix containing the eigenvalues of A. The Jacobi method approximates the EVD iteratively as follows:
(1)Ak+1=QkAkQkT,withk=0,1,2,…,
where Qk is an orthonormal plane rotation by the angle θ in the (i,j) plane.
The plane rotations Qk, where k=1,2,3,…, can be executed in various orders to obtain the eigenvalues. The most common order of sequential plane rotations {Qk} is called cyclic-by-row, meaning (i,j) is chosen as follows:
(2)(i,j)=(1,2)(1,3)⋯(1,n)(2,3)⋯(2,n)⋯(n-1,n).
The execution of all N=n(n-1)/2 index pairs (i,j) is called a sweep. Matrix A will converge into a diagonal matrix ∧ once k sweeps are applied, where ∧ contains the eigenvalues λ1,λ2,…,λn:
(3)limk→∞Ak=diag[λ1,λ2,…,λn]=[λ10⋯00λ2⋮⋮⋱00⋯0λn].
2.2. Jacobi EVD Array
Instead of performing the plane rotations Qk one by one in a cyclic-by-row order, they can be separated into multiple subproblems and executed in parallel on a log n dimensional multicore platform. Ahmedsaid et al. [3] first presented a parallel array based on Jacobi’s method. It consists of n/2×n/2 PEs and each PE contains a 2×2 subblock of the matrix A. Figure 1 shows a typical 4×4 EVD array with 16 PEs. This Jacobi array can perform n/2 subproblems in parallel. Initially, each PE holds a 2×2 submatrix of A:
(4)PEpq=(a2p-1,2q-1a2p-1,2qa2p,2q-1a2p,2q),
where p and q=1,2,…,n/2.
A 4×4 Brent-Luk-EVD array, where n=8 for an 8 × 8 symmetric matrix [3].
A rotation angle has to be chosen in order to zero out the off-diagonal elements of the submatrix by solving a 2×2 symmetric EVD subproblem as shown in the following:
(5)[aii′aij′aji′ajj′]=R·[aiiaijajiajj]·RT,
where R=[cosθ-sinθsinθcosθ].
The maximal reduction {aij′,aji′}=0 can be obtained by applying the optimal angle of rotation θopt:
(6)θopt=12arctan(2aijajj-aii),
where the range of θopt is limited to |θopt|≤π/4.
This optimal angle θopt, which can annihilate the off-diagonal elements (a2p-1,2q and a2p,2q-1), is computed using diagonal PEs in (6). Once these rotation angles are computed, they will be sent to the off-diagonal PEs. This transmission is indicated by the dashed lines in the vertical and horizontal direction in Figure 1. All off-diagonal PEs will perform a two-sided rotation with the corresponding rotation angles obtained from the row (θr) and column (θc), respectively.
Once these rotations are applied, the matrix elements are interchanged between processors as indicated by the diagonal solid lines in Figure 1, for the execution of the next n/2 rotations. One sweep needs to perform n-1 of these parallel rotation steps. After several sweeps (iterations) are executed, the eigenvalues will concentrate in the diagonal PEs.
3. CORDIC Approach3.1. Regular CORDIC
Within each PE, a simple way to solve the subproblem of (5) in VLSI for zeroing out the off-diagonal elements is to use the CORDIC algorithm. An orthogonal CORDIC rotator is defined as [8, 9]
(7)xi+1=Ai[xi-yi·di·2-i]yi+1=Ai[yi+xi·di·2-i]zi+1=zi-di·tan-12-iAi=1+2-2ii=1,2,3,…,n
when n→∞, An≅1.647.
In the Cartesian coordinate system, the CORDIC orthogonal rotation mode can be used to compute (5) by separating the two-sided rotation into two parts, G=[G1T;G2T]=A·RT and R·G. A·RT that is computed by
(8)G1=[aiir,aijr]T=[cosθ-sinθsinθcosθ]·[aii,aij]T,G2=[ajir,ajjr]T=[cosθ-sinθsinθcosθ]·[aji,ajj]T,
where the plane rotation with the desired rotation angle θopt is executed using two CORDIC rotators. The CORDIC processors apply n steps, usually n=32 for single floating precision. A constant scaling value K=1/An=0.6073 is subsequently required to fix the rotated vectors G1=[aiir,ajir]T and G2=[aijr,ajjr]T in order to retain the orthonormality. Similarly, these two CORDIC rotators can also be applied to compute R·G:
(9)[aii′,aji′]T=[cosθ-sinθsinθcosθ]·[aiir,ajir]T,[aij′,ajj′]T=[cosθ-sinθsinθcosθ]·[aijr,ajjr]T.
Meanwhile, the angle θopt can also be determined by using the CORDIC orthogonal vector mode. The CORDIC rotates the input vector through whatever angle is necessary to align the resulting vector with the x-axis:
(10)xn=Anx02+y02yn=0zn=z0-di·tan-12-i.
The CORDIC with an orthogonal vector mode can compute the arctangent result iteratively θ=arctan(y/x), if the angle accumulator is initialized with zero (z0=0).
In the VLSI design, two common approaches can be used to realize the CORDIC dependence flow graph in hardware: the folded (serial) or the parallel (pipelining) [10, 11]. Note that we limit our efforts to the conventional CORDIC iteration scheme, as given in (7). In Figure 2(a), the structure of a folded CORDIC PE is shown, which requires a pair of adders for plane rotation and another adder for steering the next angle direction (computing the following zi and di). All internal variables are buffered in the registers separately until the iteration number is large enough to obtain the result. The signs of all three intermediate variables are fed into a control unit that generates the rotation direction flags di to steer the add or suboperations and keep track of the rotation angle zi. For example, off-diagonal PE43 can directly apply the flags di from PE33 to (8)’s G1 and PE44 to (8)’s G2. After the rotation, the required scaling procedure can be obtained using the part of Figure 2(b) that fixes An, where two multiplexers are required to select the inputs into the barrel shifters. This folded dependence graph is typical for the orthogonal rotation mode and benefits in a small area in the VLSI design.
Flow graph of a folded CORDIC (recursive) processor.
Rotating
Scaling
In practice, the angle accumulator is not required for the off-diagonal PEs. The di from (7) can be used to steer the rotators. Thus, the transmission on the vertical and horizontal dashed lines in Figure 1 will be replaced by a sequence of di flags, meaning that the off-diagonal computation efforts for computing the optimal angle θopt can be omitted.
3.2. Simplified μ-Rotation CORDIC
As the process technologies continue to shrink, it becomes possible to directly implement a Brent-Luk-EVD array with the Jacobi method [12, 13]. However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, 4×4. Therefore, we must simplify the architecture in order to integrate more processors. A scaling-free μ-CORDIC for performing the plane rotation in (5) is used [5, 6], where the number of inner iterations is reduced from 32 iterations to only one iteration.
The definition of μ-CORDIC can be developed from (7) as
(11)xi+1=m^[xi-yi·di·2-i]yi+1=m^[yi+xi·di·2-i]m^=cos2θ+sin2θ=1+ϵ,
where m^ is the required scaling factor per iteration and ϵ is the scaling error. The idea of the μ-CORDIC rotation is to reduce the number of iterations of the full CORDIC to only a few iterations. Meanwhile, the scaling error ϵ will be small enough to be neglected as long as the orthonormality is retained. Figure 3 shows four different methods for different sizes of μ-rotation angles and Table 1 shows a lookup table for the μ-CORDIC, listing 32 approximated rotation angles for each μ-rotation type, the required number of shift-add operations and its computation cycles. Note that the approximated angles are stored as two times of tanθ. When the rotation angle is very tiny (i.e., ϵ is tiny, too), Type I with only one iteration will comply with the limited working range 1-2-(nm+1)<m^<1+2-(nm+1), if the selected nm (nm∈1⋯32) is larger than 16. In Figure 3(a), a pair of shift-add operations realizing one iteration step is sufficient. Furthermore, it is scaling free when the angle 2×tanθ≤3.05176×10-5. These orthonormal μ-rotations are chosen such that they satisfy a predefined accuracy condition in order to approximate the original rotation angles and are constructed by the least computation efforts.
The lookup table for μ-rotations CORDIC with 32-bit accuracy, showing the rotation type, the 2×tanθ angle, the required shift-add operations for rotation and scaling, the required cycle delay, and repeat number for CORDIC-6.
Index
Type
Angle
Shift-add
Cycle
k
2×tanθ
rot.
sca.
cnt.
re.
1
IV
1.49070
4
8
6
1
2
IV
0.54296
4
6
5
1
3
IV
0.25501
4
6
5
1
4
IV
0.12561
4
4
4
1
5
III
6.25841 10−2
6
0
3
2
6
III
3.12606 10−2
6
0
3
2
7
III
1.56263 10−2
6
0
3
2
8
II
7.81266 10−3
4
0
2
3
9
II
3.90627 10−3
4
0
2
3
10
II
1.95313 10−3
4
0
2
3
11
II
9.76563 10−4
4
0
2
3
12
II
4.88281 10−4
4
0
2
3
13
II
2.44141 10−4
4
0
2
3
14
II
1.22070 10−4
4
0
2
4
15
II
6.10352 10−5
4
0
2
5
16
I
3.05176 10−5
2
0
1
6
17
I
1.52588 10−5
2
0
1
6
18
I
7.62939 10−6
2
0
1
6
19
I
3.81470 10−6
2
0
1
6
20
I
1.90735 10−6
2
0
1
6
21
I
9.53674 10−7
2
0
1
6
22
I
4.76837 10−7
2
0
1
6
23
I
2.38419 10−7
2
0
1
6
24
I
1.19209 10−7
2
0
1
6
25
I
5.96046 10−8
2
0
1
6
26
I
2.98023 10−8
2
0
1
6
27
I
1.49012 10−8
2
0
1
6
28
I
7.45058 10−9
2
0
1
5
29
I
3.72529 10−9
2
0
1
4
30
I
1.86265 10−9
2
0
1
3
31
I
9.31323 10−10
2
0
1
2
32
I
4.65661 10−10
2
0
1
1
Four μ-CORDIC rotations.
Type I
Type II
Type III
Type IV
Next, for the Type II rotation (as shown in Figure 3(b)), when nm is selected from 8 to 15 for small angles, two pairs of shift-add operations are enough to retain the orthonormality. Moreover, when the nm is selected from 5 to 7, Type III requires three μ-rotations. No scaling is required by Types I through III. Finally, for large rotation angles, the scaling errors cannot be omitted. Figure 3(d) shows the corresponding dependence flow graph for Type IV. Besides the rotation itself, it requires two pairs of shift-add operations at the beginning of the flow graph, while 2 to 4 pairs of shift-add operations are required to fix the scaling factor m^:
(12)m^=(1+2-2(k+1))(1+2-4(k+1))⋯(1+2-2M(k+1)).
Note that the scaling costs M=2 to 4 pairs of shift-add operations. In general, the cost of Type IV is bounded by 2+M pairs of shift-add operations. For example, when the index k is 2, the scaling is
(13)m^2=(1+2-6)(1+2-12)(1+2-24).
These four subtypes have three identical parts: Type I with one iteration, the scaling part of Type IV, and the second iteration of Type II. These three parts can be integrated together by using multiplexers to select the data paths, as shown in Figure 4, where 2 adders, 2 shifters, and 4 multiplexers are required [5].
The block diagram of a scaling-free μ-CORDIC PE, including 2 adders, 2 shifters, and 4 multiplexers.
3.3. Adaptive μ-CORDIC Iterations
To improve the computational efficiency, the μ-CORDIC has been modified to perform 6 iterations per cycle as CORDIC-6. As the global clock in a synchronous circuit is determined by the critical path, the maximum timing delay per iteration is 6 cycles (when the index k is 1, Type IV). Therefore, the inner iteration steps of the angles are repeated until they are close to the critical one. The required number of repetitions is quoted in Table 1. For example, when the rotation angle index k is 8, it will repeat three times from the index k=8 to the index k=10; when the rotation angle index k is 20, it is repeated six times from the index k=20 to the index k=25. On the other hand, we can adjust the number of iterations by selecting the average angle during the last sweep and name it as CORDIC-mean.
4. Experimental Results4.1. Matlab Simulation
The full CORDIC with 32 iteration steps, the μ-CORDIC with one iteration step, and two different adaptive modes have been tested using numerous random symmetric matrices A of size 8×8 to 160×160 (i.e., EVD array sizes range from 4×4 to 80×80). Figure 5(a) shows the average number of sweeps needed to compute the eigenvalues/eigenvectors for each size of the EVD array, where the sweep number increases monotonically.
Simulations of four rotation methods.
The average number of sweeps versus array sizes
The number of shift-add operations versus array sizes
The required number of sweeps versus array off-diagonal norm (80×80 array with double precision)
The required number of sweeps versus array off-diagonal norm (80×80 array with single precision)
When the Jacobi EVD array size is 10×10, the μ-CORDIC requires 12 sweeps while the full CORDIC only requires 6 sweeps per EVD computation. If we adjust the inner rotations to six times, the sweep number will be 10, smaller than the μ-CORDIC but more than the full CORDIC. Note that using the average rotation angle to decide the rotation number as the CORDIC-mean seems to be an unwise method because it requires more sweeps. Although the μ-CORDIC requires double sweeps than the full CORDIC, it actually reduces the number of the inner CORDIC rotations, which results in improved computational complexity. For example, a 10×10 array with the Full CORDIC PE needs 6 sweeps × 32 inner CORDIC rotations and the CORDIC-6 needs 10 sweeps × 6 inner CORDIC rotations whereas the μ-CORDIC PE requires only 12 sweeps × 1 inner CORDIC rotation. In Figure 5(b), the average number of shift-add operations required for each rotation method for different sizes of EVD arrays is demonstrated whereas μ-CORDIC needs significantly fewer shift-add operations than others. The adaptive CORDIC-6 method can offer a compromise between the hardware complexity and the computational effort.
Figure 5(c) shows the off-diagonal Frobenius norm versus the sweep numbers for each array size of 80×80 with double floating precision. Each rotation method converges to the predefined stop criteria: ∥Aoff∥F×10-8. The ∥Aoff∥F is the Frobenius norm of the off-diagonal elements of A (i.e., Aoff=A-diag(diag(A))).
Figure 5(d) shows the reduction of the off-diagonal Frobenius norm versus the sweep numbers for single floating precision. It can be noticed that the off-norms do not reach the convergence criteria, and each size of the EVD array has different stop criteria for each rotation method (default IEEE 754 single). Therefore, we can first analyze the Frobenius norm of the off-diagonal elements in Matlab and then observe it until it reaches its maximal reduction. Afterwards, a lookup table can be generated and directly assign these stop criteria to the target hardware circuit or IP component.
4.2. VLSI Implementation
The μ-CORDIC is modeled and compared to the folded Full CORDIC in VHDL with the resizing feature. These two methods have been integrated into parallel EVD arrays, with sizes 4×4 and 10×10, through a configurable interface separately. After that, they have been synthesized by using the Synopsys Design Compiler with the TSMC 45 nm standard cell library. Note that the word length is 32 bits with the IEEE 754 single floating precision for both CORDIC methods using the same floating point unit from OpenCORE. Table 2 lists the synthesis results for area, timing delay, and power consumption.
Comparison of 4 × 4 and 10 × 10 Jacobi EVD arrays.
PE array size
Full CORDIC 4 × 4
μ-CORDIC 4 × 4
Full CORDIC 10 × 10
μ-CORDIC 10 × 10
Area
Combinational
0.847 mm2
0.296 mm2
5.143 mm2
1.829 mm2
Noncombinational
0.390 mm2
0.123 mm2
2.306 mm2
0.833 mm2
Total
1.237 mm2
0.419 mm2
7.449 mm2
2.662 mm2
Power
Cell
62.283 mW
18.239 mW
388.379 mW
123.215 mW
Net
0.465 mW
0.433 mW
2.993 mW
2.678 mW
Leakage
11.909 mW
3.765 mW
86.136 mW
23.966 mW
Total
74.657 mW
22.437 mW
477.508 mW
149.859 mW
Timing
Critical
4.454 ns
1.213 ns
4.286 ns
2.247 ns
Frequency
224.5 MHz
824.4 MHz
233.3 MHz
445 MHz
As expected, the combinational logic area and the power consumption of the μ-CORDIC PE are much smaller than the Full CORDIC. Furthermore, in order to determine the time required to compute the EVD of a n×n symmetric matrix, it can be obtained by
(14)Ttotal=Tdelay×Kiteration×Ksweep×[3(n-1)+Δ+3],
where n=8,16,20,30,…,160, Δ=n/2-1.
The total timing delay per EVD operation is defined by the critical timing delay × the number of inner CORDIC rotations × average number of outer sweeps × size of the matrix A. It can be observed that the total operation time is dependent on the relationship between the inner CORDIC rotations and the outer sweeps. Therefore, one obtains a speedup by a factor of 21.4 by reducing the number of inner CORDIC rotations. Although the reduction of power consumption is less significant due to an extra μ-CORDIC’s controller and multiplexers, it actually 6 consumes much less energy per EVD computation due to the shorter computation time. Note that the μ-CORDIC PE requires two inner iterations on average due to the different rotation cycles, from six to one inner iteration, as shown in Table 1. Figure 6 shows the energy consumption for sizes of the array from 4×4 to 80×80. Both rotation methods consume much less energy than the Full CORDIC, where the 6-CORDIC can obtain a factor of 40.9 and the μ-CORDIC can obtain a factor of 104.3 on average for energy reduction compared to the Full CORDIC.
The energy consumption per EVD operation with each size of EVD array (operating at 100 MHz).
In [14], a Jacobi single cycle-by-row EVD algorithm [15] has been implemented with a single CORDIC processor. Since it requires a very complex controller and lookup tables, the throughput is not comparable with a real Brent-Luk's parallel EVD array [13]. In comparison to [13], Full CORDIC for Jacobi Brent-Luk-EVD parallel architecture is implemented in FPGA; however, current configurable device can only perform 4×4 EVD array. The experimental results show that performing the unitary rotation in CORDIC processor is a good solution. It required smaller area size, improved the overall computation time, and reduced the energy consumption. Furthermore, the unitary-rotation method can be also applied to other more efficient CORIDC architectures as long as the orthogonality is obtained during CORDIC iterations, such as pipeline CORDIC [16, 17], or implementing the rotators with better adder structures [18, 19].
As the process technologies continue to shrink, it becomes possible to directly implement a Brent-Luk-EVD array with the Jacobi method [12, 13]. However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, 4×4. Therefore, we must simplify the architecture in order to integrate more processors. A scaling-free μ-CORDIC for performing the plane rotation in (5) is used [5, 6], where the number of inner iterations is reduced from 32 iterations to only one iteration.
5. Conclusions
The EVD was computed by the parallel Jacobi method, which was selected as an example for a typical iterative algorithm which exhibits very robust convergence properties. A configurable Jacobi EVD array with both Full CORDIC and μ-CORDIC is implemented in order to further study the tradeoff strategies in design criteria for parallel integrated circuits. The experimental results indicate that the presented μ-CORDIC method can reduce the size of the combinational logic, speed up the overall computation time, and improve the energy consumption.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The authors would like to thank the National Science Council of Taiwan for the support under the contract of NSC 101-2218-E-150-001.
AggarwalS.KhareK.CORDIC-based window implementation to minimise area and pipeline depth20137542743510.1049/iet-spr.2012.0021MR3112719AhmedH. M.DelosmeJ.MorfM.Highly concurrent computing structure for matrix arithmetic and signal processing1982151658210.1109/MC.1982.16538282-s2.0-0019928904AhmedsaidA.AmiraA.BouridaneA.Improved SVD systolic array and implementation on FPGAProceedings of the IEEE International Conference on Field-Programmable Technology2003342AndrakaR.Survey of CORDIC algorithms for FPGA based computersProceedings of the ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays (FPGA '98)February 19981912002-s2.0-0031645163BravoI.JiménezP.MazoM.LázaroJ. L.GardelA.Implementation in FPGAS of Jacobi method to solve the eigenvalue and eigenvector problemProceedings of the International Conference on Field Programmable Logic and Applications (FPL '06)August 2006Madrid, Spain1410.1109/FPL.2006.3113012-s2.0-46249102178BrentR. P.LukF. T.The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays198561698410.1137/0906007MR773282GolubG. H.van LoanC. F.19963rdBaltimore, Md, USAJohns Hopkins University PressMR1417720GötzeJ.HekstraG. J.An algorithm and architecture based on orthonormal μ-rotations for computing the symmetric EVD1995201213910.1016/0167-9260(95)00016-XZBL0875.685572-s2.0-0029545884GötzeJ.PaulS.SauerM.An efficient Jacobi-like algorithm for parallel eigenvalue computation19934291058106510.1109/12.241595MR12523152-s2.0-0027658980HakkarainenA.WernerJ.DandekarK. R.ValkamaM.Widely-linear beamforming and RF impairment suppression in massive antenna arrays201315438339710.1109/JCN.2013.000069KlaukeS.GötzeJ.Low power enhancements for parallel algorithmsProceedings of the IEEE International Symopsium on Circuits and Systems2001234237LiuY.BouganisC.-S.CheungP. Y. K.Hardware architectures for eigenvalue computation of real symmetric matrices200931728410.1049/iet-cdt:200800082-s2.0-58049097176MeherP. K.ParkS. Y.CORDIC designs for fixed angle of rotation201321221722810.1109/TVLSI.2012.21870802-s2.0-84872892494ParhiK. K.NishitaniT.1999Marcel DekkerPurohitS.MargalaM.Investigating the impact of logic and circuit implementation on full adder performance20122071327133110.1109/TVLSI.2011.21575432-s2.0-84862024132RamkumarB.KitturH. M.Low-power and area-efficient carry select adder201220237137510.1109/TVLSI.2010.21016212-s2.0-84856292234RusekF.PerssonD.LauB. K.LarssonE. G.MarzettaT. L.EdforsO.TufvessonF.Scaling up MIMO: opportunities and challenges with very large arrays20133014046SunC.-C.GötzeJ.VLSI circuit design concept for parallel iterative algorithms in nanoscaleProceedings of the 9th International Symposium on Communications and Information Technology (ISCIT '09)September 2009Icheon, Republic of Korea68869210.1109/ISCIT.2009.53411552-s2.0-74549203242WaltherJ. S.A unified algorithm for elementary functionsProceedings of the Spring Joint Computer Conference1971379385