Low-Complexity Inverse Square Root Approximation for Baseband Matrix Operations

Baseband functions like channel estimation and symbol detection of sophisticated telecommunications systems require matrix operations, which apply highly nonlinear operations like division or square root. In this paper, a scalable low-complexity approximation method of the inverse square root is developed and applied in Cholesky and QR decompositions. Computation is derived by exploiting the binary representation of the fixedpoint numbers and by substituting the highly nonlinear inverse square root operation with a more implementation appropriate function. Low complexity is obtained since the proposed method does not use large multipliers or look-up tables (LUT). Due to the scalability, the approximation accuracy can be adjusted according to the targeted application. The method is applied also as an accelerating unit of an application-specific instruction-set processor (ASIP) and as a software routine of a conventional DSP. As a result, the method can accelerate any fixed-point system where cost-efficiency and low power consumption are of high importance, and coarse approximation of inverse square root operation is required.


Introduction
Ever higher data rates require sophisticated transmission techniques but, on the other hand, the latest technologies allow use of advanced and more complex receiver algorithms.Such algorithms apply matrix operations which require highly nonlinear division by square root operation.For example, linear minimum mean square error (LMMSE) estimation has been proposed for the receivers of the current 3G, Universal Mobile Telecommunications System [1], and Cholesky decomposition can be used for the inevitable matrix inversion or for solving linear systems.In the upcoming 3G Long-Term Evolution (LTE) systems, multiple-input multiple-output (MIMO) receivers require demanding symbol detection methods like list sphere detector (LSD), which applies QR decomposition of a complex-valued channel matrix.When compared to matrix computation studies targeted for processing large matrices with highly parallel resources [2,3], there are four notable differences in the baseband processing in the telecommunications field: (i) the matrices are relatively small, (ii) fixed-point number system is preferred, (iii) there are real-time limits, (iv) low complexity and low power consumption are of high importance.
In this paper, a low-complexity inverse square root approximation method is proposed for baseband matrix operations.The method relies on binary presentation of the fixed-point number system and it avoids large LUTs, large multipliers, and floating-point arithmetic units.The principal idea of the method is to substitute the highly nonlinear inverse square root function with a less nonlinear function with appropriate pre-and postprocessing.The ISRN Signal Processing accuracy and complexity of the method can be adjusted with one design parameter.Thus, the method lends itself to lower-complexity applications where coarse approximations and fixed-point computations are preferred.In addition to comparison of hardware implementations of inverse square root methods, we show how the proposed method can be applied as a software routine or as an accelerating unit of an ASIP.The implementations are applied for Cholesky and QR decompositions required by 3G and 3G LTE receivers.

Previous Work
There are several methods to compute the inverse square root function.One of the basic approaches is to use lookup tables (LUT) for obtaining an initial value for iterations, which refine the value to higher accuracy [4,5].The main differences among these kinds of methods are in the size and content of LUT and the used iteration algorithm.In [5], a large multiplier was used since it was available in the targeted general purpose processor.In [6], savings were obtained by using a m × n multiplier, m ≤ n, and utilizing the fact that less significant bits of intermediate result do not contribute to the accuracy of the final result.A software implementation using LUT initialization followed by iterations was presented in [7].Another software approximation in [8] relied heavily on the binary representation of floating-point numbers.
LUTs using low-order polynomial approximation were applied in [9].Polynomial approximation was also used in [10] where a second-degree minimax polynomial approximation was followed by modified Goldschmidt iteration.A comparison considering area costs was also given.Digit recurrence methods were proposed in [11,12].The main disadvantage of using digit recurrence when compared to iterative algorithms is their linear convergence.Approximation based on LUT followed by multiplication with operand modification was proposed in [13,14] and used also in [15].Argument reduction followed by series expansion was applied in [16].Another approach is to work in logarithmic domain [17,18] where the computation of the inverse square root is straightforward [19,20].
For shorter word lengths (WLs) and for using fixedpoint numbers, table addition methods have been proposed.These methods consist of parallel LUTs and multioperand additions.As a benefit, no multipliers are required.In [21], a symmetric table addition method (STAM) was developed as an extension to a simpler bipartite method.Selecting appropriate multipartite method, that is, design space exploration, was considered in [22].The STAM enhanced with an error correction term and internal presentation in exponent and mantissa form was used in [23].
When compared to the previously mentioned methods, the proposed method in this paper is novel, that is, it is not a derivative of any of the existing methods.The area costs are kept at low as large LUTs and large multipliers are avoided.The proposed method lends itself also to software implementation.Furthermore, the proposed method can be adjusted to work only in subunitary range, which is sufficient for, for example, Cholesky decomposition, and the accuracy of the method can be adjusted along with the complexity up to a certain level while maintaining high area efficiency.

Targeted Matrix Decompositions in Baseband Processing
In this section, we describe where the targeted low-complexity inverse square root operations have been applied.
where it is assumed that autocorrelation E(dd H ) = I.The estimation in (1) can be derived into a form, which can be presented as a linear system with positive definite real-valued matrix.With Cholesky decomposition A = L T L, such a linear system, Ax = b, can be solved with the aid of two triangular systems, that is, L T z = b and Lx = z.Diagonal elements, l ii , of Cholesky factor L are defined as and nondiagonal elements, l i j , as where a i j denotes the elements of A. Equations ( 2) and (3) show that nondiagonal elements require division by square root and diagonals square root operations.Thus, the division by square root can be replaced with multiplication with the inverse square root, that is, two demanding operations are substituted with one demanding and one less demanding operation.The square root operation of (2) can also be computed with an additional multiplication, as x.An important property of Cholesky decomposition is the preservation of the subunitary of matrix elements, which limits the arguments of 1/ √ x operations efficiently.

QR Decomposition for LSD.
The LSD is used in MIMO receivers to estimate the transmitted symbol, s, by approximating maximum likelihood detection: where y is the received symbol vector and H is complexvalued channel matrix whose dimensions are equal to the number of transmit and receive antennas of MIMO system.The approximation is based on substitution with QR decomposition QR = H, that is, The LSD approximates ( 5) by gradually increasing the dimensions of symbol vector and computing partial Euclidean distances.With this practice, the search space can be limited efficiently.
QR factorization with modified Gram-Schmidt algorithm [24] is presented in Algorithm 1.
It decomposes H n×n to the orthogonal Q n×n and upper triangular R n×n .Conjugated transpose is denoted with (•) H .The lines 2 and 3 show that division by square root is required as the elements are divided by diagonals which are norms, • .In a similar way as with Cholesky decomposition, the division can be substituted with multiplication by inverse square root.

Low-Complexity Approximation Method
The main principle of the proposed method is to avoid straightforward approximation of 1/ √ x function which is highly nonlinear in subunitary range 0 < x ≤ 1.Instead, the more softly nonlinear function 1/ √ c + u with c ≥ 1 and 0 < u ≤ 1 is approximated.The usage of 1/ √ c + u is justified by the following fixed-point representations in two complement formats of x, c, and u.If the positive subunitary x has α leading zeros, c and u can be defined so that In other words, the bits of c and u do not overlap and the word lengths of c and u are denoted with N and M, respectively.Positive nonsubunitary range, x > 1, is presented similarly, except that the number of leading zeros, α, can have negative values.Since c N−1 = 1 for all valid values of x, the x can be presented with the aid of shifting by α, that is, Thus, the desired form, c + u, can be obtained, and we note that u is a positive subunitary number.The targeted function can be written as We can distinguish two cases depending on the value of α, which represents the number of leading zeros of fixedpoint binary representation of x (6).This distinct behavior is obtained because the remainder of α/2 in (8) can be either zero or one.For even values α = 2k, and, for odd values α = 2k + 1, In order to approximate (9) or (10), the expression 1/ √ c + u must be considered.A tempting solution is to approximate 1/ √ c + u with binomial series.In principle, the 1/ √ c + u could be approached with arbitrarily high precision, as the binomial series converges.Multipliers are required if polynomial approximation [9,10] or series expansion [16] is applied.Although the approximation with binomial series has a solid basis, it does not lend itself to lowcomplexity implementations due to the high-order terms.

Linear Approximation. We attempt to identify the characteristic of 1/
√ c + u and to determine a first-degree polynomial that will give the smallest approximation error for a low-complexity hardware implementation.So, we will approximate the expression 1/ √ c + u with straight lines, that is, where a t , b t > 0 and subscript t is the integer interpretation of the concatenation c N−2 • • • c 0 α 0 .The number of approximating lines, that is, the accuracy of the approximation, depends on the WL of c.Since the MSB of c has always constant value, c N−1 = 1, the number of approximating lines is 2 N .The range of the targeted expression is 1 WL, N, that is, Naturally, the domain of u depends on N and M, that is, 0 1) .In practice, the approximating lines are formed by dividing the domain of c + u into evenly spaced regions, which are determined by the N highest bits of c.The values in the start and end points are given by 1/ √ c and the value of the last The linear approximation is illustrated in Figure 1(a) where 1/ √ c + u, with even α approximated with N = 1, 2, 3.The error of approximation is shown in Figure 1(b).The figures indicate that by increasing the word length N, the accuracy of the approximation can be adjusted conveniently.

2/
√ c + u is approximated in a similar way.The lines used for even values, α = 2k, cannot be used without multiplication with √ 2, and, therefore, different approximating lines are preferred.To obtain the final result, that is, the approximation of 1/ √ x, the approximating straight lines in must be scaled with 2 k as shown in ( 9) and (10).The scaling can be carried out easily with shift operation, whose direction depends on the sign of α.

Coefficients for Hardware Implementation.
The linear approximation has the form a t − b t (c + u), which includes multiplication.However, for obtaining low complexity, the multiplications should be avoided.Braun multiplier adds shifted values of the multiplicand multiplied with one bit of the multiplier.The principle of adding shifted values can be used to approximate the product b t (c + u).Since b t ≤ 1/2, the product can be presented as where d i,t ∈ {−1, 0, 1}.As division with powers of two can be implemented with hardwired shifting in hardware, an approximation of the previous form is suitable for lowcomplexity implementation.Naturally, the accuracy depends on the number of terms included in the sum.In the proposed method, at maximum three terms are included, that is, an approximation, in which d i,t ∈ {−1, 0, 1} and e i,t ∈ {1, . . ., 8}, is used.The coefficients d i,t and e i,t are searched for each approximating line, that is, for each c and α 0 , separately.Instead of three shifters with freely variable shift count, three multiplexers can be used to select appropriate terms.

Inverse Square Root Unit Implementations
The block diagrams of the hardware implementations of the inverse square root units are shown in Figure 2. The structure is further extended in Figure 2(c) to allow free range, that is, x > 0. Basically, nonsubunitary range of x results also in negative values of α, and, therefore, both left shifting and right shifting are required as indicated in Figure 2(c).In practice, shifters consists of hardwired shift operations from which one is selected with multiplexer.Therefore, a combination of left and right shifters can be assumed to have the same complexity of unidirectional shifter with respectively wider range of shifted bits.As the input signal x has wider WL in Figure 2(c), the negative α is detected by comparing the number of leading zeros and IWL.
Only basic arithmetic and logic units are being used.The key components are priority encoder, adders, multiplexers, and shifters.Part of the functionality, for example, constant scaling, is implemented by hardwiring bits to the new positions.Due to the scaling, WLs of intermediate signals are relatively short.As the targeted accuracy depends on N, different implementations can be generated according to targeted application.Figure 2(a) shows a general case, that is, the number of inputs of multiplexers and N are free variables.In Figures 2(b) and 2(c) N = 1 and, therefore, multiplexers are controlled solely by α 0 .If N > 1, the c is obtained from the output of the first shifter(s) and the control signal is

Comparisons
Areas of the proposed method and competitive methods are estimated for a suggestive comparison.The proposed method is synthesized with 130 nm technology.The areas of other methods are estimated by considering their most expensive area components, such as multipliers and LUTs, unless more accurate details are clearly specified in the referred design.Only the mantissa of floating-point implementations is considered since its computation is similar in fixed-point number system.

Estimation of Area.
Areas in terms of logic gate equivalents (GEs) of the synthesized arithmetic and logic operations with different WLs are given in Table 1.Since the basic unit of area is one NAND gate, fractions are possible.On the contrary to simple cost estimation of LUTs in [10], we have estimated the area of all LUTs individually.If structures of LUTs are not specified in detail, fair assumptions are made for the referred works.The synthesized LUTs are filled with random bits.The main reason for accurate modeling of LUT complexity is that the relative area of LUT depends both on the address line width and data WL.Estimated areas of all the LUTs are given in Table 1.

Compared Implementations.
Since low area is emphasized in the targeted application domain of baseband processing, the methods are compared using the area efficiency as the ratio of accuracy versus area.The metric is defined as area efficiency = accuracy in bits area in GEs .( For single precision (SP) methods the accuracy is 23 bits and for dual precision (DP) methods 52 bits.The area efficiency   results for all the methods are shown in Table 2.The average accuracy of the proposed method in Table 2 is obtained in the subunitary range.Nonsubunitary range would increase the average accuracy even further.There are four versions of the proposed method with design parameter N = 1, 2, 3, 4.
The results show that the proposed method has the lowest area, and even if the accuracy is adjusted with N, the area efficiency remains the highest except with N = 1.Naturally, the accuracy is relatively modest, as we have preferred the lowest area instead of high accuracy.
The first method in Table 2 was targeted to DP general purpose processor [5].It required LUTs of sizes 2 10 × 23 and 2 11 × 23 and multiplier for 76 × 76 operands.Since the implementation was targeted to the general purpose processor, the hardware resources were not dedicated only to the inverse square root function.In [6] two 16 × 56 and one 56 × 56 multipliers were required.The total memory size was 72192 bits divided into four tables.For smaller gate count, we have assumed uniform division to four tables of 18048 bits with WL 22 bits, which is the widest word fetched from the tables.High speed was emphasized in [10].Therefore, we compare with the method with single multiply and accumulate unit [10], which had better area efficiency.The authors also reported the complexity of 5030 full adders and, therefore, their value is used instead of our own estimates.In [13], SP floating-point numbers were targeted.A 2 10 × 25 LUT was required and a 20 × 20 multiplier.In addition, a requirement of 15 inverters was reported.Symmetric table addition method (STAM) was used in [21].The smallest total LUT size was obtained with four LUTs of sizes 2 9 ×19, 2 8 ×10, 2 8 × 8, and 2 8 × 6.In addition, a sum of all the data read from LUTs must be generated, which requires adders with operand sizes 19 × 10, 8 × 6, and 20 × 9. Also a requirement of 45 XOR gates was reported.Both SP and DP were targeted in [16] but the method for SP gave better area efficiency.The SP method required 2 7 × 36 LUT, four 4 × 4 multipliers, and one 22×23 multiplier.Fixed-point number systems were targeted in [23].The method applied STAM enhanced with added correction value.Estimated complexity of 625 GE and LUT size of 3456 bits were given in [23].Since the structures of LUTs were not reported, we have assumed that, due to the STAM, the memory is divided at least to two LUTs.We also assume 16-bit WL.Several smaller LUTs or shorter WL would degrade the area efficiency.The estimated complexity of LUTs is added to the reported gate count.) with 100 MHz is 0.339 mW, which includes the power required by input and output registers.Naturally, the static power consumption is proportional to the area, and, therefore, low complexity has been targeted.The dynamic power is proportional both to the area and average switching activity of the gates.Even if the average switching activity of competitive methods cannot be estimated sufficiently accurately, the differences in the area are significant.For example, [23] has the smallest area, 1602 GE, of the referred methods and the average switching activity of [23] should get as low as 622/1602 × 100% = 39% of the average switching activity of the largest proposed unit (622 GE, N = 4) to achieve roughly the same dynamic power consumption.

Matrix Decomposition Implementation Case Studies
In this section, the proposed method for 1/ √ x approximation is applied in QR and Cholesky decomposition implementations.Many matrix decompositions are implemented with systolic arrays [25] applying inherent regularity of the algorithm, and the computations are alleviated with CORDIC [26] algorithm.However, such structures are not as flexible as programmable processors and such high parallelism can easily result in so fast processing that the array processor must idle most of the time if applied, for example, in QR decomposition of MIMO receiver.

QR Decomposition with ASIP.
Complex-valued QR decomposition was implemented with transport triggered architecture (TTA) [27] processor in [28].The TTA is an ASIP template where parallel computing resources can be tailored according to the application.The proposed 1/

√
x unit is instantiated in the processor as shown in Figure 3.In addition, there are units for complex addition and subtraction and complex multiplication with optional conjugation.
Typically, MIMO systems have only a couple of transmit and receive antennas, and, therefore, a 4 × 4 matrix decomposition is targeted.The modified Gram-Schmidt algorithm requires 2n 3 operations for n × n matrix [24].The processor implementation takes 139 clock cycles for 4 ×   [29].Equations ( 2) and (3) show that the algorithm lends itself to the multiply and accumulate instruction, for which DSPs are typically optimized.Furthermore, an efficient hardware looping can be applied in the innermost loops as testing within the loop can be avoided.With the simple 1/ √ x approximation the developed program decomposes 64 × 64 matrix in 85070 clock cycles.Maintenance of a continuous flow of computations is more important for an efficient software implementation than focusing on avoidance of multiplications.In other words, pipeline should be kept full by avoiding conditional branching when possible.For example, describing computation of α as defined in (6) in C language would result in a cumbersome loop testing bits of x.However, it can be avoided as the instruction set of the applied DSP has adequate assembly instruction for obtaining the number of leading zeros.Furthermore, short branches according to the parity of α can be avoided with guarded instructions, that is, the computations proceed uninterrupted by both branches but only the other branch affects the state.Thus, the proposed method lends itself to an efficient software implementation on a DSP with adequate instructions for obtaining the number of leading zeros and for guarded execution.

Conclusions
The inverse square root function is highly nonlinear function and, therefore, approximated usually with high-complexity implementation.The proposed approximation method targets moderate precision fixed-point numbers.The computation has been based on an appropriate substitute, which allowed approximation without large LUTs and large multipliers.The method has one design parameter which allows scaling of the accuracy and hardware complexity.The area efficiency of the proposed method has been given in terms of approximation accuracy per area.Comparisons with previously reported methods show that the proposed method achieves low complexity and high area efficiency.Finally, the method has been applied on the targeted baseband functions as a function unit of an ASIP and as a software routine on a DSP.

Figure 2
(a)   shows only the linear approximation a t − b t (c + u).The top three multiplexers correspond with term b t (c + u), and the fourth multiplexer outputs a t .The selections of multiplexer are controlled by parity of α and bits of c excluding the c N−1 which has constant value.In the next block diagram in Figure2(b) the previous unit is instantiated in the inverse square root unit.The range of the unit in Figure2(b) is positive subunitary, that is, 0 < x ≤ 1, which is sufficient for the Cholesky decomposition.

Figure 2 :
Figure 2: Units for (a) linear approximation with a t − b t (c + u), (b) approximation of 1/ √ x in subunitary range, and (c) approximation in nonsubunitary range.Left shifting and right shifting are denoted with and , respectively.Negation is marked with (−1).

Figure 3 :
Figure3: TTA processor for QR decomposition.Filled circles denote valid connections between resources and internal buses.The processor has special function units for complex-valued arithmetic and 1/ √ x approximation.

Table 1 :
Estimated area in terms of GEs of arithmetic and logic units and LUTs in compared implementations.

Table 2 :
Suggestive comparison of inverse square root methods.The proposed method has the highest area efficiency with N = 2, 3, 4.
4 matrix.If 2048 subcarriers must be processed within a coherence time of the channel, 160 MHz clock frequency is adequate.The processor is synthesized with 130 nm technology and it takes 16.3 kGE with 160 MHz and 23.2 kGE with 269 MHz.The power consumption with 160 and 269 MHz clock frequencies is 6.91 mW and 16.79 mW, respectively.7.2.Cholesky Decomposition with DSP.Cholesky decomposition was implemented as a software routine on TI's C55x DSP in