A VLSI Architecture for the V-BLAST Algorithm in Spatial-MultiplexingMIMO Systems

is paper presents a VLSI architecture for the suboptimal hard-output Vertical-Bell Laboratories Layered Space-Time (V-BLAST) algorithm in the context of SpatialMultiplexingMultiple-InputMultiple-Output (SM-MIMO) systems immersed inRayleigh fading channels. e design and implementation of its corresponding data-path and control-path components over FPGA devices are considered. Results on synthesis, bit error rate performance, and data throughput are reported.


Introduction
Multiple-Input Multiple-Output (MIMO) communication systems enhance spectral efficiency and bit error rate (BER) performance over wireless communication links [1][2][3].MIMO is already considered as the transmission scheme for emerging wireless communication standards such as 802.11n (WiFi), 802.16 d/e (WiMAX), and 802.11 ac (multiuser MIMO WLAN) [4].Digital signal processing (DSP) algorithms for symbol decoding in these systems immersed in Rayleigh fading channels require trade-off design challenges regarding BER performance, data throughput, and complexity.Sub-optimal hard-output Spatial-Multiplexing MIMO (SM-MIMO) demodulation techniques [1,2] offer low complexity aspects with a very high data throughput, but at a penalty in BER performance degradation as compared to Maximum-Likelihood (ML) performance.e V-BLAST (Vertical-Bell Laboratories Layered Space-Time) algorithm [5,6] is an adequate sub-optimal hard-output SM-MIMO demodulation technique due to the previous mentioned properties.To the best of the authors' knowledge, previous attempts on state-of-the-art VLSI implementations for hardoutput V-BLAST-based sub-optimal SM-MIMO demodulation architectures are reported in literature [7,8].Seeking for low-power consumption and cost-effectiveness, these ASICbased (Application-Speci�c Integrated Circuit) approaches permit no �exibility in meeting these attributes towards prototyping this kind of DSP solutions.In this work, a VLSI architecture for the sub-optimal hard-output detection V-BLAST algorithm is presented.e contribution of this paper is to present a novel VLSI architecture of the V-BLAST algorithm implemented on FPGA devices that perfectly suits SM-MIMO demodulation requirements regarding BER performance, hardware complexity, and data throughput while operating in Rayleigh fading channels, behaving competitively against other earlier approaches.e organization of this paper is as follows: Section 2 presents the MIMO communication model.e V-BLAST algorithm is presented in Section 3. Section 4 highlights the architecture proposal for the V-BLAST algorithm.Implementation results and comparison analysis are exposed in Section 5. Conclusions are covered in Section 6.

MIMO Communication Model
e MIMO communication model consists of an antenna array of   elements at the transmitter end and   elements at the receiver in the presence of an ccs-iid AWGN (circularly-complex symmetric, identically-distributed Additive White Gaussian Noise) Rayleigh fading channel [1][2][3].Information signal vector    1 ⋯    ]  , whose   entries are symbols drawn from a -QAM (-ary Quadrature Amplitude Modulation) constellation (known also in this context as a Gaussian �nite-integer lattice), is transmitted throughout these   antennas.e received signal vector    1 ⋯    ]  of dimension   can be mathematically described as where  and  were de�ned above,    1 ⋯    ]  is a   AWGN vector, and is the   ×   MIMO channel matrix (entries  from  correspond to the fading between the th receiver and the th transmitter antennas).System statistics are assumed to be invariant during a MIMO channel realization [1].Without loss of generality,        will be considered in the sequel.Applying QR decomposition [9,10] to  in (1), that is,   , yields where are  ×  complex orthogonal and upper-triangular matrices, respectively; moreover,       and     , where ⋅]  is the conjugate-transpose operator.Obviously speaking,      which reveals that   is equivalent to inverse of , that is,  −1 .e problem to solve in this MIMO communication scenario is to �nd the transmitted vector  among   possible candidates given that vector  was received and matrix  has been accurately estimated.As can be seen, an exhaustive search is prohibitively complex.A fast decoding procedure alleviates this complexity constraint by taking advantage of matrix structure of  presented in (2) under a successive interference cancellation (SIC) strategy [5,6], thus yielding high data throughputs for symbol-decoding purposes.at is why sub-optimal decoding algorithms are preferred.One of the best sub-optimal hard-output decoding algorithms is the V-BLAST which is explained next.

The V-BLAST Algorithm
e main idea behind the V-BLAST algorithm, as a suboptimal hard-output SM-MIMO demodulation technique, is that instead of performing an exhaustive search, symbol decoding of  is performed under an ordered-iterative and back-propagation way (identi�ed also as OSIC: Ordered Successive Interference Cancellation), in which noise (associated with    1 ⋯   ]  ) and cochannel interference (related to elements in ) are treated through a SNR (Signalto-Noise Ratio) optimization criterion that determines the order in which symbol entries of  will be decoded [5,6].At each iteration, an entry of vector  is sliced into a -QAM value at the receiver end assuming that only one transmitter end element possesses the highest SNR (choice based on an optimization criterion), and thus the remaining transmitter elements are considered as interference (besides the presence of AWGN).With these ideas exposed, the V-BLAST algorithm is de�ned into the following steps (Steps 1-8).
Step 2 (pre/post-detection).While   , and for 1     − 1, a SNR optimization criterion dictates the order in which symbol entries of  will be decoded according to index   , and is stated as where elements   1  …     are obtained form the Generalized-Inverse   (or Le-Pseudoinverse) [11] of the MIMO channel matrix de�ated versions   , that is, with  †         −1    .Make    be the   -th element of   .
Step 3 (nulling).Co-channel interference is mitigated by the use of a nulling vector

󵰒󵰒 𝑘𝑘 𝑖𝑖
, that is, obtained from rows of (7), through operation with      1 ⋯  Step 5 (cancellation).e treatment for co-channel interference and AWGN mitigation aer ( 8)-( 10) at iteration  is cancelled out according to where ℎ    = ℎ ,    ⋯ ℎ ,      is the    th column of   .
Step 6 (updating).By, respectively, removing the    th element of   and the   th column of   , de�ated versions for the MIMO channel matrix are created, as well as reshaping index vector   , such as: Step 7. Increment  and go to Step 2.

Architecture Proposal
e block diagram of the proposed architecture for the sub-optimal hard-output V-BLAST-based SM-MIMO algorithm is shown in Figure 1.e overall architecture for the V-BLAST algorithm consists of two components: the Data-Path (DP) is constituted by processing elements that implement all necessary mathematical operations; the Control-Path (CP) provides signaling for synchronization of data-�ow for the appropriate decoding of vector  (represented by output    ).In Figure 1,  clock is the system clock,   and  are the inputs.Additional external signals are employed for initialization of V-BLAST operations through a reset signal (), detection of valid information at the inputs (signal DV_inputs), and indication of valid information at the output (�ag DV_output).Remember that the V-BLAST architecture is designed in order to perform symbol decoding of   generated from a 16-QAM constellation in AWGN Rayleigh fading channels for MIMO systems with  = .e details concerning the design of the data-path and control-path components are exposed in the subsequent sections.

Finite-Precision
Analysis.e con�guration of the DP component in the V-BLAST architecture considers design speci�cations based on �xed-point arithmetic for inputs   and  (results were carried out by �oating and �xed point MATLAB simulations).Figure 2 reveals the BER performance of the V-BLAST algorithm considering speci�cations mentioned above.Optimal performance is obviously the ML solution (ML-D legend, red line).It can be seen from the �gure that using a 16-bit �xed-point word-length for  showed an acceptable performance of the V-BLAST as compared to its �oating-point model (blue and blue-dotted lines) with less than 0.01 dB in loss.On the contrary, a reduction in �niteprecision, that is, 8-bit word-length, caused a remarkable performance degradation of more than 1 dB (black-dotted line).In addition, 16 bits for both the real and imaginary parts of each entry in   was also considered.

Data-Path
Architecture.e architecture of the DP component is illustrated in Figure 3, and consists of the following elements: (i) data multiplexors MR and MYG (identi�ed by   and    , resp.) for selecting between   =  and   , as well as for   =   and  + ; (ii) registers REG_Hi and REG_Xi store, respectively, information regarding   in (6) and  + in (11); (iii) a block InviPseudo for matrix inversion and le-pseudoinversion [9][10][11] presented in (7); (iv) P/P-D implements optimization criterion (5); (v)  computes the nulling operation in (8); (vi) slicing operation for quantization (10) are implemented in ; (vii)  performs cancellation (11); (viii) So is in charge of properly assign decoded symbol entries to vector    ; and (ix)  manages de�ated matrix versions in (12) and vector index reshaping in (13).
e V-BLAST architecture provides a decoded symbol entry of   at an iteration  (exactly  iterations are required), meaning that multiplexors MYG and MR, and registers REG_Hi and REG_Xi regulate data �ow contained in   ,   ,   ,   , and  + .For  =  is evident that MR selects matrix , and MYG does the same for vector  .Whenever   .e  block deals with the process of generating de�ated versions   from  1 , and keeps track of indexes    and   .Also,  provides the pertinent value of ℎ    at every iteration .As mentioned before, in order to perform the de�ating operations in (12) the    th column of matrix  1 is removed and substituted by an -dimensional zero vector.For the index vector update in (13), the   th element of the (    1)-dimensional array   is assigned to index    , that is,   [  ] =    , then this element is removed aerwards from   and it is re-sized into a (  )-dimensional array  1 .
e InviPseudo block is the key element for performing iterative generalized-inverses found in (7).e heavy and critical operation to be treated relies on (     ) 1 , since InviPseudo implements a strategy for computing   based on a block-matrix Le-Pseudoinverse (hereinaer referred as LPI kernel) approach as proposed in [11].For the case  =  developed in this V-BLAST architecture, this LPI kernel is divided into the following entities: , , , ;  1 ,  1 ;  =  1 ,  =  1 ,  =  1 ;  =   ,  1 ;  =  1 , ;  =  1  , .All of these entities represent 2 × 2 complex-valued matrices that aer all V-BLAST iterations are accomplished, (     ) 1 will be reassembled as   1    .Depending on the V-BLAST iteration , data inside each entity is properly initialized for the correct computation of block-matrix inversion within   .For example: (i) at  = 1:  =  0  11 ,  =  0 0 0 0 ,  =  0 0 0 0 ,  =  1 0 0 1 .For the cases presented in (i)-(iv), elements   are taken accordingly from      at its corresponding iteration V-BLAST iteration .Additionally, all arithmetic divisions presented in the LPI kernel and throughout iterations concerning (7) were implemented with CORDIC (Coordinate Rotate Digital Computer) processors [13].For this purpose, the CORDIC processor (or CORDIC engine) is structured as where is associated with the polarity of  micro-rotations in (15) for 0    1; [0] =  > 0 and [0] =  are initial conditions in (15);  is an approximated scaling factor; and this engine is customized to perform  =  ∑ 1 =0 (  ⋅ tan 1 (2  )).(12) are generated through signals get_sREG_Hi01, get_sREG_Hi10 and get_sREG_Hi11.In addition to these, the roll of signal  is fundamental for synchronizing V-BLAST iterations, because  selects elements in MYG and MR, controls data �ow in , allows the generation of generalizedinverses   at InviPseudo, and regulates the choice of nulling vectors in .

Results
e sub-optimal hard-output V-BLAST architecture was designed for operating with transmitted signal vectors generated from 16-QAM modulators immersed in AWGN Rayleigh fading channels with   .Simulations for functional validation were programmed with MATLAB.A million of MIMO channel realizations were considered or until a thousand of error blocks were found, that is, a decoded block consisting of  2  bits.Synthesis was performed with the Altera Quartus II IDE tool over Cyclone III FPGA devices.
5.1.BER Performance.e results for BER performance shown in Figure 2 were corroborated for the �nite-precision analysis of the V-BLAST architecture.at is, the FPGA implementation showed the same performance as the one obtained in simulation for the �xed point case, con�rming a negligible degradation as compared to theoretical �xedpoint performance.Two different simulation scenarios were considered: (i) a MATLAB-based simulation model used for obtaining �oating and �xed points (restricted to a 16-bit precision) ML and V-BLAST performances; (ii) a testbench designed for evaluating performance of the device under test, whose test vectors were generated with MATLAB.

Synthesis Results
. e Cyclone III device form Altera FPGA family was selected as implementation target for the V-BLAST architecture.Different synthesis and �tting modes were performed within the Quartus II IDE tool.For instance, synthesis considered speed (sp), balanced (bd), and area (ar) optimization techniques, while �tting considered standard (std) and fast (fst) �tter efforts.erefore, six different modes were evaluated for implementation purposes, namely: spstd, bd-std, ar-std, sp-fst, bd-fst, and ar-fst.In all of these cases, hardware complexity resided on logic elements (LEs) F 4: Floorplan results for the V-BLAST architecture.
and embedded multipliers (eMults).e whole V-BLAST architecture demanded respectively a 27% and a 72% usage of the total amount of LEs and eMults available in the FPGA device.In fact, (a) regarding LEs usage: 62.98% belonged to the InviPseudo block, 24.94% to P/P-D, 3.67% to , bottom-view, label C1), and P/P-D (right bottom-view, label C2).

Comparison Analysis.
Earlier attempts in providing architectural implementations on sub-optimal hard-output V-BLAST-based SM-MIMO solutions are cited in [7,8].
Albeit their implementation structured on ASIC devices, which inhibits design �exibility and cost-effectiveness, the FPGA-based VLSI architectural approach developed in this work represents a modular, portable, and scalable implementation of the V-BLAST algorithm as depicted in Figure 3. Moduli constituting the V-BLAST architecture are con-�gured under an RTL level, yielding a moderate capability support for high-dimensional lattices (i.e., -QAM with 16, 64, 256 values), and high-dimensional MIMO communication scenarios (i.e.,    ).Performance results reported in Table 1 exhibit competitive aspects against other state-ofthe-art solutions: a low-dimensional lattice (i.e., QPSK) V-BLAST [7], and a high-dimensional lattice (i.e., 16-QAM) V-BLAST [8].e heaviest hardware complexity resides on how iterative  †  operations are performed.For instance, both the LPI kernel and  †  operations handled in [7] yield the same  3 ) complexity� however, the LPI kernel signi�cantly avoids matrix unitary transformations, an issue that demands a more sophisticated and complex control-path design.Also, the use of more CORDIC engines in [7,8] affect data throughput as well as affecting symbol decoding latencies.On the other hand,  †  operations in [8] are alleviated through level-thresholding, another issue which incurs into BER performance degradation for sub-optimal hard-output SM-MIMO demodulation purposes.

Conclusions
In this work, a VLSI architecture for the sub-optimal hardoutput SM-MIMO V-BLAST algorithm was proposed.e architecture was designed for operating with symbols drawn from 16-QAM modulators under AWGN Rayleigh fading channels with parameter   .Simulation testing and hardware implementation on Altera Cyclone III FPGA devices validated the functionality of the V-BLAST architecture.

F 1 :
Block diagram for the V-BLAST architecture proposal.

F 3 :
Architecture for the DP component of the V-BLAST.Red arrows are control signals generated by CP.
] ,           1 ⋯      ], and      is the signal to be sliced.
into a signal point      belonging to a -QAM constellation (     is coded into a log 2 -bit word).For 1    , complex additions and multiplications inherent in  1  =    ℎ     ⋅      complete cancellation in  with the sliced symbol      , the multiplexed   vector, and the selected column ℎ    from  1 .e So block registers and accordingly assigns symbol decoded entries     into elements in vector   based on index    , in other words   [    ] =      (the value of     is the    th element of vector (11)  , the vector   is updated into  1 following(11)through REG_�i; similarly happens to the matrix de�ated versions   following (12) through REG_Hi.Moreover, the matrix de�ated versions   are handled as just zeroing the    th column of  1 aer each iteration , yielding matrices  .e  block (or slicer) transforms ) signals get_Hi and get_Xi capture data   and   in registers REG_Hi and REG_Hi, respectively: (ii) signal sasiSo captures every decoded entry      in    at block So; (iii)   re-sizing in (13) is regulated by signals get_sXYZ, get_sAB and get_sL12; while de�ated versions of   in