Performance and Complexity Evaluation of Iterative Receiver for Coded MIMO-OFDM Systems

Multiple-input multiple-output (MIMO) technology in combination with channel coding technique is a promising solution for reliable high data rate transmission in future wireless communication systems. However, these technologies pose significant challenges for the design of an iterative receiver. In this paper, an efficient receiver combining soft-input soft-output (SISO) detection based on low-complexity K-Best (LC-K-Best) decoder with various forward error correction codes, namely, LTE turbo decoder and LDPC decoder, is investigated. We first investigate the convergence behaviors of the iterative MIMO receivers to determine the required inner and outer iterations. Consequently, the performance of LC-K-Best based receiver is evaluated in various LTE channel environments and comparedwith otherMIMOdetection schemes.Moreover, the computational complexity of the iterative receiver with different channel coding techniques is evaluated and compared with different modulation orders and coding rates. Simulation results show that LC-K-Best based receiver achieves satisfactory performance-complexity trade-offs.


Introduction
The ever increasing demand for higher data rate and better link reliability poses challenges for the modern wireless communication systems such as IEEE 802.11, 802.16,DVB-NGH, 3GPP long term evolution (LTE), and LTE-Advanced (LTE-A).The combination of multiple antennas at transmitter and/or receiver, orthogonal frequency-division multiplexing (OFDM) technique, state-of-the-art channel coding schemes, and iterative reception techniques has been seen as the promising solution for the future wireless systems.
MIMO technology which utilizes multiple antennas at transmitter and/or receiver is able to achieve high diversity through space-time coding and high data rate through spatial multiplexing [1].It is commonly used in combination with OFDM technique to combat intersymbol interference (ISI) and therefore achieve better spectral efficiency.Modern channel coding schemes such as turbo codes or LDPC codes are powerful forward error correction (FEC) codes that are able to protect the integrity of the transmitted data and to approach the channel capacity.Therefore, the coded MIMO-OFDM systems are recognized as attractive solutions for the future high speed wireless communication systems.However, the practical design of such coded MIMO-OFDM systems involves numerous challenges at the receiver.
The reception strategy that offers best performance is to jointly detect and decode the received symbols.However, this joint detection scheme has been shown to be very complex and infeasible for practical implementation [2].Alternatively, the optimal performance can be approached by the iterative processing or commonly referred to as turbo processing [3][4][5][6] which replaces the joint detection by iteratively performing independent detection and decoding processing.It consists of soft-input soft-output (SISO) detector and channel decoder that exchange "soft" information [7].
Regarding the MIMO detection method, the optimal way relies on maximum a posteriori probability (MAP) algorithm.However, it presents a complexity that exponentially increases with respect to the number of transmit antennas and modulation orders.Hence, several suboptimal but lowcomplexity detectors have been proposed in the literature.These solutions include the family of linear equalizer, interference canceller, and tree-search detector.To achieve better performance, the design and the implementation of SISO 2 Mobile Information Systems MIMO detectors have been also widely investigated, such as the minimum mean square error-interference cancellation (MMSE-IC) [8,9], improved VBLAST (I-VBLAST) [10,11], list sphere decoder (LSD) [12], single tree-search sphere decoder (STS-SD) [13][14][15], K-Best decoder [16][17][18][19][20], and fixed sphere decoder (FSD) [21][22][23].Among them, MMSE-IC and I-VBLAST present low computational complexity, but they are not able to fully exploit the spatial diversity of MIMO system.Meanwhile, the sphere decoder is able to achieve superior performance.However, the sphere decoder uses a depth-first search method.Therefore, its computational complexity varies significantly with respect to the channel condition, yielding prohibitive worst-case complexity.Moreover, the sphere decoder suffers from variable throughput due to its sequential tree-search strategy, which makes it unsuitable for parallel implementation.In contrast, the breadth-first search based K-Best and FSD algorithms are hence more attractive for practical implementation than sphere decoding, as they can offer stable throughput at a cost of acceptable performance loss.
Despite these efforts, it is still very challenging to develop a high speed iterative MIMO receiver to meet the high throughput requirements of future wireless communication systems at affordable complexity and implementation cost.In [24], the performance-complexity trade-offs of iterative MIMO receiver have been investigated.However, the investigation is limited to the turbo channel coding and theoretical channel cases.In this contribution, the performance and the complexity of iterative MIMO receiver are evaluated in a much broader and more practical scope.We investigate in depth the soft joint iterative detection schemes with various symbol detection schemes, various soft-input softoutput channel decoders, and various ways of constructing joint loops, under different channel conditions.In particular, the most representative modern channel coding schemes, including LTE turbo code and LDPC code, are considered.Several LTE multipath channel models are employed in the simulation to evaluate the performance in real propagation scenarios.Consequently, a detailed comparative study is conducted among iterative receivers with different modulations and channel coding schemes (turbo, LDPC).It has been demonstrated through the comparison that LC-K-Best based receiver achieves a best trade-off between performance and complexity among the iterative MIMO receivers considered in this work.
The remainder of this paper is organized as follows.Section 2 presents the MIMO-OFDM system model and the concept of iterative detection-decoding process.Channel decoding based on turbo decoder and LDPC decoder is described in Section 3. Section 4 briefly reviews the most relevant SISO MIMO detection algorithms based on sphere decoder, LC-K-Best decoder, and interference canceller.In Section 5, the convergence behavior of the iterative receivers is discussed using extrinsic information transfer (EXIT) chart to retrieve to required number of inner and outer iterations.Section 6 illustrates the performance of our proposed approaches in LTE-based channel environments.Then, the computational complexity of the receivers with both turbo and LDPC coding techniques is evaluated and compared with different modulation orders and coding rates.Section 7 concludes the paper.

MIMO-OFDM System
Model.We consider a MIMO-OFDM system based on bit-interleaved coded modulation (BICM) scheme [25] with   transmit antennas and   receive antennas (  ≥   ) as depicted in Figure 1.
At the transmitter, the information bits of length   are first encoded by a channel encoder which outputs a codeword c of length   with a coding rate   =   /  .The channel encoder can be a turbo encoder or an LDPC encoder.The encoded bits are then randomly interleaved and mapped into complex symbols of 2  quadrature amplitude modulation (QAM) constellation, where  is the number of bits per symbol.The symbols are mapped into   transmit antennas using either space-time block coding (STBC) schemes or spatial multiplexing (SM) schemes offering different diversity gain and multiplexing gain trade-offs.Herein, the SM-based MIMO system is considered without loss of generality.IFFT is applied to   parallel symbols to obtain the time domain OFDM symbols, where   is the number of useful subcarriers.The symbols are then sent though the radio channel after the addition of the cyclic prefix (CP) which is assumed larger than the maximum delay spread of the channel.The time domain symbol transmitted by the th antenna is expressed as where   () is the symbol in the frequency domain before IFFT,  FFT is the size of the FFT, and   is the length of the CP.The transmit power is normalized so that E{ss  } =   /  I   , where I   is the   ×   identity matrix.The transmission information rate is   ⋅   ⋅  bits per channel use.
Using the OFDM technique, the frequency-selective fading channel is divided into a series of orthogonal and flatfading subchannels.The signal equalization is performed by a simple one-tap equalizer at the receiver.Therefore, after the removal of CP, FFT is performed to get the frequency domain signal vector y  = [ 1 ,  2 , . . .,    ]  that can be expressed as where  = 1, . . .,   is the index of subcarriers.For simplicity, the subcarrier index  is omitted in the sequel.H is the   ×   channel matrix with its (, )th element ℎ , , the channel frequency response of the channel link from th transmit antenna to th receive antenna.The coefficients of the channel matrix H are assumed to be perfectly known at the receiver.n = [ x ĉ Figure 1: Block diagram of MIMO-OFDM system using bit-interleaved coded modulation with iterative detection and decoding.

Iterative Detection-Decoding Principle.
At the receiver, to recover the transmitted signal from interferences, an iterative detection-decoding process based on the turbo principle is applied as depicted in Figure 1.The MIMO detector and the channel decoder exchange soft information, that is, log likelihood ratio (LLR), in each iteration.
The MIMO detector takes the received symbol vector y and the a priori information  1 of the coded bits from the channel decoder and computes the extrinsic information  1 .The MIMO detection algorithm can be the MAP algorithm or other suboptimal algorithms like STS-SD, K-Best decoder, I-VBLAST, or MMSE-IC.The extrinsic information is deinterleaved and becomes the a priori information  2 for the channel decoder.The channel decoder computes the extrinsic information  2 that is reinterleaved and fed back to the detector as the a priori information  1 .
The channel decoding is performed either by an LTE turbo decoder or by an LDPC decoder, which exchanges soft information between their component decoders as described in the next section.In our iterative process, we denote the number of outer iterations between the MIMO detector and the channel decoder by  out and the number of iterations within the turbo decoder or LDPC decoder by  in .
For QAM, the mapping process can be done independently for real and imaginary part.The system model expressed in (2) can be converted into an equivalent realvalued model: where Re(⋅) and Im(⋅) represent the real and imaginary parts of a complex number, respectively.Each QAM constellation point is treated as two PAM symbols, and the matrix dimension is doubled.However, as shown in [17], the real-valued model is more efficient for the implementation of the sphere decoder.Hence, it will be used as the system model in case of sphere decoding in the following sections.

Soft-Input Soft-Output Channel Decoder
Channel coding is used to protect the useful information from channel distortion and noise by introducing some redundancy.The state-of-the-art channel coding schemes such as the LDPC [26] and turbo codes [3] can effectively approach the Shannon bound.LDPC codes are nowadays adopted in many standards including IEEE 802.11 and DVB-T2, as they achieve very high throughput due to inherent parallelism of the decoding algorithm.In the meantime, the turbo codes are also adopted in LTE, LTE-A (binary turbo codes), and WiMAX (double binary turbo codes).In this paper, LDPC codes and LTE turbo codes are considered.

Turbo Decoder.
Initially proposed in 1993 [3], turbo codes have attracted great attention due to the capacityapproaching performance.The turbo encoder is constituted by a parallel concatenation of two recursive systematic convolutional encoders separated by an interleaver.The first encoder processes the original data while the second processes the interleaved version of data.The main role of the interleaver is to reduce the degree of correlation between the outputs of the component encoders.
In LTE system, the recursive systematic encoders with 8 states and [13,15]  polynomial generators are adopted.A quadratic polynomial permutation (QPP) interleaver is used as a contention free interleaver and it is suitable for parallel decoding of turbo codes as illustrated in Figure 2(a).The mother coding rate is 1/3.Coding rates other than the mother rate can be achieved by puncturing or repetition using the rate matching technique.
The turbo decoding is performed by two SISO component decoders that exchange soft information of their data substreams.Each component decoder takes systematic or interleaved information, the corresponding parity information, and the a priori information from the other decoder to  Two families of decoding algorithms can be used: softoutput Viterbi algorithms (SOVA) [27,28] and maximum a posteriori (MAP) algorithm [29].The MAP algorithm offers superior performance but suffers from high computational complexity.Two suboptimal algorithms, namely, log-MAP and max-log-MAP, are practically used [30].Herein, log-MAP algorithm is considered by using the Jacobian logarithm [30]: where   (| − |) is a correction function that can be computed using a small look-up table (LUT).
The decoder computes the branch metrics () and the forward () and the backward () metrics between two states in the trellis as follows: The a posteriori LLRs of the information bits are computed as The component decoders exchange only the extrinsic LLR which is defined by where   (  ) and   (  ) correspond to the a priori information from the other decoder and the systematic information bits, respectively.

LDPC Decoder.
LDPC codes belong to a class of linear error correcting block codes, first proposed by Gallager [26].
Their main advantages lie in their capacity-approaching performance and their low-complexity parallel implementations [31].LDPC codes can be represented by a parity check matrix H LDPC , or intuitively through Tanner graph [32].The optimal maximum a posteriori decoding of LDPC codes is infeasible from the practical implementation point of view.Alternatively, LDPC decoding is done using the message passing or belief propagation algorithms which iteratively pass messages between check nodes and variable nodes as shown in Figure 3(b).The belief propagation is denoted as the sum-product decoding because probabilities can be represented as LLRs which allow the calculation of messages using sum and product operations.
Let  V  be the message from variable node  to check node  and    the message from check node  to variable node .Let   and   denote the set of adjacent variable nodes connected to the check node  and the set of adjacent check nodes connected to the variable node , respectively, where 0 ≤  ≤  and 0 ≤  ≤ .For the first iteration, the input to the LDPC decoder is the LLRs (  ) of the codeword c which are used as an initial value of the extrinsic variable node messages; that is,  V  = (  ).For the th iteration, the algorithm can be summarized as follows: (1) Each check node  computes the extrinsic message to its neighboring variable node : (2) Each variable node  updates its extrinsic information to the check node  in the next iteration: (3) The a posteriori LLR of each codeword bit is computed as The decoding algorithm alternates between check node processing and variable node processing until a maximum number of iterations are achieved, or until the parity check condition is satisfied.When the decoding process is terminated, the decoder outputs the a posteriori LLR.

Soft-Input Soft-Output MIMO Detection
The aim of MIMO detection is to recover the transmitted vector s from the received vector y.The state-of-the-art MIMO detection algorithms have been presented in [24].These algorithms can be divided into two main families, namely, the tree-search-based detection and the interferencecancellation-based detection.In this section, we briefly review the main existing SISO MIMO detection algorithms useful for the following sections.

Maximum A Posteriori Probability (MAP) Detection.
The MAP algorithm achieves the optimum performance through the use of an exhaustive search over all 2 ⋅  possible symbol combinations to compute the LLR of each bit.The LLR of the th bit in the th transmit symbol,  , , is given by where  +1 , and  −1 , denote the sets of symbol vectors in which the th bit in the th antenna is equal to +1 and −1, respectively.(y | s) is the conditioned probability density function given by (s) represents the a priori information provided by the channel decoder in the form of a priori LLRs: The max-log-MAP approximation is commonly used in the LLR calculation with lower complexity [12]: where  1 represents the Euclidean distance between the received vector y and lattice points Hs.
Based on the a posteriori LLRs ( , ) and the a priori LLRs   ( , ), the detector computes the extrinsic LLRs   ( , ) as The MAP algorithm is not feasible due to its exponential complexity since 2 ⋅  hypotheses have to be considered within each minimum term and for each bit.Therefore, several suboptimal MIMO detectors have been proposed with reduced complexity as will be briefly discussed in the following sections.

Tree-Search-Based Detection.
The tree-search-based detection methods generally fall into two main categories, namely, depth-first search like the sphere decoder and breadth-first search like the K-Best decoder.

List Sphere Decoder (LSD).
The basic idea of the sphere decoder is to limit the search space of the MAP solution to a hypersphere of radius   around the received vector.Instead of testing all the hypotheses of the transmitted signal, only the lattice points that lie inside the hypersphere are tested, reducing the computational complexity [33]: Using the QR decomposition in real-valued model, the channel matrix H can be decomposed into two matrixes Q and R (H = QR), where and R is 2  × 2  upper triangular matrix with real-positive diagonal elements [34].Therefore, the distance in ( 17) can be computed as ‖y − Hs‖ 2 = ‖ỹ − Rs‖ 2 , where ỹ = Q  y is the modified received symbol vector.Exploiting the triangular nature of R, the Euclidean distance metric  1 in ( 15) can be recursively evaluated through the accumulated partial Euclidean distance (PED)   with  2  +1 = 0 as [13] where    and    denote the channel-based partial metric and the a priori-based partial metric at the th level, respectively.
This process can be illustrated by a tree with 2  + 1 levels as depicted in Figure 4(a).The tree search starts at the root level with the first child node at level 2  .The partial Euclidean distance  2  in ( 18) is then computed.If  2  is smaller than the sphere radius   , the search continues at level 2  − 1 and steps down the tree until finding a valid leaf node at level 1.
List sphere decoder is proposed to approximate the MAP detector [12].It generates a list L ⊂ 2   that includes the best possible hypotheses.The LLR values are then computed from this list as The main issue of LSD is the missing counter-hypothesis problem depending on the list size.The use of limited list size causes inaccurate approximation of the LLR due to missing some counter hypotheses where no entry can be found in the list for a particular bit  , = {+1, −1}.Several solutions have been proposed to handle this issue.LLR clipping is a frequently used solution, which consists simply to set the LLR to a predefined maximum value [12,35].
Several methods can be considered to reduce the complexity of the sphere decoder such as Schnorr-Euchner (SE) enumeration [36], layer ordering technique [34], and channel regularization [37].Layer ordering technique allows the selection of the most reliable symbols at a high layer using the sorted QR (SQR) decomposition.However channel regularization introduces a biasing factor in the metrics which should be removed in LLR computation to avoid performance degradation as discussed in [38].In the sequel, the SQR decomposition is considered in the preprocessing step.

Single Tree-Search Sphere Decoder (STS-SD).
One of the two minima in ( 14) corresponds to the MAP hypothesis s MAP , while the other corresponds to the counter hypothesis.The computation of LLR can be expressed by with where  MAP

𝑖,𝑏
denotes the bitwise counter hypothesis of the MAP hypothesis, which is obtained by searching over all the solutions with the th bit of the th symbol opposite to the current MAP hypothesis.Originally, the MAP hypothesis and the counter hypotheses can be found through repeating the tree search [39].The repeated tree search yields a large computational complexity cost.To overcome this, the single treesearch algorithm [13,40] was developed to compute all the LLRs concurrently.The  MAP metric and the corresponding  MAP , metrics are updated through one tree-search process.Through the use of extrinsic LLR clipping method, the STS-SD algorithm can be tunable between the MAP performance and hard-output performance.The implementations of STS-SD have been reported in [14,15].

SISO K-Best
Decoder.K-Best algorithm is a breadthfirst search based algorithm, in which the tree is traversed only in the forward direction [41].This approach searches only a fixed number  of paths with best metrics at each detection layer.Figure 4(b) shows an example of the tree search with   = 2.The algorithm starts by extending the root node to all possible candidates.It then sorts the new paths according to their metrics and retains the  paths with smallest metrics for the next detection layer.
K-Best algorithm is able to achieve near-optimal performance with a fixed and affordable complexity for parallel implementation.Yet, the major drawbacks of K-Best decoder are the expansion and the sorting operations that are very time consuming.Several proposals have been drawn in the literature to approximate the sorting operations such as relaxed sorting [42], and distributed sorting [43], or even to avoid sorting using on-demand expansion scheme [44].Moreover, similarly as LSD, K-Best decoder suffers from missing counter-hypothesis problem due to the limited list size.Numerous approaches have been proposed to address this problem such as smart candidates adding [45], bit flipping [46], and path augmentation and LLR clipping [12,35].

Interference-Cancellation-Based Detection.
Interferencecancellation-based detection can be carried out either in a parallel way as in MMSE-IC [8,9] or in a successive way as in VBLAST [47].

Minimum Mean Square Error-Interference Cancellation (MMSE-IC) Equalizer
. MMSE-IC equalizer can be performed using two filters [4].The first filter p  is applied to the received vector y, and the second filter q  is applied to the estimated vector ŝ in order to cancel the interference from other layers.The equalized symbol s can be written as where ŝ denotes the estimated vector given by the previous iteration with the th symbol omitted: . ŝ is calculated by the soft mapper as ŝ = E[  ] = ∑ ∈2   (  = ) [48].The filters p  and q  are optimized using the MMSE criterion and are given in [6,24].For the first iteration, since no a priori information is available, the equalization process is reduced to the classical MMSE solution: The equalized symbols s are associated with a bias factor   in addition to some residual noise plus interferences   : These equalized symbols are then used by the soft demapper to compute the LLR values using the max-log-MAP approximation [48]: MMSE-IC equalizer requires   matrix inversions for each symbol vector.For this reason, several approximations of MMSE-IC were proposed.For example, in [9], a lowcomplexity approach of MMSE-IC is described by performing a single matrix inversion without performance loss.This algorithm is referred to as LC-MMSE-IC.

Successive Interference Cancellation (SIC) Equalizer.
The SIC-based detector was initially used in the VBLAST systems.In VBLAST architecture [47], a successive cancellation step followed by an interference nulling step is used to detect the transmitted symbols.However, this method suffers from error propagation.An improved VBLAST for iterative detection and decoding is described in [49].At the first iteration, an enhanced VBLAST which takes decision errors into account is employed [24].When the a priori LLRs are available from the channel decoder, soft symbols are computed by a soft mapper and are used in the interference cancellation.To describe the enhanced VBLAST algorithm, we assume that the detection order has been made according to the optimal detection order [47].For the th step, the predetected symbol vector ŝ−1 until step  − 1 is canceled out from the received signal: where and , with h  being the th column of H. Then the estimated symbol s is obtained using a filtered matrix W  based on the MMSE criterion that takes decision errors into account [11,49]: Σ  is the decision error covariance matrix defined as where e  denotes a unit vector having zero components except the th component, which is one.A soft demapper is then used to compute LLRs according to (25).We refer to this algorithm as improved VBLAST (I-VBLAST) in the sequel.

Low-Complexity K-Best Decoder.
The low-complexity K-Best (LC-K-Best) decoder recently proposed in [20] uses two improvements over the classical K-Best decoder for the sake of lower complexity and latency.The first improvement simplifies the hybrid enumeration of the constellation points in real-valued system model when the a priori information is incorporated into the tree search using two look-up tables.The second improvement is to use a relaxed on-demand expansion that reduces the need of exhaustive expansion and sorting operations.The LC-K-Best algorithm can be described as follows.
The preprocessing step is as follows: (1) Input y, H, .
(2) Enumerate the constellation symbols based on   for all layers.
In the case of missing counter hypothesis, LLR clipping method is used.
It has been shown in [20] that the LC-K-Best decoder achieves almost the same performance as the classical K-Best decoder with different modulations.Moreover, the computational complexity in terms of the number of visited nodes is significantly reduced specially in the case of highorder modulations.

Convergence of Iterative Detection-Decoding
The EXtrinsic Information Transfer (EXIT) chart is a useful tool to study the convergence behavior of iterative decoding systems [50].It describes the exchange of the mutual information in the iterative process in order to predict the required number of iterations, the convergence threshold (corresponding to the start of the waterfall region), and the average decoding trajectory.
In the iterative receiver considered in our study, two iterative processes are performed, one inside the channel decoder (turbo or LDPC), and the other between the MIMO detector and the channel decoder.For simplicity, we separately study the convergence of the channel decoding and the MIMO detection.We denote by  1 and  2 the a priori mutual input information of the MIMO detector and the channel decoder, respectively, and by  1 and  2 their corresponding extrinsic mutual output information.
The mutual information   (  or   ) can be computed through Monte Carlo simulation using the probability density function    [50]: A simple approximation of the mutual information is used in our analysis [51]: where   is the number of transmitted bits and   is the LLR associated with the bit  ∈ {−1, +1}.
The a priori information   can be modeled by applying an independent Gaussian random variable   with zero mean and variance  2  in conjunction with the known transmitted information bits  [50]: For each given mutual information value   ∈ [0, 1],  2  can be computed using the following equation [52]: )) where  1 = 0.3073,  2 = 0.8935, and  3 = 1.1064.At the beginning, the a priori mutual information is as follows:  1 = 0 and  2 = 0.Then, the extrinsic mutual information  1 of the MIMO detector becomes the a priori mutual information  2 of the channel decoder, and so on, and so forth (i.e,  1 =  2 and  2 =  1 ).For a successful decoding, there must be an open tunnel between the curves; the exchange of extrinsic information can be visualized as a "zigzag" decoding trajectory in the EXIT chart.
To visualize the exchange of extrinsic information of the iterative receiver, we present the MIMO detector and the channel decoder characteristics into a single chart.For our convergence analysis, a 4 × 4 MIMO system with 16-QAM constellation, turbo decoder, and LDPC decoder (  = 1/2) is considered.Table 1 summarizes the main parameters for the convergence analysis.
Figure 5 shows the extrinsic information transfer characteristics of MIMO detectors at different   / 0 values.As the I-VBLAST detector performs successive interference cancellation at the first iteration and parallel interference cancellation of the soft estimated symbols for the rest iterations, it is less intuitive to present its convergence in the EXIT chart.Therefore, the convergence analysis of VBLAST is not considered.
It is obvious that the characteristics of the detectors are shifted upward with the increase of   / 0 .We show that, for low   / 0 (0 dB) and for low mutual information (<0.1),MMSE-IC performs better than LC-K-Best decoder.However, with larger mutual information, its performance is lower.Moreover, the mutual information of STS-SD is higher than LC-K-Best decoder and MMSE-IC for different   / 0 .For higher   / 0 (5 dB), MMSE-IC presents lower mutual information than other decoders when  1 < 0.9.
Figure 6 shows the EXIT chart for   / 0 = 2 dB with several MIMO detectors, namely, STS-SD, LC-K-Best, and MMSE-IC.We note that the characteristic of the channel decoder is independent of   / 0 values.It is obvious that the extrinsic mutual information of the channel decoder increases with the number of iterations.We see that after 6 to 8 iterations in the case of turbo decoder (Figure 6(a)), there is no significant improvement on the mutual information.Meanwhile, in the case of LDPC decoder in Figure 6(b), 20 iterations are enough for LDPC decoder to converge.
By comparing the characteristics of STS-SD, LC-K-Best decoder, and MMSE-IC equalizer with both coding schemes, we notice that STS-SD has a larger mutual information at its output.LC-K-Best decoder has slightly less mutual information than STS-SD.MMSE-IC shows low mutual information levels at its output compared to other algorithms when  1 < 0.9, while for  1 > 0.9 the extrinsic mutual information is comparable to others.
In the case of turbo decoder (Figure 6(a)) with  in = 8, 3 outer iterations are sufficient for STS-SD to converge at   / 0 = 2dB.However, the same performance can be attained by performing 4 outer iterations with only 2 inner iterations.Similarly, LC-K-Best decoder shows an equivalent performance but slightly higher   / 0 is required.The convergence speed of LC-K-Best decoder is a bit lower than STS-SD, which requires more iterations to get the same performance.The reason is mainly due to the unreliability of LLRs caused by the small list size (L = 16).In the case of MMSE-IC, the characteristic presents a lower mutual information than the LC-K-Best decoder when  1 < 0.9.Therefore, an equivalent performance can be obtained at higher   / 0 or by performing more iterations.
In a similar way, we study the convergence of MIMO detection algorithms with LDPC decoder.Figure 6(b) shows the EXIT chart at   / 0 = 2 dB.The same conclusion can be retrieved as in the case of turbo decoder.We can see that, at   / 0 = 2 dB, a clear tunnel is observed between the MIMO detector and the channel decoder characteristics allowing iterations to bring improvement to the system.Similarly, STS-SD offers higher mutual information than LC-K-Best decoder and MMSE-IC equalizer, which suggests its superior symbol detection performance.
Additionally, the average decoding trajectory resulting from the free-run iterative detection-decoding simulations is illustrated in Figure 6 at   / 0 = 2 dB, with  out = 4,  in = 2, and  out = 4,  in = 5 in the case of turbo decoder and LDPC decoder, respectively.The decoding trajectory closely matches the characteristics in the case of STS-SD and LC-K-Best decoders.The little difference from the characteristics after a few iterations is due to the correlation of extrinsic information caused by the limited interleaver depth.In the case of MMSE-IC, the decoding trajectory diverges from the characteristics for high mutual information because the equalizer uses the a posteriori information to compute soft symbols instead of the extrinsic information.The best trade-off scheduling of the required number of iterations is therefore  out iterations in the outer loop and a total of 8 iterations inside the turbo decoder and 20 iterations inside the LDPC decoder distributed across these  out iterations.

Performance and Complexity Evaluation of Iterative Detection-Decoding
In this section, we evaluate and compare the performance and the complexity of different MIMO detectors, namely, STS-SD, LC-K-Best decoder, MMSE-IC, and I-VBLAST equalizers, with different channel coding techniques (turbo, LDPC).A detailed analysis of the performance and the complexity trade-off of MIMO detection with LTE turbo decoder and 16-QAM modulation in a Rayleigh channel has been discussed in [24].Herein, the performance and the complexity of the receiver with LDPC decoder are investigated.Moreover, several modulations and coding schemes are considered to quantify the gain achieved by such iterative receiver in different channel environments.Consequently, a comparative study is conducted in iterative receiver with both coding schemes (turbo, LDPC).
For the turbo code, the rate 1/3 turbo encoder specified in 3GPP LTE with a block length   = 1,024 is used in the simulations.Puncturing is performed in the rate matching module to achieve an arbitrary coding rate   (e.g.,   = 1/2, 3/4).Meanwhile, the LDPC encoder specified in IEEE 802.11n is considered.The encoder is defined by a parity check matrix that is formed out of square submatrices of sizes 27, 54, or 81.Herein, the codeword length of size   = 1,944 with coding rate of   = 1/2 and 3/4 is considered.

Performance Evaluation.
The simulations are first carried out in Rayleigh fading channel to view general performance of the iterative receivers.Real channel models will be considered to evaluate the performance in more realistic scenarios.Therefore, the 3GPP LTE-(A) channel environments with low, medium, and large delay spread values and Doppler frequencies are considered.The low spread channel is the Extended Pedestrian A (EPA) model which emulates the urban environment with small cell sizes ( rms = 43 ns).The medium spread channel ( rms = 357 ns) is the Extended Vehicular A (EVA) model.The Extended Typical Urban (ETU) model is the large spread channel which has a larger excess delay ( rms = 991 ns) and simulates extreme urban, suburban, and rural cases.Table 2 summarizes the characteristic parameters of these channel environments.For all cases, the channel is assumed to be perfectly known at the receiver.Table 3 gives the principle parameters of the simulations.The performance is measured in terms of bit error rate (BER) with respect to signal-to-noise ratio (SNR) per bit   / 0 : In our previous study [24], the performance of MIMO detectors with LTE turbo decoder is evaluated in a Rayleigh channel with various outer and inner iterations.It has been shown that the performance is improved by about 1.5 dB at a BER level of 1 × 10 −5 with 4 outer iterations.It has been also shown that no significant improvement can be achieved after 4 outer iterations; this improvement is less than 0.2 dB with 8 outer iterations.
Similarly to turbo decoder, we fix the number of inner iterations inside LDPC decoder to 20 while varying the number of outer iterations.Figure 7 shows BER performance of MIMO detectors with LDPC decoder in Rayleigh channel with  in = 20 iterations and  out = 1, 2, 4, or 8 iterations.STS-SD is used with a LLR clipping level of 8 which gives close to MAP performance with considerable reduction in the complexity.For LC-K-Best decoder, a LLR clipping level of 3 is used in the case of missing counter hypothesis.We show that performance improvement of 1.5 dB is observed with 4 outer iterations.For  out = 8 iterations, the improvement is less than 0.2 dB.Therefore  out = 4 iterations will be considered in the sequel.Figure 8 shows the BER performance of 4 × 4 16-QAM system in a Rayleigh fading channel with  out = 4 and  in = 2 in the case of turbo decoder and  in = [3,4,6,7] in the case of LDPC decoder.The notation  in = [3,4,6,7] denotes that 3 inner iterations are performed in the 1st outer iteration, 4 inner iterations in the 2nd outer iteration, and so on.The performances of STS-SD with  in = 8 and  in = 20 for each outer iteration in the case of turbo decoder and LDPC decoder are also plotted as a reference.In the case of turbo decoder, we show that performing  in = 8 and  out = 4 iterations does not bring significant improvement on the performance compared to the case when  in = 2 and  out = 4 iterations are performed.Similarly, in the case of LDPC decoder, the performance of  in = 20 and  out = 4 iterations is comparable to the performance of  in = [3,4,6,7] and  out = 4 iterations.Hence, using a large number of iterations does not seem to be efficient which proves the results obtained in the convergence analysis of Section 5.
By comparing the algorithms, LC-K-Best decoder shows a degradation of less than 0.2 dB compared to STS-SD at a BER level of 2 × 10 −5 .However, it outperforms MMSE-IC and I-VBLAST equalizer by about 0.2 dB at a BER level of 2 × 10 −5 .MMSE-IC and I-VBLAST show almost the same performance.In addition, in the case of LDPC decoder (Figure 8(b)), we notice that increasing the number of inner iterations for each outer iteration  in = [3,4,6,7]  shows slightly better performance than performing an equal number of iterations  in = 5 for each outer iteration.
In the case of high-order modulation, higher spectral efficiency can be achieved at a cost of increased symbol detection difficulty.Figure 9 shows the BER performance of 64-QAM with   = 3/4.We see that LC-K-Best decoder with a list size of 32 presents the similar performance as STS-SD.However, I-VLAST equalizer and MMSE-IC equalizer present degradation of more than 2 dB at a BER level of 1 × 10 −5 compared to LC-K-Best decoder.Therefore, LC-K-Best decoder is more robust in the case of high-order modulations and high coding rates.The figure also shows that the BER performance of LDPC decoder is almost identical to that of the turbo decoder.
In order to summarize the performance of different detectors with different channel decoders, we provide the   / 0 values achieving a BER level of 1 × 10 −5 in Table 4.The values given in the parentheses of the table represent the performance loss compared to STS-SD.
Next, we evaluate the performance of the iterative receiver in more realistic channel environments.Figures 10,11,and 12 show the BER performance of the detectors with the channel decoders on EPA, EVA, and ETU channels, receptively.Similar behaviors can be observed with LTE turbo decoder and with LDPC decoder.
In EPA channel (Figure 10), we see that LC-K-Best decoder achieves similar performance as STS-SD in the case of 64-AQM and presents a degradation less than 0.2 dB in the case of 16-QAM.Meanwhile, MMSE-IC presents significant performance loss of more than 6 dB in the case of 64-QAM and   = 3/4.With 16-QAM and   = 1/2, the degradation of MMSE-IC compared to LC-K-Best decoder is about 1 dB at a BER level of 1 × 10 −4 .
In EVA channel (Figure 11), the performance loss of MMSE-IC compared to LC-K-Best decoder is reduced to approximately 5 dB with 64-QAM and 0.5 dB with 16-QAM.LC-K-Best decoder presents a degradation of about 0.1∼ 0.3 dB compared to STS-SD in the case of 16-QAM and 64-QAM.
Similarly in ETU channel (Figure 12), MMSE-IC presents a performance degradation compared to LC-K-Best decoder.This degradation is less than 4 dB in the case of 64-QAM and less than 0.5 dB in the case of 16-QAM.We notice also that the LC-K-Best decoder is comparable to STS-SD in the case of 64-QAM and has a degradation of 0.2 dB in the case of 16-QAM.
Comparing the performance of the iterative receiver in different channels, it can be seen that iterative receivers present the best performance in ETU channel compared to EPA and EVA channels.This is due to the high diversity of ETU channel.At a BER level of 1 × 10 −4 , the performance gain in ETU channel in the case of LTE turbo decoder is about 0.8 dB, 1.3 dB compared to EVA channel with 16-QAM and 64-QAM, respectively.In the case of LDPC decoder, this gain is 0.4 dB and 1 dB with 16-QAM and 64-QAM, respectively.However, in EPA channel, the performance gain in ETU  channel in the case of turbo decoder or LDPC decoder is more than 1 dB with 16-QAM and 64-QAM.Table 5 summarizes the   / 0 values achieving a BER level of 1×10 −4 of different detectors combined with different channel decoders, and modulation orders in different channel models.The values given in the parentheses in the table represent the performance loss compared to STS-SD.As indicated in Table 5, the iterative receivers with turbo decoder and LDPC decoder have comparable performance with a coding rate   = 1/2 (16-QAM).However, with   = 3/4 (64-QAM), the receivers with LDPC decoder present slightly a better performance, especially in ETU channel (0.6 dB).
From these results, we show that the iterative receiver substantially improves the performance of coded MIMO systems either with turbo decoder or with LDPC decoder in Rayleigh channel (Figures 8 and 9) and in more realistic channels (Figures 10,11,and 12).Moreover, we show that performing a large number of inner iterations does not bring significant improvement.In addition, we show that the LC-K-Best decoder achieves a good performance with different modulations and channel coding schemes.The figures suggest that the BER performance of the iterative receiver with turbo decoder is almost comparable to that of the LDPC decoder.It is therefore meaningful to evaluate the computational complexity of the iterative receivers with both decoding techniques as it will be discussed in the next section.

Complexity Evaluation.
The computational complexity has significant impact on the latency, throughput, and power consumption of the device.Therefore, the receiver algorithms should be optimized to achieve a good trade-off between performance and cost.In this part, we evaluate the computational complexity of the iterative receiver in terms of basic operations such as addition, subtraction, multiplication, where  det1 denotes the complexity of the first iteration of MIMO detection algorithm per symbol vector without taking into consideration the a priori information;  deti denotes the complexity per iteration per symbol vector taking into consideration the a priori information;  dec denotes the complexity of the channel decoder per iteration per information bit;  bit is the number of information bit at the input of the encoder;  symb is the number of symbol vectors;  symb and  bit are linked by the following relation: where  is the number of bits in the constellation symbol,   is the coding rate, and   is the number of transmit antennas.[30].The complexity of max-log-MAP decoder corresponds to three principal computations: branch metrics, recursive state metrics, and LLR of the bits.Table 6 summarizes the total number of operations per information bit per iteration for the LTE turbo decoder with 2  = 8 states and  = 2 output bits, where  is the memory length of the component encoder.Therefore, the overall complexity of the turbo decoder can be obtained by multiplying the information block length   and the number of iterations  in ⋅  out .
The complexity of LDPC decoder depends on the scheduling used to exchange the messages between check node (CN) and variable node (VN).There are two distinct schedules of belief propagation: flooding schedule and layered schedule.In the flooding schedule, the messages are passed back and forth along all the edges.This schedule increases the complexity especially with long block length.A layered schedule is therefore proposed where only a small number of check nodes and variable nodes are updated per subiteration [53].The messages generated in a subiteration are immediately used in subsequent subiterations of current iteration.This leads to a faster convergence of LDPC decoding and a reduction of the required memory size.
The computational complexity of the layered LDPC decoder can be expressed in the function of degree of connectivity as summarized in Table 7.  V and   denote the degree of connectivity of the variable node  and the check node , respectively.  and  V denote the average row weight and the average column weight of LDPC code, respectively.

Iterative MIMO Detection Complexity.
The computational complexity of MIMO detection depends on the detection algorithm.In the case of tree-search-based algorithms, the commonly used approach to measure the complexity is to count the number of visited nodes in the tree-search process [54][55][56].However, in the case of the interferencecancellation-based equalizers, the complexity is evaluated in terms of real or complex operations required to compute filter coefficients.For a fair comparison, the complexity is estimated based on basic operations (ADD, SUB, MUL, DIV, SQRT, Max, and LUT) in this work.
The complexity of tree-search-based algorithms can be divided into two steps: the preprocessing and the treesearch process.The complexity of interference-cancellationbased equalizer algorithms is dominated by the computation of the filter coefficients and the matrix inversion.Several methods for matrix inversion, namely, Cholesky decomposition and QR decomposition, have been widely studied in the literature.Herein, QR decomposition based on Gram-Schmidt method is used to compute the matrix inversion.However, more efficient method for QR decomposition may be considered to optimize the cost of computational complexity in hardware implementation, like Givens rotations (GR) that can be effectively done by coordinate rotation digital computer (CORDIC) scheme.
In the case of STS-SD, it is very difficult to find an analytical expression of the complexity due to the sequential nature of the tree search and the channel statistics.Therefore, Monte Carlo simulations were used to measure the average number of operations of STS-SD over all SNR range.
The complexity of the interference-cancellation-based equalization comprises the complexity of soft mapping and soft demapping.In the case of STS-SD and LC-K-Best decoder, the computational complexity includes the complexity of SQR decomposition for the first iteration and the complexity of LLR computation.The SQR decomposition is based on Gram-Schmidt method which requires many ADD, MUL, DIV, and SQRT operations.It is important to note that, in LC-K-Best decoder, there is a number of comparisons to choose the best candidates that are not taken into consideration in the complexity comparisons.
Figure 13 summarizes the complexity of different detection algorithms in terms of number of operations in the case of 4 × 4 spatial multiplexing system using 16-QAM for the 1st and th iteration.The MAP algorithm presents the highest complexity (4.7 × 10 6 MUL, 4.6 × 10 6 ADD).It is not represented in the graph, but it is used as a reference to view the reduction in the complexity of other algorithms compared to the optimal detector.The average number of arithmetic operations of STS-SD is 90% lower than the MAP algorithm.However, it still has a larger complexity than other algorithms.The complexity of LC-K-Best is approximately 30% higher than that of the MMSE equalizer and 50% lower than that of I-VBLAST.I-VBLAST requires more complexity due to the matrix inversion for each detected symbol.At the th iteration, LC-MMSE-IC algorithm proposed in [9] has slightly lower complexity than the LC-K-Best decoder in terms of MUL (7%) and ADD (19%) with additional DIV and SQRT operations required for the matrix inversion.
Figure 14 illustrates the complexity of different detection algorithms in terms of number of operations in the case of 4 × 4 spatial multiplexing systems using 64-QAM for the 1st and th iteration.Similarly, STS-SD presents more than 90% reduction in the complexity compared to the MAP algorithm (1.2 × 10 9 MUL, 1.1 × 10 9 ADD).We note that the complexity of MMSE, LC-MMSE, and I-VBLAST slightly increases because the complexity of soft mapper and soft demapper increases with the constellation size.Meanwhile, the complexity of computing filter coefficients will not be affected since the number of antennas is the same.We notice also that the complexity of LC-K-Best decoder is approximately twice as much as that of LC-MMSE-IC equalizer.However, its complexity is about 40% lower than  STS-SD (44% MUL and 45% ADD).It should be noted that even though LC-MMSE-IC has a lower complexity, it presents a severe degradation of more than 2 dB in the case of 64-QAM in the Rayleigh channel, and more than 4 dB in realistic channels (cf.Section 6.1).

Complexity of Iterative Receivers.
In this section, we compare the complexity of the iterative receivers using different coding techniques.The same simulation parameters as those used in the previous section are considered.We consider a block length   of 1,024 for the turbo decoder and codeword length   of 1,944 in the case of LDPC decoder which gives a block length slightly lower (5%) than the turbo decoder case for   = 1/2.Four outer iterations are performed between the MIMO detectors and the channel decoders.The total number of iterations inside the LDPC decoder and the turbo decoder is chosen to be 20 and 8 iterations, respectively, because these number of iterations were found sufficient for the convergence of both decoders (cf.Sections 5 and 6).
The number of operations consumed by LDPC decoder and turbo decoder per information block length with code rates   = 1/2 and   = 3/4 is listed in Table 8.We notice that the LDPC decoder requires 20% to 40% less operations than the turbo decoder.Note that the decoding complexity of turbo code is constant and does not depend on the code rate, because all code rates are generated from the mother coding rate  = 1/3.In contrast, the complexity of LDPC depends on the code rate.The decoding complexity decreases when the code rate increases.
Figure 15 shows the computational complexity of the iterative receivers for one signal frame using both coding schemes and 16-QAM.In the case of turbo decoder, the LC-MMSE-IC equalizer presents the lowest computational complexity in terms of MUL, ADD.However, it requires more DIV and SQRT operations.The complexity of STS-SD is much higher than the LC-K-Best decoder (about 60% MUL and 30% ADD).Note that the more complexity brings only a performance improvement of ∼0.2 dB at a BER level of 1 × 10 −5 .In addition, the LC-K-Best decoder presents a reduced complexity than I-VBLAST (20%∼30% MUL, 2%∼5% ADD, approximately 50% DIV, and approximately 50% SQRT).The reason is that I-VBLAST requires multiple matrix inversions for the first iteration.Similar complexity results can be observed in the case of LDPC decoder.By comparing the complexity of the receivers with both coding techniques, we notice that the complexity of iterative receiver with LDPC decoder is smaller than the complexity with turbo decoder in terms of ADD, Max, and LUT operations.However, both receivers present approximately similar complexity in terms of MUL, DIV, and SQRT.
It is therefore worthy to compare the complexity of the iterative receiver with high modulation order and coding rate.Figure 16 illustrates the computational complexity of the iterative receivers for one transmitted frame in 4 × 4 spatial multiplexing system with 64-QAM.As shown in the figure, the complexity of the receiver based on STS-SD and LC-K-Best decoder increases significantly since the tree-search detection depends on the modulation order.The complexity of the receiver based on LC-MMSE-IC and I-VBLAST slightly increases compared to the case of 16-QAM due to the small increases in the complexity of the soft mapper and soft demapper.Furthermore, the complexity of LC-MMSE-IC equalizer is much lower than the LC-K-Best decoder (∼55% MUL, ∼26% ADD).However, LC-MMSE-IC presents a significant degradation of about 2 dB in a Rayleigh fading channel and more than 4 dB in more realistic channels at the BER level of 1 × 10 −4 compared to the LC-K-Best decoder (cf.Section 6).
In addition, Figure 16(b) shows that iterative receiver with LDPC decoder presents low computational complexity in terms of ADD, LUTs.However, similar complexity of the receiver with both coding techniques is observed in terms of MUL, DIV, and SQRT.Since MUL and DIV are more complex than ADD, MAX, and LUT, we can conclude that the complexity of iterative receiver with both coding schemes is comparable.
From this evaluation, we conclude that the performance and the complexity of the iterative receiver with turbo decoder and LDPC decoder is highly comparable.We should also note that the turbo decoder is recommended for small to moderate block lengths and coding rates.Meanwhile, the LDPC decoder is more favored for large block sizes due to their superior performance and lower complexity.In addition, we see that the LC-K-Best decoder achieves a good performance-complexity trade-off compared to other detection algorithms.Furthermore, the LC-K-Best decoder performs a breadth-first search that can be easily paralyzed and pipelined in hardware architecture as discussed in [16,41].The LC-K-Best decoder can be also easily implemented and can provide a high and fixed detection rates for future communication systems.

Conclusions
The iterative receivers have recently emerged as very attractive solutions for high data rate transmission in next generation wireless systems.In this paper, an efficient iterative receiver combining MIMO detection based on K-Best decoder with channel decoding, namely, turbo decoder and LDPC decoder, has been investigated.Several soft-input softoutput MIMO detection algorithms have been considered in this work.We analyzed the convergence of combining these detection algorithms with different channel decoders (turbo, LDPC) using EXIT chart.Based on this analysis, we retrieved the number of inner/outer iterations required for the convergence of the iterative receiver.Additionally, we provided a detailed comparison of different combinations of detection algorithms and channel decoders in terms of performance and complexity with real channel environments, various modulation orders, and coding rates.Through the performance and complexity evaluation, we show that LC-K-Best decoder achieves a best trade-off between performance and complexity among the considered detectors.We show also that the performance and the complexity of iterative receivers with turbo decoder and LDPC decoder are highly comparable.Future work can include other aspects like optimization of the computational complexity in hardware architecture, estimation of the required memory, conversion of the algorithm into a fixed point format, and implementation in real environments.

Figure 3 :
Figure 3: LDPC decoder, (a) matrix representation and Tanner graph and (b) iterative decoding between VNs and CNs.

Figure 4 :
Figure 4: Tree-search strategies, (a) depth-first search sphere decoder and (b) breadth-first search K-Best decoder.

(a) expand all √ 2 ( 2 )
possible constellation nodes, (b) calculate the corresponding PEDs, (c) if √ 2  > , select the  best nodes and store them in the list L 2  .For layer  = 2  − 1 : 1, (a) enumerate the constellation point according to    of the  surviving paths in the list L +1 , (b) find the first child (FC) based on    and    for each parent nodes, (c) compute their PEDs, (d) select  best children with smallest PEDs among the  FCs and add them to the list L  , (e) if |L  | < , find the next child (NC) of the selected parent nodes.Calculate their PEDs and go to step (2)(d), (f) else move to the next layer  =  − 1 and go to step (2).

Figure 13 :
Figure 13: Complexity of a 4 × 4 spatial multiplexing system with 16-QAM for different detection algorithms in terms of the number of operations per symbol vector at the 1st and th iteration.

Table 2 :
Characteristic parameters of the investigated channel models.
* * The number in the parenthesis corresponds to the performance loss in dB compared to STS-SD.

Table 5 :
/ 0 values achieving a BER level of 1 × 10 −4 in LTE channel models for different detectors and channel decoders (turbo, LDPC) in 4 × 4 spatial multiplexing system with 16-QAM   = 1/2, and 64-QAM   = 3/4.The number in the parenthesis corresponds to the performance loss in dB compared to STS-SD.

Table 6 :
Complexity of turbo decoder per information bit per iteration.

Table 8 :
Complexity of turbo decoder (8 iterations) and LDPC decoder (20 iterations) in terms of number of operations.
ith iteration Figure14: Complexity of a 4 × 4 spatial multiplexing system with 64-QAM for different detection algorithms in terms of the number of operations per symbol vector at the 1st and th iteration.