Main-Branch Structure Iterative Detection Using Approximate Message Passing for Uplink Large-Scale Multiuser MIMO Systems

The emerging large-scale/massive multi-input multioutput (MIMO) system combined with orthogonal frequency division multiplexing (OFDM) is considered a key technology for its advantage of improving the spectral efficiency. In this paper, we introduce an iterative detection algorithm for uplink large-scale multiuser MIMO-OFDM communication systems. We design a Main-Branch structure iterative turbo detector using the Approximate Message Passing algorithm simplified by linear approximation (AMP-LA) and using the Mean Square Error (MSE) criterion to calculate the correlation coefficients between main detector and branch detector for the given iteration. The complexity of our method is compared with other detection algorithms. The simulation results show that our scheme can achieve better performance than the conventional detection methods and have the acceptable complexity.


Introduction
Multi-input multioutput systems with multiple antennas employed at both the transmitter and receiver got a lot of attention due to their multiplexing and diversity capabilities which can offer much higher data rates and enhance the system capacity [1].OFDM has been shown to be an attractive scheme for mitigating the efforts of intersymbol interference (ISI) by broadband wireless channels which have long response memory [2].In recent years, large-scale (or massive) multiuser MIMO systems have become a hot topic for the next generation 5G wireless communication, which equip tens to hundreds of antennas at the base station as shown in Figure 1.The large-scale multiuser MIMO technology combined with OFDM promises significant improvements in terms of spectral efficiency, link reliability, and coverage compared to conventional small-scale MIMO systems [3].Unfortunately, the promised benefits of large-scale MIMO come at the cost of significantly increased computational complexity in the BS, especially for the antenna array design and signal processing.A major challenge in uplink communications of broadband multiuser MIMO systems is to create a receiver algorithm that can efficiently detect the multiple signals, transmitted from multiple antennas of different uplink users [4].
Turbo detection which performs detection/equalization and decoding in an iterative manner in coded data transmission over ISI channels have been widely studied [5].The maximum a posteriori probability (MAP) can achieve optimal performance but the complexities are exponential in   , where   is the number of transmit antennas [6].To reduce the complexity, the detector using soft interference cancellation (SIC) based on the Minimum Mean Square Error (MMSE) criterion for MIMO system was proposed in [7,8].MMSE-SIC has lower complexity than MAP algorithm but the accuracy is poor because of its structure without the feedback design.Moreover, the complexity of these suboptimal detectors remains prohibitive for the large-scale antenna array.Recently in ten years, message passing based on factor graphs models [9,10] has been studied for detection/equalization on ISI channels.In [11], Kaynak et al. proposed a belief propagation (BP) algorithm, which has better performance than MMSE-SIC; however the computational complexity is still very large because of the marginalization operations over discrete symbols.In [12], a variant BP algorithm based on Gaussian tree approximation has been proposed recently, which approximates the dense factor graph of the MIMO system into a tree and passes exact messages over the resultant tree.More recently, Wu et al. proposed a relatively low complexity iterative detection algorithm for large-scale multiuser MIMO-OFDM systems using Approximate Message Passing in [13], but the performance is lower than conventional MMSE detection algorithm when the number of iterations is small.Besides those algorithms derived from factor graph, iterative BP algorithm based on Markov random field (MRF) was also investigated in [14].
In this paper, we propose a new iterative detection algorithm by using several signal processing methods for uplink multiuser MIMO systems.First, we design a Main-Branch structure for iterative signal detection.The soft information is parallel iterating between main detector and branch detector when the LLRs iterate between main detector and decoders.The Main-Branch structure detectors are allowed to exchange soft information in the absence of interleavers and promote the performance of suboptimal constituent detectors.Second, the principle of expectation propagation and linear approximation method are applied to obtain the symbol belief and approximate the Gaussian messages for reducing computing complexity.Finally, we employ AMP-LA algorithm as the constituent detector of the Main-Branch structure, and using the Mean Square Error (MSE) criterion to calculate the correlation coefficients between main detector and branch detector for the given iteration, we also give the LLRs computing between Main-Branch structure detector and decoders.The proposed scheme not only gains better performance by canceling the residual ISI and CCI but also simplifies the computational complexity of the detectors for avoiding the problem of matrix inversion on each frequency bin involved in the conventional MMSE detection algorithm.Simulation results show that the proposed algorithm can achieve better tradeoffs between performance and complexity than existing turbo detection algorithms and approaches the optimal performance with a small number of iterations.
The rest of the paper is organized as follows.The system model is described in Section 2, and the proposed Main-Branch iterative structure with AMP algorithm is discussed in Section 3. In Section 4, novel iterative detection based Main-Branch structure joint AMP-LA algorithm is proposed.We will show the simulation results along with a discussion in Section 5.And Section 6 concludes the paper.

System Model and Iterative Receiver
We consider the uplink of a large-scale multiuser MIMO system with   single antenna users, and the receiver is equipped with   antennas at the base station (BS).At the transmitter, the information bits are first encoded into a code sequence and interleaved; each  interleaved coded bit is mapped into one symbol.Let the transmitted frequency domain symbols vector by the th user be X  () = [  (1),   (2), . . .,   ()]  , where   () ∈ A is the frequency domain symbol transmitted at the th subcarrier and  is the number of the subcarriers in the OFDM system; here A = { 1 ,  2 , . . .,   , . . .,  2  } denotes the modulation constellation set.The large-scale multiuser MIMO system uses the OFDM modulation technique at the transmitters.Suppose the maximum taps number of MIMO intersymbol interference (ISI) channel is .We make  point IFFT to symbol sequence X  (), and the cyclic prefix (CP) with length V ≥  is inserted at the beginning of the OFDM signals to remove the interblock interference and make the linear convolution to a circular convolution, as shown in Figure 2. Then the OFDM modulated signals are sent through the MIMO-ISI channel.After removing CP, the FFT technique is used to transform received signal from time domain to frequency domain at the receiver.Then the channel estimation algorithm in frequency domain provides the current channel state knowledge to the detector.As a core part of this large-scale multiuser MIMO system, the proposed iteration detector at the receiver can suppress cochannel interference (CCI).
Hence the received   -dimensional base band signal vector at the th subcarrier in frequency domain can be written as where X() ∈ C   × 1 denotes all the transmitted frequency domain symbols associated with the th subcarrier and W() denotes a   -dimensional stationary Additive White Gaussian Noise vector in frequency domain with zero-mean and covariance matrix  2  I. H() is a   ×   matrix representing the MIMO channel frequency response at the th subcarrier, which is given by where h  is a   ×   time-domain block circulant channel matrix with the entry ℎ (,)  being the th channel impulse response taps from the th user transmit antenna to the th receiving antenna.
The matrix form of H() can be denoted as

Iterative Detection Receiver and Approximate Message Passing Algorithm
3.1.Turbo Main-Branch Iterating Detection Structure.The iterative process by Main-Branch structure soft detector in the system is shown in Figure 2. Turbo MIMO detection exchanges extrinsic log likelihood ratios (LLRs) of the coded bits between a Main-Branch iterating structure part and a bank of channel decoders.The Main-Branch iterative part consists of one soft-input soft-output (SISO) main detector and one SISO branch detector; the soft information is parallel iterating between main detector and branch detector when the LLRs iterate between main detector and decoders.We will omit the subcarrier index  for notational simplicity.
Unlike the extrinsic information exchanged between main detector and channel decoders, the extrinsic information between the main detector and the branch detector has significant correlation because no interleaving can be applied.On the other hand, detection algorithm can perform iteratively by Main-Branch structure soft detector without the decoder, called "self-iteration," so it can be used in uncoded systems also.For the coded systems, the proposed turbo detection algorithm is established in Figure 2 to adapt common circumstances.At the th iteration, the received signal Y is detected by the main detector and its extrinsic LLR  , (   ) of each coded bit    corresponding to the symbol   is passed to the branch detector, to be used to generate its a priori information  , (   ).At the same time the extrinsic information generated by the main detector is passed to channel decoders as their a priori information directly.The branch detector in turn generates its own extrinsic information  , (   ) with the given a priori information  , (   ).For the significant correlation between  , (   ) and  , (   ), they combine together to generates the a priori information  , (   ). , (   ) combined with the a priori information   (   ) fed back from the decoders generate extrinsic LLRs of the main detector for the next detection process.The extrinsic LLRs of the main detector are given by where ln(( is the a posteriori LLR at the output of the main detector and (   ) represents the total a priori information at the input of the main detector, given as where   (   ) ≜ ln((   = 1)/(   = 0)) represents the a priori LLRs passed down from the decoders, and  , (   ) represents the a priori LLRs passed from the branch detector.
The extrinsic LLRs of the branch detector are given by where ln(  ( is the a posteriori LLR at the output of the branch detector and  , (   ) represents the a priori information at the input of the branch detector.
The problem is how to adopt extrinsic information  , (   ) to generate  , (   ) and adopt  , (   ) to generate  , (   ).We can regard the a priori LLRs and the extrinsic LLRs as the output of an equivalent AWGN channel by the method of [15].The unbiased versions of these LLRs associate with the branch detector as the transmitted symbol X corrupted by AWGN: where u , and u , are assumed to be zero-mean Gaussian random vector with zero-mean and covariance matrix  2 , I and  2 , I, which are independent of the transmitted data X, but correlated with each other with correlation coefficient   .The a priori information corresponding to the symbol   passing to the main detector can be defined as We assume that the variance  2 , =  2 , , and then we can construct the a priori LLRs for the main equalizer based on two correlated LLRs  , (   ) and  , (   ) as In order to accurately measure the correlation of the a priori LLRs  , (   ) and the extrinsic information  , ( We also can get the MSE between  , (   ) and L, (   ) like (10), evaluating the derivative of Mean Square Error function with respect to the coefficients  and setting it to zero; according to the theory in [16], we can get  1 ≈ (1 −   )/(1 +   ) and  2 ≈ (1 −   )/(1 +   ), and two linear correlation evaluation equations are given as where   is the noise correlation parameter between  , (   ) and  , (   ) and   represents the other noise correlation parameter between  , (   ) and  , (   ), so the noise correlation parameters   and   can be estimated as [17,18] where , and sign(⋅) is the Sign function, which can be defined as

Factor Graph and Approximate Message
Passing Algorithm.The detector generates the extrinsic LLRs of    based on the received signal Y and the a priori LLRs can be written as For the presentation of factor and message passing algorithm in [9,10], we obtain the joint distribution probability as where   (  | X) denotes the channel transition function, which can be defined as where  , is the component of H in the  row and  column.
Let us investigate the message passing algorithm on the factor graph model in Figure 3, where   represents the mapping constraint function [  (  ),   ] and [⋅] is the Kronecker delta function."=" denotes the cloning node of variable, and we must note that the code constraints cross all the subcarriers.We assume that the pass messages are from the top to the bottom of Figure 3 and back immediately, so we can avoid inner iteration.When extrinsic information is updated and passed downward, the new iteration will start.
Using the sum-product updating rule, for the th turbo iteration, the message passed from the channel transition node   to the cloning node of   is given by In the opposite direction, the message passed from the cloning node to the channel transition node is given by where  ()   →  (  ) is the message passed from the mapping node to the cloning node, which can be defined as where ) is the a priori probability.
Let us assume that   is continuous random variables, and approximating the message  ()     →  (  ) by the estimation of a complex Gaussian probability density function μ() ), the integration form of ( 19) is defined as where And the parameters X() →  are obtained by the criterion of minimum KL divergence in [19] as In practice, we can use symbol belief by the principle of expectation propagation in [20][21][22] at the th iteration to get the approximate message μ() , and from the factor graph theory we also find that the symbol belief is the a posteriori probability of the symbol   , so symbol belief can be approximately given by where P() (   ) is the a posteriori probability of the coded bit fed back from the decoders. ()    →  (  ) can be regarded as the multiplication of all the incoming messages by where By the expectation propagation (EP) principle and the canonical form of Gaussian PDF by [23], the approximate message μ()   →  can be given as where In order to reduce the complexity, we use the linear approximation (LA) in [13] to simplify the parameters; if the number of users   is large, we can get Now, by using the first-order linear approximation and Taylor formula in ( 27), (33) and after some mathematical simplifications, we can get Finally, the LLRs of coded bits of signal   in frequency domain can be expressed in form as where The detailed processing procedure of AMP algorithm simplified by linear approximation (AMP-LA) is illustrated in Algorithm 1.

Iterating Soft Turbo Detection Using Approximate Message Passing.
We choose the main detector and the branch detector both to apply the AMP-LA algorithm, so the processing procedure will be simple.The iteration between main detector and branch detector can be called Main-Branch selfiteration, while the turbo iteration between main detector and decoders can be seen as outer iteration.Total iteration numbers of the proposed algorithm equal the numbers of outer iterations.The Main-Branch self-iteration and the outer iteration are performed in parallel.At the specified number of outer iterations i, combining with the a priori LLR from the decoders and the branch detector for the main detector, we have  () (   ) =  () , (   ) +  ()  (   ).When the main detector generates the extrinsic LLRs  ()  , (   ) by Algorithm 1,  ()   (   ) =  () , (   ) for (16).We pass the extrinsic information and the correlation compensated extrinsic information of the main detector to the decoders and to the branch detector, respectively.Likewise, when the branch detector generates the extrinsic LLRs  ()  , (   ) by Algorithm 1,  ()  (   ) =  () , (   ) and  () (   ) =  () , (   ) for ( 16).The main steps of proposed iterative detection structure joint Approximate Message Passing are described as in Algorithm 2.

Complexity Analysis.
Actually, the computational complexity of different algorithms is mainly measured by the numbers of floating-point operations (FLOPs) used for multiplications.We can find that the  point multiplications of a complex number and a real number need 2 point FLOPs, and the  point multiplications of two complex numbers require 6 point FLOPs operations.In this paper, the complexity of channel estimation and decoder is not considered because the different detection schemes will estimate the channel and decode using the same amount of computations.
We compare the average complexity of the proposed detection algorithm and other algorithms in Table 1; (⋅) is the Big- notation expressing the complexity of an algorithm as a function of a given input.
The traditional MMSE-SIC frequency domain algorithm in [7] has to estimate the complexity of the extrinsic mean and variances of   .We let   be the number of users, and   is the numbers of receiver antennas.The MMSE-SIC method has the largest computational complexity compared to AMP using Gaussian approximation (AMP-G), AMP-EP, and AMP-LA but it got better performance improvement than them also.All the message passing algorithms require 3    FLOPs to compute |H()| 2 in the preprocessing stage, but the proposed algorithm only needs once preprocessing for main and branch iterating detector.At the observation nodes, AMP-G and AMP-EP both require 13    FLOPs, the AMP-LA algorithm needs 10    + 3  FLOPs, and the AMP-LA algorithm needs 10    + 3  FLOPs.The computational complexity of extrinsic information LLRs is [(Q + 7)|A| + 1]  FLOPs.We give the normalized computational complexity versus number of antennas   for different (2) for  = 1 →   do computation at the cloning nodes.
(3) Calculate P() (   ) by ( 25).(4) Calculate X()   and τ()   by (30).( 5) end for (6) for  = 1 →   do computation at the channel transition nodes.(7) Calculate V ()   by (32).( 8) Calculate  ()    by (34).( 9) end for (10) for  = 1 →   do computation at the cloning nodes.(11) Calculate  ()    and  ()   by ( 35) and ( 36). ( 12) Calculate P()  (  ) by ( 38).(13) for  = 1 →  do computation of LLRs (14) Calculate  ()  (   ) by ( 37).(15) end for (16) end for Algorithm 1: Processing procedure of the AMP-LA algorithm.detection algorithms in the   ×   multiuser MIMO-OFDM systems with QPSK modulation as shown in Figure 4.For the antennas number   =   , both BP algorithm based on Markov random field (MRF) in [14] and MMSE-SIC algorithm have the same order complexity ( 2    ) = ( 3  ), so we only consider the normalized computational complexity of MMSE-SIC algorithm.Figure 4 shows that the floating-point operations of proposed algorithm are fewer than other conventional MMSE-SIC algorithm and AMP-G algorithm, especially when the number of antennas is large.With the number of antennas   increasing, the normalized computational complexity of AMP-EP is very close to the proposed detection method.The AMP-LA has the minimum complexity.The similar normalized complexity comparison can be got from other types of modulation.

Simulation Results and Performance Analysis
We use convolution-coded style to test the performance of different detection methods in MIMO system, respectively.The channels are 16-tap Rayleigh fading with equal tap exponential power delay profile model and 9-tap ITU-EVA model which has unequal taps power delay profile, so we can simulate frequency selective fading channel satisfactorily.When the number of taps is large and paths have equal energy, the MIMO-ISI channel has severely delay spread so we can simulate the scene just like [13,14].When the channel is ITU-EVA we can simulate the real application scenario in 5G.
We first consider a RSC-coded QPSK and 16QAM modulation 64 × 64 MIMO system, and the channel is 16-tap Rayleigh fading with equal tap exponential power delay profile model.It is seen from Figure 5 that the proposed method has better BER performance than conventional detector using soft interference cancellation (SIC) based on the Minimum Mean Square Error (MMSE) criterion when the iterative numbers are  > 2 in QPSK and  > 5 in 16QAM.Our method also gets better performance than Approximate Message Passing detection by expectation propagation (AMP-EP) scheme and AMP with linear approximation (AMP-LA) scheme in [13].For instance, at the BER 10  method gets about 0.3 dB gain compared with the MMSE-SIC when the iterative numbers are  = 6 with QPSK.At the BER level of 10 −3 and the number of iterations being 12, the proposed method gets about 1.1 dB gains compared with MMSE-SIC in 16QAM.For this BER level and 6 iterations, the proposed iterative equalization scheme also achieves better BER performance compared with the MMSE-SIC, while the AMP-EP performance is worse than MMSE-SIC.Note that, in the low   / 0 , the performance curve of proposed algorithm first reaches the matched filter bound (MFB).It should be stressed that the proposed method is very precise especially regarding the first iteration.From Figures 6 and 7 we can observe that the proposed method also has better BER performance than AMP-EP algorithm with different number of iterations in multiuser 64 × 64 MIMO systems.When the number of iterations is increased, the performance improvement of our algorithm is obviously more remarkable.Figure 6 presents that 6 and 12 iterations are enough for the AMP-EP to approach the MFB of QPSK and 16QAM modulation at BER = 10 −5 , while the BER performance of the proposed algorithm can approach the MFB after just 5 and 10 iterations at BER = 10 −4 as shown in Figure 7.The performance gap between the AMP-EP algorithm and the MFB is obviously larger than that between the proposed method and the MFB for the same BER level in the low   / 0 region.Because the AMP-LA is simplified by AMP-EP, the performance of AMP-LA is worse than AMP-EP.So we just consider comparison of the performance of AMP-EP.
It should be also noted that the performance of our scheme for large-scale MIMO systems can be seriously affected due to error propagation effects on the approximation made in the message updating by (34) and ( 35).If we want the performance curves of our scheme to approach the MFB, we need more number of iterations for the lowdimensional MIMO systems.As is shown in Figures 8  and 9, different methods of BER performance are present in 16 × 16 and 4 × 4 MIMO systems with QPSK and 16QAM modulation.Compared with the MFB, at BER = 10 −2 , the performance loss for the proposed algorithm is about 1.3 dB and 1.5 dB when the number of iterations  is 6, respectively, in a 16 × 16 and 4 × 4 MIMO system with the 16QAM constellation, while the performance degraded for the proposed algorithm in a 64 × 64 MIMO system is about 0.7 dB by the same iterations compared with the MFB at the same BER level with 16QAM.This is very important for large-dimensional systems, because the error propagation effects will be lower.On the other hand, the performance gap between the proposed method and AMP-EP algorithm in the more antennas MIMO system is obviously larger than that in a small-scale 4 × 4 MIMO system for the same BER level.The proposed algorithm could also be applied in noncoded systems for the Main-Branch iterating soft detection structure not using interleaving.In fact, in noncoded systems our algorithm can be seen as complexity turbo detection implemented in the frequency domain which does not employ channel decoders in the feedback loop.So it has similar performance behavior with turbo detection just like the BER performance curves of our scheme approaching the MFB in larger   / 0 region.It is seen from Figure 9 that the performance of the proposed algorithm is even lower than MMSE-SIC when the number of iterations is  = 3 for QPSK and  = 6 for 16QAM, and we also find that the performance of the proposed algorithm is also near other algorithms when the number of iterations is small in Figure 8.So we can conclude that our proposed algorithm is best used in the large-scale MIMO systems such as  ⩾ 16.
In Figures 10 and 11, the performance of the proposed algorithm is compared with other different methods in ITU-EVA channel for QPSK and 16QAM modulation, respectively.We can see that the performance gap between the proposed method and other detection algorithms over ITU-EVA channel is larger than that over 16-tap Rayleigh fading with equal tap power delay profile model.

Conclusions
In this paper, we proposed an iterative detection algorithm in order to defy intersymbol interference and improve the spectral efficiency for uplink large-scale MIMO communication systems.The iterative detection algorithm can achieve nearoptimal performance by iteratively exchanging probabilistic information about the coded bits between a soft-input softoutput (SISO) Main-Branch structure detector and a SISO  channel decoder.Our method calculates the correlation coefficients between iterative main detector and branch detector by computing gradient to minimize the MSE.By applying AMP-LA, the precision of detection is improved while the complexity can be acceptable.The simulation results show that the performance of the proposed method is better than traditional MMSE-SIC detection and AMP type detection in small number of iterations.Moreover, the complexity of the proposed algorithm is lower than MMSE-SIC algorithm and near the AMP-EP when the number of system antennas is large.

Figure 2 :
Figure 2: Block diagram of the coded large-scale multiuser MIMO-OFDM system with iterative detector at the receiver.

Figure 3 :
Figure 3: Factor graph model of multiuser MIMO communication system at one subcarrier.

Algorithm 2 :
Processing procedure of the proposed Main-Branch structure turbo iteration detection.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Comparison of normalized computational complexity versus number of antennas.

Figure 7 :Figure 8 :
Figure 7: BER performance versus   / 0 for the proposed algorithm with different number of iterations in a 64 × 64 MIMO system, QPSK and 16QAM modulation, 16 taps Rayleigh channel with equal tap power.

Figure 9 :
Figure 9: BER performance for different detection algorithms in a 4 × 4 MIMO system, with QPSK and 16QAM modulation, 16-tap Rayleigh channel with equal tap power.

Figure 10 :
Figure 10: BER performance for different detection algorithms in a 64 × 64 MIMO system, with QPSK modulation and ITU-EVA channel.

Figure 11 :
Figure 11: BER performance comparison for different detection algorithms in a 64 × 64 MIMO system, with 16QAM modulation and ITU-EVA channel.
), we force a linear relationship L, (   ) =  1  , (   ), where  1 is a positive scaling factor, and another linear relationship L, (   ) =  2  , (   ) is also forced to measure the correlation of the a priori LLRs  , (   ) and the extrinsic information  , (   ).The aim of coefficients  1 computing is to minimize the error between  , (   ) and L, (   ).

Table 2 :
Simulation parameters of large-scale MIMO-OFDM system.
−3, the proposed International Journal of Antennas and Propagation