Improved Approximate Expectation Propagation Massive MIMO Detector with Second-Order Richardson Iteration

,e expectation propagation (EP) detector achieves significantly better performance than the linear detectors (such as minimum mean squared error detector) in massive MIMO systems, which has drawn great attention recently. EP’s approximation (EPA) algorithm simplifies the update formula of the EP algorithm by reexpressing the moment matching condition so that the number of matrix inversions in the EP algorithm is reduced to one. However, the expense is that the EPA algorithm requires higher accuracy for this inversion; otherwise, the bit-error-rate (BER) performance will suffer serious losses. To tackle this issue, the SORI iterative algorithm is introduced to obtain the high-precision result of this inversion to ensure the good BER performance of the EPA algorithm. First, the new expression of the SORI iterative algorithm is derived under the equivalent real-valued system. Second, the improved EPA-SORI algorithm is then introduced by the SORI algorithm, which is used to approximate the initial value of the EPA algorithm under the real-valued system. Finally, by designing the initial solution and the relaxation factor of the EPA-SORI iterative algorithm, the convergence rate can be quickly increased without increasing the complexity. Simulation and complexity results exhibit that in various massive MIMO system configurations, the proposed EPA-SORI algorithm can achieve the same BER performance as the Exact EP algorithm with significantly lower complexity. At the same time, compared with MMSE and the existing EPA algorithms, the proposed EPA-SORI algorithm has a better performance-complexity trade-off advantage, which is more obvious in scenarios with high modulation order and a large number of users.


Introduction
Future wireless communication requires a high data rate and a tremendous amount of connection for emerging applications such as the Internet of things (IoT) [1,2]. e access of massive IoT devices to the network will lead to tremendous growth in the data volume of mobile communication services, and the wireless network capacity will face unprecedented challenges [3][4][5]. Aiming at this challenge, main solutions include the usage of larger bandwidth, higher-order MIMO [6,7], higher-order modulation, more effective coding, and so on. Among these solutions, through deep utilization of spatial dimensions, massive MIMO technology attains enhanced wireless communication capacity and spectral efficiency and has become a key technology for 5 G/B5G wireless communication [8]. With enormous system dimensions and the use of higher-order modulation, signal detection faces a challenge in terms of computational burden and hardware implementation [9]. e traditional optimal signal detector, maximum likelihood (ML) detector, faces the problem of exponential increase in computational complexity for massive MIMO systems [10]. In contrast, linear detection algorithms (such as MMSE, ZF) have reduced computational complexity. Especially when the loading factor ζ ≪ 1 (ζ ≜ N t /N r , where N t and N r represent the number of single antenna users and the number of base station antennas, respectively), the minimum mean squared error (MMSE), and Zero Forcing (ZF) linear detection algorithms can achieve near-optimal system performance. However, as a large number of IoT devices are connected to the cellular network, more users need to be served in a cell (e.g., mobile phones, unmanned aerial vehicles (UAVs) [11,12], sensors [13], vehicle to vehicle (V2V) [14]). Unfortunately, as the number of users increases, the performance of the linear detection algorithm suffers severe degradation. Compared with ML algorithm, other complex detectors (e.g., belief propagation (BP) [15,16], approximate message passing (AMP) [17]) can achieve excellent performance. However, the convergence speed of the iterative update in the BP algorithm will decrease, and the update formula during the iteration process becomes more complicated, because massive MIMO contains a large number of ring structures when the number of users increases. To address the above issues, an EP algorithm is proposed [18,19]. For a random value of ζ and high-order modulation, the EP algorithm not only shows significantly better performance than other algorithms such as BP, AMP and MMSE, but also has great flexibility and strong robustness, which has attracted wide attention. However, during each iteration of EP, it is necessary to perform a full matrix inversion with a complexity as high as O (N 3 ). In addition, the huge computational cost makes it difficult to implement on hardware. To address this problem, EP's approximation algorithm (EPA) simplifies the update formula of the EP algorithm by reexpressing the moment matching condition, so that the number of matrix inversions in the EP algorithm is reduced to one [20]. However, the expense is that the EPA algorithm requires higher accuracy for this inversion; otherwise, the bit-error-rate (BER) performance will suffer serious performance losses. Recently, some methods based on matrix polynomial decomposition are proposed to apply to the EPA algorithm (such as EPA-NSA and EPA-wNSA) [9,20]. When loading factor ζ ≪ 1, the EPA-NSA algorithm shows good performance. But with the increase of ζ, the EPA-NSA algorithm shows slow convergence or even nonconvergence, resulting in serious degradation of system performance. Although EPA-wNSA algorithm has an improved performance compared with the EPA-NSA algorithm, the degree of improvement is very limited. In addition, the above-mentioned algorithms like EPA-NSA, EPA-wNSA, etc., in spite of avoiding the direct inversion of the matrix and reducing some complexity, calculation of the Gram matrix with a complexity up to O (N 2 t N r ) still needs to be involved to obtain high-precision signal detection. Some famous linear iterative method, such as Gauss-Seidel [21][22][23], successive over relaxation [24,25], suffer from calculating the Gram matrix and low parallelism. erefore, they can not directly be applied to alleviate the computational burden of EP iteration. In order to solve these complex issues, an improved algorithm EPA-SORI is proposed in this paper. e EPA-SORI algorithm introduces the SORI iterative algorithm in EPA to obtain a high-precision result of this one inversion to ensure that the EPA algorithm has good error bit rate performance. e contributions of this paper are as follows: (1) We first deduce a new expression of the SORI iterative algorithm in the real value system. en, the SORI algorithm under the real value system is further applied to the EPA algorithm. Furthermore, LLR approximation is provided to enhance the accuracy of the EPA-SORI detector. (2) According to the random matrix theory, under the real-valued system, promising initial solution and relaxation factors are used to further enhance the convergence rate and accuracy and then reduce the computational complexity. Furthermore, a theoretical analysis of the convergence speed of the proposed EPA-SORI algorithm is presented. eoretical analysis proves that the convergence speed of the proposed EPA-SORI algorithm is significantly higher than the recently reported EPA-wNSA algorithm. (3) In the iterative process, the proposed EPA-SORI algorithm not only requires no matrix inversion operations but also avoids direct calculation of the Gram matrix, effectively reducing the complexity of the entire algorithm. At the same time, the proposed EPA-SORI algorithm is high-parallel and hardwarefriendly. (4) Simulation and complexity results show that the bit error rate performance of the EPA-SORI algorithm is much better than MMSE, and the complexity is much lower than MMSE. Compared with existing EPA algorithms (such as EPA-INSA, EPA-wNSA, etc.), the proposed EPA-SORI can achieve performance close to Exact EP with lower complexity and higher convergence rate. Furthermore, with high modulation order and a large number of users, EPA-SORI will show a more obvious performancecomplexity trade-off advantage.
1.1. Notation. Matrices and column vectors are represented by uppercase and lowercase boldface letters, respectively. e element in i-th row and j-th column of matrix A is denoted by A (i,j) . (·) T , (·) H , (·) − 1 , ‖ · ‖ 2 , and | · | denote transpose, conjugate transpose, inversion, 2 − norm and determinant, respectively. Also, the probability distribution of s is denoted by p(s). e real part and imaginary part are denoted by Re(·) and Im(·) , respectively. N(y: μ, Σ) represents the Gaussian probability distribution with mean μ and variance Σ.

System Model
We consider a massive MU-MIMO system in which N r antennas are deployed at the base station to serve N t users. s � [s 1 , s 2 , . . . , s N t ] T is a (N t × 1)-dimensional transmission signal vector, where s i ∈ Ω and Ω are the symbol set of the M− QAM constellation (e.g., M � 16/64/256, etc.). We assume that H ∈ C N r ×N t is a flat Rayleigh fading channel. 2 Wireless Communications and Mobile Computing e received signal y ∈ C N r ×1 at the receiver can be modeled by the following: where n is the additive white Gaussian noise (AWGN), and its elements satisfy a complex Gaussian distribution with mean 0 and variance σ 2 n . Here, the received signal can be reexpressed by the equivalent real-valued augmented vectors as follows: where y � Re(y) Im(y)] T ∈ R 2N r ×1 , s � Re(s) Im(s)] T ∈ Ω 2N t ×1 and Ω represents the set of real and imaginary parts of the point set on the constellation in M− QAM modulation. n � Re(n) Im(n)] T , where n ∈ N(0, σ 2 n I 2N r ) satisfy a real Gaussian distribution with mean 0 and variance σ 2 n � (σ 2 n /2

EP and EPA Algorithms
EP algorithm is a reasoning method based on Bayesian inference, which is used to estimate the value of x under the condition that y is known when a joint distribution p xy is given [26][27][28]. If we use p xy to estimate x directly, the complexity will increase exponentially with the dimensions of x and y, which consume huge hardware resources in massive MIMO scenario. erefore, an approximate distribution is used to estimate p xy . In model y � Hs + n, the joint posterior distribution of the y and s is p(s|y) ∝ N(y: Hs, σ 2 n I 2N r ) · p(s). To facilitate the analysis, according to [7], when the transmitted symbols are independent of each other, the EP detector uses a nonstandard Gaussian distribution to replace the prior distribution p(s) of each transmitting antenna, that is, And then, a posterior distribution p(s|y) whose distribution satisfies the exponential family approximate. p(s|y) can be expressed as follows: e cumulative multiplication in (3) can be transformed into the following form: p(s|y) � N y: Hs, σ 2 n I 2N r exp − is a diagonal matrix. According to (4), the expression of mean μ and variance Σ of p(s|y) is as follows: Having constructed p(s|y), it is necessary to use the moment matching technique so that p(s|y) and p(s|y) are as similar as possible. However, the complexity of p(s|y) is very high, and it is difficult to directly obtain Λ and c. Aiming at this problem, a sequential EP algorithm is proposed in [9,20]. It is assumed that p(s i |y) is the i− th edge of p(s|y): c 0 i � 0 and Λ 0 i � E s are adopted as the initial solutions of (7), where E s denotes the mean symbol energy. en, the alternative distribution expression is expressed as follows: where p (l)/i (s i ) is expressed as follows: en, the first-order moments η l i and the second-order moments χ l i of p (l) r (s i |y) can be obtained by the following: Assuming that t l i , h 2(l) i are the mean and variance of q (l)/i (s i ), the values of V l i and ρ l i for l iterations can be obtained as follows: Next, the updated Λ and c can be obtained by matching the first-order and second-order moments of p (l) r (s i |y) and the cavity marginal distribution p \i (s i |y). Since the calculation process of the EP algorithm is too cumbersome, and each iteration involves the matrix inversion calculation in equation (5), which brings difficulties to the signal detection in massive MIMO scenarios. To simplify the calculation Wireless Communications and Mobile Computing process and reduce computational complexity, EP's approximate algorithm EPA is proposed in [20]. e EPA algorithm reexpresses the moment matching conditions and no longer uses explicit matrix matching to calculate the results of each iteration. It simplifies the update formula in the iteration process of the EP algorithm and uses a fixed matrix to estimate the matrix V before the iteration. en, the approximate value is used to estimate the value of Σ, and finally, we will obtain an approximate algorithm EPA which is different from the Exact EP iteration process. e EPA algorithm reduces the EP algorithm that requires multiple matrix inversion calculations to one time, which can effectively avoid the impact of multiple inaccurate approximate matrix inversion results. However, the expense is that the EPA algorithm requires a complete and high-precision matrix inversion to obtain an initial value of iteration. e implementation of EPA is detailed in Algorithm 1.

The Proposed EPA-SORI Detector
MMSE solution is usually used as the iterative initial solution of EP detection [20], then we also use MMSE solution as the iterative initial solution t 0 of the proposed EPA-SORI algorithm. Define W ≜ H T H + (σ 2 n /E s )I 2N t , and the initial iteration t 0 can be obtained by the following: From (14), the initial iteration solution problem can be treated as a solution of the equation: Wt 0 � y MF . Note that the result of this equation is required to be highly accurate; otherwise, the bit error rate performance will suffer serious degradation, so we propose to apply SORI iteration to obtain a high-precision iterative initial solution. In the complex value system, the corresponding SORI algorithm can be constructed as follows: where l ≥ 1, s denotes the estimated value of W − 1 y MF , and α 1 is the relaxation factor, β 1 � 2/(λ max + λ min ), λ max and λ min are the maximum and minimum eigenvalues of W, respectively, where Note that the EPA algorithm is implemented in a real-valued system; thus, we need to apply the SORI algorithm to a real-valued system. Here, the SORI method can be reexpressed as follows: where l ≥ 1, s ∈ R 2N r ×1 denotes the estimated value of t 0 , and α 2 denotes the relaxation factor, β 2 � 2/(λ max + λ min ), and λ max and λ min are the maximum eigenvalue and minimum eigenvalue of W, respectively. e relationship between y MF and y MF satisfies: y MF � Re(y MF ) Im(y MF ) T . In recently reported literature [29], the SORI algorithm is mainly applied to signal detection under complex-valued systems and cannot be directly applied to real-valued systems. us, it is necessary to further deduce the maximum and minimum eigenvalues of W. As long as the eigenvalues of H T H can be obtained, the eigenvalues of W can be obtained.

Lemma 1.
In the real-valued system, the maximum and minimum eigenvalues of W are as follows: where Size C (·) and Size R (·) denote the number of columns and rows.
Proof. Assuming that one of the eigenvalues of H T H is λ 1 , and one of the eigenvalues of H H H is λ. Based on the properties of eigenvalues, we have the following: Here, H H H and H T H can be expressed as follows: Substitute (20) and (21) into (18) and (19), and we have the following: Simplify (23), and we have the following: e comparison between (24) and (22) shows that the eigenvalues of the two are the same, i.e., In addition, we can get the same conclusion by using random matrix theory [30,31]. According to random matrix theory [30], in a real-valued system, us, Lemma 1 is proved. e next step is to derive the relaxation factor α 2 in the real-valued system. According to [29], where ρ(G ri ) denotes the spectral radius of G ri , and G ri � (I 2N t − β 2 W). Given that b and a are the maximum and minimum eigenvalues of G ri , ρ(G ri ) � |b|, thus α 2 can be obtained by the following formula: 4 Wireless Communications and Mobile Computing Since us, we have the following: It can be seen that α 2 is only related to λ min and λ min . According to the typical properties of the massive MIMO channel, H satisfies asymptotic orthogonality, and the Gram matrix W is dominant diagonally. From (20) and (21), we have W � W 0 0 W , thus, the matrix W is also dominant diagonally. erefore, W could satisfy the approximation as follows: It can be seen from (29), when β 2 � 2/(λ max + λ min ), β 2 y MF can be used to approximate W − 1 y MF , the SORI algorithm can choose the initial iteration solution of s 0 � 0, e SORI algorithm is summarized in Algorithm 2.
In the proposed EPA-SORI algorithm, s 0 � 0 and s 1 � D − 1 y MF are used as the initial iteration solutions to further improve the convergence speed. Since D is the diagonal matrix of the W matrix, the complexity will not increase.
e EPA-SORI algorithm is summarized in Algorithm 3, and an intuitive diagram of the processing of the EPA-SORI algorithm is presented in Figure 1.

Convergence Performance Analysis
is section mainly analyzes the convergence performance of the EPA-SORI algorithm and compares it with that of the EPA-wNSA algorithm. Note that the convergence performance of the EPA-SORI and EPA-wNSA algorithms mainly depends on the convergence performance of the initialization part (i.e., SORI and wNSA).

Lemma 2.
e iterative spectral radius of the proposed EPA-SORI and the EPA-wNSA satisfy the following relationship: Proof. First, ε l is denoted as the SORI estimation error after l iterations: From (16), the relationship between ε l ε l+1 and ε l− 1 ε l can be obtained by the following: us, G sori denotes the iterative matrix of the SORI iterative algorithm which can be expressed as follows [29]: (1) Input: y, H, L, σ 2 n , E s , ϑ. Wireless Communications and Mobile Computing 5 Assuming that λ ri and λ sori are the eigenvalues of G ri and G sori , respectively. b 1 and a 1 are the maximum and minimum eigenvalues of G sori , respectively. At the same time, b and a are the maximum and minimum eigenvalues of G ri , respectively. us, we have |λ ri I 2N t − G ri | � 0 and |λ sori I 4N t − G sori | � 0. en, after substituting (33) into |λ sori I 4N t − G sori | � 0, we have | (λ 2 sori + α 2 − 1/λ sori α 2 ) (1) Input: y, H, K, σ 2 n ,E s .
end for ALGORITHM 2: e SORI algorithm.
(1) Input: y, H, L, K, σ 2 n , E s , ϑ.  Wireless Communications and Mobile Computing us, compared with |λ ri I 2N t − G ri | � 0, it can be derived as follows: Next, by substituting b into (34), the maximum eigenvalue b 1 of G sori can be obtained by ), m can be calculated as follows: Based on the above derivation, we have b 1 � (α 2 b/2). And after simplification, the spectrum radius of G sori can be calculated by the following: Assuming that G w denotes the iterative matrix of the wNSA iterative algorithm in a real-valued system, which is given by the following: where δ denotes the weight factor of EPA-wNSA, and D satisfies the following approximation: us, we have the following: (39) can be further simplified by substituting the minimum eigenvalue λ min of W. Furthermore, (σ 2 n /E s ) can be neglected since it is much smaller than N r . us, we have the following: Comparing the spectrum radius G sori and G w , we have the following: us, Lemma 2 is proved. e convergence rate R of the iterative algorithm is closely related to the spectral radius of the iterative matrix G, i.e., R � − lg(ρ(G)). Hence, the smaller the spectral radius of the iteration matrix, the greater the convergence speed. From (30), G sori has a smaller spectrum radius than G w ; it means that the proposed EPA-SORI algorithm exhibits favorable convergence performance.

Computational Complexity Analysis
In this section, the computational complexity is given by the number of real-valued multiplications (RMULs). As shown in Figure 1, the entire calculation process is divided into two parts: the initialization process and the iterative processes of the EPA-SORI algorithm.
Firstly, in the initialization process, the number of RMULs required by y MF and b are 4N t N r and 2N t , respectively. Since D is a diagonal matrix, thus the complexity is 4N t N r . Please note here that compared with matrix multiplication, the complexity of λ max , λ min and α 2 can be ignored. us, the initial solution s 0 does not need to be calculated, and the number of RMULs involved in calculating s 1 is 4N t . e next step is the SORI iteration part of the initialization process. e RMULs of h k and s l+1 for each

Wireless Communications and Mobile Computing
iteration is approximately 4N t N r and 4N t N r + 6N t , respectively. erefore, the RMULs complexity involved in calculating the SORI iteration is K (8N t N r + 6N t ), where K denotes the number of iterations of SORI. e final step of this process is the initialization of t, whose computational complexity is 4N t .
Secondly, the RMULs complexity involved in calculating the EPA iteration is L (8N t N r + 8N t ), where L denotes the number of EPA iterations.
Finally, the overall complexity of the proposed EPA-SORI algorithm is approximately as follows: Table 1 shows the computational complexity of EPA-SORI algorithm, MMSE [10], EPA-INSA [9], EPA-wNSA [32], and the Exact EP [18] algorithm. From Table 1, note that Gram matrix calculations with a complexity of up to 4N 2 t N r cannot be avoided in many of the reported methods, such as MMSE, EPA-wNSA, and the Exact EP. Additionally, MMSE and the Exact EP also involve the inversion of a matrix with a complexity of up to 4N 3 t . In contrast, the proposed EPA-SORI algorithm only involves operations with complexity of about 8N t N r (K + 1) + 8N t N r L. is is because in the SORI algorithm, direct calculation of the Gram matrix is avoided by splitting calculation. For example, to calculate H T Hs 1 , we first calculate Hs 1 , and then multiply the result with H T . Here, we have a much lower calculation complexity, which is 4N t N r + 4N t . Compared with these methods of calculating Gram matrix first and then multiplying with vector, the complexity, in this case, is greatly reduced. Since K and L are much smaller than N t and N r for massive MIMO systems, the complexity of the EPA-SORI algorithm is much lower than that of other algorithms.

Simulation Result
In this section, the BER performance results of the EPA-SORI algorithm are presented and compared with EPA-wNSA [32], MMSE [10], EP-INSA [9], and the Exact EP [18] algorithms. To fully demonstrate and verify the BER performance of the proposed EPA-SORI, simulations are performed under different modulation methods (i.e., 16/64/256QAM) and different loading factors (i.e., ζ � 0.5/0.25). For some damping factor ϑ ∈ [0, 1], we set ϑ � 0.5 in EPA-wNSA and EPA-SPRI algorithms according to [20]. Assuming that the base station is able to obtain perfect channel state information (CSI) and complete signal detection based on the obtained CSI.
To further exhibit the convergence performance of EPA-SORI, Error-vector magnitude (EVM), which is defined as EVM � (‖s − s‖ 2 /‖s‖ 2 ) × 100% [33], is considered in Figure 2. As presented in Figure 2, EPA-INSA diverges for three different modulation methods when N r � 128, N t � 64. In contrast, the convergence performance of EPA-wNSA can be improved by the optimal choice of the weighted factor, but the degree of improvement is very limited [32]. At the same time, the proposed EPA-SORI algorithm could fast converge to the accurate Exact EP algorithm with obviously fewer iterations. As the modulation order increases, the benefit brought by EPA-SORI is more obvious. erefore, it is verified that the advantage of the proposed algorithm is in fast convergence.
As shown in Figures 3-5, when the Massive MIMO system is configured as 32 × 64, 32 × 128, or 64 × 128, the loading factor ζ is 0.5 and 0.25. In each system configuration, we consider three modulation methods: 16QAM, 64QAM, and 256QAM, and different algorithms are compared and analyzed. In Figure 4, at BER 10 − 3 with N r � 128, N t � 32 for 256-QAM, EPA-SORI (K � 3, L � 7) has a better performance than that of MMSE 0.9 dB. And in Figure 3, at BER 10 − 3 with N r � 64, N t � 32 for 256-QAM, EPA-SORI (K � 9, L � 7) outperforms MMSE 2.2 dB. It can be seen that the BER of EPA-SORI algorithm is obviously superior to that of the MMSE, and as the loading factor ζ increases, the advantage will grow further. In Figures 4 and 5, EPA-INSA (K � 6, L � 7) has a good performance when ζ � 0.25, but does not converge when ζ � 0.5. Meanwhile, choosing the weight factor of EPA-wNSA helps improve the convergence performance of EPA-wNSA but makes it difficult for further improvement, especially at a low ζ. In contrast, EPA-SORI uses SORI iteration to solve the only one-time matrix inversion in EPA, which enables it to fast converge to the accurate Exact EP algorithm with low complexity. For example, in Figure 4, at BER 10 − 3 with N r � 128, N t � 32 for 64-QAM, EPA-SORI (K � 3, L � 7) outperforms EPA-wNSA (K � 5, L � 7) 0.3 dB, and for 256-QAM, EPA-SORI (K � 3, L � 7) outperforms EPA-wNSA (K � 5, L � 7) 3.1 dB. In Figure 5, at BER 10 − 3 with N r � 128, N t � 64 for 64-QAM, with only 8 iterations used by EPA-SORI (K � 8, L � 7), its performance is outperforming that of EPA-wNSA algorithm (K � 21, L � 7) 2 dB which requires 21 iterations 0.2 dB. And for 256-QAM, the advantage will grow further. In other words, the performance of the EPA-SORI algorithm is always superior to that of EPA-wNSA under the same Massive MIMO system configuration.
Furthermore, a clear overview of the performance-complexity trade-off under different modulation methods and system configurations is provided in Figure 6. From Figure 6, the EPA-SORI algorithm can achieve not only significantly better performance than MMSE with a significantly lower computational complexity, but also almost the same performance as the Exact EP. For example, in Figures 6(b) and 6(c), at BER 10 − 3 with N r � 64, N t � 32 for 64-QAM and 256-QAM, EPA-SORI (K � 9, L � 7) outperforms MMSE 2.4 dB and 3.8 dB, respectively, which is very close to that of Exact EP. However, EPA-SORI (K � 9, L � 7) only consumes 67% and 63% of computational cost of MMSE and Exact EP, respectively. It is also observed from Figure 6, EPA-wNSA can improve the BER performance by increasing the number of iterations K at the cost of a substantial increase in computational complexity. In contrast, the EPA-SORI algorithm can achieve BER performance close to Exact EP with fewer iterations, and its complexity is much lower than the Exact EP and EPA-wNSA. For example, in Figure 6(b), at BER 10 − 3 with N r � 64, N t � 32 for 64-QAM, by increasing the number of iterations K, the performance of EPA-wNSA (K � 29, L � 7) is increased by 0.9 dB compared with EPA-wNSA (K � 16, L � 7), but the complexity is increased by 11% compared with EPA-wNSA (K � 16, L � 7). However, EPA-8 Wireless Communications and Mobile Computing Detector scheme Computational complexity MMSE [10] 4N t (N 2 t + N t N r + N r ) Exact EP [18] 4N t (N 2 SORI (K � 8, L � 7) can achieve a performance close to the Exact EP when K � 8, but it only consumes 66% of complexity of EPA-wNSA (K � 29, L � 7). Next, we compare the EPA-SORI, EP-wNSA and Exact EP algorithms in Figures 6(a) and 6(d). In Figure 6(a), at BER 10 − 3 with N r � 128, N t � 32 for 256-QAM, EPA-SORI (K � 3, L � 7) outperforms EPA-wNSA (K � 5, L � 7) 2.3 dB, and the complexity is 65% of EP-wNSA (K � 10, L � 7) and 52% of that of the Exact EP. In addition, in Figure 6(d), at BER 10 − 3 with N r � 128, N t � 64 for 256-QAM, EPA-SORI (K � 8, L � 7) outperforms EPA-wNSA (K � 21, L � 7) 2 dB. At the same time, it only consumes 42% of the computational cost of EP-wNSA (K � 27, L � 7) and 29% of that of Exact EP. In summary, compared with MMSE, Exact EP and the recently reported EPA-wNSA algorithms, the proposed EPA-SORI algorithm has a better performance-complexity trade-off advantage, which is more obvious in scenarios with high modulation order and a large number of users.

Conclusion
In this paper, we propose a novel data-detection scheme, EPA-SORI detector, which can achieve the same BER performance as the Exact EP algorithm with significantly lower complexity in various massive MIMO system configurations. At the same time, compared with MMSE and the existing EPA algorithms, the proposed EPA-SORI algorithm has a better performance-complexity trade-off advantage, which is more obvious in scenarios with high modulation order and a large number of users. e proposed algorithm avoids the direct calculation of the Gram matrix. At the same time, several effective techniques (i.e., the iteration initial solution and the optimal relaxation factor) are adopted to further enhance the convergence rate and accuracy. In future work, there will be many potential applications. e proposed design can be extended to other more complex scenarios, such as the extension of EPA-SORI to decentralized architectures [34][35][36]. Also, it can be further combined with deep learning methods to improve performance [37,38]. Finally, we will investigate the proposed design to more realistic channel scenarios in our future work.

Data Availability
e data used to support the findings of this study are included within the article. No other data were used beyond those in this article.  Figure 6: e performance-complexity trade-off comparison for different detectors under different system configurations. Exact EP is set as the performance benchmark to compare the SNR loss of MMSE, EPA-wNSA detectors. EPA-INSA detector is not compared in this figure because it fails to converge results in very poor performance, which cannot achieve the illustrated BER level.