An Improved Data-Driven Decision Feedback Receiver via Deep Unfolding

e proposed decision feedback receiver (DFR) is an end-to-end data-driven iterative receiver, and the performance gain is achieved from iterations. However, themismatch problem between the training set and test set exists in the DFR training, and thus performance degradation, slow convergence speed, and oscillation are introduced. On the other hand, deep unfolding using parameter sharing is a practical method to reduce the model parameter number and improve the training eciency, but the problem whether the parameter sharing will cause performance degradation is rarely considered. In this work, we generally discuss and analyze these two problems, and solution to solve the problem or conditions that the problem no longer exists is then introduced. We give the improvements to address the mismatch problem in the DFR, and thus we propose an improved-DFR via deep unfolding. e improved-DFRs with and without parameter sharing, namely, DFR-I and DFR-IS, are both developed with low computation complexity and model complexity and can be executed by parallel processing. Besides, the practical training tricks and performance analysis including computation complexity and model complexity are given. In the experiments, the improved-DFRs outperform the DFR in various scenarios, in terms of convergence speed and symbol error rate. e simulation results also show that the DFR-IS is easier to train, and the slight performance loss can be reduced by increasingmodel complexity, in comparison to DFR-I.


Introduction
Machine learning (ML) or deep learning (DL) [1][2][3][4][5] is promising for wireless communications. e communication system is typically model-driven and thus the corresponding design is derived by domain knowledge. Meanwhile, the ML-enabled approach is data driven, and it learns to optimize the system design through data training. ML is applicable for end-to-end optimization problems in complex scenarios with inaccurate and/or intractable models [6,7]. Neural network (NN) is a fundamental model of ML. As a universal function approximator, NN has been widely studied in a range of optimization problems in wireless communications, such as signal detection [8][9][10][11], degree of arrival (DOA) estimation [12,13], beam prediction [14], and resource allocation [15,16]. Besides, the rapid development of massively parallel processing hardwares with distributed memory guarantees the deployment and execution of NN-based algorithms to be fast and e cient.
Deep unfolding [17] is proposed as a combination of the data-driven NN-based learning approaches and the modelbased iterative algorithms, and it has been widely investigated in the wireless communication community [6,18,19]. In deep unfolding, the iterations are unfolded into a layerwise structure analogous to a NN, and the model parameters across layers are untied to obtain NN-like architectures which are learnable. e resulting formula makes an iterative algorithm to be capable of data learning and thus a better performance can be achieved [20]. In some cases where the signal dimension is high and the mathematical model is unpresented, the iterative algorithms are developed in a data-driven manner, and thus the scale of trainable parameters in an unfolded NN is extremely large [21,22] and the training is challenging. Parameter sharing is a practical technology which allows the parameters across di erent layers to be the same and thus reduces the number of trainable parameters [9,22,23]. On the other hand, without parameter sharing, the number of samples with respect to the corresponding trainable parameter is in inverse proportion to the number of layers. erefore, the decrease in model complexity and the increase in sample complexity are beneficial to overcome the overfitting problem and also improve the learning efficiency. Intuitively, the reduction of parameters in the whole unfolded NN will lead to a decrease in model approximation quality. But strictly speaking, whether the parameter sharing would lead to performance degradation is rarely investigated.
Considering signal detection, Ye et al. presented initial results in deep learning for signal detection in orthogonal frequency-division multiplexing (OFDM) systems [8]. In this letter, they exploit deep learning to handle wireless OFDM channels in an end-to-end manner. Different from existing OFDM receivers that first estimate channel state information (CSI) explicitly and then detect/recover the transmitted symbols using the estimated CSI, the proposed deep learning-based approach estimates CSI implicitly and recovers the transmitted symbols directly. Samuel et al. considered the use of deep neural networks in the context of Multiple-Input-Multiple-Output (MIMO) detection; a model and data-driven neural network architecture suitable for this detection task is proposed [24]. Furthermore, a MIMO detector which is specially designed by unfolding an iterative algorithm and adding some trainable parameters is proposed in [9]. Since the number of trainable parameters is much fewer than the data-driven DL-based signal detector, the model-driven DL-based MIMO detector can be rapidly trained with a much smaller data set.
In [25], a point-to-point band-pass wireless communication system is considered. e transmitted symbols are mapped by m-ary phase position shift keying (MPPSK) modulation [26], and the infinite impulse response (IIR) band-pass filters [27] are deployed at both the transmitter and receiver, to shape the transmitting signal and obviate the disturbances, respectively. However, the filters also introduce inter-symbol-interference (ISI), waveform distortions, and nonwhite noise. To address these issues, a matched filter, an equalizer, and a demodulator are usually required at the receiver. e ISI becomes severe when the allocated bandwidth is narrowed, and equalization becomes the key issue in this receiver. However, equalizer design in such a system is challenging. First, the system is IIR, but the typical equalizers are developed with the assumption that the systems are finite impulse response (FIR) [28]. Second, the symbols are mathematically described as scalars in a typical equalizer; meanwhile, the symbols modulated by MPPSK are formulated as vectors in the time domain.
ird, the block-by-block system design is fussy. Due to these issues, the receiver design is intractable and thus we proposed an end-to-end NNbased receiver which iteratively estimates the transmitted symbols, namely the decision feedback receiver (DFR). Simulation results showed that, after several iterations the DFR had better detection performance than the receiver without feedbacks. However, we also found that the first solution of DFR performed worse than the receiver without feedbacks, but theoretically they are the same. Besides, the detection instability along iteration times occurred when the soft information of both posterior and previous adjacent symbols was utilized as feedbacks. ese existing problems are needed to be solved.
In this work, we study the principle behind the performance degradation of DFR in a general way. Based on the theoretical analysis, we propose an improved-DFR to address the existing problems in DFR. Specially speaking, this work makes the following contributions: (i) We consider a mismatch between the training set and test set, and point out that the increased test error is caused by data missing and data divergence.
We also give the upper and lower bounds of the test error. en, we respectively introduce the solution to eliminate the error caused by data missing and the conditions that inherent data divergence no longer exists.
(ii) We consider training one single model on an integration of multiple training sets and point out that the increased test error is caused by data divergence.
We also introduce the conditions that data divergence no longer exists. Based on these conclusions, under sufficient model complexity assumption, we prove that data divergence does not exist in deep unfolding using parameter sharing. Furthermore, we use the Markov decision process (MDP) [29] to describe the iterative algorithm with insufficient model complexity, and the condition of data divergence nonexistence is given.
(iii) We point out the existing problems in DFR are introduced by the mismatch between the data sets.
To address the mismatch problem, we further propose the improved-DFR via deep unfolding. e improvements include: the DFR is unfolded to several sub-NNs, and the intermediate solutions can be obtained during training; each sub-NN is trained with its parameter, and the possibly existing data divergence is avoided; the soft information of both posterior and previous adjacent symbols are utilized as feedbacks, and the compromised method to guarantee stability becomes unnecessary.
(iv) e improved-DFR is an end-to-end data-driven receiver, and it has low computation complexity and model complexity and can be executed by parallel processing. We also propose a clip technique to solve the numeric overflow problem in practical training. In the experiments, the improved-DFRs outperform the DFR in various scenarios, in terms of convergence speed and symbol error rate (SER). e simulation results also show that the DFR-IS is easier to train, and the slight performance loss can be reduced by increasing model complexity, in comparison to DFR-I. e remainder of this study is organized as follows. e theoretical analysis including the test error brought by the mismatch problem, the integration of multiple training sets, and the parameter sharing are given in Section 2. e DFR and the proposed improved-DFR are presented in Section 3. e simulation results of the improved-DFR with and without parameter sharing and DFR are demonstrated in Section 4. Conclusions are presented in Section 5.

Theoretical Analysis
2.1. Preliminary. We begin by establishing the models of the data set. We assume that the potential mapping in the data set is a surjection in which every element of the image space is a value for some members of the domain, and the mapping also satisfies Lipschitz continuity. e mapping f in the training set is expressed as where the input vector x following the random distribution X is an input random variable vector, and the output vector y following the random distribution Y. e probability density function (PDF) of X is p(x). We sample the input-output pairs of f independently and identically, then we get the training set S train � (x n , y n ) N train n�1 , where the subscript "train" denotes the training set, and N train is the number of training samples. Similarly, the mapping g in the test set is defined as where X ′ is a random variable with PDF being q(x). We sample the input-output pairs of g and get the testing set S test � (x n , y n ) N test n�1 , where the subscript "test" denotes the testing set, and N test is the number of testing samples.
Second, we define a space D(p ∩ q), an intersection of support sets of functions p(x) and q(x): We also define a space D(p/q) which is a complement of the support set of q(x) in the support set of p(x), and similarly we have a space D(q/p). e two spaces are written as follows: Figure 1 shows an illustration of the spaces p(x) and q(x) with a one-dimensional variable x. Particularly, D(p ∩ q) is an intersection of support sets of functions p and q, D(p/q) is a complement of support set of q in the support set of p, and D(q/p) is a complement of support set of p in the support set of q.

Mismatch between Training Set and Test Set.
In ML, the potential mappings in the training set and the test set are usually the same, i.e., f � g, and they follow the same distribution, i.e., p(x) � q(x). Using a training set where N train ⟶ ∞, a parameterized model f θ of sufficient complexity can be sufficiently trained, then we have f θ ⟶ f � g. is indicates that the learned model f θ has zero error on the test set. First, we investigate the mismatch problem between the training set and the test set: when there exists a difference between the distributions of the data sets, i.e., KL(p � � � �q) > 0 where KL(·) is the Kullback-Leibler divergence [2], and thus f ≠ g, then how will the learned model perform on the test set. Without loss of generality, we use some distance function d(·) as an error function, and the expected error on the test set can be expressed as

Mathematical Problems in Engineering
where e upper bound of the test error is given in (5), and it indicates that the error is composed of two parts. First in the space D(p ∩ q), the images of the same variable are different, i.e., g(x) ≠ f(x), we call this inherent error as data divergence, and it cannot be cancelled. e other part of the error comes from the input space D(q/p), which only shows in the test set and is unpresented in the training set. is error is caused by data missing, and the model cannot capture the proper mapping during learning. Particularly, the data missing is different from the generalization error. By increasing sample complexity, data divergence cannot be reduced but generalization error can be alleviated. However, the samples which follow the distribution of the test set can be added to the training set, and then the data missing can be removed. We consider three cases. First, after proper training with added samples, the upper bound of the test error can be reduced to Second, when any of the following two conditions are satisfied: meanwhile, if the added samples are unavailable, then the upper bound of the test error is ird, when the learned model eliminates the data missing with added samples, and the data set satisfied any of conditions (8) and (9), then sup L(f, g) � 0, According to the nonnegativity of PDF and distance function, the upper bound is also the lower bound.

Integration of Multiple Training Sets.
We have quantitatively analyzed the test error when the training set and test set mismatch, and in this subsection we focus on the situation where multiple training sets which follow different distributions, integrate into a new training set. First, we consider two training sets, i.e., K � 2, and their potential mappings are f and g, respectively. We assume that f is sampled with probability u ∈ (0, 1) in the integrated training set S train , and the test set S test follows the same distribution as the training set. erefore, the expected errors on the training set S train and the test set S test are the same. e model h is trained with S train , and the corresponding error can be written as According to (11), ∀x ∈ D(p ∩ q), ∃h * (x): Especially, according to the symmetry and trigonometric inequality of d(·), when u � q(x)/p(x) + q(x) where u is a function of x, then we have Plug (13) into (11), and (11) can be rewritten as p (x) Figure 1: e illustration of the spaces D(p ∩ q), D(p/q) and D(q/p) with one-dimensional variable x.
More generally, after sufficient training, the learned function h * can be achieved as Plug (15) into (11), and we have where e upper bound of error is given in (16). Data missing no longer exists in this situation, but data divergence arises from the integration. Similarly, when the data set satisfies condition (8) or (9), then the upper bound and the lower bound of error are zeros, i.e., e above conclusions can be extended to K > 2. Given a set of mapping functions S f � f (k) K k�1 , and the element mapping is expressed as when any of the following two conditions: are satisfied, and we train model h of sufficient complexity with the sufficient large training set, then the expected error on the data set reaches the lower bound, i.e., L(h, S f ) � 0. In summary, the data missing can be removed by the solution that adding the missing samples to the training set. Moreover, the condition (19) or (20) guarantees that the mapping in the integrated training set is still a surjection, and then the inherent data divergence no longer exists.

Deep Unfolding.
In deep unfolding, an iteration algorithm can be unfolded as follows: where k is the iteration index, x is the input, ϕ (k) is the current iterative solution and also part of the input in the next iteration, and θ (k) is the parameter of layer k. In the first iteration, ϕ (0) is some initial value. e target solution of the iteration algorithm is y. Given an algorithm with K max iteration times, the parameter set of the corresponding model is θ (k) K max k�1 . In some cases, the parameters in different layers are shared, i.e., θ (k) � θ, ∀k.
We discuss the training procedures with and without parameter sharing here. First, given a parameter set θ (k) K max k�1 and a set of training sets S (k) train K max k�1 where the element training set S (k) train is provided for parameter θ (k) , and the element sample in S (k) train is (x, ϕ (k− 1) , y). Without parameter sharing, there are K max layers, and each layer is trained with its training set with respect to its parameter. Second, we consider the training case with parameter sharing. According to the discussions in subsection 2.2 and subsection 2.3, a single model is used to approximate multiple potential mappings, and thus the corresponding training set must include samples of all these mappings to eliminate data missing. Hence, an integrated training set train is provided for the model with a shared parameter θ. Using parameter sharing, only one model is trained with respect to the shared parameter on an integrated training set.
Several benefits can be gained from parameter sharing. Assuming that the amounts of the parameter in different layers are the same, then the total amount of model parameter is reduced to 1/K max times, but the amount of samples with respect to the corresponding parameter is increased to K max times. e model complexity and parameter redundancy are reduced, but the sample complexity is increased, and thus the overfitting can be alleviated and the learning efficiency can be improved.
On the other hand, some negative issues can arise from parameter sharing. For example, the decrease in parameter number also leads to a decrease in model approximation quality. When we consider parameter sharing, the model complexity can be appropriately increased, because that the union space D(p (1) Moreover, there is always a trade-off between the model approximation quality and computation efficiency. Normally, a proper NN scale can be determined through simulations. Most importantly, we consider such a problem: does the test error caused by data divergence exist in deep unfolding with parameter sharing?

Sufficient Model Complexity.
According to the introduced formulation and training of deep unfolding, given any input (x, ϕ (k− 1) ) and output ϕ (k) in any layer k, then the target solutions are the same, i.e., Y k ( ) � y, ∀k. is indicates that when the model complexity and sample complexity are both sufficient, after sufficient training, then we have Equation (22) shows that deep unfolding satisfies condition (20), and the potential mapping in the integrated training set is still a surjection. erefore, the data divergence does not exist in deep unfolding with parameter sharing, and the corresponding upper bound of the test error is zero.
In fact, the assumption of sufficient model complexity is impractical. Due to the extreme complexity of potential function f, the practical parameterized model f θ usually cannot approximate f with arbitrary precision. erefore, we resort to iterative methods which approach the optimal solution step-by-step, instead of reaching the optimum by a single step. erefore, the data divergence still possibly exists due to the insufficient model complexity.

Insufficient Model Complexity.
We resort to the MDP to describe the iterative optimization in deep unfolding with insufficient model complexity. MDP is a typical model in reinforcement learning [29], which concerns about an agent interacting with an environment in the single-agent case. In each interaction, the agent takes action a by policy π using the observed state s, then receives a feedback reward r and an updated state s ′ from the environment. e agent aims to find an optimal policy to maximize the cumulative reward over the continuous interactions. e inference process of iterative algorithms can be described as MDP, as shown in Figure 2. e agent policy is the model function π � f. At the time k, the state is s k � (x, ϕ (k− 1) ), the action is produced by the policy (namely the model function), and it is a k � ϕ (k) . e reward is some negative distance function − d(ϕ (k) , ϕ * ) where ϕ * denotes the optimum solution with respect to x. e environment is then transferred to the updated state s k+1 � (x, ϕ (k) ). In such formulation, the iterative inference is in consistent with the MDP.
We can prove that: when the performance of the intermediate solution − d(ϕ (k) , ϕ * ) is monotonously increasing, then the model f is a surjection and thus the data divergence does not exist.

Proof. Given any input x and intermediate solution
, then according to the monotonicity of the distance function, we have ϕ (i) ≠ ϕ (j) .
Given a solution set ϕ (k) K max k�0 where the elements satisfy − d(ϕ (0) , ϕ * ) < · · · < − d(ϕ (K max ) , ϕ * ), and thus we have ϕ (i) ≠ ϕ (j) , i, j ∈ 0, 1, . . . , K max , i ≠ j. e mapping of any input (x, ϕ (k− 1) ) is a unique ϕ (k) . erefore, the mapping f is a bijection and also a surjection, and the data divergence does not exist. □ e above conclusion is derived without the assumption of sufficient model complexity. When the monotonously increasing condition is not satisfied, the oscillation possibly occurs. Besides, when the monotonously increasing condition is satisfied, the data divergence does not exist in the situation where the model parameter is not shared and the model complexity is limited. Figure 3, a point-to-point band-pass wireless communication system is considered. e symbol sequence vector s is produced with frame size N F , and they are modulated by the m-ary phase position shift keying (MPPSK) modulation [30]. e MPPSK signal of the symbol m ∈ 1, . . . , M { } is given as

System Model and Problem Formulation. As shown in
where T c � 1/f c represents the carrier period, andK and N denote the number of carrier periods in each time slot and in each symbol, respectively. Apparently, we have N � KM Environment a k = ϕ (k) r k = -d (ϕ (k) , ϕ*) Agent π = f s k = (x, ϕ (k-1) ) and T � NT c , where T denotes the symbol duration. Under the MPPSK modulation, transmitted symbols are mapped into a base-band signal s l (t) which can be written as where g n (t) ∈ g m (t) M m�1 . en, s l (t) is up-converted by carrier frequency f c into a signal s(t) � s l (t)e j2πf c t , and shaped by a band-pass IIR filter with the impulse response h bp (t), to be transmitted radio frequency signal x(t). e additive white Gaussian noise (AWGN) channel is considered, and additional noise is denoted by w(t) with variance σ 2 w . At the transmitter, a same band-pass IIR filter is used to denoise and obviate the disturbances. e filtered received signal z(t) is obtained as (25) where * represents convolution operator. It can be seen that z(t) is composed of two parts: the colored Gaussian noise w ′ (t) which is filtered once, and the band-pass signal s ″ (t) which is filtered twice. e continuous signal z(t) is then sampled at sampling frequency f s where f s is usually a positive integral multiple of f c , i.e., f s /f c ∈ N + , and finally we have sampled signal z up . In this work, we investigate the detection problem: estimate the transmitted symbol sequence s with the received sampled signal z up . e band-pass IIR filters are deployed at both the transmitter and receiver to shape the transmitting signal and obviate the disturbances, respectively. However, the filters also introduce ISI, waveform distortions, and nonwhite noise. To overcome these issues, a matched filter, an equalizer, and a demodulator are usually required. e receiver design is intractable and thus we proposed an end-toend data-driven approach.

Decision Feedback Receiver.
In [25], we proposed an end-to-end NN-based receiver which iteratively estimates the transmitted symbols, and we called it decision feedback receiver (DFR). ere are L hidden layers in the full-connected NN, and the number of neurons in the lth hidden layer is N l . In the kth iteration, to detect symbol n, the input of DFR includes a windowed received sampled signal z up n and a prior information vector q (k− 1) n , and the output p (k) n is a conditional probability vector. e mapping of the DFR is expressed as where θ denotes the model parameter. As shown in Figure 4, z up n is a vector filtered by a rectangular window. e window length is N s × f s /f c where N s ∈ N + , and the window center is on the symbol n. e prior information vector q (k− 1) n is composed of the conditional probability vectors of the N pre ∈ N + previous and N pos ∈ N + posterior adjacent symbols of symbol n. erefore, the number of feedback symbols is N f � N pre + N pos , and the length of priori information vector is M × N f . en, the length of DFR input is given as Besides, the length of DFR output is equal to M. e estimated symbol can be achieved by Summarily, the DFR utilizes the prior soft information of adjacent symbols derived from the last estimation, to iteratively estimate the transmitted symbols.

Existing Problems in DFR.
Simulation results in [25] showed that after several iterations the DFR had better detection performance than the NN-based receiver without feedbacks. However, we also found that the first solution of DFR performed worse than the receiver without feedbacks, but theoretically they are the same. is phenomenon is caused by the mismatch between the training set and the test set. In (26), the DFR obtains the estimated soft information derived from the last estimation. However, the soft information q is unavailable in the training set, and q is replaced by the hard information which is generated from the correct output. From the analysis in subsection 2.1, we know that the mismatch in the input will lead to performance degradation.
Owing to the mismatch in the input data between the training set and test set, three sub-problems arise and we summarize them as follows: (i) First, according to the analysis in subsection 2.1, the data missing exists in DFR due to the mismatch problem. When the band is narrow and the signalto-noise ratio (SNR) is low, the mismatch becomes severe and the DFR detection performance on the test set is ruined. (ii) Second, the data divergence can exist due to the insufficient complexity of DFR. ere is only one single model in (26), and its parameter θ is shared in all iterations. (iii) ird, when the soft information of both posterior and previous adjacent symbols are utilized, the detection becomes unstable and oscillation occurs. erefore, we adopt an alternative method where only the soft information of previous adjacent symbols is used as feedbacks.

Framework of Improved-DFR.
To overcome these existing problems, we unfold the DFR and make adjustments to the training algorithm via deep unfolding, and the improved-DFR is proposed. As shown in Figure 5, the iterative algorithm is unfolded into K max sub-NNs, and the parameter of NN − k is θ (k) , All the sub-NNs have the same structures and scale. e activation function in the hidden layers and the output layer are sigmoid and softmax, respectively. In DFR, the detection is frame-by-frame, and thus we use tensors to describe the data. e first dimension of tensor-

Source MPPSK Modulation BPF
BPF DFR (receiver)  Mathematical Problems in Engineering like data refer to the sample number, which is the frame size N F in DFR. e tensor is a matrix X when the sample is a vector x. e mapping of sub-NN can be described as We use X k ( ) � [Z up , Q (k− 1) ] to denote the input of NN − k. In the first iteration, Q (0) is initialized as a zero matrix.
In the proposed improved-DFR, Memory I transforms the received sampled signal z up into Z up and stores Z up . Besides, the output matrix P (k− 1) is transmitted to Memory II − k and then transformed into priori information matrix Q (k− 1) . e Q (k− 1) is stored in Memory II − k − 1, and then transmitted to the NN − k as input. ere are one Memory I and K max Memory IIs in the improved-DFR. In Algorithm 1, the detection algorithm of improved-DFR is given in the symbol-wise form. In addition, the detection algorithm can be parallel executed. e detailed hyper-parameters of the improved-DFR are given in Table1.

3.4.2.
Training of Improved-DFR. First, the received sampled signal z up and the corresponding transmitted symbol sequence s, respectively, are transformed into the input Z up and the output P. Given initial Q (0) and input Z up , the improved-DFR serially obtains the solutions of each sub-NN, and we obtain the solution set P (k) K max k�1 . According to the cross-entropy loss function, the error of one frame is expressed as follows: To minimize (30), the parameter is updated along the negative gradient direction during learning: where η denotes the learning rate. Especially, we use the Adam optimizer [31][32][33] to dynamically adjust the learning rate. In Algorithm 2, the training algorithm is given in the parallel processing form. We summarize the improvements of improved-DFR as follows: (i) During learning, the training set is dynamically generated where the priori information is transformed by the last solution of the previous sub-NN. erefore, the training data and test data follow the same distribution, and the error caused by data missing is eliminated. (ii) Each sub-NN is trained with its parameter, and the possibly existing data divergence caused by insufficient model complexity is avoided. (iii) e soft information of both posterior and previous adjacent symbols are utilized as feedback, and the detection instability does not exist.

Practical Tricks in Training.
In practical training, the numeric overflow problem often occurs, especially when the logarithmic calculation or index calculation is considered. Take the DFR as an example, the target output of the symbol m is a one-hot vector where the mth element is 1 and the other elements are 0. e m th output element of the model is where W and b are the parameters of the previous layer. When the model approximates the mth to be 1, from (32) we know that only when (W n x + b n ) ⟶ + ∞, then p m ⟶ 1. e parameters W n , b n will sharply increase and thus the numeric overflow is caused. To fix this issue, one simple and efficient trick is to clip the element of output by the following function: where ϵ is a small value, and ϵ � 1 × 10 − 4 in our training. On the other hand, overfitting often exists in practical training, due to the unbalance between the model complexity and sample complexity. We use a validation set to observe the overfitting and evaluate the DFR performance after each training episode, and the model with the smallest error on the validation set is saved and then executed on the test set. (1) Input: e received sampled signal z up , frame size N F , iteration times K max , terminal threshold θ th , (2) Initialization: Transform z up to Z up and store Z up in Memory I. Initialize P (0) as 0, initialize s (0) as 0, (12) break (13) end if (14) end for (15) Output: s � s (k) ALGORITHM 1: e iterative detection algorithm of improved-DFR. (1) Input: Frame number N sam , the received sampled signal set z up u N sam u�1 and the corresponding transmitted symbol sequence set s u N sam u�1 , initial learning rate η, frame size N F , training episode times N episode , iteration times K max , (2) Initialization: Randomly initialize θ (k) K max k�1 , initialize Adam with learning rate η, (3) for i � 1 to N episode do (4) forj � 1 to N sam do (5) Randomly select z up in z up Initialize P (0) to be 0 (7) fork � 1 to K max do (8) Transform end for (12) According to (30), calculate error L(P, P (k) K max k�1 ) with P and P (k) K max k�1 (13) According to (31), use Adam to calculate gradients ∇ θ (k) L K max k�1 and update the parameter set θ (k) K max to be the shared parameter θ, and then the improved-DFR with parameter sharing is derived. In short, we call improved-DFR with parameter sharing as DFR-IS, and improved-DFR without parameter sharing as DFR-I.

Complexity
Analysis. First, we consider the computation complexity of receivers with various model scales. e computation complexity of the DFR and the improved-DFR is the same. In each inference per iteration and symbol, the addition calculation times N add , multiplication calculation times N mul , and exponentiation calculation times N exp are listed in Table 2. Both the computation complexity and time cost are proportion to the iteration times K max . Meanwhile, all the symbols in one frame can be parallel processed. erefore, when the parallel computation is fully utilized, the time cost of inference is unchanged with various frame sizes.
Second, we consider the model complexity and storage complexity of the improved-DFR. e number of parameters N θ and the amount of stored data N mem are listed in Table 3. e stored data include: the received sampled signal z up and its transformation Z up in Memory I, the DFR output P, and its transformation Q in all Memory IIs. It is noteworthy that the K max Memory IIs are virtual, and in fact there is only one physical Memory II in DFRs.

Simulation Results
e parameters of the band-pass wireless communication system are listed in Table 4 With the values and settings in Table 4, the number of carrier periods in each symbol is N � 8, and the order of the designed band-pass filter is 21. Besides, the detailed hyper-parameters of the improved-DFR are listed in Table 1. e number of feedback symbols is N f � 3, the length of priori information vector is 12, the length of input received sampled signal is 128, and the input vector length of DFR is N in � 140. e number of training samples is 1 × 10 5 . e DFR-I and DFR-IS are tested in this section, and the DFR with N pre � 3 and N pos � 0 is regarded as a comparison. ese values or settings are fixed in the following simulations and will be particularly mentioned if modified.
e simulation results are averaged with 20 repeats. { } has sufficient model complexity. We use the error function (33) to measure the training quality. Figure 6, the error on the training set is recorded along the episode times N episode . In general, the final error decreases as the model complexity increases, and the DFR-I has a smaller final training error than DFR-IS. Besides, the NNs of more complexities have faster convergence. ese simulation results basically agree with the theoretical expectations. Meanwhile, the error on the validation set is recorded in Figure 7. As the episode times increase, the validation errors of NNs with 128 { }, 64, 64 { }, and 128, 128 { } increase to varying degrees, which verifies our speculation that the overfitting occurs.

Error. As shown in
e DFR-IS with 128 { } obtains the lowest final validation error. e error on the validation set also indicates that parameter sharing is useful to alleviate the overfitting when the model is complex.

Detection.
e learned models are then tested on the test set. After 5 iteration times, the SER curves are plotted in Figure 8. Generally, the SNR-SER performance of DFR-I and DFR-IS is similar, but the ones using parameter sharing are slightly worse. All the SERs of improved-DFRs are lower than 1.0 × 10 − 5 when SNR � 8 dB, except for the NN with 2 { }. However, the smallest NN still can achieve SER � 1.0 × 10 − 5 when the SNR is increased up to 10 dB.

Complexity.
e high-speed processing is significant for wireless communication systems, and thus the low complexity is preferred. e specific computation times and parameter number of different NNs are listed in Table 5. Generally, the computation times are proportion to the sum amount of hidden neurons. Although the improved-DFR with NN being 2 { } suffers about 2 dB SNR loss to achieve SER � 1.0 × 10 − 5 , the corresponding computation cost and parameter number are extremely small. Taking account of training difficulty, detection performance, computation, and model complexity, an improved-DFR with 32 { } is adopted in the following simulations.

Iteration Times.
In this subsection, we study the relationship between the iteration times and the detection performance, and SNR ∈ [0, 10] dB. As shown in Figure 9, the SNR-SER curves of DFR and DFR-I after 5 iterations are plotted. Generally, the SER of the DFR-I sharply declines in the first 3 iterations, then the SER almost no longer declines and thus the corresponding solution converges. Meanwhile, the SER of the DFR still slowly declines after 4 iterations, and the corresponding solution slowly converges. In the high SNR region, the SER gap between the DFR-I and the DFR becomes larger than that in the low SNR region. e SER performance of DFR after 3 iterations is comparable to that of DFR-I after the first iteration. is indicates that the mismatch between the training set and test set seriously ruins the performance of DFR. To achieve SER � 1.0 × 10 − 5 after 5 iterations, the required SNR of the DFR is 10 dB, while the DFR-I only needs about SNR � 7 dB. In summary, the improved-DFR addresses the mismatch problem between the training set and the testing set in the vanilla DFR, and thus the improved-DFR outperforms the DFR. In comparison to the DFR, the detection SER of the   initial solution of the improved-DFR is greatly reduced. Besides, the convergence speed and the final SER of the improved-DFR respectively are faster and lower.
On the other hand, we study the influences of parameter sharing on the detection performance of improved-DFR after different iteration times. As shown in Figure 10, generally the SNR-SER performance of the DFR-I is slightly better than that of the DFR-IS which uses parameter sharing. When SNR � 10 dB, symbol errors cannot be captured on the test set with the DFR-I after 3 iterations; meanwhile, the SER of the DFR-IS is around 1.0 × 10 − 6 . To improve the detection performance of the improved-DFR using parameter sharing, we can properly increase the model complexity.  10] dB. e SNR-SER curves of DFR-I after first iteration, and DFR, DFR-I, DFR-IS after 5 iterations are illustrated in Figure 11. As a baseline, the DFR-I after first iteration is regarded as a normal NN-based receiver without feedbacks. In general, under different bandwidths, the DFR-I outperforms the other receivers, and the detection performance of improved-DFRs is better than the DFR. When the bandwidth is reduced to B � 15 MHz, the ISI is serious, and the SERs of all receivers drop slowly. When SNR � 14 dB, the estimated result of the normal receiver without feedbacks is SER � 7.6 × 10 − 2 . e performance improvement of DFR is poor, and its SER is only 1.6 × 10 − 2 . In contrast, the SER of DFR-I and DFR-IS after 5 iterations are 1.5 × 10 − 3 and 2.1 × 10 − 3 , respectively. Compared with DFR, the SER of improved-DFR is reduced by about one order of magnitude. en, we turn to the scenario with B � 25 MHz. It can be seen that as the bandwidth increases, the ISI is alleviated. e detection performance of the three receivers gets closer, and the performance gain achieved by iterations is insignificant. However, the DFR-I still outperforms the other receivers. When B � 25 MHz and SNR � 6 dB, the SER of DFR-I is 2.5 × 10 − 6 , while SER of the DFR-IS and DFR is around 1.0 × 10 − 5 .

Conclusions
In a general way, we have quantitatively analyzed the increased test error brought by the mismatch problem between the training set and the test set, and the corresponding upper bound and lower bound are also given. We have pointed out that the increased test error is composed of data divergence and data missing, and the solution to eliminate the data missing and the conditions that the data divergence no longer exists are also given. e established analysis is further developed for the case where one single model is trained on an integral training set. We have proved that the deep unfolding using parameter sharing has no data divergence with sufficient model complexity. Meanwhile, with the aid of the MDP model, we have given the condition that the data divergence does not exist, when the model complexity is insufficient. Based on the aforementioned analysis, we have studied the mismatch problem and the caused sub-problems in the DFR. en, the improvements to solve these problems are proposed and thus the improved-DFR is developed. e improved-DFR has low computation complexity and model complexity and can be executed by parallel processing. e simulation results show that the improved-DFR has a faster convergence speed and better final SER performance than the DFR. Moreover, the performance of DFR-I and DFR-IS is similar. In comparison to DFR-I, the DFR-IS is easier to train, and the slight performance loss can be reduced by increasing model complexity. In our future work, we will focus on the model and data-driven method for DFR, to further reduce the training complexity of the data-driven DFR.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest.