Cyclic prefix (CP) in multicarrier modulation systems has been considered as an alternative to the training sequences to track channel estimates. In this paper, two new algorithms are developed that exploit CP from their data detection part and employ systolic block Householder transformation recursive least squares (SBHT-RLS) algorithms for channel tracking in multicarrier systems. The new methods are compared with the existing CP exploiting correlation matrix based block RLS (CMB-RLS) channel tracking approach to outline their relative advantages. Aspects of computational complexity and parallel implementation are addressed, and the algorithms are tested in terms of their channel estimation and tracking capabilities. Performance of the algorithms is also evaluated for varying forgetting factor parameter values, constellation size, and word lengths. Floating-point and fixed-point simulations are tailored to illustrate pertinent tradeoffs.
1. Introduction
Over the last two decades, multicarrier modulation has received considerable interest for its use in wireless and wireline communication systems [1–4]. It has been adopted in many communication standards, including digital audio broadcasting (DAB) [5], digital video broadcasting (DVB) [6], high-speed modems over digital subscriber lines (xDSLs) [4], and local area mobile wireless broadband [7].
Most multicarrier systems use coherent detection of data symbols, which requires reliable estimation of channel at the receiver. Channel state information is also necessary for techniques such as channel shortening [8], adaptive modulation/loading, and/or power control [9]. In applications such as discrete multitone (DMT) xDSL [4], channel is estimated through some initial training process, and retraining is required to track the channel variation. To avoid the system overhead due to retraining and thus to track the channel more efficiently, in [10], a correlation matrix based block recursive least-squares (CMB-RLS) algorithm is proposed. The algorithm takes advantage of the inherent redundancy introduced by the cyclic prefix (CP) to blindly estimate the channel. In [11], performance of the algorithm is analyzed considering both the effect of channel noise and decision error. The algorithm is further explored in [12], where its performance is analyzed considering the impact of exponential forgetting factor values, constellation size, and channel nulls. Also, in [13], the method is used in single-carrier (SC) modulation with frequency domain equalization (FDE) to maintain both system performance and throughput.
While CP-based CMB-RLS approach is standard complaint, there are two basic problems that make it unsuitable for real-time implementation. First, it relies on computation of inverse of the correlation matrix Φ¯ per time update. The computational cost of performing the required matrix inversion in real time can be prohibitively high for a system with a large channel length (To reduce the computational complexity and thus processing power, this inversion cannot be done recursively using Matrix inversion Lemma (such as in conventional RLS (CRLS) algorithm [14]).). Second, the direct inversion and recursive inversion approaches are known to severely limit parallelism and pipelining that can effectively be applied in the practical implementation.
Usually, to minimize the round off error, matrix inversions are done with general-purpose digital signal processing (DSP) devices/processors using floating-point arithmetic. A disadvantage of this approach, however, is severe processing power limitation due to small number of floating-point processing units commonly available per device. Specialized hardware with high-processing power is therefore required to execute requisite computations in real time. An appealing alternative for implementation is not to do this inversion explicitly and solve the problem through a computationally cheaper approach that works directly with data matrix and is realizable on the systolic array architecture offering large amounts of parallelism for high-speed very large scale integration (VLSI) implementation. In VLSI implementation, floating-point arithmetic units are more complex than those of fixed-point arithmetic, involving extra hardware overhead and more clock cycles [15]. Hence, the bit-level systolic architecture must be implemented with fixed-point arithmetic.
The QR decomposition (QRD) approaches for RLS problem have played an important role in adaptive signal processing, adaptive equalization, and adaptive spectrum estimation [16]. It is generally agreed that QRD-RLS algorithms are one of the most promising RLS algorithms, due to their numerical stability [17, 18] and suitability for VLSI implementation [19, 20]. There are three approaches to QRD-RLS problem, namely, Givens rotation (GR), modified Gram-Schmidt (MGS), and Householder transformation (HT) method. These methods have been successfully applied to the development of the QRD-RLS systolic array [16, 21–24]. Because HT generally outperforms GR and MGS methods under finite precision computations (see the references in [16]), and in the context of our application the channel needs to be updated for each block input data matrix, we focus our attention to the QRD-RLS algorithm based on block HT. Notice that HT is a well-known rank-v update approach and is one of the most efficient methods to compute QRD (Rank-1 updating fast QRD-RLS algorithms (where QRD is updated after the original data matrix has been modified by the addition and deletion of a row or column) [14] are not suitable here in particular due to high throughput (here the term throughput is used to indicate total number of data vectors at the input of the RLS algorithm) and speed requirements.). In [24, 25], Liu et al. investigated one such QRD-RLS algorithm using block HT. The work in [24] describes the block HT implementation on a systolic array and its application to RLS algorithm called systolic block HT-RLS (SBHT-RLS). So far, SBHT-RLS is used in beamforming and linear predication applications but has not been applied for channel tracking in high-throughput multicarrier applications. The algorithm is well known for its computational efficiency, very good numerical properties, and parallel processing implementation advantages.
In this paper, we develop two new CP exploiting SBHT-RLS approaches for adaptive channel estimation in multicarrier systems.
The first approach is based on SBHT-RLS approach of Liu et al.. In its original form, the SBHT-RLS does not provide access to channel weights, as its use has been limited to the problem seeking an estimate of output error signal. In the context of our application, the proposed approach finds the channel explicitly. In order to differentiate between the two techniques, the new method will be referred to as CP-based Direct SBHT-RLS approach. The proposed scheme is computationally efficient and can be mapped to triangular systolic arrays for efficient parallel implementation. Unfortunately, the scheme suffers from a major drawback, namely, back substitution, which is a costly operation to perform in array structure [26, 27].
The second approach relies on inverse factorizations to calculate least squares channel coefficients (weight vector) without back substitution. This approach also employs SBHT to recursively update the channel coefficients and thus preserves the inherent stability property of SBHT-RLS approach. The derivation of the inverse factorization method in this paper is done by generalizing the Extended QRD-RLS algorithm to block RLS case [28]. For this reason, this method will be referred to as CP-based Extended SBHT-RLS approach. We underscore here that this simple and straightforward derivation is different than the previous challenging work on block RLS using inverse factorizations in [29, 30]. Computational complexity of this scheme is equivalent to the first proposed scheme, but unlike the first scheme it is fully amenable to VLSI implementation and also results in improved steady-state performance.
For the sake of brevity, in the rest of this paper, we refer to the CP based CMB-RLS as CPE1, Direct SBHT-RLS as CPE2, and Extended SBHT-RLS as CPE3. Also, for uniformity, we closely follow the notation that appears in [10].
The paper is organized as follows. In the next section, we provide an overview of the DMT system model [10]. Section 3 explains the newly proposed algorithms, followed by a discussion on their computational complexity and systolic array implementation in Section 4. In Section 5, illustrating floating- and fixed-point simulations are conducted, while conclusions are drawn in Section 6. Some results contained in this paper have been presented/accepted for presentation in [31, 32].
Notation 1.
(·)T,(·)*, and E[·] denote transpose, complex conjugate, and expectation operation. The Matlab notation X(:,m:m′) is used to to denote the submatrix of X that contains the columns m to m′. x(n:n′) denotes the subvector of x comprising of entries n through n′. In denotes identity matrix of size n, 0 denotes the all zeros matrix of appropriate dimensions, and j=-1. The meaning of other variables will be clear from the context.
2. System Model
We consider a high-speed DMT data transmission system over digital subscriber lines, shown in Figure 1. The system has m/2 complex parallel subchannels and illustrates the typical CP based adaptive channel estimation task, which is our main concern in this paper. Let {sn} represent the data sequence to be transmitted over the channel. This input data is buffered to blocks, and each data block is divided into m/2 bit streams and then mapped to quadrature amplitude modulation (QAM) constellation points Xi,k,i=0,…,m/2-1 at time k. After m-point inverse fast Fourier transform (IFFT) on the kth DMT block Xk=[X0,k,X1,k,…,Xm-1,k]T (here the last m/2 samples are just the conjugates of the first m/2 samples), the modulated real valued time domain signal is xk=[x0,k,x1,k,…,xm-1,k]T. A CP xk(f)=[x¯m-v,k,…,x¯m-1,k]T, where x-i,k=xm-i,k and i=1,…,v, is then appended in front of xk before transmission through the channel H(z)=∑l=0vhl,kz-l, having impulse response hk=[h0,k,h1,k,…,hv,k]T of length r=v+1. At the receiver, the prefix part yk(f)=[y-v,k,…,y-1,k]T is removed.
Multicarrier system with CP-based adaptive channel estimation.
The relationship between prefix part yk(f) and the transmitted signal may be expressed as [10]yk(f)=Akhk+nk(f),
where
Ak=[x-v,kxm-1,k-1⋯xm-v,k-1⋱⋱⋱⋮⋱⋱⋱⋮x-1,k⋯x-v,kxm-1,k-1]=[a0,k,a1,k,…,av,k],aj,k is the jth column of Ak, nk(f)=[n-v,k,…,n-1,k]T, and ni,k~𝒩(0,σ2) is the channel noise.
After the FFT operation on yk=[y0,k,y1,k,…,ym-1,k]T, the demodulated signal is Yk=[Y0,k,Y1,k,…,Ym-1,k]T. The CP removes interblock interference (IBI) between Xk’s. The received symbols can thus be written asYi,k=Xi,kHi,k+Ni,k,i=0,…,m-1,
where ℋi,k=∑l=0vhl,ke-j((2πil)/m) is the channel frequency response and Ni,k=(1/m)∑l=0m-1nl,ke-j((2πil)/m)~𝒩(0,σ2) is the noise of the ith subchannel.
To get the estimation of Xi,k from Yi,k, a one-tap minimum mean square error (MMSE) equalizer wi,k=(Γi1/2ℋi,k*)/(Γi∥ℋi,k∥2+σi2), where i=0,…,m-1 and Γi=E[∥Xi,k∥2], is then employed at the ith channel. The estimated data is then X̂i,k=Yi,kwi,k. The decision is then made on X̂i,k to get the final output X¯i,k=q(X̂i,k), where q(·) is the decision operation.
3. CP-Based SBHT-RLS Algorithms3.1. CP-Based Direct SBHT-RLS Algorithm (CPE2)
Based on the CP data model (1), we define nv×r weighted data matrix and the nv×1 weighted received vector in a recursive manner as
Ȧk=Λ[Ak-(n-1)Ak-n⋮Ak]=[λ1/2Ȧk-1Ak],ẏk(f)=Λ[yk-(n-1)(f)yk-n(f)⋮yk(f)]=[λ1/2ẏk-1(f)∫yk(f)],
where Λ is an nv×nv block-diagonal forgetting matrix of the formΛ=[λ(n-1)/2Iv⋯00⋮⋱⋮⋮0⋯λ1/2Iv00⋯0Iv],
with forgetting factor across blocks 0<λ≤1. The forgetting factor λ is incorporated in the scheme to avoid overflow in the processors as well as to facilitate nonstationary data updating [25].
Suppose that at the (k-1)th update we have QRDQk-1Ȧk-1=[Rk-10],
where Qk-1 is an (n-1)v×(n-1)v orthogonal matrix and Rk-1 is a r×r upper triangular matrix.
Now by denoting Q¯k-1=[Qk-10∫∫∫0∑T∫Iv], we then haveQ¯k-1Ȧk=[Rk-10Ak].
A n×n HT matrix T is of the form T=In-βvvT, where β=2/vTv=2/∥v∥2. When a vector x=[x1,x2,…,xn]T is multiplied by T, it is reflected in the hyperplane defined by span{v}⊥. Choosing v=x±∥x∥2e1, where e1=[1,0,0,…,0]T, then x is reflected onto e1 by T as: Tx=±∥x∥2e1.
A series of HTs are then used to zero out Ak in the right-hand side of (8). Let Hk=Hk(r)Hk(r-1)⋯Hk(1) (a sequence of r-ordered matrix multiplications), where Hk(i) denotes the ith HT matrix (which zeroes out ith column of updated Ak) given asHk(i)=[Hk,11(i)0Hk,12(i)0I(n-1)v-r0∫Hk,21∑(i)0∫Hk,22∑(i)],
where Hk,11(i) is r×r identity matrix except for the ith diagonal entry, Hk,12(i) is r×v zero matrix except for the ith row, Hk,12(i)=Hk,21(i), and Hk,22(i) is a symmetric v×v matrix.
It is thus we have HkQ¯k-1Ȧk=[Rk0] and Qk=HkQ¯k-1. Now withQk[Ȧkẏk(f)]=[Rkuk0vk],
where uk=[u0,k,u1,k,…,uv,k]T andRk=(r(0,0),kr(0,1),k⋯r(0,v),k0r(1,1),k⋯r(1,v),k⋮⋮⋱⋮00⋯r(v,v),k),
the optimal solution is thus obtained by solving the upper triangular system Rkĥk=uk by back substitution operation as follows:ĥi,k=ui,k-∑j=i+1vr(i,j),kĥj,kr(i,i),k,i=v,…,0.
The matrix Ȧk-1 can be uniquely QR factorized only if it is full column rank (i.e., rank Ȧk-1=r). Therefore, the minimum number of rows in Ȧk-1 must be at least large as the number of columns. To satisfy this requirement and thus to reduce the number of received blocks needed by CPE2 (and CPE3 in Section 3.2), in step (10), we setȦk=[λ1/2Rk-1Ak],ẏk(f)=Λ[yk-2(f)(v)=y-1,k-2yk-1(f)yk(f)]=Λ[ẏk-1(f)yk(f)].
Based on the above discussion, CPE2 algorithm is summarized in Table 1.
CP-based Direct SBHT-RLS algorithm (CPE2).
Input: yk(f), yk-1(f),…, yk-2(f)(v) and Yk
Known parameters: Γi and σi
Selecting parameters: λ (with 0<λ≤1)
Initialization: k=0, an initial training process is used to initialize ĥ0 and R0=δI (with 0<δ≪1 is small positive scalar).
In this section, we propose an alternative approach by appending one more column to the matrices of CPE2 algorithm. To simplify the derivation, we combine the first column of (10) and the new column to construct the formulaQk[λRk-1Rk-1-T/λAk0T]=[RkRk-T0Wk].
We next define a lemma, known as the matrix factorization lemma [33] that is very elegant tool in the development of QRD-RLS algorithms.
Lemma 1.
If A and B are any two N×M(N≤M) matrices, then
ATA=BTB,
if and only if there exists an N×N unitary matrix Q(QTQ=I) such that
QA=B.
Applying Lemma 1 to (14), we obtainRk-TRkT=Rk-1-TRk-1T=Ir.
This shows that Rk-T obtained is the correct inverse transposition of RkT and can be updated by using the same orthonormal transformation Qk.
Next, we combine the second column of (10) and the new column to construct the formulaQk[λẏk-1(f)Rk-1-T/λyk(f)0T]=[ukRk-TvkWkT].
Now by applying Lemma 1 to (18) yieldsRk-1uk+Wkvk=Rk-1-1uk-1.
From (19), we establish a simple recursion to compute the channel vectorhk=hk-1-Wkvk.
This recursion can be written in component form ashi,k=hi,k-1-wi,kTvk,i=0,…,v,
where wi,k is the ith column of the matrix WkT.
Based on the above discussion, CPE3 is formulated in Table 2.
CP-based Extended SBHT-RLS algorithm (CPE3).
Input: yk(f), yk-1(f),…, yk-2(f)(v) and Yk
Known parameters: Γi and σi
Selecting parameters: λ (with 0<λ≤1)
Initialization: k=0, an initial training process is used to initialize ĥ0, R0=δI,
(i) Both algorithms are initialized in a training mode, the algorithms then switch to a decision-directed mode for channel tracking. Note that, in step (1), based on the previous channel estimate ĥk-1, the previous frequency response ℋ̃k-1=[ℋ0,k-1,…,ℋm-1,k-1] is computed. In step (2), ℋ̃k-1 is then used to compute equalization coefficients. The decision-directed data vector X̂k is then computed in step (3). In step (4), symbol estimates are projected onto the finite alphabet (FA), and the estimated transmitted CP data x¯k(f) is obtained by performing partial FFT on the decision-directed projected samples X¯k. In steps (5) through (7), the new channel estimate is then obtained by treating the resulting symbol estimates as the known symbols. The process of alternating between channel and symbol estimation steps is applied repeatedly.
(ii) In [29], Sakai has derived a method for extracting weight coefficients based on the inverse factorization method of Pan and Plemmons [34] and Liu’s SBHT-RLS algorithm. The time updating formula for channel coefficients is obtained by first generalizing the inverse factorizations for the block case and then deriving a formula for updating the channel coefficients. The complicated and challenging derivation gets rid of matrix operations by exploiting the relation between a priori and posteriori error vectors. Based on [35] and suggested by its author, Sakai has also presented a simpler derivation for updating the channel vector in [30]. In contrast, in the above discussion, the same result is derived by following a straightforward approach by generalizing the Extended QRD-RLS algorithm of Yang and Bohme [28] to the block RLS case.
4. Computational Complexity and Systolic Array Implementation4.1. Computational Complexity
The CPE1, CPE2, and CPE3 algorithms are similar in the CP estimation part (i.e., steps (1) through (4)), we therefore compare their complexities in the channel estimation part. The CPE1 channel estimation stage requires 𝒪(r3) computations to update h. In contrast, due to absence of any matrix inversion as opposed to CPE1, it is possible to implement channel estimation parts of both the algorithms with 𝒪(r2) operations per time update. This indicates that the proposed algorithms are computationally superior than the CPE1.
4.2. Systolic Array Implementation
The detection part of both proposed algorithms (comprising of steps (1) through (4)) is particularly simple for which many efficient systolic array architectures have been proposed. We therefore limit our discussion to possible implementation architectures for channel estimation part of the proposed algorithms.
The systolic array implementation of channel estimation section of CPE2 and its processing cells are shown in Figure 2, where adaptive filtering triangular update part (comprising of step (6)) is realized on a triangular vectorial systolic array as in [24] for Rk and uk extraction. It consists of two sections: the upper triangular array (shown in part (a) of Figure 2), which stores and updates Rk and the right-hand column of cells (shown in part (b) of Figure 2), which stores and updates uk. The input data are fed from top and propagate to the bottom of the array. The rotation angles are calculated in left boundary cells, and propagate from left to right. The resulting Rk and uk updates in step (6) are subsequently used in the linear bidirectional systolic array section [19] (shown in part (c) of Figure 2) to obtain the channel estimate using back substitution operation. Unfortunately, a critical obstruction appears because the process of the triangular-updates runs from the upper-left corner to the lower-right corner of the array, while the process of the back substitution runs in exactly the opposite direction. It is therefore pipelining of the two steps (the triangular update and back substitution) that seems impossible on a triangular array. Back substitution may be implemented as a separate operation on a parallel two-dimensional array [36]. Nevertheless, the two-dimensional array can become quite large for long channel lengths, requiring a substantial area for VLSI implementation. On the other hand, comparatively simpler linear array structure shown in Figure 2 is highly sequential, thus involving more time delay due to increased clock cycles to compute the channel coefficients. For these reasons, the back substitution in CPE2 is a costly operation to perform.
Systolic array implementation of channel estimation section of CPE2 (using Householder transformations) with processing cell descriptions.
The CPE3 approach involves a time recursive QR solution to compute the channel vector h. The channel estimation part of CPE3 algorithm can be implemented by a fully pipelined rhombic systolic array obtained by combining lower triangular array with an upper triangular array. This implementation has been performed by Sakai in [29, 30] and is reproduced in Figure 3. The components of Rk are updated in the upper triangular part (a) of Figure 3. Also, the components of uk are updated in part (b) in the same fashion as the off-diagonal components of Rk, with the input data yk from the top of this column and the output vk from the bottom of this column. Notice that systolic implementation in upper section of part (c) is similar to that in part (a), except that the array is now lower triangular, and each element is divided with β=λ before updating, and the input to the array is provided from the top in the form of a zero vector. A systolic array performing (20) is shown in lower portion of part (c) of Figure 3, where the cells in the bottom line, shown by small circles; perform (20) for calculating the tap coefficients. Each column of the lower triangular array whose cells are shown by diamonds perform Rk-T updating. The cells also calculate each column of UkT, appearing from the last diamond cell. Notice that due to absence of back substitution, the CPE3 algorithm is rich in parallel operations and therefore leads to more efficient and simple implementation on systolic processors.
Systolic array implementation of channel estimation section of CPE3 (using Householder transformations) with processing cell descriptions.
5. Simulation Results
In this section, floating-point and fixed-point simulation results are presented to examine and compare the performance of the CPE1, CPE2, and CPE3 approaches. All simulations were carried out in a typical asymmetric digital subscriber line (ADSL) environment with perfect block synchronization, FFT size m=512, the CP length v=32, λ=0.75, δ=1e-3, and 4-QAM constellation for modulation, unless otherwise stated. For a fair comparison, for CPE1 we set forgetting factor across blocks μ1=λ and forgetting factor within blocks μ2=1. The mismatch performance is evaluated by averaged mean-square-error (MSE) per subchannel err=∑i∈U∥Xi-X̂i∥/|U|, where U is the set of indexes corresponding to the U used subchannels and |U| is the number of all the used subchannels [10]. The transmit power of all used subchannels is same (i.e., σi2=σ2) and the noise power was set such that SNR= 30 dB (a typical value of SNR in ADSL environments).
The discrete channel impulse response with transfer function H0(D) for carrier service loop area (CSA) loop # 1 was obtained from the Matlab DMTTEQ Toolbox [37] and sampled at 2.208 MHz. For simulation purposes, the shorter channel was generated by subsampling. H0(D) was perturbed to obtain another test channel H1(D) (to mimic small variation in H0(D)). Corresponding frequency responses for the two test channels are shown in Figure 4. Initially, the channel transfer function is H0(D), which remains unchanged for the first 400 data blocks. At data block 401, the channel is switched from H0(D) to H1(D). For all adaptive schemes, only the first DMT symbol was sent as pure training sequence to identify the initial channel for fast convergence. Also, the inverse of the correlation matrix in CPE1 is initialized to a constant multiple of the identity matrix.
Frequency responses for channels H0(D) and H1(D).
Example 1
Figure 5 shows typical learning curves of the three algorithms, with adaptation factor parameter λ values of 0.75 (top plots) and 0.55 (bottom plots), under double-precision floating-point implementation (using IEEE standard for floating-point arithmetic (IEEE 754)). It can be seen that all the schemes are able to converge and can track the channel variation. The learning curves of CPE1 and CPE2 are overlaid and both the algorithms converge faster than CPE3. As compared to CPE3, the two algorithms are also seen to have greater uneven performance. In contrast, although CPE3 convergence is slower, it is seen to demonstrate superior steady-state performance. A close examination of CEP2 algorithm shows that the back substitution operation involves decision-feedback computation of channel coefficients. If a channel coefficient suffers from an error, this error weights heavily in the estimation of the next and subsequent channel coefficients. The erroneous estimated channel causes the next detection error. This decision error further propagates and causes subsequent decision errors. Consequently, CPE2 encounters performance loss. In contrast, channel is recursively updated without back substitution in CPE3. CPE3 is therefore seen to yield better performance.
A close observation of top and bottom plots of Figure 5 also indicates that convergence rate and steady-state performance of the three algorithms can be improved by lowering the value of λ. The price paid in growth is uneven performance which can be reduced and thus numerical stability can be improved by increasing the data block size (i.e., with the increased CP length), while the system latency is increased.
Effect of the forgetting factor λ on performance analysis curves of CPE1, CPE2, and CPE3 algorithms when tracking channels H0(D) and H1(D): λ=0.75 (top plots) and λ=0.55 (bottom plots).
Example 2.
Without giving a rigorous stability analysis, we verify the stability of the CPE1, CPE2, and CPE3 algorithms experimentally through a long-time simulation with 5×103 data blocks (considerably large number of samples). Corresponding results in Figure 6 show that the three algorithms do not show any sign of divergence and have very stable performance.
Stability performance comparison with 5×103 data blocks (λ=0.75 and 4-QAM).
Example 3.
The more complex the modulation alphabet, the narrower the gap between the symbol decision space and the higher the probability of error in detecting the signal [38]. Since the three algorithms rely on the FA property of source symbols, high-performance degradation is expected as the constellation size increases. It is therefore the three algorithms that may not be suitable for rate adaptation. To verify this, in this simulation example, we repeat Example 1 with 16-QAM and 64-QAM constellation sizes. Corresponding simulation results in Figure 7 show that the three algorithms take the same number of data blocks to converge. However, as expected, their performance degrades when the constellation size is increased.
Effect of modulation constellation size on algorithm performance curves (16-QAM (top plots), 64-QAM (bottom plots), λ=0.75).
Example 4.
In this section, due to inherent parallelism and thus suitability for fixed-point VLSI implementation, we examine the fixed-point performance of the CPE2 and CPE3 algorithms with 16, 24, and 32 bit data word length WL implementations for both data and channel coefficients. These WLs are selected as a reasonable approximation as these data lengths are suitable for many applications. For fixed-point simulations, routines in Matlab are developed to mimic the operations of fixed-point arithmetic, and all quantities in the algorithms are represented with finite bits. The fixed-point representation requires WL = (li bits for integer part) + (l bits for the fractional part) + (1 bit for sign). For real number x, its quantized value xq is obtained as follows. With WL bits, the largest integers that can be represented are ±2(WL-1)-1. When the value of x falls outside the interval [+2(WL-1)-1,-2(WL-1)-1], the saturation occurs, and the xq is then taken as one of the boundary values, +2(WL-1)-1 or -2(WL-1)-1. On the other hand, if x lies within the interval [+2(WL-1)-1,-2(WL-1)-1], then the li bits are computed to represent the integer part of x, and the remaining bits l are used to represent the fractional part of x. It is important to note here that for the above choice of WLs, the thresholds are sufficiently larger than signal values involved in both the algorithms. The quantizer is therefore always expected to operate on values that are much lesser than the boundary values, and therefore no saturation errors are expected. The only errors that are introduced by finite precision approximations are the round-off errors.
Figure 8 provides performance plots of CPE2 (top) and CPE3 (bottom) with different WL choices and floating-point performance. From the performance curves, we infer that both algorithms are able to track the channel without numerical stability issues with WLs 24 and 32. The performance curves with WL of 16 bits indicate unacceptable performance or breakdown caused by quantization errors for both the algorithms. For both the algorithms, increasing WL above 24 bits does not result in any improvement and performance curves of their 24-bit and floating-point implementations are overlaid (there is no visible difference). It is therefore 24-bit finite word implementation is a reasonable approximation of their floating-point computation.
Plots of fixed-point performance of CPE2 (top) and CPE3 (bottom) with three choices of quantization bits (λ=0.75, 4-QAM).
6. Conclusion
In this paper, by using numerically robust block HTs, two CP-based adaptive channel estimation algorithms have been presented for multicarrier systems. Conceptually, the new schemes maintain the same spirit of the CP based CMB-RLS channel tracking scheme. More precisely, the basic idea is to utilize CP data from the data detection part for adaptive channel estimation. The new approaches achieve the same purpose by replacing the computationally expensive CMB-RLS channel estimation part with the computationally cheaper SBHT-RLS alternatives. Among the two schemes, the method called CP based Direct SBHT-RLS is based upon Liu’s algorithm in the channel estimation part but adaptively updates channel vector instead of the error vector. The second method called CP based Extended SBHT-RLS is based upon Sakai’s algorithm in the channel estimation part but uses an independent and simpler derivation.
Floating-point performance curves indicate that all the three schemes are able to converge and can track channel variation without any stability problems. CPE1 and CPE2 exhibit identical stable performance, whereas CPE3 outperforms both the CPE1 and CPE2 techniques. In contrast to CPE1, what is remarkable here is that the CPE2 and CPE3 algorithms achieve their performance at lower computational complexity, enhanced parallelism, and pipelining for systolic array/VLSI implementation. All the three algorithms are seen to converge faster and perform better with lower values of forgetting factor parameter λ. Our simulation results suggest that such advantages come at the price of greater uneven performance. Hence, moderate values of forgetting factor would be preferred where a balance in both performance and stablility is required. The three techniques also show reduction in performance with the increase in modulation constellation size. Hence, these techniques are more appealing when the constellation size is small and may not be suitable for rate adaptation. It is also shown that in terms of finite word length behavior, 24-bit finite word implementation is a reasonable approximation of their typical floating-point computation (In practice, the word lengths are optimized with respect to the actual system requirements (i.e., chip area, latency, power consumption, FFT size, throughput), noise, channel length, and desired acceptable performance.).
Systolic array structures that allow efficient parallel implementations of the schemes with VLSI technology in real time were considered. The CPE2 approach is partially concurrent due to costly back substitution operation, whereas, CPE3 approach is highly concurrent due to the absence of back substitution operation and therefore lead to more efficient implementation on systolic processors.
The methods proposed in this paper are well suited for applications where good numerical properties, computational saving, and parallel processing implementation advantages (with improved performance (in case of CPE3 only)) are desired. Although a real baseband DMT case is the main focus of this paper, the proposed approaches can also be applied to the complex baseband case (wireless multicarrier systems). In such case, a further improvement in performance is possible by including forward error correction (FEC) decoding in the reliable reconstruction of transmitted symbols. Future interesting directions include studying hardware implementation problems, fine grain implementation/architecture of processing elements to workout total cost of operators (adders, multipliers, dividers, memory elements (delay elements), etc.) and algorithm latencies, modifications of the schemes to achieve reduced complexity, performance improvement, and stable implementations with reduced word lengths.
Acknowledgment
The author wishes to express his sincere thanks to the reviewers for their constructive comments and useful suggestions towards improving this paper.
WangZ.GiannakisG. B.Wireless multicarrier communications: where Fourier meets Shannon200017329482-s2.0-0033732297IEEE Part II: wireless LAN medium access control (MAC) and physical layer (PHY) specifications: high speed physical layer in 5 GHz bandIEEE Std. 802.11a-1999, September 1999Van AckerK.LeusG.MoonenM.van de WielO.PolletT.Per tone equalization for DMT-based systems20014911091192-s2.0-003511216210.1109/26.898255ChowJ. S.TuJ. C.CioffiJ. M.A discrete multitone transceiver system for HDSL applications1991968959082-s2.0-002620490210.1109/49.93100Radio broadcasting systems: digital audio broadcasting (DAB) to mobile, portable and fixed receiversETSI ETS 300 401, 1.3.2 ed., 2000Digital video broadcasting (DVB): framing structure, channel coding and modulation for digital terrestrial televisionETSI EN 300 744, 1.3.1 ed., 2000van NeeR.AwaterG.MorikuraM.TakanashiH.WebsterM.HalfordK. W.New high-rate wireless LAN standards1999371282882-s2.0-003328076810.1109/35.809389MelsaP. J. W.YounceR. C.RohrsC. E.Impulse response shortening for discrete multitone transceivers19964412166216722-s2.0-003041393610.1109/26.545896ChowP. S.CioffiJ. M.BinghamJ. A. C.A Practical discrete multitone transceiver loading algorithm for data transmission over spectrally shaped channels1995432347737752-s2.0-002925178610.1109/26.380108WangX.LiuK. J. R.Adaptive channel estimation using cyclic prefix in multicarrier modulation system19993102912932-s2.0-003336164910.1109/4234.798021WangX.LiuK. J. R.Performance analysis for adaptive channel estimation exploiting cyclic prefix in multicarrier modulation systems2003511941052-s2.0-003722959710.1109/TCOMM.2002.807610AliH.Hassan.Ali@newcastle.edu.auLingE. P.eyx6ple@nottingham.edu.myOn the performance of CP based exponentially weighted block RLS channel estimation algorithm for OFDM systemsProceedings of IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing (PacRim ’09)August 2009Victoria, Canada13513910.1109/PACRIM.2009.5291385SyafeW. A.NishijoK.NagaoY.KurosakiM.OchiH.Adaptive channel estimation using cyclic prefix for single carrier wireless system with FDEProceedings of the 10th International Conference on Advanced Communication TechnologyFebruary 2008103210352-s2.0-4424909588910.1109/ICACT.2008.4493942ManolakisD. G.IngleV. K.KogonS. M.2000SingaporeMcGraw-Hill EducationRaghunathK. J.ParhiK. K.Finite-precision error analysis of QRD-RLS and STAR-RLS adaptive filters1997455119312092-s2.0-0031142182TangC. F. T.LiuK. J. R.HsiehS. F.YaoK.VLSI algorithms and architectures for complex Householder transformation with applications to array processing19924153682-s2.0-002682123210.1007/BF00930618LeungH.HaykinS.Stability of recursive QRD-LS algorithms using finite-precision systolic array implementation19893757607632-s2.0-002466415410.1109/29.17570SiqueiraM. G.DinizP. S. R.Infinite precision analysis of the QR-recursive least squares algorithm1Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '93)May 19938788812-s2.0-0027316765HaykinS.1996Upper Saddle River, NJ, USAPrentice HallLightbodyG.WoodsR.WalkeR.Design of a parameterizable silicon intellectual property core for QR-based RLS filtering20031146596782-s2.0-014175061510.1109/TVLSI.2003.816142McWhirterJ. G.ShepherdT. J.Systolic array processor for MVDR beamforming1989136275802-s2.0-0024639871WardC. R.HargraveP. J.McWhirterJ. G.A novel algorithm and architecture for adaptive digital beamforming19863433383462-s2.0-0022690993KalsonS. Z.YaoK.A class of least-squares filtering and identification algorithms with systolic array architectures199137143522-s2.0-002600341610.1109/18.61101LiuK. J. R.HsiehS. F.YaoK.Systolic block Householder transformation for RLS algorithm with two-level pipelined implementation19924049469582-s2.0-002684513610.1109/78.127965LiuK. J. R.HsiehS. F.YaoK.Recursive LS filtering using block Householder transformations3Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90)April 1990Albuquerque, NM, USA1631163410.1109/ICASSP.1990.1157392-s2.0-0025590321ElnasharA.ElnoubiS.El-MikatiH. A.Performance analysis of blind adaptive MOE multiuser receivers using inverse QRD-RLS algorithm20085513984112-s2.0-4294916102110.1109/TCSI.2007.913611ChernS.-J.ChangC.-Y.Adaptive linearly constrained inverse QRD-RLS beamforming algorithm for moving jammers suppression20025081138115010.1109/TAP.2002.801276YangB.BohmeJ. F.Rotation-based RLS algorithms: unified derivations, numerical properties, and parallel implementations1992405115111672-s2.0-002685801210.1109/78.134478SakaiH.A vectorized systolic array for block RLS using inverse factorizations4Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92)March 1992233236SakaiH.A Vectorized systolic array for parallel weight extraction of block RLS199485475482AliH.A cyclic prefix based adaptive channel estimation algorithm for multicarrier systemsProceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT '10)December 2010Luxor, EgyptAliH.A cyclic prefix based extended QRD-RLS algorithm using block Householder transformation for adaptive channel estimation in multicarrier systemsProceedings of the 3rd International Conference on Signal Acquisition and Processing (ICSAP '11)February 2011SingaporeGolubG. H.Van LoanC. F.19963rdThe Johns Hopkins University PressPanC. T.PlemmonsR. J.Least squares modifications with inverse factorizations: parallel implications1989271-21091272-s2.0-0000882117McWhirterJ. G.Algorithmic engineering in adaptive signal processing199213932262322-s2.0-0026880050KungS. Y.1988Englewood Cliffs, NJ, USAPrentice HallArslanG.DingM.LuB.ShenZ.EvansB. L.DMTTEQ ToolboxThe University of Texas at Austin, http://users.ece.utexas.edu/~bevans/projects/adsl/dmtteq/dmtteq.htmlAliH.DoucetA.AmshahD. I.GSR: a new genetic algorithm for improving source and channel estimates2007545108810982-s2.0-3424867071310.1109/TCSI.2007.893507