Implementations of Signal-Processing Algorithms for OFDM Systems

1 Department of Electrical Engineering, National Cheng Kung University, Tainan 600, Taiwan 2 Department of Electrical Engineering, National Chung Hsing University, Taichung 40227, Taiwan 3 Department of Information and Communication Engineering, Inha University, Inncheon, Republic of Korea 4 School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 5 Department of Electrical & Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA

Orthogonal frequency-division Multiplexing (OFDM) techniques, coupled with multiple-input multiple-output (MIMO) processing, have been widely deployed in various wireless communication systems such as 802.11n, WiMAX, and 3GPP LTE. To enhance the system performance, novel signal processing techniques are developed, which often require massive and complicated computations and pose major challenges in implementations. With the advances of VLSI technology, the employment of these sophisticated signal processing algorithms now becomes feasible via dedicated mapping onto powerful digital signal processors or dedicated hardware modules. Joint consideration of algorithm development and implementation issues is crucial to achieve optimized solutions. This special issue is created to bring together the state-of-the-art research contributions to the digital signal processing techniques tailored for OFDM systems. It has indeed successfully attracted the submissions of many high-quality papers. After going through a conscientious peer-review process, eight papers are selected. These papers make an inspiring ensemble with a topic spectrum ranging from novel algorithm development, theoretical system characterization, to efficient hard-wired implementations in FPGA/ASIC. Although research topics on signal processing for MIMO-OFDM systems are too broad to be covered in one special issue, we believe that the collected papers dutifully serve as a good starting point. Finally, we would like to thank the authors for their excellent contributions and the cooperation during the editorial process. The competent and timely work by all the reviewers of these papers is also highly appreciated.
For the eight papers included in this special issue, two of them aim at channel estimation and tracking. Another two papers work on the MIMO signal detection schemes. Still another two papers are related to Trellis decoder designs. We also have one paper focusing on the mitigation of RF impairments and another one provides probabilistic behavior analysis of MIMO channels.
Specifically, in the paper entitled "Probabilistic behavior analysis of MIMO fading channels under geometric mean decomposition," P. H. Kuo and P. A. Ting presents statistical characterization of link gains and channel capacity that can be achieved by using a GMD precoding scheme. Problem formulation, PDF and channel capacity derivation, finitestate Markov chains constructions, and simulation results are provided with satisfactory performance.
In the paper entitled "CP-based SBHT-RLS algorithms for tracking channel estimates in multicarrier modulation systems" by H. Ali, two new algorithms that exploit cyclic prefix for data detection and employ systolic block Householder transformation recursive least squares (SBHT-RLS) algorithms for channel tracking are presented. The new methods outperform existing RLS-based channel tracking schemes. Computational complexity and parallel implementation issues are also addressed. Complete algorithm performance evaluations subject to varying forgetting factor parameter values, constellation size, and word lengths are provided to illustrate the advantages of the proposed work.
The paper is entitled "Advanced receiver design for mitigating multiple RF impairments in OFDM systems algorithms and RF measurements." Kiayani et al. investigate the joint impact of frequency-selective I/Q imbalance at both transmitter and receiver together with channel distortions and CFO errors. Two estimation and compensation structures based on different pilot patterns are proposed to separate the individual impairments. Low-complexity time domain estimation and compensation algorithms are next developed to mitigate the problems. Both simulation and measurement results support their performance claims.
There are two papers both working on the topic of MIMO-OFDM signal detection. In the paper entitled "Loopreduction LLL algorithm and architecture for lattice-reductionaided MIMO detection," Liao et al. propose a loop reduction LLL algorithm and its architecture for lattice-reductionaided MIMO detection. A look-ahead check technique is applied to reduce the algorithm complexity while observing the original LLL criterion. Simulation results show that not only the computation complexity but also the computing latency are effectively reduced.
In the paper "Two-dimensional iterative processing for DAB receivers based on trellis-decomposition," W. J. Houtum and F. M. J. Willems investigate trellis decoding and iterative techniques for digital audio broadcasting (DAB). Trellisdecomposition methods allow us to estimate the unknown channel phase, since this phase relates to the subtrellises. Simulations of this iterative approach applied to the TU-6 (COST207) channel show an improvement of 2.4 dB at a Doppler frequency of 10 Hz.
In the paper entitled "A power-efficient soft-output detector for spatial-multiplexing MIMO communications," Wang et al. propose VLSI implementation of a configurable power-efficient MIMO detector that supports 4 × 4 spatial multiplexing and modulation from QPSK to 64-QAM. A novel tree search algorithm is also proposed to enable the detector to provide soft outputs and to be implemented in parallel and pipelined hardware architecture.
The paper entitled "Interference cancellation using replica signal for HTRCI-MIMO/OFDM in time-variant large delay spread longer than guard interval" focuses on the time domain ISI compensation through the replica signal insertion with the shifted channel impulse response. System models and complexity analysis are also provided. Exhaustive simulations are conducted to show their performance edge.
In the paper entitled "VLSI architectures for slidingwindow-based space-time turbo trellis code decoders," G. Passas and S. Freear present VLSI architectures of space-time turbo trellis coding decoders as well as of a set of SISO-MAP component channel decoders used in turbo coding based on sliding window. They consider space-time turbo-code receivers based on nonbinary trellises. Finally, measurements of complexity and decoding delay of several sliding window technique-based MAP decoder architectures are developed and compared.

Introduction
The ability of turbo codes (TCs) to achieve very low BER that approaches the Shannon limit is very attractive. These channel capacity approaching codes have been proposed by Berrou et al. [1,2]. Iterative turbo decoding involves the parallel concatenation of two or more component codes, which operate on the data and exchange information in order to progressively reduce the error rate. The exchanged information is in the form of log-likelihood ratio (LLR) soft decisions and can be exchanged between elementary decoders, which apply either the soft output Viterbi algorithm (SOVA) [3,4] or the maximum a posteriori (MAP) probability algorithm [5]. The principles of iterative turbo decoding can be combined with those of space-time coding resulting in a bandwidth efficient low-error-rate channel coding scheme named space-time turbo trellis coding (STTuTC) [6][7][8][9][10]. The latter scheme benefits from the impressive coding gain of turbo codes and the diversity gain of space-time codes, to obtain a very power-efficient system. These power gains are very important for high-performance communication systems, particularly in scenarios where low signal-to-noise ratio is in demand.
Despite the complexity reduction, iterative turbo codes still have prohibitively high implementation complexity and suffer from large decoding delay, motivating researchers to seek efficient implementations [11][12][13][14][15][16][17][18][19]. The sliding window technique (SWT) presented in [17][18][19][20][21] enables the early acquisition of the state metric values without having to scan the whole trellis frame in the one direction before scanning the other, and this results in reduced elementary decoder latency as well as smaller state metric memory requirements.

Turbo Transmitter.
A space-time turbo trellis code is a parallel concatenation scheme using two identical elementary encoders to operate on an information message C = [C 0 , C 1 , . . . , C t , . . . , C P−1 ], where each symbol consists of b bits [8,9]. The same message is sent to both component encoders, but each one receives this information in a different order through an interleaver.
Elementary encoders using recursive space time trellis coding (STTC) benefits from interleaver gain and iterative decoding [6][7][8]. These encoders add redundancy to the encoded messages, and the second sequence is deinterleaved back to its original order. The two sequences of codewords are multiplexed and truncated to preserve bandwidth efficiency and form a single P length sequence denoted by X = [X 0 , X 1 , . . . , X t , . . . , X P−1 ], where each space-time vector X t = [x 1 t,p , x 2 t,p , . . . , x i t,p , . . . , x K t,p ] T consists of K b-bit modulated symbols. In vector X , half of the symbols (odd locations) come from component encoder 1 (p = 1) and the other half (even locations) come from component encoder 2 (p = 2) [6]. The sequence is forwarded for carrier modulation and transmission since baseband modulation occurs within the encoders. The structure results in a long block code from small memory convolutional codes, a fact that enables the turbo code to approach the Shannon limit.
Note that the recursive STTCs are trellis based, meaning that during the encoding process, in each elementary encoder a trellis path is created [7]. Let S = [s 0 , s 1 , . . . , s t , . . . , s P−1 , s P ] be a state row vector containing all the states in a trellis path chosen by the information message. The combined trellis paths will be decomposed and exploited at the receiver side to decode the message. Figure 1, the received frame R = [R 0 , R 1 , . . . , R t , . . . , R P−1 ], where R t = [r 1 t , r 2 t , . . . , r j t , . . . , r ψ t ] T , can be used to estimate the original sequence C. Using the baseband signal representation, the received noisy signal from antenna j at time t over a MIMO (possibly fading, when h t j,i / = 1) Gaussian channel is given by the equation

Proposed Turbo Receiver. Consider a baseband MIMO receiver, shown in
where V j t represents the noiseless version of the r j t signal [7][8][9]. The receiver, after estimating the channel fading coefficients, separates the signals using maximal ratio receiver combining (MRRC), which also equalises the signals by removing the effect of the channel. With MRRC, costly joint detection of space-time symbols is avoided.
The equalised frame symbols are delivered to a softdemapper to calculate the soft channel output given in (2) The received symbols have already −E s /2σ 2 prescaled by a SNR scaler in the analogue domain [12]. A frame formation block (FBB) separates the even and odd symbols and inserts zeros, so it is assured that each log-MAP decoder core receives information only from the corresponding component encoder.
The STTuTC decoder at the algorithmic level consists of two soft-in soft-out (SISO) component decoders, which exchange soft information and progressively through many iterations result in a better estimate of the values of the information symbols [11,12]. Since the two component decoders do not operate on data at the same time, only one silicon IP core can be used to implement both. As illustrated in Figure 1, the decoder contains a symbol-based SISO component core that applies the log-MAP algorithm on a nonbinary trellis. It also uses two memories, one for storing the exchanged soft information and another to perform symbol interleaving/deinterleaving.

MAP and Log-MAP Algorithms.
Maximum likelihood decoding using Viterbi algorithm determines the most likely path through the trellis and from that determines the where f = 0, 1, 2, . . . , 2 b − 1. This equation expresses the excess probability of a symbol u t being f in the logarithmic domain over the probability of being equal to a reference symbol 0 [11,13]. The log-MAP decoder calculates the LLRs of all the possible modulation symbols in the constellation diagram using the forward A t (s) = ln(a t (s)) and backward B t (s) = ln(β t (s)) state metric probabilities as well as the branch metric probabilities Γ t (s , s) = ln(γ f t (s , s)) in the logarithmic domain: The A t (s) metric represents the logarithmic probability that the state s at stage t has been reached from the beginning of the trellis. Similarly, the B t−1 (s ) metric represents the logarithmic probability that the state s at stage t − 1 has been reached from the end of the trellis [13,15]. The branch metric probabilities can be computed using where C 1 = 1/2πσ 2 . The above equation gives the probability of a transition at time t from state s to state s in the trellis diagram [13]. Translating (5) in the logarithmic domain [15], where C 1 = ln(C 1 ) is a constant, the H t (s , s) term is the soft channel output, and the ln(Pr t ( f )/Pr t (0)) term is the a priori information coming from the other component decoder: Finally, (7) gives the elementary decoder output, which is a posteriori probability LLR in terms of logarithmic domain state and branch metrics [13]. The BM-CM calculates all possible normalised branch metric values (see Section 3.1) and store them in a register file. The latter consists of M = 2 b registers to store the M possible branch metric values. These metrics are given to A-SM-CM, B-SM-CM, and LLR-CM. The A-metric CM scans the entire trellis diagram in the forward direction, starting from stage zero and ending at stage P, calculating all A-state metrics and storing them in the A-metric RAM.

Soft-In Soft-Out-MAP Component Decoder
After this first trellis pass the B-metric CM and the LLR-CM simultaneously compute the B-state metrics and the LLR outputs, respectively. The B-SM-CM scans the entire trellis in the backward direction, from stage P to 0. At this time, the B-metrics do not need to be stored in a memory, since every time a B-metric is calculated the corresponding A-metric is extracted from the A-metric RAM and the LLR calculation is computed.
Therefore, only the A-state metrics are stored in a RAM memory, the size of which must be (P + 1) * 2 G * w L , for continuous mode decoders. In a decoder operating in the terminated mode, where the trellis path begins and ends at state s(0), that is, s 0 = s(0) and s P = s(0), the memory requirement is slightly reduced to [(P − 3) * 2 G + 2 * 2 b + 2] * w L , whereas in truncated operating mode (i.e., s 0 = s(0) must hold for every frame, but there is no termination condition), the storage requirement is [(P − 1) * 2 G + 2 b + 1] * w L , where w L is the word length of the data format  used to represent the internal data of the MAP decoder. Table 1 shows the complexity and decoding delay of a log-MAP decoder operating in different modes. The metrics are calculated using data flow units (DFUs) that operate in decoding steps. As can be seen the log-MAP decoder results in very high memory requirements and long elementary decoder latencies.

Branch Metric Calculation
Module. The branch metric calculation module computes the normalised branch metrics Γ f t (s , s). The branch metric appears on both the numerator and denominator of the a posteriori LLR formula, thus any constants will cancel out and play no role in the calculation: The module receives as input the soft output of the channel H t (s , s) and the a priori information from the other component decoder and computes all possible values of branch metrics as shown in Figure 3.

State Metric Calculation Module.
In the literature, the proposed architectures that recursively calculate the state metrics are based in the so-called ACSO (add-compareselect-offset) processing unit equivalent to add-max * () presented in [18]. It is responsible for the computation of a state metric based on the previous stage metrics and the branch metrics arriving at the state in question.
Several different, but similar architectures have been proposed. The diagram in Figure 4 suggests a new type of the ACSO unit, which instead of employing more multiplexers to cope with the more than two input state metrics in nonbinary trellises, it recursively accepts the input state and branch metrics and calculates the output state metric in a number of cycles. A look-up table can be used to store precalculated correction factors and its quantization is discussed in [18,22].
This module requires M clock cycles to calculate the output state metric. In the first clock cycle, the feedback register is reset, and the A t−1 (s 1 ) and Γ 0 t (s 1 , s) are applied at the inputs. Also the SW1 switch is connected to the zero Journal of Electrical and Computer Engineering . . .  register. The result is that a 1 = A t−1 (s 1 ) + Γ 0 t (s 1 , s) is stored in the feedback register. In the remaining cycles, the LUT output is connected through SW1 to the output adder. In the second cycle, the a 2 = A t−1 (s 2 ) + Γ 1 t (s 2 , s) is applied at the input; therefore the max * (a 1 , a 2 ) is stored in the feedback register. Similarly, the max * (a 3 , max * (a 1 , a 2 )) is stored at the third cycle, whereas the output state metric equal to max * (a 4 , (max * (a 3 , max * (a 1 , a 2 ))) is ready at the output port during the forth clock cycle. A-metric calculation is illustrated, but the architecture operates equally good for Bmetric calculation in a backward recursion. This unit will be referred to as recACSO unit.
The forward/backward recursion computations may lead to an over flow. If a state metric value constantly grow the finite word length w L will not be sufficient to hold this value and an over flow will occur. Since the max * () operation is linear and shift invariant and a global shift of state metric values would not change the LLR output value, it is the difference between the state metrics and not their absolute values that is important. Rescaling of the state metrics can be readily performed to avoid over flow. The rescaling technique is the same as that used in other SISO-log MAP algorithms [11,18,23].
The precision of the state metrics determines the word length w L . The precision depends on the dynamic range of the state metrics.
Thus, 2 G recACSO units can be employed to calculate all the states of trellis stage in one step, as depicted in Figure 5 for a 2 G state trellis. This unit receives 2 G possible branch metrics and the previous state metrics and computes all the state metrics of the current stage. Configuring the module this way fastens computation since it avoids reading the previous 6 Journal of Electrical     stage state metrics from the RAM memory. At the input of the module two selectors are used to drive the recACSO units with the appropriate inputs.

LLR Output Calculation
Module. The LLR-CM depicted in Figure 6 is responsible for computing the output reliability   A careful observation of (9) reveals a close relation with the recACSO unit calculation. The LLR computation has two terms that can be computed separately and then subtracted. If A t−1 (s ) and B t (s) are added in advance, then the recACSO unit can then be used to calculate max * (), where f = 0, 1, . . . , 2b − 1. Because the max * () when f = 0 has to be subtracted from all other LLRs, an inverter is included to negate this term.
As already mentioned, λ 0 (u t | R) need not be calculated since zero is the reference symbol and therefore its LLR is zero. The remaining 2 b −2 log-likelihood ratios are computed by the proposed architecture and outputted out of the symbol-based log-MAP component decoder.

SWT-Based Architectures for the SISO-MAP Decoder
4.1. The Concept. SISO-MAP decoders suffer from very long latency and high memory requirements. A different organisation (scheduling) of the computation and exploitation of the convergence property [24][25][26] of the decoder lead to reduction in the latency and the amount of state metrics that need to be stored. For this reason, the sliding window technique (SWT) has been proposed in the literature [18,19]. SWTs are divided in single-flow structure (SFS), doubleflow structure (DFS), and pointer-based (PNT) techniques. In these techniques the trellis or transmitted frame is divided in windows of size L, where L is the convergence length [25]. Care must be taken to ensure that the whole frame can be exactly divided into an integer number of windows. MAP decoder convergence property allows recursive calculation of a state metric to result in a good approximation of the actual metric value after L recursion steps, even if the initial data is a dummy value [8,9].
Three important metrics for comparison of MAP decoders can be considered: (i) decoding delay (latency), (ii) memory requirements (or storage lifetime), and (iii) computational complexity (number of recursion modules). Several different schedulings of the MAP algorithm computation reveal trade-offs between these quantities. Since the window size is L recursion steps, a DFU proposed in Section 3 can be employed to calculate all the state metric vectors in  a window in L steps. A DFU can perform forward, backward, or dummy metric calculations. Defining the following parameter set will assist in the analysis of SWT-based MAP decoders (technique, Π, π, shift, ε). The "technique" element is the type of computation organisation with possible values being SFS, DFS, SFS/PNT, and DFS/PNT. "Π" is the relative position between (A-and B-) forward and backward coupled metric recursive calculations and therefore determines when the B-metric calculation begins relative to A-metric calculation. It is a continuous value variable, but only three values are a meaningful choice. Π = 0 means that the backward calculation starts after the forward ends, Π = 2L means that the forward calculation starts after the backward ends, and Π = L means that the two calculations take place simultaneously and switch role in the middle of the convergence block calculation.
"π" is the ratio of the valid over the invalid (dummy) metric recursive calculation. To compute the backward metrics before the end of the whole frame a dummy metric calculation is required to estimate an intermediate value using the convergence property. This means that for every Aand B-metric pair one dummy module is required, but that can vary depending on the technique, π is the ratio of invalid metrics over valid A-and B-metrics.
The computation can take place in two directions (two flows) simultaneously each calculating half of the frame. The "Shift" is the time shift in decoding steps in between the two flows. In the pointer technique pointers are used to reduce memory requirements, ε is the number of pointers, which when exists indicates the pointer technique PNT.
This set of SWT-based architectures has been fully investigated and some results are given in Table 2. These indicate the hardware requirements and decoding delay of some common implementation structures. of a single flow and the architecture of SM-CM of (SFS, Π = 0, π = 1) and the architecture of (SFS, Π = 0, π = 1/2) singleflow structures. The flow diagram explains the computation.
In the DFG the horizontal axis represents time quantized in terms of symbol periods, whereas the vertical axis represents the symbol number within the frame. In this specific diagram a frame of P = 4L symbols decoded by a trellis of P + 1 = 4L + 1 stages is presented, where L is the convergence length of the decoder.
Three data flow units can be seen in the diagram. The dashed arrow represents L recursions of a forward DFU calculating A-metric, which are stored as calculated. The continuous arrow represents L recursion of a backward DFU calculating B-metrics. This calculation is done simultaneously with the LLR soft output calculation. The dotted arrow represents L recursions of a dummy metric DFU.
In the diagrams to follow the dotted arrow always represents dummy metric recursive use of DFUs, whereas for forward and backward metric DFUs the arrow is continuous or dashed when the metric is not stored or stored in the memory, respectively. The dummy metric DFU results in a valid metric after the L backward recursions.
The fact that the dummy metric DFU is working backwards can be understood from the fact that its projection on the vertical axis points downwards. DFUs whose projection on the vertical axis points upwards are operating on the data in a forward manner. Also note that the maximum number of arrows a vertical line can cross in this diagram gives the number of DFUs employed in this structure [21].
Storage is also represented by the shaded rectangular areas. Taking into account one of these areas (they are all the same), the projection of this rectangular area on the vertical line gives the amount of state metric vectors to be stored [18,21], whereas the projection on the horizontal line gives the time required to store the vectors.
Decoding delay (or latency) is the horizontal distance between the acquisition curve (always y 1 = t) and the decoding curve in DFG. Since the decoding curve in this structure is at y 2 = t − 4L, the decoding delay is y 1 − y 2 = t − (t − 4L) = 4L symbol periods. The symbol period is denoted by t S .
DFGs and tile graphs model the same thing, they model the resource-time scheduling of the recursions of the algorithm. A diagram is a DFG when viewed as a concrete graph or tile graph when viewed as tile repetition [17]. On the right of Figure 7 a tile, which consist of 3 DFUs, is illustrated. This tile is repeated as many times as required to form a complete DFG.
Let us concentrate more on the scheduling of the operations described by the DFG in Figure 7. During the symbol periods 2L to 3L−1, the dummy metric DFU (dotted arrow 1) starting from a zero vector assigned to state metric vector B 2 L computes invalid metrics from B (2L−1) down to B L . This calculation results in a valid state metric vector B L , because convergence is reached. All other metrics calculated during these L recursions are invalid. At the same time the forward DFU (dashed arrow 2) calculates and stores a sequence of L A-metrics from A 0 to A (L−1) .
In the next L symbol periods from 3L to 4L − 1 the backward DFU (continuous arrow 3) comes into play to  calculate B (L−1) to B 0 . During the backward calculation the A-state metrics from A (L−1) to A 0 are extracted from the memory and together with the B-state metrics are used to calculate the first L soft outputs.
Thus, a set of 3 recursion units (called a "tile" in a tile diagram as indicated in DFG) results in the calculation of L A-and B-metrics required to produce L soft output LLR values. Note that the LLR values are calculated in a reverse manner. At symbol periods 4L to 5L − 1, the order of soft output values is reversed [18]. In a turbo coding scheme, the interleaver reordering can be exploited to perform this reversing. If this last LLR reversing step is done by the MAP decoder, then the latency of this core is 4L symbol periods. If the reversing is done by the interleaver, the decoding delay is 3L symbol periods.
The above process (tile) is repeated until the end of the frame. The set of tiles in the tile graph (or DFG) represents a single flow passing through the data of a frame (or trellis). Hence, the proposed structure is designated as a single-flow structure (SFS). Structures where the above process occurs in both directions are called double-flow structures (DFSs) and will be discussed later in this paper.
In single-flow structures the benefit of structures with Π = 0 compared to Π = L is the use of a single RAM memory as opposed to two RAM memories of half size each, required in the latter structures. Although the memory requirements as well as other requirements are the same, less complex datapath circuits are needed in the (SFS, Π = 0, π) case. Also the (SFS, Π = L, π) structure experiences smaller decoding delay than the (SFS, Π = 2L, π); LLR reversing is performed in the interleaver rather than in the MAP decoder. Thus, for SFS, Π = 0 is clearly the best choice. For (SFS, Π = 0, π) decoders, decoding delay = 3πL or 2πL, π > 1, (2 + 2π)L or (2 + π)L, π ≤ 1.
Note that, when π > 1, both the storage requirements and the decoding delay are increased without any reduction in the number of DFUs; therefore architectures with π > 1 are not suggested.
For π < 1, the storage requirements and decoding delay are decreasing as π decreases, but the number of required DFUs is increasing. The decoding delay is (2 + 2π)L when MAP decoder reverses the LLR outputs and (2 + π)L when the interleaver reverses the LLR outputs.

Double-Flow Structure SISO-MAP Decoders.
The DFG and the state metric calculation modules for a (DFS, Π = L, π = 1/2, shift = L/4) structure are shown in Figures 9 and  10, respectively. The double-flow structures use all of the concepts of single flows, in the sense that they also have parameters Π, π with the interesting values being Π = 0, L and 2L and π = 1/2, 1 and 2. Some DFSs have the benefit of operating with twice as much speed, without doubling the hardware requirements.
For example, when Π = L the RAM memories used by the data flow units to store the state metrics A and B of the SISO-MAP algorithm operate in the write mode for half of the time. The remaining half of the time they operate in read mode. This read period can be exploited by a second flow which runs over half of the frame data in the opposite direction. This is illustrated in Figure 9 for Π = L, π = 1, which shows a structure that decodes a frame of Π = 8L symbols in half time. Figure 13 depicts a DFS, with Π = L, π = 1/2, shift = L/4. Note that for RAM sharing between the two data flows a shift of L/4 is required in this case, as opposed to L/2 shift used in other DFSs. This means that the appropriate shift is not constant but depends on the other parameters of the structure. In this diagram it is clearly indicated that two flows are operating splitting the 8L frame into two 4L symbol blocks. The first flow is using 4 DFUs, one forward, one backward, and two dummy metric DFUs. The same resources are employed by the second flow, therefore in total 8 DFUs are required.
Double-flow structures lead to efficiency only for Π = L, since only for this value of Π memory sharing can take place. For all other values of Π the memory requirements and the number of DFUs are doubling for double-flow structures with the added advantage of doubling the MAP decoder speed since the frame decoding is split between the two flows. For (DFS, Π = L, π, shift) decoders, Setting π above 1 results in prohibitively large decoding delay. Even if π is only 3, the decoding delay is (4π − 3/2)L/(3π −3/2)L = 10.5L/7.5L, which is unacceptably large, therefore architectures for π ≤ 1 are of interest only. Figures  11 and 12 illustrate the DFG and the SM-CM for a (SFS/PNT,

Pointer Technique-Based SISO-MAP Decoders.
In all of the above structures all the necessary metrics are stored in the RAM memories. Another idea is to selectively store some of the state metrics and recompute all the others. This idea of selective recomputation leads to reduced storage requirements.
The pointers are nothing else but the actual state metric values, which are stored and extracted from a storage device whenever is necessary. The values to be stored in pointers are depicted with small white circles in Figure 11.
In this structure, during time t = 2Lt S to (3L − 1)t S , after L dummy metric recursion steps, the obtained valid B L value is stored in a register. This value is also used to initialise a backward recursion, which operates during the symbol periods 3L to 4L − 1. This backward B-metric DFU as progresses stores in registers the values B (3L/4) and B (L/2) and in the RAM memory the last L/4 state metric values from B (L/4−1) down to B 0 . Thus, only L/4 state vectors are stored in RAM. An extra backward DFU will then extract the pointer B (L/2) to recalculate the other L/4 B-metric values (B (L/2−1) down to B (L/4) ) and store them in the memory. This procedure will be repeated until the calculation of all L Bmetric vectors in the window. As the backward metrics are calculated, a forward DFU computes the corresponding Astate metrics.
As can be observed from Figure 11 at any given time only L/4 state metric vectors of size 2Gw L each are stored in the RAM. This means the storage requirements of this structure are a RAM of size (L/4) * 2G * w L bits and 3 pointer vectors of 2G * w L bits each.
For pointer-based SFS and Π = 0 the storage of state metrics starts right after the dummy metric calculation has computed a valid metric, so there is no time space for pointers. Π = 2L is equally well with the Π = L case regarding pointers in the sense that the trade-offs mentioned before are not affected. For Increasing ε (the number of pointers) reduces the RAM memory but has no effect on the number of DFUs. The size of the required memory is given by (L/(ε + 1))2 G w L . Of course, some registers need to be allocated for each pointer, whose size is 2 G w L . The decoding delay is only affected by the amount of pointers used if interleaver LLR reversing is employed. In this case the decoding delay is 4L − (1/(ε + 1))L and holds unless π > ε.
If the structures (SFS/PNT, Π, π = 1, shift = −, ε = 3) with Π = 0, L, 2L, are compared with the corresponding SFS structures that use no pointers, it is observed that using three pointers and one more DFU results in 75% reduction in the required RAM memory. The decoder delay of the pointerbased structures when interleaver LLR reversing is used is increased by 0.75L for the first two cases Π = 0, L.
For π / = 1, if mod(ε + 1, 1/π) / = 0, the pointer-based techniques (single flow and double flow) are not applicable. If mod(ε + 1, 1/π) = 0, the resultant architecture does not reduce memory size but has higher throughput (computation speeds up π times). In this case, (12a) and (12c) hold as they are, but in (12b) the number of DFUs is given by 1/π +3. Π = 2L is the best choice for pointer-based SFS, whereas Π = L is a better choice for pointer-based DFS, since the latter Again, note that increasing ε reduces the storage requirement but not the data flow unit count of the structures. Figure 13 shows a complete SISO-MAP decoder VLSI architecture using a double flow with Π = L, π = 1, 3 Ptrs, shift = L/8. This architecture requires 8 DFUs (2 dummy, 2 pointer saving, and 2 forward and 2 backward) and two RAM memories (L/8)2 G w L each and has a decoding delay of 4L. With selective recomputation only a fraction of the required state metric vectors are stored in the RAM memories. If ε pointers are used, then only L/(ε+1) of the state metrics need to be stored in the RAM memory. The remaining L − (L/(ε + 1)) is recalculated in blocks of L/(ε + 1).

Conclusions
The VLSI architectures of space-time turbo trellis coding decoders as well as of a set of SISO-MAP component channel decoders used in turbo coding are proposed and investigated. The space-time turbo code receiver as opposed to binary turbo codes is based on nonbinary trellises, which imposes a number of differences. Except the fact that channel estimation as well as MRRC combining is needed to cope with the more than one diversity received symbols, the frame formation block is different than in a traditional binary turbo receiver, since in the latter systematic and parity redundant information can be separated and stored in separate memory banks, whereas in the former it cannot, so the whole symbol is stored in a memory bank. The difference is that in a STTuTC receiver case, the equalised symbols are demultiplexed and stored in memory banks, zeros are inserted at even or odd locations, and one memory bank is sent to the decoder each time. This ensures that the SISO-MAP decoder accepts the information from the corresponding encoder. The symbols are demapped by a symbol hard or soft demapper.
The proposed STTuTC architectures are based on a different ACSO unit than the binary turbo codes. This ACSO unit can handle the more than two state and transition metric pairs iteratively. In nonbinary trellises the state and transition metric pairs are more than two, because of the increased number of transitions in between states. Thus, the ACSO unit must either grow or work iteratively. Many ACSO units can be used to calculate all state metrics in one step without the need for storing those values in a state metric memory. Thus, the state metric calculation modules are considerably different than in the binary turbo decoders. The LLR calculation module is also different because more than one LLR value needs to be calculated in every decoding step. "f " ACSO units appropriately connected rather than two are required to calculate the appropriate amount of LLR values for all possible symbols. This also means that the number of soft LLR outputs the SISO-APP demapper delivers every time depends on the "f " possible values. That is, f -1 LLR must be outputted, so the LLR memory size is different than the binary turbo decoder case.
A parameter set (FS, Π, π, shift, e) helped in the comparison and led to defining equations for many different cases, single-flow, double-flow, and pointer-based techniques. Table 3 gives a list of formulae, which determine the quantities of memory size, number of deployed DFUs, and decoding delay of the most efficient techniques from those discussed in the this paper. [5] L. Bahl

Introduction
Recently, the sophisticated terminal as a smart phone becomes widely used and many multimedia services are provided [1,2]. High speed packet access (HSPA) using a wideband code division multiplexing access (W-CDMA) is used in the mobile communications and provides the data services with the maximum transmission rate of about 14 Mbps [3]. However, since the available frequency band is limited, a HSPA cannot improve the transmission rate more than now [4]. To solve this problem, orthogonal frequency division multiplexing (OFDM) and multiple-input multiple-output (MIMO) are actively used [5][6][7][8][9]. OFDM is a multicarrier digital modulation. The OFDM signal can be transmitted in parallel by using the many subcarriers that are mutually orthogonal. In MIMO systems, the signal of the several transmit antennas is transmitted in the same frequency band. Moreover, since each transmitted signal is sent over the independent channel, the space diversity can be obtained in the receiver. Therefore, a long term evolution (LTE) has been standardized as 3.9 G system using MIMO/OFDM systems [10,11]. An LTE provides the maximum transmission rate of about 50-100 Mbps. Moreover, an LTE Advanced will support broadband data services with the maximum transmission rate of about 100 M-1 Gbps as 4 G systems.
Since the received signal is changed due to the amplitude and phase variations for a frequency selective fading, the channel estimation (CE) is important to compensate the channel variance. In the conventional MIMO/OFDM, the orthogonal pilot symbols-based CE is used to identify an accurate channel state information (CSI) [7]. However, since the orthogonal pilot-based CE requires the large number of pilot symbols and large transmission power, the transmission rate is degraded. To mitigate these problems, a high time resolution carrier interferometry (HTRCI) for MIMO/OFDM has been proposed [12]. · · · · · · · · · · · · · · · · · · · · · Interference cancellation  In wireless communications, intersymbol interference (ISI) and intercarrier interference (ICI) are serious problems to mitigate the system performance. To prevent these problems, a guard interval (GI) is generally inserted. In general, GI is usually designed to be longer than the delay spread of the channel. However, if the maximum delay spread is longer than the GI, the system performance is significantly degraded due to the ISI and ICI. However, the conventional HTRCI-MIMO/OFDM does not consider the case with the time-variant large delay spread longer than the GI. In this paper, we propose the ISI and ICI compensation methods for a HTRCI-MIMO/OFDM in the time-variant large delay spread longer than the GI. Until this time, several schemes have proposed the ISI and ICI compensation methods due to the large delay spread channel. For example, [13] has proposed the ISI reduction method by extending the GI. However, since [13] has extended the GI length, the maximum throughput is degraded and the transmission power is also increased. Reference [14] has proposed the ISI and ICI compensation methods using the turbo equalization. However, [14] has large complexity by the iterative processing for the turbo equalizer. Reference [15] has proposed the ISI and ICI compensation methods using the estimated channel coefficients and the CSI to reproduce the interference components. However, the packet length becomes longer by using the training symbol. To mitigate the above-mentioned problems, we propose the time domain ISI compensation method with the replica signal based on the ICI compensation for a HTRCI-MIMO/OFDM in this paper. This paper is organized as follows. In Section 2, we present the system model. Then, we describe the proposed system in Section 3. In Section 4, we show the computer simulation results. Finally, the conclusion is given in Section 5.

System Model
This section describes the system model, which employs the time-division multiplexing (TDM) transmission for multiple users. This system is illustrated in Figure 1.

Channel Model.
We assume that a propagation channel consists of L discrete paths with different time delays. The impulse response between the mth transmit and nth receive antenna h m,n (τ, t) is represented as follows: where h m,n,l , τ m,n,l are the complex channel gain and the time delay of the lth propagation path, and L−1 l=0 E|h 2 m,n,l | = 1, where E| · | denotes the ensemble average operation. The channel transfer function H m,n ( f , t) is the Fourier transform of h m,n (τ, t) and is given by 2.2. HTRCI-MIMO/OFDM. The transmission block diagram of the proposed system is shown in Figure 1(a). Firstly, the coded binary information data sequence is modulated, and N p pilot symbols are appended at the beginning of the sequence. The HTRCI-MIMO/OFDM transmitted signal for the mth transmit antenna can be expressed in its equivalent baseband representation as follows: where N d and N p are the number of data and pilot symbols, N c is the number of subcarriers, T s is the effective symbol length, S is the average transmission power, and T is the OFDM symbol length, respectively. The frequency separation between adjacent orthogonal subcarriers is 1/T s and can be expressed by using the kth subcarrier of the ith where c PN is a long pseudonoise (PN) sequence as a scrambling code to reduce the peak-to-average power ratio (PAPR). GI is inserted in order to eliminate the ISI due to a multipath fading, and hence, we have where T g is the GI length. In communication systems, ε is generally considered as 4 or 5, where ε = T s /T g . In this paper, we assume ε = 4. In (3), the transmission pulse g(t) is given by For 0 ≤ i ≤ N p − 1, the transmitted pilot signal of the kth subcarrier for the mth transmit antenna element is given by where ζ = ε/2 , m = mod (m, ζ), and x stands for the integer lower and closer to x, respectively. In this case, the HTRCI-MIMO/OFDM pilot signal can multiplex the same impulse responses from the transmit antenna elements in each receive antenna element in ζ times on the time domain without overlapping to each other. For example, if we consider ε = 4 and 2 transmit antenna elements, we obtain Nc−1 k=0 d 0 (k, 0) = {1, 0, . . . , 1, 0} and Nc−1 k=0 d 1 (k, 0) = {1, 0, −1, 0, . . . , 1, 0, −1, 0} as the pilot signal of the first and second transmit antenna elements from (7) as shown in Figure 2(a). In this case, since each pilot signal contains "0" components, we can identify the half of transmission power compared with the conventional MIMO/OFDM system using the orthogonal pilot. However, if τ m,n,L > T g , the conventional HTRCI-MIMO/OFDM can not obtain an accurate CSI, where τ m,n,L is the maximum delay spread for the mth transmit and nth receive antenna. To solve this problem, we propose the new channel estimation using a HTRCI-MIMO/OFDM. Firstly, in the transmitter, we shift the channel impulse responses of (7) as follows: In (8), the channel impulse responses are shifted as shown in Figure 2(a). This operation enables estimating the maximum delay spread longer than the GI and to obtain an accurate CSI in the receiver. The received structure is illustrated in Figure 1(b). By applying the FFT operation, the received signal r n (t) is resolved into N c subcarriers. The received signal for the nth receive antenna r n (t) in the equivalent baseband representation can be expressed as follows: where M is the number of transmit antennas and n n (t) is additive white Gaussian noise (AWGN) with a single-sided power spectral density of N 0 for the nth receive antenna, respectively. The kth subcarrier r n (k, i) is given by where n n (k, i) is AWGN noise with zero mean and a variance of 2N 0 /T s . After abbreviating, (10) can be rewritten as follows:  From Tx2 (GI part) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · −T g T g  After descrambling, the output signal r n (k, i) for the nth receive antenna element is given by where (·) * is a complex conjugate and c * PN (k)/|c PN (k)| 2 is the descrambling operation, respectively. Observing (11) and (12), the noise components are the same notation. In this paper, the descrambling operation is to rotate the phase of each subcarrier by using a PN code. Since ε = T s /T g , a HTRCI-MIMO/OFDM can multiplex the same impulse responses in ζ times on the time domain. After the pilot signal separation, the pilot signal is converted to the time domain signal r n (t) again as where P is the transmission pilot signal power and n n (t) is the noise component, respectively. Here, if τ m,n,L > T g , since Journal of Electrical and Computer Engineering 5 the channel impulse responses of the different time window overlap the desired channel impulse responses, the accurate CSI can not be obtained. Therefore, we process as follows. At the receiver, in Figure 2( where

Rewritten Matrix Form.
After the pilot signal separation, (13) for τ m,n,L ≤ T g can be rewritten in the matrix form as follows: where λ i,m,n is the N c × N c time-domain channel matrix for the ith symbol between the mth transmit and nth receive antenna, N i,n is the N c × 1 noise matrix, and F is the IFFT operation, respectively. However, (15) for τ m,n,L > T g is rewritten as follows: where λ isi,i−1,m,n , λ ici,i,m,n denote the ISI and ICI channel matrices for the (i − 1)th and ith symbols between the mth transmit and nth receive antenna, respectively. Observing (16), since the received signal R i,n contains the ISI and ICI terms, these equalization processing are necessary.

ISI and ICI Equalization.
In τ m,n,L > T g , since the first data symbol of R i,n does not contain the ISI [16], (16) for i = 0 is obtained as However, the data symbols for i > 0 have the ISI. Here, the ISI equalization is performed by using the previous detected symbol D i−1,m,n and the estimated ISI channel matrix λ isi,m,n . The estimated ISI channel matrix λ isi,m,n consists of the estimated channel impulse response. The estimated channel impulse response h m,n,l (t)δ(τ − τ m,n,l ) is obtained by the channel response H m,n (k) after the IFFT operation, where L ≤ l ≤ L − 1. Therefore, the ISI equalized signal R i,n is given by where N i,n is the noise term with the residual ISI. After the FFT operation, by using maximum likelihood detection (MLD), the detected signal T is obtained as follows: where (·) T is the transpose operation, N is the number of receive antennas, D i,m is the constellation of the symbol replica candidates as M−1 m=0 C−1 c=0 D i,m,c , C is the modulation level, and H m,n is the matrix form of H m,n (k), respectively. From (19), ISI is eliminated from the received signal. However, the orthogonality is destroyed by the detected signal due to the ISI compensation. If ICI is eliminated from (19), the detected signal D i,m can be more accurately detected.

Replica Signal Insertion Based on ICI Equalization.
To eliminate the ICI, we consider the orthogonality reconstruction with inserting the detected signal after the ISI 6 Journal of Electrical and Computer Engineering compensation. The ICI equalized signalȒ i,n with inserting eliminated the part of signal using the previous detected symbol D i,m is given by where λ ici,m,n is the estimated ICI channel matrix andN i,n is the noise term with the residual ISI and residual ICI, respectively. λ ici,m,n consists of the estimated channel impulse responses h m,n,l (t)δ(τ − τ m,n,l ). After the FFT operation, the detected signalD i = T is obtained as follows: Observing (19) and (21), since ICI is eliminated,D i,m is the more accurately detected signal compared with D i,m .

Complexity Comparison.
Here, we compare the complexity of the conventional and proposed methods. Firstly, we show the complexity of [14]. The authors of [14] have proposed the ISI and ICI compensations using the turbo equalization. From Table 1 of [14], the complexity of [14] is given by where N s is the number of sampling points for GI and effective symbol. Next, we show the complexity of our proposed method. Firstly, the complexity of (18) is obtained by For (23), the first term is the complexity of MLD and the second term is the complexity of IFFT. Next, the complexity of (20) is obtained by For (24), the first and second terms are the complexity of MLD and IFFT of (23). Finally, the complexity of (21) is Therefore, the complexity of the proposed method is obtained by For example, when M = N = 2, N c = 64, N d = 20, N s = 80, L = 2, and C = 4, the complexities of [14] and the proposed method are 3000320 and 130560 from (22) and (26). Therefore, the proposed method is small compared with [14].

Computer Simulated Results
In this section, we show the performance of the proposed method. Figure 1 shows the simulation model of the proposed system. In this simulation, we assumed that 2 × 2 and 4 × 4 MIMO systems. On the transmitter, the pilot signal is assigned for each transmitter using (8). In this case, the proposed system can multiplex the same impulse responses from the transmit antenna elements in each receive antenna element in ζ times on the time domain. These have been found to be efficient for the transmission of the OFDM signal over the frequency selective fading channel. After serial to parallel (S/P) converted, the coded bits are QPSK modulated, and then the pilot signal and data signal are multiplexed with the scrambling using a PN code to reduce the PAPR. The OFDM time signal is generated by the IFFT operation and is transmitted to the frequency-selective and time-variant radio channel after the cyclic extension has been inserted. The transmitted signal is subject to the broadband channel propagation. In this simulation, we assume that OFDM symbol period is 5 μs and guard interval is 1 μs for ε = 4, and L = 2, 3 path Rayleigh fadings. The maximum Doppler frequency is 10 Hz. In the receiver, guard interval is erased from the received signal and the received signal is converted S/P. The parallel sequences are passed to the FFT operator and convert the signal back to the frequency domain. After the descrambling and IFFT operation, each impulse response for all combination of the transmit and receive antenna elements can be estimated by extracting and averages ζ times impulse responses using the time windows with (14). In τ m,n,L > T g , the HTRCI-MIMO/OFDM pilot signal contains the overlapped channel impulse responses from the different time window. However, the pilot signal of the GI does not contain the overlapped channel impulse responses. By using this signal, the channel impulse responses of the different time window are eliminated from the overlapped channel impulse responses as shown in Figure 2(b). The frequency domain data signal is detected and demodulated by using the MLD algorithm. Since the detected data signal contains the ISI and ICI, these equalization processing are necessary. The ISI equalization is performed with the previous detected symbol and estimated ISI channel matrix as (18). From (18), ISI is eliminated. However, the orthogonality is destroyed by the detected signal due to the ISI compensation. To reconstruct the orthogonality, the ICI equalization is performed by using the replica signal insertion as (20). Finally, the data Journal of Electrical and Computer Engineering   (4,4) signal is detected as (21). The packet consists of N p = 1, 2 pilot symbols and N d = 20 data symbols. Table 1 shows the simulation parameters. Figures 3 and 4 show the BER of the conventional and proposed methods for 2 × 2 and 4 × 4 MIMO systems at Doppler frequency of 10 Hz. The number of maximum delay spread is 4 and 16 in 2 × 2 and 4 × 4 MIMO systems, respectively. From the simulation results, the BER performance for no ISI and ICI compensation is increased about 22 and 300 times compared with no ISI and ICI case in 2 × 2 and 4 × 4 MIMO systems, respectively. For the conventional method, the BER performance shows the error floor in high E b /N 0 . This is because the residual ISI and ICI are remained. The proposed method shows approximately the same BER performance compared with no ISI and ICI case. Therefore, the proposed method can eliminate the Therefore, the channel impulse responses of the different time window are eliminated by using the channel impulse responses of the GI. From the simulation results, the 2-path model shows the better BER performance than that of the 3path model in low E b /N 0 . It means that CSI degrades due to the eliminated processing of the overlapped channel impulse responses in the 3-path model. Figures 5 and 6 show the BER of the proposed method for 2 × 2 and 4 × 4 MIMO systems at Doppler frequency of 10 Hz. The number of maximum delay spread is 4 and 16 in 2 × 2 and 4 × 4 MIMO systems, respectively. For the proposed method without the noise, the noise is not added in the channel impulse responses of the GI. From the simulation results, the 2-and 3-path models without the noise show the approximately same BER performance. Therefore, the 3-path model with the noise degrades due to the noise of the GI.  about 5 times. On the other hand, the BER performance of the proposed method is increased about 2 times. Therefore, the proposed method can suppress the ISI and ICI due to the changing of the number of maximum delay spread. In 4 × 4 MIMO system, the BER performance of no ISI and ICI compensation and the conventional method are increased about 170 and 80 times. On the other hand, the BER performance of the proposed method is increased about 4 times and the proposed method can suppress the ISI and ICI as 2 × 2 MIMO system. However, the BER performance  Throughput (Mbps)  Figure 9 shows the throughput performance of the conventional and proposed methods for 2 × 2 and 4 × 4 MIMO systems at Doppler frequency of 10 Hz. The number of maximum delay spread is 4 and 16 in 2 × 2 and 4 × 4 MIMO systems, respectively. The throughput T tp is given by where R is the coding rate and P per is the packet error rate (PER), respectively. The extension of the GI length is the simple method to prevent the problem of the maximum delay spread [13]. However, the throughput performance is degrade with this method. This is because the number of carriers N c decreases for (27) when the OFDM symbol length T is the same as no ISI and ICI case. In this simulation, N c decreases from 64 to 48, and the GI length T g increases from 16 to 32. On the other hand, the proposed method can maintain the number of carriers N c . From (27), the maximum throughout for no ISI and ICI case is about 48.8 and 93.1 Mbps in 2 × 2 and 4 × 4 MIMO systems, respectively. Therefore, the proposed method shows the best throughput performance and can achieve the maximum throughput performance as the same no ISI and ICI case.

Conclusion
In this paper, we have focused on the large delay spread channel and proposed the ISI and ICI compensation methods for a HTRCI-MIMO/OFDM. In the proposed method, we have performed the time domain ISI compensation method with the replica signal based on the ICI compensation. From the simulation results, the proposed method has achieved the approximately same BER performance like the case with no ISI and ICI. Moreover, the proposed method can suppress the ISI and ICI due to the changing of the number of maximum delay spread. Finally, the proposed method has shown the best throughput performance and can achieve the same maximum throughput performance compared with no ISI and ICI case.

Introduction
Multiple-input multiple-output (MIMO) techniques in combination with high constellation orders have been identified as a promising approach to high spectral efficiency systems. Prominent detection in spatial multiplexing is essential for system performance. Maximum likelihood (ML) detection is the optimal method in spatial multiplexing systems. Sphere detection (SD) methods [1][2][3] compute the ML solution by taking into consideration only the lattice points inside the sphere with a given radius. Because a soft-insoft-out error-correction-code (ECC) decoder [4] can have better error-correction performance than a usual hard ECC decoder, a soft-output MIMO detector, which cooperates with ECC better, is needed for coded communication system. The soft-output MIMO detector algorithms, such as list sphere decoder (LSD) [5] and soft single tree search sphere decoder (STS-SD) [6], have quasi-optimal performance with ECC. However, LSDs have huge complexity when list is extended, and STS-SD has variable throughput when channel condition is changed. The high computational intensity and the variable throughput characteristic of the iterative methods prevent current practical implementations from conforming the requirements for actual chip area, latency, and power consumption. Fixed-complexity sphere decoder (FSD) algorithm [7] is another practical solution to MIMO detection. The FSD is similar to SD but with different search criteria. FSD performs a fixed tree search by visiting all nodes in the top level and just visiting one node in other levels to simplify the tree search in SD. Therefore, the complexity of FSD can be lower than SD. The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD is not compatible with powerful soft-input ECC decoders and needs some modifications. The work in [8] provides a good reference. In this paper, we reduced the visiting nodes by another way of node visiting distribution and found the local minima for soft-output under the tradeoff between performance and complexity using a modified FSD scheme. Furthermore, a novel method to simplify the SE enumeration [9] is proposed with the FSD. Therefore, the proposed FSD does not need to sort all points of the constellation.

Journal of Electrical and Computer Engineering
To achieve more power saving, a full-pipelined parallel architecture is proposed according to the modification of the algorithm. The parallel design and adaptive voltage scaling technique can provide a power-efficient ASIC solution.
In this paper, a robust and power-efficient solution for spatial-multiplexing MIMO detection is proposed. The proposed soft MIMO detector can provide LLR output to ECC decoder and provide a turbo-MIMO solution. The proposed MIMO detector has the following features. (b) Soft-Output Performance. With a modified tree search algorithm, the proposed MIMO detector, which provides soft-valued outputs, is compatible with soft-in-soft-out ECC decoders to attain enhanced detection performance.
(c) Parallel and Pipelined Architecture. Iterative sphere decoding methods prevent decoders from efficient hardware implementation. The proposed soft-output fixed-complexity sphere decoder (SFSD) retains the advantage of the fixedcomplexity sphere decoder (FSD) [7] and therefore can be implemented in parallel and full pipelined designs to increase hardware efficiency.
(d) Fixed Throughput. General soft-output sphere decoding solutions only provide variable throughput, which makes imperfect use of hardware resources due to the sequential nature. The SFSD provides fixed throughput and achieves better performance than usual sphere decoders when the throughput is fixed. The maximum throughput is 120 Mbps in the proposed decoder.
(e) Error-Recovered Adaptive Voltage Scaling. A novel adaptive voltage scaling method is applied to the detector to reduce power dissipation. With double sampling circuitry, a timing error will be detected and recovered at run time. Therefore, an optimal voltage can be achieved and also keep the processing from functionality violation. With these techniques, a 48.8% saving in power is achieved.

(f) Configurable, Complexity-Efficient, and Power-Efficient
Hardware Implementation. The configurable ASIC provides fixed throughput of 45 Mbps in 64-QAM configuration, 120 Mbps in 16-QAM configuration, and 60 Mbps in QPSK configuration. The normalized power efficiency is 1.56 Mbps/mW and 2.53 Mbps/mW for 64-QAM and 16-QAM configurations, respectively. The complexity efficient is 1.19 Mbps/K-gate for 16-QAM configuration.
The remainder of the paper is organized as follows. Section 2 reviews the conventional sphere decoding algorithm. The proposed SFSD algorithm with related simulation results is introduced in Section 3. Section 4 presents the logic design. Section 5 reports the hardware implementation. Finally the paper is concluded in Section 6.

Conventional Sphere Decoding Algorithm
Let us consider a MIMO system with M transmitting and N receiving antennas (N ≥ M). M streams of Q bits of data, x j,b , j = 1, 2, . . . , M, and b = 1, 2, . . . , Q, are mapped to an M-dimensional transmitted symbol vector s = [s 1 s 2 · · · s M ] T , using 2 Q -QAM modulation. The received complex vector, y, is given by where H is an N × M channel matrix, which is assumed to be known in advance, and n is the complex Gaussian noise vector.
The a posteriori log-likelihood ratio (LLR) of the bit x j,b , conditioned on the received symbol vector, y, provides a soft-in-soft-out detector information for decision and can be expressed as By Bayes' theorem, the max-log approximation, and proof in [5,6], (2) can be rewritten as where S (−1) j,b and S (1) j,b are search spaces, {s | x j,b = −1} and {s | x j,b = 1}, respectively. The maximum-likelihood (ML) minima in (3) is expressed as where C M is the set of constellation symbols in the Mdimensional complex space. Letting x ML j,b be the binary complement of the bth bit in the jth data symbol of s ML , the other minima in (3) can be expressed as By means of the QR decomposition of the channel matrix H = QR, (4) and (5) can be reformulated as respectively, where y = Q H y = Rs + n, Q is an N × M matrix with orthogonal unit norm columns, R is an M × M upper-triangle matrix, and (·) H denotes the Hermitian matrix operator. Note that the noise term n = Q H n keeps the same statistical properties as Q is an unitary matrix [1]. In addition, sorted QR decomposition (SQRD) [10] is applied, computing the Euclidean distance (ED) y − Rs 2 in (6) and (7) recursively as and y i and R i, j are, respectively, elements in y and R. Lattice points with partial Euclidean distance (PED), d i (s i ), greater than the square of a given radius r, are invalid. Therefore, the candidate search is confined to lattice points within a sphere to reduce the detection complexity. Nevertheless, the sequential search property and the variable computational complexity prevent the conventional SD from efficient hardware design.

Proposed Soft-Output FSD (SFSD)
3.1. Algorithm Description. FSD algorithm [7] is another solution to MIMO detection, which is similar to SD but has two major differences.
(i) A fixed tree search is performed that FSD visits all nodes in the top level and just visits one node in other levels to simplify the tree search in SD.
(ii) The channel matrix ordering [11] is modified that the smallest column norm of channel matrix H is ordered in the first level of the tree.
The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD cannot be compatible with powerful soft-input ECC decoders.
For soft-output MIMO detection, the quasi-ML solution in (6) and the local minima in (7) are essential to calculate LLR. Therefore, the proposed FSD is a modification of the FSD by finding the local minima in (7) for soft-output. Accordingly, a soft-output SD requires more branches than the FSD. Monte Carlo simulations for the number of required visiting branches for SD and FSD algorithms were performed. Tables 1 and 2 show the mean and variance of the number of branches for a visited node at each layer in a 4 × 4 MIMO system using 16-QAM and 64-QAM, respectively. The distribution helps FSD algorithm to fix the number of  Figure 1. We can see that means in Tables 1 and 2 are in the descending order from the third level to the first level. Then we can find that the property of n i is SFSD may follow the property and can be fixed to n = [2, 3, 4, P] T , n = [3, 3, 3, P] T , where P is equal to the number of the constellation points (2 Q ) or other distributions which follow (9). Figure 2 shows FER performance of different SFSD distributions with LLR maximum L max = 16 in 4 × 4 MIMO 16-QAM system. The simulation distributions n = [n 1 , n 2 , n 3 , n 4 ] T on the figure are selected based on (9) and the mean/variance of the visiting branches in Tables 1 and 2. SD can be considered as the optimal performance here. We can see that the branches of node can affect the performance of SFSD, and more branches can have better performance directly. But too high distribution n can cause the tree search complexity to increase. Table 3 presents the total visited nodes in different SFSD distributions with 16-QAM modulation. The distribution [1, 1, 1, 16] T is used in the hard output FSD algorithm. The total visited nodes may represent the computation complexity of SFSD. We can find that the n = [1, 2, 2, 16] T SFSD needs the least branch expansion and has only 0.5 db performance degeneration compared to SD. The n = [2, 3, 4, 16] T SFSD has better performance which degrades about 0.1 dB but has the highest complexity. It is a tradeoff between performance and complexity. Therefore, the branch distribution n = [1, 2, 2, 2 Q ] T for tree search has low complexity and is proposed to full-pipeline parallel hardware architecture implementation in the proposed SFSD. The top layer, n 4 , is set to the number of all constellation points.
The usual sphere decoders adopt the SE enumeration [9], which sorts all of the constellation points by the Euclidean distances between the received signal y i and itself. The point with the smallest distance has the first priority and enumerates others in ascending order of distance. Because proposed SFSD [1, 2, 2, P] T only needs to enumerate the    two smallest points of constellation, the SE enumeration can be simplified here. Therefore, the proposed SFSD does not need to sort all points. The following is the presentation of simplified enumeration.
Only some comparisons help the SFSD to achieve the SE enumeration. Note that these PED computations of real part and imaginary part are equivalent to the PED computations of two points in the complex valued constellation. Therefore, the simplified SE enumeration just costs two PED computations. The sequential design for tree search has variable throughput due to the unpredictable number of visited nodes, which causes hardware inefficiency. The early termination (ET) [6] can solve the problem. It confines the frame run time to N frame · D avg and lets every frame have the same delay during N frame MIMO detections. The function confines the maximum number of visited nodes to a value in each tree search as where D(i) denotes the number of visited nodes for the ith MIMO symbol vector. evaluated in terms of the number of visited parent nodes in tree search and frame-error-rate (FER) performance. All simulations are performed in a 4 × 4 MIMO system with Gray mapping QAM constellation modulation. In the simulations, data are coded in a rate R = 1/2 convolutional code with constraint length 7 and a [133 o 171 o ] polynomial generator. A soft-input Viterbi decoder (max-log BCJR algorithm [12]) is employed in the receiver. A frame consists of N frame = 64 MIMO symbols. In other words, a frame consists of randomly interleaved N frame ·M ·Q bits after outer encoding. Figure 4 illustrates the frame-error-rate (FER) performance of the optimal SD, STS-SD [6], and proposed SFSD. At 2%-FER, only a 0.5-dB loss is observed between the SFSD and the optimal SD. The performance degradation is even smaller between the SFSD and the STS-SD.

Simulation
Recursive computation of the PED as (8) for tree search leads to considerable computational complexity and latency. Hence, the number of visited parent nodes in tree search is an indicator for searching complexity and latency. Figure 5 shows the comparison of the average number of visited parent nodes between the STS-SD algorithm [6] and the proposed SFSD. It can be seen that the proposed method can reduce 20% of the average number of visited parent nodes. Accordingly, the low complexity advantage of the SFSD over the STS-SD can be observed.

Logic Design
The proposed MIMO detector, shown in Figure 6, consists of a preprocessing and an SFSD block. The preprocessing block performs sorted QR decomposition (SQRD) and FSD ordering. The matrix R is decomposed by SQRD and consists of real valued diagonal elements, R r , and complex valued offdiagonal elements, R c . The computation of y = Q H · y is also performed in the block. FSD ordering chooses the column with the second smallest column norm to process in each iteration of the SQRD algorithm instead of the smallest column norm. The smallest column norm of channel matrix is ordered in the top level of tree; the other columns of channel are ordered to be decreased from right to left. In other words, the second top level has the greatest column norm, and the leaf level has the second smallest column norm. The SFSD block performs the proposed SFSD algorithm and adopts proper parallel design for high throughput and full pipeline design for high clock rate. Figure 7 shows the detailed block diagram of the proposed SFSD. The SFSD block consists of several PED modules (PED4, PED3, PED2, and PED1), a list administration unit (LAU), and an LLR module. The PED4, PED3, PED2, and PED1 modules calculate PEDs for layer 4, 3, 2, and 1, respectively. The LAU module compares and sorts the Euclidean distances (EDs) which are accumulated by the PED modules. After 2 Q times (2 Q is the constellation size) of sequential update checking, the LAU module outputs the calculated local minima λ for each layer and the ML solution x ML (Figure 7). LLR computation module substrates the smallest distance value λ ML from the local minimum distance values λ k which corresponds to the local minima with binary complement of the kth bit of x ML and computes each LLR 6 Journal of Electrical and Computer Engineering value. The order of LLR values is not the same as the order of the original input data because SQRD preprocessing reorders the matrix R. Hence, the order recovery is also implemented in the LLR computation module.

Adaptive Voltage Scaling with Double
Sampling. Dynamic voltage scaling (DVS) [13] is a common technique to adjust system voltage and reduce power consumption. Some methods have been proposed for pessimistic designs that require large safety margins [14]. Instead of using guard range from the safety margin, Razor [15,16] proposed an in situ timing error detection and correction mechanism, which can overcome variation from the PVT and significantly reduce the power consumption. In Razor-based DVS, each flip-flop in the critical path is augmented with a shadow latch, which is triggered by a delayed clock. By comparing the data sampled by the main flip-flop and the shadow latch, a timing error can be detected and the corrupt value will be restored by the correct value in the shadow latch. An adaptive voltage scaling with double sampling scheme is applied to the proposed SFSD. As shown in Figure 8, the main flop is clocked by clk and the delayed flop is clocked by clk del, which is delayed with respect to clock clk. The delay between the clock edges of clk and clk del is designed such that the correct value is obtained at the next clock edge of clk del, even in the presence of unpredictability in the wire delay. Therefore, the delayed flop is designed to operate in an error-free manner. The XOR gate is used to compare the data captured by the main flop and delayed flop. When the outputs of the two flops differ, the signal errq is active to indicate a timing error and turn the main flop to the error mode. Afterward, the correct value captured by the delayed flop will be sent by the main flop in the next cycle, causing a one-cycle penalty for error recovery. Figure 9 shows the timing diagram of the error detection and recovery.
At each pipeline stage, an error control circuit presented in Figure 10 is added for generating suitable control signals when an error is detected. Let the number of bit-lines in the pipeline be w lines. The XOR outputs (errq signals) generated at all the w bit lines at each pipeline stage of the pipeline are ORed together and fed as an input to the error control circuit. For the Error-OR-tree, the error signals from all pipeline flip-flops are OR together to generate a unified error signal. If the design has many double sampling flipflop, the fan-in of the OR gate can be very large, requiring OR signal to be pipelined. For error stabilization, in some cases, it is possible for error signal to become metastable. So we add two flip-flops at the global error output to overcome this problem.

Modified
Cell-Based Flow. The timing paths of each submodule are analyzed after top-down synthesis. The critical paths of the target design are in PED1 and PED2 modules. Therefore, the double sampling flip-flops are inserted into the critical paths and ORed all the error signals on each double sampling pipeline stage. Figure 11 shows the block diagram of the SFSD after double sampling pipeline insertion.

Hardware Implementation
The proposed DVS soft output MIMO detector, using TSMC 0.18 μm single-poly six-metal (1P6M) CMOS technology, is   shown in Figure 12. Table 4 lists the ASIC specification. All the power and throughput information in Table 4 operate at 1.8 V and 120 MHz. This design surpasses the previous researches [6,17] for its high power efficiency. A comparison (with the proposed postlayout simulation results) is given in Table 5, where the power efficiency is normalized to mitigate the impact of different process factors and throughputs using Normalized power efficiency While normalizing power efficiency using (11), the nominal V DD of the proposed design is 1.8 V, instead of the scaled voltage, to show the power-saving effect of the applied voltage scaling technique.
By scaling down the supply voltage, the power consumption of the design can be reduced, as shown in Figure 13.   The lowest supply voltage for error-free operations at 100-MHz clock rate for 64-QAM and 16-QAM modulation is 1.44 V and 1.26 V, leading to 34.7% and 49.22% power reduction, respectively. When voltage scales to 1.26 V for 64-QAM modulation or 1.08 V for 16-QAM modulation, timing error occurs on the critical path, which is pipelined by double sampling circuitry. The duplicate circuitry (delayed flip-flop) is triggered by the error signal and recovers the critical path from the timing error. While further scaling down the supply voltage, some subcritical paths may suffer from timing error and cannot be recovered. When the supply voltage decreases to 1.08 V for 64-QAM modulation, the functionality of the design fails due to incorrect data transferred by subcritical paths. With the proposed adaptive voltage scaling method with double sampling, the supply voltage can scale down to 1.08 V and 1.26 V, and the power reduction approaches 48.8% and 61.25%, for 16-QAM and 64-QAM modulation, respectively. Figure 14 shows the maximum clock rate and the current of the design in 16-QAM modulation. The power consumption ratio of each module is shown in Figure 15. By using the error monitoring circuit, timing error information can be reported and conveyed to the voltage controller at run time to control the voltage regulator to power up/down the system. If the supply voltage is inadequate to support required timing of critical paths, the power controller will send a decision bit to indirectly control the voltage source which is provided by a DC-DC converter. The power management of the platform is completed through these mechanisms. The double sampling scheme shows well-behaved function, provides a solution to timing error detection and recovery, and accomplishes a robust DVS platform.

Conclusion
A power-efficient SFSD for MIMO detection is presented in the paper. The configurable SFSD with a novel tree search algorithm achieves soft-output decoding with fixed throughput of 120 Mbps in 16-QAM modulation. The FER of the detector approaches the optimal sphere decoder, with 0.5-dB degradation. A novel adaptive voltage scaling method with double sampling circuitry provides error detection/recovery and a 48.8% power saving.

Introduction
Over the last two decades, multicarrier modulation has received considerable interest for its use in wireless and wireline communication systems [1][2][3][4]. It has been adopted in many communication standards, including digital audio broadcasting (DAB) [5], digital video broadcasting (DVB) [6], high-speed modems over digital subscriber lines (xDSLs) [4], and local area mobile wireless broadband [7].
Most multicarrier systems use coherent detection of data symbols, which requires reliable estimation of channel at the receiver. Channel state information is also necessary for techniques such as channel shortening [8], adaptive modulation/loading, and/or power control [9]. In applications such as discrete multitone (DMT) xDSL [4], channel is estimated through some initial training process, and retraining is required to track the channel variation. To avoid the system overhead due to retraining and thus to track the channel more efficiently, in [10], a correlation matrix based block recursive least-squares (CMB-RLS) algorithm is proposed. The algorithm takes advantage of the inherent redundancy introduced by the cyclic prefix (CP) to blindly estimate the channel. In [11], performance of the algorithm is analyzed considering both the effect of channel noise and decision error. The algorithm is further explored in [12], where its performance is analyzed considering the impact of exponential forgetting factor values, constellation size, and channel nulls. Also, in [13], the method is used in single-carrier (SC) modulation with frequency domain equalization (FDE) to maintain both system performance and throughput.
While CP-based CMB-RLS approach is standard complaint, there are two basic problems that make it unsuitable for real-time implementation. First, it relies on computation of inverse of the correlation matrix Φ per time update. The computational cost of performing the required matrix inversion in real time can be prohibitively high for a system with a large channel length (To reduce the computational complexity and thus processing power, this inversion cannot be done recursively using Matrix inversion Lemma (such as in conventional RLS (CRLS) algorithm [14]).). Second, the direct inversion and recursive inversion approaches are known to severely limit parallelism and pipelining that can effectively be applied in the practical implementation.
Usually, to minimize the round off error, matrix inversions are done with general-purpose digital signal processing (DSP) devices/processors using floating-point arithmetic. A disadvantage of this approach, however, is severe processing power limitation due to small number of floating-point processing units commonly available per device. Specialized hardware with high-processing power is therefore required to execute requisite computations in real time. An appealing alternative for implementation is not to do this inversion explicitly and solve the problem through a computationally cheaper approach that works directly with data matrix and is realizable on the systolic array architecture offering large amounts of parallelism for high-speed very large scale integration (VLSI) implementation. In VLSI implementation, floating-point arithmetic units are more complex than those of fixed-point arithmetic, involving extra hardware overhead and more clock cycles [15]. Hence, the bit-level systolic architecture must be implemented with fixed-point arithmetic.
The QR decomposition (QRD) approaches for RLS problem have played an important role in adaptive signal processing, adaptive equalization, and adaptive spectrum estimation [16]. It is generally agreed that QRD-RLS algorithms are one of the most promising RLS algorithms, due to their numerical stability [17,18] and suitability for VLSI implementation [19,20]. There are three approaches to QRD-RLS problem, namely, Givens rotation (GR), modified Gram-Schmidt (MGS), and Householder transformation (HT) method. These methods have been successfully applied to the development of the QRD-RLS systolic array [16,[21][22][23][24]. Because HT generally outperforms GR and MGS methods under finite precision computations (see the references in [16]), and in the context of our application the channel needs to be updated for each block input data matrix, we focus our attention to the QRD-RLS algorithm based on block HT. Notice that HT is a well-known rank-v update approach and is one of the most efficient methods to compute QRD (Rank-1 updating fast QRD-RLS algorithms (where QRD is updated after the original data matrix has been modified by the addition and deletion of a row or column) [14] are not suitable here in particular due to high throughput (here the term throughput is used to indicate total number of data vectors at the input of the RLS algorithm) and speed requirements.). In [24,25], Liu et al. investigated one such QRD-RLS algorithm using block HT. The work in [24] describes the block HT implementation on a systolic array and its application to RLS algorithm called systolic block HT-RLS (SBHT-RLS). So far, SBHT-RLS is used in beamforming and linear predication applications but has not been applied for channel tracking in high-throughput multicarrier applications. The algorithm is well known for its computational efficiency, very good numerical properties, and parallel processing implementation advantages.
In this paper, we develop two new CP exploiting SBHT-RLS approaches for adaptive channel estimation in multicarrier systems.
The first approach is based on SBHT-RLS approach of Liu et al.. In its original form, the SBHT-RLS does not provide access to channel weights, as its use has been limited to the problem seeking an estimate of output error signal. In the context of our application, the proposed approach finds the channel explicitly. In order to differentiate between the two techniques, the new method will be referred to as CP-based Direct SBHT-RLS approach. The proposed scheme is computationally efficient and can be mapped to triangular systolic arrays for efficient parallel implementation. Unfortunately, the scheme suffers from a major drawback, namely, back substitution, which is a costly operation to perform in array structure [26,27].
The second approach relies on inverse factorizations to calculate least squares channel coefficients (weight vector) without back substitution. This approach also employs SBHT to recursively update the channel coefficients and thus preserves the inherent stability property of SBHT-RLS approach. The derivation of the inverse factorization method in this paper is done by generalizing the Extended QRD-RLS algorithm to block RLS case [28]. For this reason, this method will be referred to as CP-based Extended SBHT-RLS approach. We underscore here that this simple and straightforward derivation is different than the previous challenging work on block RLS using inverse factorizations in [29,30]. Computational complexity of this scheme is equivalent to the first proposed scheme, but unlike the first scheme it is fully amenable to VLSI implementation and also results in improved steady-state performance.
For the sake of brevity, in the rest of this paper, we refer to the CP based CMB-RLS as CPE1, Direct SBHT-RLS as CPE2, and Extended SBHT-RLS as CPE3. Also, for uniformity, we closely follow the notation that appears in [10].
The paper is organized as follows. In the next section, we provide an overview of the DMT system model [10]. Section 3 explains the newly proposed algorithms, followed by a discussion on their computational complexity and systolic array implementation in Section 4. In Section 5, illustrating floating-and fixed-point simulations are conducted, while conclusions are drawn in Section 6. Some results contained in this paper have been presented/accepted for presentation in [31,32].
Notation. (·) T , (·) * , and E[·] denote transpose, complex conjugate, and expectation operation. The Matlab notation X(:, m : m ) is used to to denote the submatrix of X that contains the columns m to m . x(n : n ) denotes the subvector of x comprising of entries n through n . I n denotes identity matrix of size n, 0 denotes the all zeros matrix of appropriate dimensions, and j = √ −1. The meaning of other variables will be clear from the context.

System Model
We consider a high-speed DMT data transmission system over digital subscriber lines, shown in Figure 1. The system has m/2 complex parallel subchannels and illustrates the typical CP based adaptive channel estimation task, which is our main concern in this paper. Let {s n } represent the data sequence to be transmitted over the channel. This input data is buffered to blocks, and each data block is divided into m/2 bit streams and then mapped to quadrature amplitude modulation (QAM) constellation points The relationship between prefix part y ( f ) k and the transmitted signal may be expressed as [10] where After the FFT operation on y k = [y 0,k , y 1,k , . . . , The CP removes interblock interference (IBI) between X k 's. The received symbols can thus be written as where To get the estimation of X i,k from Y i,k , a one-tap minimum mean square error (MMSE) equalizer , is then employed at the ith channel. The estimated data is then X i,k = Y i,k w i,k . The decision is then made on X i,k to get the final output X i,k = q( X i,k ), where q(·) is the decision operation.

CP-Based Direct SBHT-RLS Algorithm (CPE2).
Based on the CP data model (1), we define nv × r weighted data matrix and the nv×1 weighted received vector in a recursive manner asȦ Journal of Electrical and Computer Engineering where Λ is an nv × nv block-diagonal forgetting matrix of the form with forgetting factor across blocks 0 < λ ≤ 1. The forgetting factor λ is incorporated in the scheme to avoid overflow in the processors as well as to facilitate nonstationary data updating [25]. Suppose that at the (k − 1)th update we have QRD where A n × n HT matrix T is of the form is multiplied by T, it is reflected in the hyperplane defined by span{v} ⊥ . Choosing v = x ± x 2 e 1 , where e 1 = [1, 0, 0, . . . , 0] T , then x is reflected onto e 1 by T as: Tx = ± x 2 e 1 .
A series of HTs are then used to zero out A k in the righthand side of (8).
k denotes the ith HT matrix (which zeroes out ith column of updated A k ) given as where H (i) k,11 is r × r identity matrix except for the ith diagonal entry, It is thus we have Input: y and Y k Known parameters: Γ i and σ i Selecting parameters: λ (with 0 < λ ≤ 1) Initialization: k = 0, an initial training process is used to initialize h 0 and R 0 = δI (with 0 < δ 1 is small positive scalar).
the optimal solution is thus obtained by solving the upper triangular system R k h k = u k by back substitution operation as follows: The matrixȦ k−1 can be uniquely QR factorized only if it is full column rank (i.e., rankȦ k−1 = r). Therefore, the minimum number of rows inȦ k−1 must be at least large as the number of columns. To satisfy this requirement and thus to reduce the number of received blocks needed by CPE2 (and CPE3 in Section 3.2), in step (10), we seṫ Based on the above discussion, CPE2 algorithm is summarized in Table 1.

CP-Based Extended SBHT-RLS Algorithm (CPE3).
In this section, we propose an alternative approach by appending one more column to the matrices of CPE2 algorithm. To simplify the derivation, we combine the first column of (10) and the new column to construct the formula We next define a lemma, known as the matrix factorization lemma [33] that is very elegant tool in the development of QRD-RLS algorithms.

Lemma 1. If A and B are any two N × M(N ≤ M) matrices, then
if and only if there exists an N ×N unitary matrix Q(Q T Q = I) such that Applying Lemma 1 to (14), we obtain This shows that R −T k obtained is the correct inverse transposition of R T k and can be updated by using the same orthonormal transformation Q k .
Next, we combine the second column of (10) and the new column to construct the formula Now by applying Lemma 1 to (18) yields From (19), we establish a simple recursion to compute the channel vector This recursion can be written in component form as where w i,k is the ith column of the matrix W T k . Based on the above discussion, CPE3 is formulated in Table 2.
Remarks. (i) Both algorithms are initialized in a training mode, the algorithms then switch to a decision-directed mode for channel tracking. Note that, in step (1), based on the previous channel estimate h k−1 , the previous frequency response H k−1 = [H 0,k−1 , . . . , H m−1,k−1 ] is computed. In step (2), H k−1 is then used to compute equalization coefficients. The decision-directed data vector X k is then computed in step (3). In step (4), symbol estimates are projected onto the finite alphabet (FA), and the estimated transmitted CP data x ( f ) k is obtained by performing partial FFT on the decision-directed projected samples X k . In steps (5) through (7), the new channel estimate is then obtained by treating the resulting symbol estimates as the known symbols. The process of alternating between channel and symbol estimation steps is applied repeatedly.
(ii) In [29], Sakai has derived a method for extracting weight coefficients based on the inverse factorization method of Pan and Plemmons [34] and Liu's SBHT-RLS algorithm. The time updating formula for channel coefficients is obtained by first generalizing the inverse factorizations for the block case and then deriving a formula for updating the channel coefficients. The complicated and challenging derivation gets rid of matrix operations by exploiting the relation between a priori and posteriori error vectors. Based on [35] and suggested by its author, Sakai has also presented a simpler derivation for updating the channel vector in [30]. In contrast, in the above discussion, the same result is derived by following a straightforward approach by generalizing the Extended QRD-RLS algorithm of Yang and Bohme [28] to the block RLS case.

Computational Complexity and Systolic
Array Implementation  (1) through (4)) is particularly simple for which many efficient systolic array architectures have been proposed. We therefore limit our discussion to possible implementation architectures for channel estimation part of the proposed algorithms.
The systolic array implementation of channel estimation section of CPE2 and its processing cells are shown in Figure 2, where adaptive filtering triangular update part (comprising of step (6)) is realized on a triangular vectorial systolic array as in [24] for R k and u k extraction. It consists of two sections: the upper triangular array (shown in part (a) of Figure 2), which stores and updates R k and the righthand column of cells (shown in part (b) of Figure 2), which stores and updates u k . The input data are fed from top and propagate to the bottom of the array. The rotation angles are calculated in left boundary cells, and propagate from left to right. The resulting R k and u k updates in step (6) are subsequently used in the linear bidirectional systolic array section [19] (shown in part (c) of Figure 2) to obtain the channel estimate using back substitution operation. Unfortunately, a critical obstruction appears because the process of the triangular-updates runs from the upperleft corner to the lower-right corner of the array, while the process of the back substitution runs in exactly the opposite direction. It is therefore pipelining of the two steps (the triangular update and back substitution) that seems impossible on a triangular array. Back substitution may be implemented as a separate operation on a parallel twodimensional array [36]. Nevertheless, the two-dimensional array can become quite large for long channel lengths, requiring a substantial area for VLSI implementation. On the other hand, comparatively simpler linear array structure shown in Figure 2 is highly sequential, thus involving more time delay due to increased clock cycles to compute the channel coefficients. For these reasons, the back substitution in CPE2 is a costly operation to perform.
The CPE3 approach involves a time recursive QR solution to compute the channel vector h. The channel estimation part of CPE3 algorithm can be implemented by a fully pipelined rhombic systolic array obtained by combining lower triangular array with an upper triangular array. This implementation has been performed by Sakai in [29,30] and is reproduced in Figure 3. The components of R k are updated in the upper triangular part (a) of Figure 3. Also, the components of u k are updated in part (b) in the same fashion as the off-diagonal components of R k , with the input data y k from the top of this column and the output v k from the bottom of this column. Notice that systolic implementation in upper section of part (c) is similar to that in part (a), except that the array is now lower triangular, and each element is divided with β = √ λ before updating, and the input to the array is provided from the top in the form of a zero vector. A systolic array performing (20) is shown in lower portion of part (c) of Figure 3, where the cells in the bottom line, shown by small circles; perform (20) for calculating the tap coefficients. Each column of the lower triangular array whose cells are shown by diamonds perform R −T k updating. The cells also calculate each column of U T k , appearing from the last diamond cell. Notice that due to absence of back substitution, the CPE3 algorithm is rich in parallel operations and therefore leads to more efficient and simple implementation on systolic processors.

Simulation Results
In this section, floating-point and fixed-point simulation results are presented to examine and compare the performance of the CPE1, CPE2, and CPE3 approaches. All simulations were carried out in a typical asymmetric digital subscriber line (ADSL) environment with perfect block synchronization, FFT size m = 512, the CP length v = 32, λ = 0.75, δ = 1e −3 , and 4-QAM constellation for modulation, unless otherwise stated. For a fair comparison, for CPE1 we set forgetting factor across blocks μ 1 = λ and forgetting factor within blocks μ 2 = 1. The mismatch performance is evaluated by averaged mean-square-error (MSE) per subchannel err = i∈U X i − X i /|U|, where U is the set of indexes corresponding to the U used subchannels and |U| is the number of all the used subchannels [10]. The transmit power of all used subchannels is same (i.e., σ 2 i = σ 2 ) and the noise power was set such that SNR= 30 dB (a typical value of SNR in ADSL environments).
The discrete channel impulse response with transfer function H0(D) for carrier service loop area (CSA) loop # 1 was obtained from the Matlab DMTTEQ Toolbox [37] and sampled at 2.208 MHz. For simulation purposes, the shorter channel was generated by subsampling. H0(D) was perturbed to obtain another test channel H1(D) (to mimic small variation in H0(D)). Corresponding frequency responses for the two test channels are shown in Figure 4. Initially, the channel transfer function is H0(D), which remains unchanged for the first 400 data blocks. At data block 401, the channel is switched from H0(D) to H1(D). For all adaptive schemes, only the first DMT symbol was sent as pure training sequence to identify the initial channel for fast convergence. Also, the inverse of the correlation matrix in CPE1 is initialized to a constant multiple of the identity matrix.
Example 1. Figure 5 shows typical learning curves of the three algorithms, with adaptation factor parameter λ values   754)). It can be seen that all the schemes are able to converge and can track the channel variation. The learning curves of CPE1 and CPE2 are overlaid and both the algorithms converge faster than CPE3. As compared to CPE3, the two algorithms are also seen to have greater uneven performance. In contrast, although CPE3 convergence is slower, it is seen to demonstrate superior steady-state performance. A close examination of CEP2 algorithm shows that the back substitution operation involves decision-feedback computation of channel coefficients. If a channel coefficient suffers from an error, this error weights heavily in the estimation of the next and subsequent channel coefficients. The erroneous estimated channel causes the next detection error. This decision error further propagates and causes subsequent decision errors. Consequently, CPE2 encounters performance loss. In contrast, channel is recursively updated without back substitution in CPE3. CPE3 is therefore seen to yield better performance.
A close observation of top and bottom plots of Figure 5 also indicates that convergence rate and steady-state performance of the three algorithms can be improved by lowering the value of λ. The price paid in growth is uneven performance which can be reduced and thus numerical stability can be improved by increasing the data block size (i.e., with the increased CP length), while the system latency is increased.

Example 2.
Without giving a rigorous stability analysis, we verify the stability of the CPE1, CPE2, and CPE3 algorithms experimentally through a long-time simulation with 5 × 10 3 data blocks (considerably large number of samples). Corresponding results in Figure 6 show that the three algorithms do not show any sign of divergence and have very stable performance.
Example 3. The more complex the modulation alphabet, the narrower the gap between the symbol decision space and the higher the probability of error in detecting the signal [38].
Since the three algorithms rely on the FA property of source symbols, high-performance degradation is expected as the constellation size increases. It is therefore the three algorithms that may not be suitable for rate adaptation. To verify this, in this simulation example, we repeat Example 1 with 16-QAM and 64-QAM constellation sizes. Corresponding simulation results in Figure 7 show that the three algorithms   take the same number of data blocks to converge. However, as expected, their performance degrades when the constellation size is increased.
Example 4. In this section, due to inherent parallelism and thus suitability for fixed-point VLSI implementation, we examine the fixed-point performance of the CPE2 and CPE3 algorithms with 16, 24, and 32 bit data word length W L implementations for both data and channel coefficients. These W L s are selected as a reasonable approximation as these data lengths are suitable for many applications. It is important to note here that for the above choice of W L s, the thresholds are sufficiently larger than signal values involved in both the algorithms. The quantizer is therefore always expected to operate on values that are much lesser than the boundary values, and therefore no saturation errors are expected. The only errors that are introduced by finite precision approximations are the round-off errors. Figure 8 provides performance plots of CPE2 (top) and CPE3 (bottom) with different W L choices and floating-point performance. From the performance curves, we infer that both algorithms are able to track the channel without numerical stability issues with W L s 24 and 32. The performance curves with W L of 16 bits indicate unacceptable performance or breakdown caused by quantization errors for both the algorithms. For both the algorithms, increasing W L above 24 bits does not result in any improvement and performance curves of their 24-bit and floating-point implementations are overlaid (there is no visible difference). It is therefore 24-bit finite word implementation is a reasonable approximation of their floating-point computation.

Conclusion
In this paper, by using numerically robust block HTs, two CP-based adaptive channel estimation algorithms have been presented for multicarrier systems. Conceptually, the new schemes maintain the same spirit of the CP based CMB-RLS channel tracking scheme. More precisely, the basic idea is to utilize CP data from the data detection part for adaptive channel estimation. The new approaches achieve the same purpose by replacing the computationally expensive CMB-RLS channel estimation part with the computationally cheaper SBHT-RLS alternatives. Among the two schemes, the method called CP based Direct SBHT-RLS is based upon Liu's algorithm in the channel estimation part but adaptively updates channel vector instead of the error vector. The second method called CP based Extended SBHT-RLS is based upon Sakai's algorithm in the channel estimation part but uses an independent and simpler derivation.
Floating-point performance curves indicate that all the three schemes are able to converge and can track channel variation without any stability problems. CPE1 and CPE2 exhibit identical stable performance, whereas CPE3 outperforms both the CPE1 and CPE2 techniques. In contrast to CPE1, what is remarkable here is that the CPE2 and CPE3 algorithms achieve their performance at lower computational complexity, enhanced parallelism, and pipelining for systolic array/VLSI implementation. All the three algorithms are seen to converge faster and perform better with lower values of forgetting factor parameter λ. Our simulation results suggest that such advantages come at the price of greater uneven performance. Hence, moderate values of forgetting factor would be preferred where a balance in both performance and stablility is required. The three techniques also show reduction in performance with the increase in modulation constellation size. Hence, these techniques are more appealing when the constellation size is small and may not be suitable for rate adaptation. It is also shown that in terms of finite word length behavior, 24-bit finite word implementation is a reasonable approximation of their typical floating-point computation (In practice, the word lengths are optimized with respect to the actual system requirements (i.e., chip area, latency, power consumption, FFT size, throughput), noise, channel length, and desired acceptable performance.).
Systolic array structures that allow efficient parallel implementations of the schemes with VLSI technology in real time were considered. The CPE2 approach is partially concurrent due to costly back substitution operation, whereas, CPE3 approach is highly concurrent due to the absence of back substitution operation and therefore lead to more efficient implementation on systolic processors.
The methods proposed in this paper are well suited for applications where good numerical properties, computational saving, and parallel processing implementation advantages (with improved performance (in case of CPE3 only)) are desired. Although a real baseband DMT case is the main focus of this paper, the proposed approaches can also be applied to the complex baseband case (wireless multicarrier systems). In such case, a further improvement in performance is possible by including forward error correction (FEC) decoding in the reliable reconstruction of transmitted symbols. Future interesting directions include studying hardware implementation problems, fine grain implementation/architecture of processing elements to workout total cost of operators (adders, multipliers, dividers, memory elements (delay elements), etc.) and algorithm latencies, modifications of the schemes to achieve reduced complexity, performance improvement, and stable implementations with reduced word lengths. [ [2]. Noncoherent detection schemes like 2SDD are not optimal and can be improved by multisymbol differential detection (MSDD), which is a maximum likelihood procedure for finding a block of information symbols after observing a block of received symbols [3]. For very large numbers of observations, the performance of MSDD approaches the performance of ideal coherent detection of DE-QPSK, which is given in, for example, [4][5][6]. Noncoherent MSDD can also be used if channel coding is applied in a noniterative way, see [7,8].
If MSDD is combined with iterative (turbo) processing (parallel concatenated systems were first described by Berrou et al. [9], serial concatenation was developed by Benedetto and coinvestigators [10][11][12]), it needs to be improved to get a more acceptable complexity. We were motivated by a number of encouraging results on serial concatenation of convolutional encoding followed by differential encoding with turbolike decoding techniques, also referred to as Turbo-DPSK. Turbo-DPSK was investigated for single-carrier transmission on AWGN channels in [13][14][15][16][17][18], as well as for time-varying channels in [19][20][21][22][23]. The main objective of these papers was to reduce the complexity of the inner decoder. Two main methods can be distinguished: first an explicit estimation of the channel phase followed by coherent detection, see [19,20], and for the 2D-case [24][25][26], or secondly by directly calculating the a posteriori probabilities of the information symbols as in [17,18,22], and for the 2D case [27][28][29].
We focus in the present paper on 2D processing, that is, in both the frequency and time domains. We will propose methods based on iteratively demodulating and decoding blocks of received symbols in a DAB-transmission stream. First we will; however, summarize other 2D approaches that are relevant to our work.
The work of ten Brink et al. in [24] on 2D phase-estimate methods can be regarded as an extension of the results of Hoeher and Lodge in [20] to the multicarrier case. Park et al. in [25] improved the hard-decision approach of ten Brink et al. by considering soft-decision. Both [24,25] rely on pilot symbols, which are not present in DAB-transmission [1] unfortunately. Blind channel estimation techniques were proposed by Sanzi and Necker in [26]. They proposed a combination of the iterative scheme of ten Brink in [24] and a fast converging blind channel estimator based on higherorder asymmetrical modulation schemes, which are not used within a DAB-transmission [1].
To obtain a posteriori probabilities of the information symbols in a 2D setting, May, Rohling, and Haase in [27][28][29] considered iterative decoding schemes for multicarrier modulation with the soft-output Viterbi algorithm (SOVA, [30]). The SOVA was used for differential detection as well as for decoding of the convolutional code. They used in the coherent setting an estimate of the phase based on a block of three by three received symbols, which are adjacent in time and frequency direction. They proposed, for the coherent case, to use only the current received symbol to obtain a symbol metric for the SOVA innerdecoder, actually ignoring the differential encoding. For the incoherent case, they used a transition metric for the SOVA innerdecoder based on the current and previously received symbol. These a posteriori detection schemes produce approximations of the a posteriori probabilities. Procedures that focus on efficient computation of exact values can be found in [5] for the coherent case, but also in [18] for the incoherent case.
To reduce complexity we accept a small performance loss due to channel-phase discretization (see, e.g., Peleg et al. [17] and Chen et al. [22]) in this contribution, but apart from that we determine the exact a posteriori probabilities of the information symbols in a 2D setting. Our starting point will be the techniques proposed by Peleg et al. in [17]. We discretize the channel phase into a number of equispaced values, but do not allow the "side-step" transitions that were proposed by Peleg et al. to track small channel-phase variations. Then we calculate, in an efficient way, the a posteriori probabilities of the information symbols using the BCJR-algorithm [31] in a 2D setting, see also [32]. We will consider 2D blocks and trellis decomposition. Each 2D block consists of a number of adjacent subcarriers of a number of subsequent OFDM symbols. Focussing on 2D blocks was motivated by the fact that the channel coherence-time is typically limited to a small number of OFDM symbols, but also since DAB transmissions use time-multiplexing of services, which limits the number of OFDM symbols in a codeword. Extension in the subcarrier direction is required then to get reliable phase estimates. The trellisdecomposition method allow us to estimate the unknown channel-phase efficiently. This phase is related to subtrellises of which we can determine the a posteriori probabilities. With these probabilities we are able to chose a dominant subtrellis, which results in a significant complexity reduction.
Franceschini et al. [33] also use the idea of trellisdecomposition and subtrellises (multiple trellises), to focus on estimating channel parameters. Variation of these parameters is tackled by applying the so-called intermix intervals, in which special manipulations (mix-metric techniques) on the forward and backward metrics are performed. Since we cannot track channel variations here, we apply a 2D approach which is based on the assumption that there are independent channel realizations within distinct blocks. We will explain later, in Section 2.1, why we cannot track the channel phase.

Paper Outline.
In this paper, we will focus both on iterative and noniterative decoding techniques for DAB-like systems. In the next section, we will give a short outline of the DAB system. In Section 3, we will start our analysis by considering noniterative methods for the single-carrier case and introduce trellis-decomposition with a dominant subtrellis approach. In Section 4, we expand our singlecarrier methods to the multicarrier case and introduce a 2D-block approach for demodulation. Iterative methods based on serial concatenation of convolutional codes (SCCC) and trellis-decomposition with two dominant subtrellis approaches are considered in Section 5. In Section 6, we generalize our iterative methods for the single-carrier case to the multicarrier case with 2D-block demodulation. Results of applying our approach to a practical case are shown in Section 7. Finally, Section 8 draws conclusions on our decoding procedures based on 2D blocks and trellis-decomposition for DAB-like systems.

Overview.
Terrestrial digital broadcasting systems like DAB, DAB+, and T-DMB, all members of the "DAB family," comprise a combination of convolutional coding (CC), interleaving, π/4-DE-QPSK modulation followed by OFDM, see Figure 1. Time multiplexing of the transmitted services allows the receiver to perform per service symbol processing [1], see Figure 2 where 1536 is the number of "active" OFDM subcarriers for a DAB-transmission in Mode-I [1]; hence the receiver can decode a certain service without having to process the OFDM symbols that do not correspond to this service. Consequently, only at particular time instants within a DAB transmission-frame a small number (usually up to four) of OFDM-symbols need to be processed. This results in "idle time" for the demodulation and decoding processes. Note, that due to this "idle time," the mix-metric techniques of [33] cannot be applied to DAB receivers. However, if all the transmitted services are decoded, and there is no idle time, mix-metric techniques could be a Note that there is an overlap between the service since differential modulation is used. valuable extension to the 2D iterative processing methods based on trellis-decomposition that, we will develop here.
In the following subsections we will describe the transmit processes (convolutional encoding, differential modulation, and OFDM) in more detail.

Convolutional Coding and
Interleaving. The convolutional code that is used within DAB has basic code-rate R c = 1/4, constraint length K = 7, and generator polynomials g 0 = 133, g 1 = 171, g 2 = 145, and g 3 = 133. Larger coderates can be obtained via puncturing of the mother code, see Hagenauer et al. [34]. The time and frequency interleavers in DAB perform bit and bit-pair interleaving, respectively. As a result the code-bits leaving the convolutional encoder are permuted and partitioned over the subcarriers of a number of subsequent OFDM symbols (in subsequent frames). The bits for each subcarrier are grouped in pairs, and each of such pair is mapped onto a phase (difference) that, therefore, can assume four different values. The mapping that is used here is based on the Gray principle, that is, labels that correspond to adjacent phase differences differ only in a single bit position. b = (b 1 , b 2 , . . . , b N ) consisting of N symbols (phase differences) b n for n = 1, 2, . . . , N carries the information · · · · · · · · · · · · · · · · · · · · · · · · y 1,0

Differential Modulation in Each Subcarrier. For each subcarrier, π/4-DE-QPSK modulation is applied. A sequence
where for the first symbol s 0 1. and s m,n is the n-th differentially encoded symbol corresponding to the m-th subcarrier, or equivalently the m-th element in the n-th frequency-domain OFDM-symbol, see Figure 1. Note, that the IFFT is a computationally efficient inverse discrete Fourier transform (IDFT) for values of B that are powers of 2. To prevent Intersymbol interference (ISI) resulting from multipath reception, a cyclic prefix of length L cp is added to the sequence s n . This leads to the sequence s n = ( s B−Lcp+1,n , . . . , s B,n , s 1,n , s 2,n , . . . , s B,n ) that is finally transmitted. We assume that the channel is slowly varying with an impulse response shorter than the cyclic-prefix length. Moreover, we assume that the channel coherence bandwidth and coherence time span multiple OFDM subcarriers and multiple OFDM symbols. Therefore, the channel-phase and gain might be assumed to be fixed for a number of adjacent subcarriers and consecutive symbols. This is the assumption on which we base our investigations. The channel phase and gain are assumed constant (yet unknown to the receiver) over a 2D block of symbols, see Figure 3.

OFDM in DAB. OFDM in DAB is
The receiver, in the case of perfect synchronization, removes the (received version of the) cyclic prefix, and then OFDM reception can be regarded as parallel matchedfiltering corresponding to B complex orthogonal waveforms, one for each subcarrier. This results in a channel model, holding for a 2D block of symbols, that is, given by r m,n = |h|e jφ s m,n + n m,n , for some subsequent values of n and m, where the channel gain |h| and phase φ are unknown to the receiver. It should be noted that a phase rotation proportional to m, due to a time delay, is removed by linear phase correction (LPC). This technique modifies the phase of each OFDM subcarrier with an appropriate rotation based on the starting position (time delay) of the FFT window within the OFDM symbol.
In practise, this delay can be determined quite accurately.
In the next subsection, we focus on a single subcarrier.

Incoherent Reception, Channel Gain Known to Receiver.
The sequence s that is transmitted via a certain subcarrier is now observed by the receiver as sequence r = (r 0 , r 1 , . . . , r N ).
Note that compared to the previous subsection we have dropped the subscript m here. Since it is relatively easy to estimate the channel gain, we assume here that it is perfectly known to the receiver, and to ease our analysis we take it to be one. The received sequence now relates to the transmitted sequence s as follows: r n = e jφ s n + n n , for n = 0, 1, . . . , N, where we assume that n n is circularly symmetric complex Gaussian with variance σ 2 per component. Basically we assume that the random channel phase φ is real-valued and uniform over [0, 2π). This channel phase is fixed over all N+1 transmissions and unknown to the receiver. Accepting a small performance loss as in, for example, Peleg et al. [17] and Chen et al. [22], we may assume that the channel-phase is discrete and uniform over 32 levels, which are uniformly spaced over [0, 2π), hence Pr φ = πl 16 = 1 32 , for l = 0, 1, 2, . . . , 31.
We will first study the situation in which we consider a uniformly chosen channel phase in a single subcarrier. Later we will also investigate the setting in which a uniformly chosen channel phase is moreover constant over a number of (adjacent) subcarriers.

Equivalence between DE-QPSK and π/4-DE-QPSK.
It is well known and straightforward to show that the π/4-DE-QPSK modulation, which is performed in each of the subcarriers, is equivalent to DE-QPSK. To see this, we define for n = 1, 2, . . . , N a n = b n e − jπ/4 , and for n = 0, 1, . . . , N x n = s n e − jnπ/4 , y n = r n e − jnπ/4 , w n = n n e − jnπ/4 .
Now we may conclude that also x n ∈ A for all n = 0, 1, . . . , N, and that w n , just like n n , is circularly symmetric complex Gaussian with variance σ 2 per component. Moreover, since b is Gray-coded with respect to the interleaved code bits, so is a = (a 1 , a 2 , . . . , a N ). From now on, we will therefore focus on DE-QPSK.

Detection and Decoding: Single-Carrier Case, Noniterative
We will start by considering the single-carrier case. For some single subcarrier, we will discuss DE-QPSK modulation with incoherent reception. Based on trellis decoding techniques, we will determine the a posteriori symbol probabilities under the assumption that the (quantized) channel phase is uniform and unknown to the receiver. We also assume the transmitted symbols to be independent of each other and uniform.
The variables z n for n = 0, 1, . . . , N can now be regarded as states in a trellis, and the independent uniformly distributed (iud) symbols a 1 , a 2 , . . . , a N correspond to transitions between states. The resulting graphical representation of our trellis can be found in Figure 4.
Journal of Electrical and Computer Engineering 5 · · · · · · · · · · · · · · ·  If we would use the standard BCJR algorithm for computing the a posteriori symbol probabilities in the trellis in Figure 4, we have to do 32 × 4 multiplications in the forward pass, 32 × 4 multiplications in the backward pass, and 4 × 32 × 2 multiplications and 4 normalizations in the combination pass, per trellis section, if the a priori probabilities are all equal. In total, this is 512 multiplications and 4 normalizations per trellis section. We suggest to focus only on multiplications and normalizations in this paper since additions have a smaller complexity than multiplications and normalizations. (In the log-domain, multiplications and normalizations are replaced by additions, and additions are typically approximated by maximizations. This would more or less suggest to consider multiplications, normalizations, as well as additions, but for reasons of simplicity we neglect the additions here.) An important observation for our investigations is that the trellis can be seen to consist of eight subtrellises T 0 , T 1 , . . . , T 7 , that are not connected to each other. A similar observation was made by Chen et al. [22]. We will discuss connections between our work on the trellis decomposition and that of [22] later.
Note that for the likelihood γ n (z n ) corresponding to some state z n ∈ Z for n = 0, 1, . . . , N in the trellis T or in a subtrellis, we can write that

Forward-Backward Algorithm, Subtrellises.
In this subsection, we would like to focus on computing the a posteriori symbol probabilities Pr{a n | y 0 , y 1 , . . . , y N } for all n = 1, 2, . . . , N and all values a n ∈ A. It will be demonstrated that it is a relatively simple exercise to do this. We will show that the resulting a posteriori probability is a convex combination of the a posteriori probabilities corresponding to the eight subtrellises. Computing the a posteriori probabilities for each subtrellis is simple and can be done without performing the BCJR algorithm, as was demonstrated by Colavolpe [5]. The coefficients of the convex combination do not depend on the trellis section index n and are quite easy to determine as we will show here.

Forward Recursion.
In our forward pass, we focus on subtrellis T s , for some s ∈ {0, 1, . . . , 7}. For that subtrellis we find out how to compute all the α's in that subtrellis first. Starting from α 0 (z 0 ) = 1/32 for all z 0 ∈ Z s , we can compute the α's recursively from for n = 1, 2, . . . , N and z n ∈ Z s . The notation (z, a) → z stands for all states z and symbols a that lead to next state z .
for n = 0, 1, . . . , N and z n ∈ Z s Proof. Our proof is based on induction. Clearly for n = 0 the result holds. Now assume that α n−1 (z n−1 ) = (1/32) n−2 i=0 K s (i) for z n−1 ∈ Z s , then from (13) we obtain for all z n ∈ Z s .

Backward Recursion.
Also in the backward pass we first focus only on subtrellis T s for some s. In this subtrellis, we would like to compute the β's. Taking β N (z N ) = γ(z N ) for z N ∈ Z s we can compute all other β's from where again n = 0, 1, . . . , N − 1 and z n ∈ Z s .

6
Journal of Electrical and Computer Engineering
Proof. Again our proof is based on induction. Note first that for n = N the result holds. Now assume that β n+1 (z n+1 ) = γ n+1 (z n+1 ) N i=n+2 K s (i), for z n+1 ∈ Z s . Then for all z n ∈ Z s .
The right-hand side of this equation can be interpreted as a convex combination of a posteriori symbol probabilities Pr{a n | y, s}, one for each subtrellis, where the weightingcoefficients are the a posteriori subtrellis probabilities Pr{s | y}. An a posteriori subtrellis probability is the conditional probability that the discrete channel phase modulo 8 equals s for some s = 0, 1, . . . , 7 given y.
The demodulator that operates according to (25) has three tasks, first the eight weighting coefficients (23) have to be computed, then for each of the eight subtrellises for all symbol values a n ∈ A and all n ∈ {1, 2, . . . , N}, the a posteriori symbol probabilities have to be computed. Finally the weighting (25) has to be done. Computing the weighting coefficient requires for each subtrellis s ∈ {0, 1, . . . , 7} the computation of the factors K s (n) for n = 0, 1, . . . , N. These factors should then be multiplied and normalized to form Pr{s | y}. For these computations, 8 multiplications per trellis section are needed. Computing the a posteriori symbol probabilities Pr{a n | y, s} can be done efficiently by applying the Colavolpe [5] technique to each subtrellis. As in Colavolpe each such a posteriori symbol probability is based on only two received symbols y n−1 and y n as is shown in (24). This avoids the use of the BCJR method in full generality and leads to significant complexity reductions, that is, only 8 × 4 × 4 = 128 multiplications and 8 × 4 = 32 normalizations are needed per trellis section. The weighting operation requires 8 × 4 = 32 multiplications, and therefore in total this approach leads to 8 + 128 + 32 = 168 multiplications and 32 normalizations, which is considerably less than what we need for full BCJR. (25) shows how the exact a posteriori symbol probabilities can be determined. If the a posteriori subtrellis probabilities are such that one of the probabilities dominates the other ones then weighting (25) can be approximated by Pr a n | y ≈ Pr a n | y, s , with s = arg max s Pr s | y . (26) Observe that this approach involves the computations of the a posteriori symbol probabilities, as described in (24), only for the dominant subtrellis s. This requires 4 × 4 = 16 multiplications and 4 normalizations only per trellis section. Together with the computation of the weighting coefficients 8 + 16 = 24 multiplications and 4 normalizations are necessary. Therefore, this reduces the number of multiplications with respect to full weighting by a factor of seven.

Simulations.
We use in our simulations, just like Peleg et al. [17], the de facto industry standard R c = 1/2 convolutional code with generator polynomials g 0 = 133 and g 1 = 171, which is equal to the convolutional code with puncturing index PI = 8 of Table 29 in [1, Section 11.1.2, page 131]. The DAB, DAB+, and T-DMB bit-reversal time interleaver and block frequency interleaver are modeled by a bitwise uniform block interleaver generated for each simulated code block of bits, hence, any permutation of the coded bits is a permissible interleaver and is selected with equal probability, as is done in [17].
The demodulator calculates, for each OFDM-subcarrier, the a posteriori probability given by (25) for N + 1 = 2, 4, 8, and 32. The demodulator is followed by a convolutional decoder, which needs as input soft-decision information about the coded bits. Now, it follows from Gray mapping, that is, b 1 b 2 00 01 11 10, a(b 1 b 2 ) 1 e jπ/2 e jπ e j3π/2 , that the desired metrics related to transmission n, that is, the log-likelihood ratios (LLRs) [30], can be expressed as with symbol metric m φ = ln Pr a n = e jφ | y , and where λ 1 n corresponds to bit b 1 and λ 2 n to bit b 2 . Figure 5 shows the Bit-Error Rate (BER) performance with the so-called ideal LLRs for a decomposed trellis for trellis-length N + 1 = 2, 4, 8, and 32. On the horizontal axis is the signal-to-noise ratio E b /N 0 = 1/(2σ 2 ). The demodulator operates according to (25).
We will compare the performance of this demodulator with that of two well-known procedures described in the literature: firstly, to "classical" DQPSK [35, page 224], that is, two-symbol differential detection (2SDD). This leads to a posteriori symbol probabilities as in (9) in Divsalar and Simon [3], that is, to Pr a n | y n , y n−1 ∝ I 0 1 σ 2 y n a * n + y n−1 , for a n ∈ A, where I 0 (·) is the zeroth-order modified Bessel function of the first kind. Secondly, we will compare our results to coherently detected DE-QPSK. We assume that the received sequence is perfectly derotated, that is, y = ye − jφ . Then the a posteriori symbol probabilities are given by Pr a n | y ∝ xn−1∈A exp 1 σ 2 x * n−1 y n a * n + y n−1 , for a n ∈ A, as described by Colavolpe [5]. Note that (31) is similar to (24) for s = 0.  (25), that is, ideal LLRs, for different trellis-lengths.
The simulation results, which are shown in Figure 5, demonstrate that the BER performance curves of 2SDD and trellis length N + 1 = 2 are practically identical as we expect. Moreover, the coherent-detection curve and the curve for very large trellis sizes (N → ∞) are very close. The small performance loss is due to discretizing the channel-phase with 32 levels. Furthermore, Figure 5 shows that (a) larger values of N + 1 result in performance closer to the coherentdetection performance, and, (b) for N + 1 = 32 ideally computed LLRs for a decomposed trellis perform quite close to coherent detection, that is, the difference in signal-to-noise ratio (E b /N 0 ) is less than 0.15 dB at a BER of 10 −4 .
Next, in Figure 6, we turn to the dominant subtrellis approach, which is denoted by "Max" in the legend. We compare for trellis-length N + 1 = 2, 8, and 32, the difference in performance between ideal LLRs based on the a posteriori probabilities given by (25) and the approximated LLRs based on the dominant-subtrellis a posteriori probabilities specified in (26). It can be seen from Figure 6 that (a) for larger N + 1, the difference between the exact and approximated LLRs becomes smaller, and (b) for N + 1 = 32, the difference between the ideal LLRs and the approximated LLRs is less than 0.1 dB.

Some Conclusions.
Our simulations demonstrate that for trellis length N +1 = 32, the ideal LLRs and the approximated LLRs have a performance quite close to that of coherent detection. The difference in signal-to-noise ratio is less than 0.25 dB at a BER of 10 −4 for the dominant-trellis approach. Therefore, if we focus on a BER of 10 −4 for obtaining an acceptable performance, with single subcarrier transmission, we need a trellis length N + 1 ≥ 32.
With a trellis length of N + 1 ≥ 32 received symbols, the channel coherence time needs to be in the order of  (25), that is, ideal LLRs, and approximated LLRs computed as in (26), for different trellis-lengths.
T c ≈ 32T s , where T s is the OFDM symbol time. This imposes quite a strong restriction on the time-varying behavior of the channel. In practice, the channel may not be coherent so long, and therefore focussing on trellis-length N + 1 = 32 might not be realistic. We will discuss this effect in more detail in Section 7, where we study a typical urban channel. There is a second reason for arguing that large values of N are undesirable. DAB systems support, for complexity reduction, per service symbol processing. In such services, typically, at most N + 1 ≤ 4 subsequent OFDM symbols are contained in a single convolutionally encoded word, see Figure 2, and this does not match to processing more than four OFDM symbols in a demodulation trellis. After having concluded that we cannot make N too large, it makes sense to investigate the possibility of using a number of (adjacent) subcarriers to jointly determine the a posteriori symbol probabilities for the corresponding DE-QPSK streams. Instead of using a single trellis with length N + 1 = 32, we could find out whether a similar performance can be obtained with a 2D block of M = 8 trellises of length N +1 = 4 corresponding to adjacent subcarriers, see Figure 3. This will be the subject of the next section.

Demodulation Procedures.
We have seen that the trellislength N + 1 needs to be as large as possible. For obtaining an acceptable performance, it must be larger than 32. This may not always be true. Therefore, we want to investigate the question of jointly decoding a block (2D) of received symbols. It would again be nice if we could decompose the computation of the a posteriori probabilities as in (25) (a m,1 , a m,2 , . . . , a m,N ) is conveyed using differential encoding. For the components of the transmitted sequence x m = (x m,0 , x m,1 , . . . , x m,N ), we can write x m,n = a m,n x m,n−1 (32) and x m,0 = 1. We assume that the channel phase is constant over the block of symbols; therefore, y m,n = e jφ x m,n + w m,n , where φ ∈ {lπ/16, l = 0, 1, . . . , 31} and uniform just as before, and the noise variables w m.n are circularly complex Gaussian with variance σ 2 per component. The output sequence corresponding to subcarrier m is denoted by y m = (y m,0 , y m,1 , . . . , y m,N }. Just like in the single carrier case, we can determine the a posteriori subtrellis probabilities: where Pr{s} = 1/8 for s = 0, 1, . . . , 7 and where K m,s (n) zn∈Zs (1/4)γ m,n (z n ). Note that for the likelihood corresponding to some state z n for n = 0, 1, . . . , N in the trellis T or in a subtrellis, we can write that where Pr s | y 1 , y 2 , . . . , Pr a m,n | y m , s = zn−1∈Zs (1/4)γ m,n−1 (z n−1 )γ m,n (z n−1 a n ) 4K m,s (n − 1)K m,s (n) , for s ∈ {0, 1, . . . , 7} and a m,n ∈ A. This suggests that the demodulator first determines the a posteriori subtrellis probabilities (weighting coefficients) using (38), for which the first 8 × M × (N + 1) K-factors have

Simulations.
In the previous section, we analyzed and simulated the single subcarrier approach. Here we will discuss the simulations corresponding to the multi-carrier method. We will again study the coded BER versus the signalto-noise ratio E b /N 0 = 1/(2σ 2 ). The BER performance for the ideal LLRs, based on a posteriori probabilities computed as in (37), is shown in Figure 7 with a fixed block size of M(N + 1) = 16. This fixed block-size is realized by the parameter pair values (M, N + 1) = (1,16), (2,8), (4,4), and (8,2). The detector operates according to (37). The performance of 2SDD and coherently detected DE-QPSK are shown as reference curves. From Figure 7, it can be observed that a 2D decomposition with a shortest possible trellis-length of N + 1 = 2 and M = 8 adjacent subcarriers performs identical to the largest trellis-length N + 1 = 16 and M = 1 subcarrier that is, the single-carrier case. Intermediate cases also have an identical performance.
We do not show the results of the dominant subtrellis approach for the multi-carrier case here, since these results are identical to the corresponding results for the singlecarrier case shown in Figure 6 in the previous section.

Conclusion Noniterative Decoding.
Our investigations for the noniterative 2D-case show that we are very close to the performance of coherent detection of DE-QPSK even for small values of the trellis length N + 1, by processing simultaneously several subcarriers. A next question is whether we can do better than this. In the literature, see, for example, Peleg et al. in [17] and Chen et al. [22], it is demonstrated that iterative decoding techniques lead to good results for differential encoding. Therefore, in the sequel of this paper we study iterative decoding techniques for DAB-like streams, with a special focus again on 2D blocks.

Detection and Decoding: Single-Carrier Case, Iterative
In the following two sections, we consider iterative decoding procedures. Peleg and Shamai [13] first demonstrated that iterative techniques could increase the performance of the demodulation procedures of DE-QPSK streams significantly. We specialize their approach to DAB systems, and solve a problem connected to the, in practise quite small, length of the trellises for each subcarrier, by turning to 2D blocks for iterative demodulation.

Serial Concatenation.
In the current section, we will investigate iterative decoding procedures for DAB-like systems, which are based on convolutional encoding, interleaving, and DE-QPSK modulation. If we consider DE-QPSK modulation as the inner coding method and convolutional encoding as the outer code, then it is obvious that we can apply techniques developed for serially concatenated coding systems here, see Figure 8. Serially concatenated turbo codes were proposed by Benedetto and Montorsi [10] and later investigated in more detail in Benedetto et al. [12]. Iterating between the DPSK-demodulator and convolutional decoder for the incoherent case was first suggested (for a single carrier) by Peleg and Shamai [13]. Hoeher and Lodge [20] also applied iterative techniques to the incoherent case, but focussed on channel estimation, to be able to use coherent detection. For an overview of related results, all for the singlecarrier case, see Chen et al. [22]. We will start in this section by considering the single carrier case and our aim is again to find out what we can gain from decomposing the trellis used in the demodulator into a part that corresponds to the channel phase and a part that relates to differential encoding. In the section that follows, we will consider the multi-carrier setting.

Peleg Approach.
In this subsection, we investigate the forward backward procedures where we drop the assumption that the symbols a n , n = 1, 2, . . . , N, are uniformly distributed. Interleaving should still guarantee the independence of the symbols, however.
Just like Peleg et al. [17] we focus on the entire trellis T . Note, however, that our trellis is different from that of [17], in which tracking of small channel phase variations is made possible by adding "side-step" transitions. We do not have such transitions in our trellis and, therefore, our trellis can be decomposed in eight unconnected subtrellises. In the next subsection, we take advantage of this decomposition; however, first we will consider the undecomposed trellis.
To determine the a posteriori symbol probability for symbol value a n ∈ A, we compute the joint probability and density Pr{a n }p y | a n = zn−1∈Z α n−1 (z n−1 )Pr{a n }γ n−1 (z n−1 )β n (z n−1 a n ). (44) This expression also tells us how the resulting extrinsic information can be determined. It can be checked, see Benedetto and Montorsi [10], that multiplying by the factors Pr{a n } in the a posteriori information (44) should be omitted for obtaining extrinsic information. The extrinsic information is now further processed by the convolutional decoder. The results of the iterative procedure are discussed in Section 5.5.
Using the standard BCJR algorithm for computing the extrinsic symbol probabilities in the trellis in Figure 4, since a priori symbol probabilities are non-uniform now, leads to 32 × 4 × 2 multiplications in the forward pass, 32 × 4 × 2 multiplications in the backward pass, and 32 × 4 × 2 multiplications and 4 normalizations in the combination pass for computing extrinsic information, per trellis section. In total, this is 768 multiplications and 4 normalizations per trellis section per iteration. In the next subsection, we investigate the decomposition of the demodulation trellis.

Trellis Decomposition.
Here we investigate whether we can decompose the entire trellis for the case where the a priori probabilities are nonuniform. We are interested in decomposing (44) in such a way that we can write Pr a n | y = s Pr s, a n | y = s Pr s | y Pr a n | y, s , for all a n ∈ A. The question now is how to compute the a posteriori subtrellis probabilities Pr{s | y} for s = 0, 1, . . . , 7.
It can be shown that and therefore Now for each subtrellis, we can determine the a posteriori symbol probabilities using Pr a n | y, s ∝ Pr{a n , s}p y | a n , s = Pr{a n , s} zn−1∈Zs α n−1 (z n−1 )γ n−1 (z n−1 )β n (z n−1 a n ), and by omitting the factor Pr{a n , s} in (48), the corresponding extrinsic information. Note that this approach requires a backward pass through the entire trellis T , first to find the weighting probabilities Pr{s | y}, for s = 0, 1, . . . , 7. This requires 32 × (4 + 1) = 160 multiplications per trellis section observing that in (43), γ n (z n ) can be put in front of the summation sign. Then for all subtrellises T s , we do a forward pass (requiring 8 × 4 × 4 × 2 = 256 multiplications per section) and then combine the results to obtain the extrinsic symbol probabilities Pr{a n | y, s} for that subtrellis (for which we need 8 × 4 × 4 × 2 = 256 multiplications and 8 × 4 = 32 normalizations per section). Finally these probabilities have to be weighted as in (45) which requires 8 × 4 = 32 multiplications. In total, this results in 704 multiplications and 32 normalizations per iteration. It should be noted that decomposition of the trellis does not result in a significant complexity reduction with respect to the Peleg approach. In the next subsection, we will discuss an approach that gives a relevant complexity reduction, however.

Dominant Subtrellis Approaches.
To achieve a complexity reduction, we investigate a method that is based on finding, at the start of a new iteration, the dominant subtrellis first and then do the forward-backward processing for demodulation only in this dominant subtrellis. Finding the dominant subtrellis for an iteration is done based on the a posteriori subtrellis probabilities Pr{s | y} that are computed before starting this iteration. Now assuming that one of the a posteriori subtrellis probabilities dominates the other ones, we can write Pr a n | y ≈ Pr a n | y, s , with s = arg max s Pr s | y .

(49)
This approach involves the computations of the a posteriori symbol probabilities (and corresponding extrinsic information) as described in (42), (43), and (44) only for the dominant subtrellis s. Computing the a posteriori subtrellis probabilities for each iteration and then focussing only on the forward pass and combination computations is less complex than following the Peleg procedure. For the best subtrellis T s , we do a forward pass (4 × 4 × 2 = 32 multiplications per trellis section) and then we combine the results to obtain the a posteriori (actually extrinsic) symbol probabilities Pr{a n | y, s} for that subtrellis (4 × 4 × 2 = 32 multiplications and 4 normalizations per section). In total, we now need 224 multiplications and 4 normalizations per trellis section per iteration.
A second approach involves choosing the dominant subtrellis only once, before starting with the iterations. Since before starting the iterations the a priori probabilities Pr{a n } = 1/4, that is, are all equal, the analysis in Section 3.4 applies. The a posteriori subtrellis probabilities can be computed as in (23). Now we do the iterations only in the subtrellis that was chosen initially. This approach requires 84 multiplications and 4 normalizations per trellis section per iteration and is therefore essentially less complex than the Peleg technique. In our simulations, we will only use this last technique when we address dominant subtrellises.

Simulations.
We simulated the Peleg method described in [17] and determined the BER versus the signal-to-noise ratio E b /N 0 = 1/(2σ 2 ). This BER performance is shown in Figure 9 for trellis lengths practically infinite, that is, N → ∞ and ideal LLRs are based on the a posteriori probability given by (44). The BER performance is shown for L = 1, 2, . . . , 5 iterations, where L = 1 stands for no iterations. Note that since we are using ideal LLRs and infinite trellis lengths, the corresponding curves shown in Figure 9 can be regarded as target curves for the iterative (single-carrier) case. In addition, also here, 2SDD and coherently detected DE-QPSK curves are shown for reference. Not in the figure are the curves corresponding to the approach based on decomposing the trellis and using weighting as in (45). As expected, the performance of this approach shows no differences with the Peleg approach in (44). From Figure 9, it can be seen that for a BER = 10 −4 the improvement in required signal-to-noise ratio is ≈ 4.1 dB after L = 5 iterations. Figure 9 also shows that improvement decreases with the number of iterations   and that the first iteration yields the largest improvement. Similar results were obtained by Peleg et al. [17]. To see how the performance in the iterative case depends on the trellis length N, we simulated the Peleg approach for N + 1 = 2, 4, and 32, for L = 5 iterations. The results are in Figure 10. It can be seen that the "iterative coding gain" increases, as expected, with N and that, for N + 1 = 32, the performance is already quite close to that of N → ∞.
Finally, we compared for N + 1 = 4 and 32, the difference in BER between the exact LLRs based on the a posteriori (extrinsic) probability given by (44)  in Figure 11. We can conclude from Figure 11 that for larger N + 1, the difference in performance between the exact and approximated LLRs becomes smaller and that for N + 1 = 32 the difference between the ideal LLRs and the approximation versions, by selecting the dominant subtrellis before starting with the iteration process, is less than 0.3 dB.

Detection and
where Pr{a m,n | y m , s} is computed as given by (48) for s ∈ {0, 1, . . . , 7} and a m,n ∈ A. From these a posteriori probabilities, we can compute the extrinsic information that is needed by the convolutional decoder. Computing extrinsic information is actually a little bit easier since it involves less multiplications. This suggests that, for each iteration, the demodulator first determines the a posteriori subtrellis probabilities using (50), for which first a backward pass in each of the M trellises corresponding to the subcarriers is needed.
Using the weighting coefficients, the convex combination in (52) leads to the a posteriori symbol probabilities. Finding the a posteriori symbol probabilities Pr{a m,n | y m , s} should be done in the standard way, taking into account that the backward passes were already carried out.
Again this approach involves, in each iteration, the computations of the a posteriori symbol probabilities only for the dominant subtrellis s. If we compute the dominant subtrellis only before the start of the iteration process, we obtain a significant complexity reduction since the analysis in Section 3.4 applies. Moreover, all iterations are done in the initially chosen subtrellis. The methods described here will be evaluated in the next subsection.

Simulations.
We have seen before that in the noniterative multi-carrier case the performance was more or less determined by the size M(N + 1) of the block. If the channel cannot be assumed to be constant for large values of N + 1, we can always increase the number of subcarriers M if the frequency selectivity allows this. Note that keeping N + 1 small also has advantages related to service symbol processing [1]. Here the situation is slightly different as is demonstrated in Figure 12. Increasing M has a positive effect on the performance; however, since the trellis-length N + 1 remains constant (and is quite small), the effect of iterating is limited. We see, however, that by increasing M from 1 to 8 we get an improvement of roughly 0.7 dB.
Finally, we compare for N + 1 = 4 and 32, for M = 8, the difference between the performance of exact LLRs based on the a posteriori (extrinsic) probabilities given by (52) and the approximated LLRs based on the a posteriori (extrinsic) probabilities given by (53), see Figure 13. We can observe from Figure 13 that, as expected, the larger N + 1 is, the smaller the difference between the exact and approximated LLRs becomes. For N + 1 = 4, the difference between the ideal LLRs and the approximation, by selecting the dominant subtrellis before starting to iterate, is roughly 0.3 dB.
Journal of Electrical and Computer Engineering

Performance for TU-6 Channel Model
So far we have used AWGN channels with unknown channel phase and fixed (unit) gain in our analysis and simulations.
To investigate the performance in practise, we have used the TU-6 (Typical Urban 6 taps) channel model defined in [36], which is commonly used to test DAB, DAB+, or T-DMB transmission. Two maximum Doppler frequencies are chosen, that is, f d = 10 and 20 Hz, representing DAB transmission (in Band-III) movement speeds between transmitter and receiver of ≈45 and ≈90 km/h, respectively. We use our methods for DAB transmission in Mode-I, where the inverse subcarrier spacing T u = 1 ms and where the cyclic-prefix period T cp = 246 μs [1]. Now, with these settings, the normalized Doppler rate f d T u is 0.01 and 0.02, respectively.
Note that to prevent ISI in an OFDM-scheme, the delay differences on separate propagation paths need to be less than the cyclic-prefix period [2], that is, the channel impulse response length τ m must satisfy τ m ≤ T cp . Within the DABsystem, T cp (63/256)T u < T u /4 [1] and therefore the coherence-bandwidth B c ≈ 1/τ m > 4(1/T u ), which is at least 4 OFDM-subcarriers. For Doppler frequency f d = 20 Hz, the coherence-time T c ≈ 1/(2 f d ) = 25 ms, which is ≈20 OFDMsymbols (including cyclic prefix).
The channel gain representative for a 2D block, where it is assumed to be constant, is estimated similar to (8) in Chen et al. [22], that is, The results of our simulations with the TU-6 model are shown in Figure 14, where the solid lines show the results for f d T u = 0.01 and the dashed lines for f d T u = 0.02. We have results for N + 1 = 18 with M = 1 and for N + 1 = 4 with M = 8. We considered iterative procedures with L = 5 iterations. In our simulations, we used the dominant subtrellis approach, where we have chosen the dominant subtrellis before starting the iterations.
The value for N + 1 = 4 might be seen as a representative frame-size for services broadcasted by the DAB-family in transmission Mode I. In this mode, N + 1 = 18 is the maximum possible number of interleaved OFDM symbols. Note that N + 1 = 18 is close to the coherence-time of our TU-6 channel for a Doppler frequency of 20 Hz.
It can be concluded from Figure 14 that for N + 1 = 18 and M = 1, reliable transmission is not possible for the TU-6 channel with movement speeds of ≈45 km/h and ≈90 km/h. For the 2D-decomposition approach, however, with N + 1 = 4 and M = 8, there is a considerable improvement of roughly 2.4 and 1.6 dB for 10 and 20 Hz, respectively, in required signal-to-noise ratio possible, compared to 2SDD.

Conclusions
We have investigated decoding procedures for DAB-like systems, focussing on trellis decoding and iterative techniques, with a special focus on obtaining an advantage from considering 2D blocks and trellis decomposition. These 2D blocks consist of the intersection of a number of subsequent OFDM symbols and a number of adjacent subcarriers. The idea to focus on blocks was motivated by the fact that the channel coherence time is typically limited to a small number of OFDM symbols, but also since per service symbol processing is used which limits the number of OFDM symbols in a codeword.
We have used trellis decomposition methods that allows us to estimate the unknown channel-phase modulo π/2. This channel phase relates to subtrellises of which we can determine the a posteriori probabilities. Using these probabilities we can weigh the contributions of all the subtrellises to compute the a posteriori symbol probabilities. We can also use these probabilities to chose a dominant subtrellis for providing us with these a posteriori symbol probabilities. Working with dominant subtrellises results in significant complexity reductions. A second important advantage of trellis-decomposition is that it allows us to process in an efficient way several subcarriers simultaneously.
We have first investigated noniterative methods. The advantage of these methods is that forward-backward procedures turned out to be extremely simple since we could use Colavolpe processing [5]. The drawback of these noniterative methods is, however, that their gain, relative to the standard 2SDD technique, is modest. Iterative procedures result in a significantly larger gain, however. In this context we must emphasize that part of this gain comes from the fact that we can do 2D processing.
Simulations for the noniterative AWGN case show that (a) trellis-lengths of N + 1 ≥ 32 are required and (b) that 2D dominant subtrellis processing with M(N + 1) = 32 outperforms 2SDD by 0.7 dB at a BER of 10 −4 .
For the iterative AWGN case with L = 5 iterations, simulations show that 2D dominant subtrellis processing with M(N + 1) = 32, where N + 1 = 32 and M = 1, outperforms 2SDD by 3.7 dB at a BER of 10 −4 . However, simulations also reveal that with M(N + 1) = 32, where N + 1 = 4 and M = 8, the iterative coding gain is reduced to 2.5 dB, which is caused by the smaller value of N + 1.
On the other hand, iterative simulations for a practical setting (i.e., the TU-6 model) show that (a) with trellislength N + 1 = 18 and M = 1 (one subcarrier) no reliable communication is possible, but that (b) with a modest trellislength N + 1 = 4 and M = 8 subcarriers, the iterative coding advantage is maintained and that the gain is roughly 2.4 dB for 10 Hz Doppler frequency, and 1.6 dB for 20 Hz.

Introduction
With the ever-increasing demand for high data rates and high quality of services for end users, bandwidth-efficient transmission schemes such as orthogonal frequency division multiplexing (OFDM) are being adopted in emerging wireless communication systems (e.g., WLAN 802.11a/g/n [1], WiMAX IEEE 802.16 [2], DVB-T [3], DVB-H [4], 3GPP LTE [5]). The physical layer implementation of OFDMbased systems with direct-conversion (zero-IF or homodyne) radio architecture represents a promising solution for future wireless systems. The direct-conversion architecture offers a simplified analog front end (FE) as it performs the frequency translation in one step and thus eliminates the need of bulky image rejection filters [6,7]. This yields an easy integration of analog and digital components of the FE on a single chip and consequently results in lowercost and less power consumption. From the perspective of practical implementation, a trade-off exists between the high integrability and the performance. The direct-conversion architecture-based transceivers are extremely vulnerable to the nonidealities of analog front-end components. The main impairments that degrade the system performance are inphase quadrature-phase (I/Q) imbalance, DC offset, and carrier frequency offset (CFO) [6,7]. The adoption of higher-order modulation alphabets (such as 64-QAM) in OFDM systems suggests that they are increasingly sensitive to any impairments in the underlying analog hardware. Rather than trying to improve the quality of individual analog modules, it is more cost-efficient to tolerate these RF impairments to a certain degree in the analog domain and afterward compensating them in the digital domain.
CFO is another important RF impairment, particularly associated with OFDM-based communication systems. It is caused by the instability of local oscillator and also due to the mobility of users [25,26]. OFDM systems divide the available bandwidth into many orthogonal subcarriers with a small subcarrier spacing. The subcarriers orthogonality is lost in the presence of CFO, leading to intercarrier interference. Because the CFO cannot be perfectly estimated in the presence of I/Q imbalances, most of the recent works (e.g., [18][19][20][21][22][23][24]27]) treat them jointly. The algorithms introduced in [18][19][20][21] analyze the performance degradation and propose compensation scheme coping with CFO and receiver I/Q imbalance, but do not take into account the transmitter I/Q imbalance. Gil et al. [22], Chung and Phoong [23], and Tandur and Moonen [24] developed the joint estimation algorithm of the I/Q imbalance, CFO, and channel, however frequencyindependent I/Q imbalance model is assumed. Finally, in [27], compensation methods for the mitigation of frequencyselective transmitter and receiver I/Q imbalance combined with CFO and channel distortions are proposed by the authors of this article. But the application of proposed algorithms is restricted to systems with a preamble pilot, and also the computational complexity for CFO estimation is very large.
In this paper, we consider DSP-based compensation of frequency-selective transmitter and receiver I/Q imbalance together with frequency-selective channel and CFO in the OFDM system context. The central theme is to first systematically formulate the baseband equivalent of the received signal. Based on this signal model, we propose new low-complexity decoupled estimation and compensation algorithms. Unlike the joint estimation algorithms (e.g., [22,24]) where the compensator parameters have to be reestimated with channel variations, the decoupled schemes are advantageous in the sense that only the varying parameter has to be reestimated, while the other parameters remain the same. More precisely, at the estimation and compensation stage, we first compensate the generally timeinvariant receiver I/Q imbalance in time domain with blind methods proposed in authors' earlier work [11,13]. The subsequent estimation and compensation of nonidealities including CFO, channel distortions, and transmitter IQ imbalance is done for pilot symbol-assisted modulation (PSAM) OFDM systems in each transmission. With practical considerations, two pilot patterns are taken into account in this work, namely, the preamble pilot pattern, where one complete OFDM symbol in a transmission frame is assumed known, and the sparse pilot pattern, where pilot symbols are placed over different subcarriers of certain OFDM symbols. We propose two alternative estimation and compensation structures depending on the pilot pattern. The leading principle in CFO estimation and compensation for both pilot patterns is to use an already existing technique of [28] and apply it in time domain. For the preamble pilot-based estimation, we propose zero-forcing (ZF) and maximumlikelihood (ML) estimation methods for joint channel distortions and transmitter I/Q imbalance compensation, whereas an algorithm for successive compensation of channel distortions and transmitter I/Q imbalance is proposed for sparse pilot-based structure. The performance of algorithms is evaluated with extensive computer simulations as well as with laboratory measurement setup.
The novelty of this paper is as follows: (1) We do not make any specific assumption about the location of pilots for sparse pilot-based estimation and compensation structure. The existing algorithms, for example [29], allocate the pilots to mirror frequency pairs which is generally not valid in practical radio systems like LTE [5]. (2) In this work, instead of proposing comprehensive and brand new algorithms to cope with multiple RF impairments at once, we simplify the complexity of the problem by reforming overall receiver design and decoupling the effects of individual RF impairments. Then incorporating already existing efficient algorithms with new proposed methods, a hybrid time-and-frequency domain compensation architecture with very reasonable complexity and good performance is achieved. (3) Practical RF measurements are used to verify the applicability of algorithms in real-world receiver design which, to the best of authors' knowledge, has not been addressed so far in the literature. The performance of individual impairments has been evaluated though, for example, [9,11], but RF performance evaluation in the presence of all the considered impairments is still new. (4) Here we use two widely deployed pilot patterns in the current and upcoming OFDM-based radio systems (e.g., IEEE 802.11n [1], DVB-H/T [3,4], and LTE [5] systems), namely, the preamble-based and sparsely located pilot structures for parameter estimations. As a result, the proposed algorithms can be directly applied in the corresponding receiver design without any modification.
An attractive feature of receiver I/Q imbalance compensation algorithm is that it is able to track the time-variation of I/Q imbalance and updates the coefficients of compensation filter appropriately.
The paper is organized as follows. I/Q imbalance model and its impact, OFDM signal model under frequencyselective I/Q imbalances, channel distortions, and CFO are described in Section 2. Section 3 presents the estimation and compensation techniques for the mitigation of impairments. Computer simulation results are shown in Section 4. In Section 5, the measurement setup and obtained results are presented. Finally, the conclusions are drawn in Section 6.
Preliminaries. The notations used in this paper are as follows. Scalar parameters are represented by lower case letters a and frequency domain quantities with upper case letters A. We denote the time domain vectors/matrices by lower case bold face with over-bar a/A and frequency domain vectors/matrices by bold face letters a/A. Superscript (·) T , (·) * , and (·) −1 denote the transpose, conjugate, and inverse of a vector, a scalar, or a matrix, respectively. The (i, j)th quantity of a matrix is defined by A i j and (i, i)th quantity by A i . The convolution operation is indicated by .  If a = [a 1 , a 2 , . . . , a N/2 , . . . , a N

Frequency-Selective I/Q Imbalance Model.
The directconversion transmitter architecture is based on the principle of directly I/Q upconverting the baseband signal to the RF frequency. The upconversion is performed in the analog domain by a quadrature mixer, which theoretically provides infinite image signal attenuation. This eliminates the need for image rejection filter, relaxing the overall requirement for RF filtering. A perfectly balanced modulator corresponds to equal gain and 90 • phase difference between the quadrature branches. However, in practice, this requirement is not fully satisfied. In addition to that, other modem components such as DACs and LPFs in the I-and Q-branch are not perfectly matched [7,8]. These effects are called I/Q imbalance, and it results in the limited suppression of the image signal [8,9]. In the wideband system context, the reconstruction filters in the modulator exhibit frequency-dependent response, which causes the I/Q imbalance to vary as a function of frequency. Consequently, we can characterize the I/Q imbalance with transmitter gain imbalance and phase imbalance parameter g T and φ T , respectively [8,27]. In addition, the relative nonideal filter transfer function between in the I-and Qbranches is modeled with the filter h T (t). A conceptual block diagram is illustrated in Figure 1. To see the impact of I/Q imbalance on the transmitted signal, we denote the baseband equivalent ideal transmit signal as z(t) = z I (t) + jz Q (t). The complex envelope of the transmitted RF signal with I/Q imbalance effects is [8] where the complex filters g 1,T (t) and g 2,T (t) correspond to imbalance filters and are defined as The baseband equivalent signal in frequency domain is The above equation indicates that the imbalanced baseband signal is weighted sum of the desired signal Z( f ) and the undesired image signal Z * (− f ). The undesired image signal term is produced by the I/Q imbalance and results in the mirror frequency interference.
On the receiver side, the RF signal is downconverted to baseband using a quadrature demodulator. The downconverted baseband equivalent of the received signal r(t) is of the form [11] x(t) = g 1,R (t) r(t) + g 2,R (t) r * (t) (4) with the receiver I/Q imbalance filters g 1,R (t) and g 2,R (t) given as The Fourier transform of (4) is Again the mirror-frequency interference due to I/Q imbalance is evident from (6). The image rejection ratio (IRR) quantifies the image signal suppression and is defined as the ratio of desired signal power to image signal power, expressed in dB as With careful analog design, IRR's in the order of 25-40 dB are stated feasible [6][7][8].

OFDM Signal Model with RF Impairments.
We consider an OFDM system with N subcarriers. At the transmitter, OFDM symbols are generated by computing the N-point inverse discrete Fourier transform (IDFT) of the data symbols u. A cyclic prefix (CP) of length L CP is appended ahead of each OFDM symbol which is removed at the receiver after DFT. Referring to Figure 2, consider the transmission of ith symbol of size N × 1, the frequency domain transmitted signal corresponding to ith OFDM symbol is then given by where G 1,T and G 2,T are N × N frequency domain diagonal matrices whose diagonal elements possess frequency responses of transmitter I/Q imbalance filters (g 1,T ; g 2,T in (2)). We consider the transmission over a block fading channel whose impulse response is denoted by h = [h(0), h(1), . . . , h(L − 1)] T with channel length L. The received frequency domain signal can then be expressed as where H is N × N frequency domain matrix that contains the channel frequency response on its main diagonal and n is additive white Gaussian noise (AWGN). Assume now that the local oscillator of transmitter and receiver are not synchronized and a CFO Δ f is present. If we denote the time domain received signal vector as r i , then the signal in the presence of CFO can be modeled as Here Δ f is the frequency offset in Hertz and T s is the sampling rate. The frequency domain N × N CFO matrix Ω i = FΩ i is circulant in nature and has c(0) on its main diagonal and c i (1) and c i (N − 1) on the first sub-and superdiagonal [30]. Including the CFO effect to (9), we obtain Finally taking into account the receiver I/Q imbalance, the signal at the downconversion stage appears as where again G 1,R and G 2,R are N × N frequency domain diagonal matrices whose diagonal elements are DFT of receiver I/Q imbalance filters (g 1,R ; g 2,R in (5)). After some simple mathematical manipulations, the complex baseband equivalent of the received signal in terms of transmitter/receiver I/Q imbalance, channel, and CFO is given by where Ω i# , H # , G # 1,T , G # 2,T are the mirrored matrix of Ω i , H, G 1,T , G 2,T , respectively. The noise term now becomes v = G 1,R n + G 2,R n # , whose elements are still complex circular Gaussian, but with correlated mirror subcarriers.In the ideal case when there is perfect transmitter/receiver modulator/demodulator matching as well as perfect synchronization between the transmitter/receiver local oscillator, the model in (13) reduces to x i = Hu i + n which can be equalized with one complex multiplication for each subcarrier in each OFDM symbol. As evident from (13), both I/Q imbalances and CFO introduce ICI and severely degrade the system performance, thus digital compensation is needed.
In this paper, we assume that the length of the cascade of impulse responses of the transmitter I/Q imbalance filters (g 1,T (t) and g 2,T (t) in (1)), the multipath radio channel, and the receiver I/Q imbalance filters (g 1,R (t) and g 2,R (t) in (4)) does not exceed the length of CP. It is reasonable to make such an assumption as the frequency-selective transceiver I/Q imbalance is rather moderate and can be estimated with sufficient accuracy with only a few taps [8,11,27].

Estimation and Compensation of Nonidealities
The estimation and compensation of radio frequency impairments in OFDM systems can be performed blindly and/or with the aid of training (a.k.a. pilot) symbols. For forming such pilot-symbol-assisted OFDM systems, it involves inserting the known symbols in the stream of data symbols. With practical system design, two types of pilot patterns are widely used-the preamble pilot pattern that amends an entirely known OFDM symbol at the beginning of the frame and the sparse pilot pattern where pilot symbols are sparsely inserted at some subcarriers of specific OFDM symbols. Figure 3 illustrate these two ways of inserting pilots among the data symbols.
In the following, we discuss receiver-based estimation and compensation algorithms for mitigating all the considered RF impairments. With developed signal model and given pilot patterns, we are able to isolate them individually and then process them for estimation and compensation in the reverse order of their appearance in the transmitter and receiver front end. We first discuss the receiver I/Q imbalance estimation and compensation, which is performed in time domain and is independent of the pilot pattern. Then, two low-complexity and novel structures are proposed for CFO, channel, and transmitter I/Q imbalance compensation with both preamble and sparse pilot structures. The CFO estimation and compensation is also carried out in time domain, while the channel and transmitter IQ imbalance are compensated in frequency domain.

Blind Receiver I/Q Imbalance Compensation.
For receiver I/Q imbalance estimation and compensation, we propose to utilize the statistical signal processing-based blind I/Q imbalance compensation algorithms, developed in authors' earlier work [11,13,14]. In general, most complex communication waveforms based on M-PSK and M-QAM (with M > 2) are proper or circular [31]. The OFDM signal constructed from these alphabets is thus also circular. I/Q imbalance destroys this property and makes the signal noncircular. The strategy used in [11] is to recover the circularity using second-order statistics-based approach, under the assumption that the received signal (without receiver I/Q imbalance) is circular. The circularity of received signal is inherently lost in the presence of transmitter I/Q imbalance, potentially hindering the applicability of the I/Q imbalance compensation algorithms of [11]. However, it is shown in [13] that either a CFO or a fading channel will restore the circularity of the received signal (without receiver I/Q imbalance), thus making it possible to utilize the circularity-based algorithms of [11]. These blind algorithms are also beneficial from the point of view that they do not affect the subsequent signal processing of CFO, channel, and transmitter I/Q imbalance estimation and compensation.
For notational simplicity, we drop the noise term in (13) and continue the analysis. Notice that even though the actual compensation processing is done partly in the time domain (for receiver I/Q imbalance and CFO), the following analysis is done completely in the frequency domain. From (12), we recognize that the signal x i is the linear combination of the received signal r i and its mirror conjugate r i# . For such a system model, the natural compensator is of the form Substituting the observed signal x i of (12) into the compensator in (14), the output can be written as From (15), it follows that the optimum compensation filter W R canceling the mirror-frequency interference is, in frequency domain N × N diagonal matrix, of the form W OPT = −G 2,R (G # 1,R ) −1 and is derived in [11].
Applying the optimum compensation filter to (15) cancels the mirror conjugate term r i# and results in the output signal as It is well known that |G 1;T,R | > |G 2;T,R | and |G 1;T,R | ≈ 1 for any practical imbalance values; thus the contributions of the second terms inside the parentheses are relatively very small compared to the first term and can be omitted, leading to the following form: Comparing (16) with (13) suggests that, given W OPT , the compensator is able to fully suppress the image signal term generated by a frequency-selective receiver I/Q imbalance in the presence of CFO, channel distortions, and transmitter I/Q imbalance. The estimation of W OPT is described in [11,13] and is not reproduced here. Next, we discuss the pilot-based estimation and compensation schemes for CFO, channel distortions, and transmitter I/Q imbalance.  utilizes one complete pilot OFDM symbol embedded in the OFDM frame. The preamble pilot symbol is transmitted at the beginning of the transmission to estimate the RF impairments. This kind of pilot pattern assumes that the channel remains static over several OFDM symbols and has the benefit of efficiently using the available bandwidth. OFDMbased systems such as IEEE 802.11 n and IEEE 802.16d have a preamble at the beginning of the frame [25]. Figure 4 shows the system model with RF impairments estimation and compensation block in place. In the following, we discuss the algorithms for the estimation and compensation of RF impairments.

CFO Estimation and Compensation.
There exists an abundant literature on CFO estimation (see, e.g, [18][19][20][21][22][23][24][25] and references therein), but most of these algorithms assume a specific pilot structure which restrict their use in our problem. The subspace-based CFO estimation approach proposed in [26] utilizes only one OFDM symbol, and can be directly used here at the expense of increased computational complexity. We rather suggest to use the CFO estimation algorithm of [28] as discussed in [25]. More specifically, the actual CFO estimation is performed in two stages: first the fractional part of CFO is estimated and corrected in time domain by using the CP correlation. The integer part is then estimated in frequency domain in the second stage. Even though the presence of transmitter I/Q imbalance is not taken into account in this method, we show with simulations that the algorithm is still able to achieve reasonable good CFO estimation performance. Assuming perfect CFO estimate is obtained, we perform CFO compensation in time domain by multiplying the time domain equivalent of (17) with the complex conjugate of Ω i , which, in time domain, can be written as where G 1,R Δ = Ω i# G 1,R .

Joint Transmitter I/Q Imbalance and Channel
We now switch to the subcarrier model and write the CFO compensated symbol for the kth subcarrier and the complex conjugate of its mirror subcarrier. In the subsequent derivations, the subcarrier index −k refers to the physically opposite mirror subcarrier The matrix G total,k represents the joint transmitter I/Q imbalance and channel response. Assuming the coefficients of matrix G total,k are known, we are able to directly apply the ZF or the ML detection principle on the mirror subcarrier pair to estimate the transmitted symbol. With ZF equalization scheme, the estimate of original transmit symbols is obtained by solving (20) for each subcarrier k and its mirror subcarrier −k as It is well known that ZF equalization suffers from the noise enhancement problem, yet it gives good calibration performance as will be shown in Section 4. Alternatively, the transmitted symbols can be estimated using ML detection principle which is based on the principle of minimizing the cost function whereȖ i k in (22) denotes the trial value of OFDM symbol U i k . The ML equalizer is able to exploit the frequency diversity induced by transmitter I/Q imbalance and gives better performance than ZF, but the computational complexity associated with ML is very large which increases with higherorder constellations and large number of subcarriers.

LS Identification of G total,K .
We consider the LS estimation of the joint transmitter I/Q imbalance and channel filters G D and G M . The filters are estimated in time domain similar to [32], where a time-domain ML channel estimator was proposed for OFDM. This approach can utilize the frequency domain correlation of the channel implicitly without making any statistical assumptions about the channel, through limiting the length of the estimated channel response estimates as N g N. We switch to matrix-vector algebra and write the time-domain received symbol corresponding to preamble pilot as y T , where N is the length of pilot data and N g < L CP is the length of estimated filters. The received symbol vector during the pilot symbol is then where U p (n) is the time-domain circular-convolution matrix formed from IDFT of pilot symbol u p . The LS estimates of imbalance filters can be computed as [33] g Here U p † b (n) represents the pseudoinverse of U p b (n) and is given by [33]. Frequency domain expressions of the estimated filters G D and G M , which are needed for equalization/detection in (20)- (22), are then obtained through DFT.

Estimation and Compensation with Sparse Pilot Structure.
OFDM systems such as LTE and DVB-T/H do not include a preamble pilot in their frame, rather the pilot tones are inserted sparsely in the OFDM symbols. DVB-T/H standard defines both continual and scattered pilots, on the other hand, LTE includes pilot subcarriers only on specified OFDM symbols. Figure 5 shows the reference pilot structure for LTE system with pilot symbols at every sixth subcarrier during the first and fifth OFDM symbol of each slot.
One big challenge with sparse located pilot structure is that the pilots are most likely not allocated to mirrorfrequency pairs which is required by most of the pilot-based algorithms developed in literature, for example, [29]. The compensation algorithms proposed in this section consist of estimating the impairments at the pilot frequencies and then interpolating the estimates over all the subcarriers. Therefore, no pilot structure modification is needed, and they can be directly applied for the development of, for example, LTE receiver. The estimation and compensation structure is illustrated in Figure 6. Again, notice that the receiver I/Q imbalance has already been compensated with the blind method discussed in Section 3.1.

CFO Estimation and Compensation.
For the CFO estimation, we again apply the two-step time-domain approach of [25] as discussed in the previous subsection, resulting in the time domain CFO compensated signal similar to (18) as

Decoupled Transmitter I/Q Imbalance and Channel
Equalization. With the sparse pilot structure, we propose to compensate the channel and transmitter I/Q imbalance successively. Algorithm 1 summarizes the algorithm for the estimation of transmitted symbols.
Algorithm 1 (It is for estimation of channel distortions and transmitter I/Q imbalance).
(2) Switch to subcarrier model and write (25) in frequency domain for subcarrier k as (3) Divide (26) by G D,k to obtain the equalized subcarriers as Conceptually, a conventional channel equalizer in the absence of CFO and I/Q imbalances operates on a persubcarrier basis, requiring 1 complex multiplication per active subcarrier. However, a joint equalizer for channel, CFO, and Tx. Rx. imbalances has to take into account the contribution of all active subcarriers. Therefore, for such an equalizer with N a active subcarriers, N 2 a complex multiplications would be required. Comparing this complexity with the complexity of proposed decoupled schemes clearly signifies the benefit of decoupled compensation.

Simulation Results
The performance of proposed algorithms is illustrated in this section with computer simulations. We first evaluate the performance of CFO estimator in the presence of transmitter I/Q imbalance and channel distortions and do not take into account receiver I/Q imbalance, for both preamble and sparse pilot structures. Then, SER simulations are performed to evaluate the detection performance of compensation schemes in the presence of transmitter and receiver I/Q imbalance, channel distortions, and CFO.

Performance of CFO Estimator.
The parameters considered for CFO estimator simulations are as follows: OFDMbased system with 64-QAM subcarrier modulation, total number of subcarriers N = 1024, of which 600 are active, 50 OFDM symbols, and the subcarrier spacing is 15 KHz. In the case of preamble pilot, the first complete OFDM symbol is considered as pilot and used for the CFO estimation. For sparse pilot example, pilot symbols are inserted at every sixth subcarrier. The multipath channel is 39 taps long with maximum delay spread of 2.5 ms and power delay profile as described in [34].  [25] for preamble and sparse pilot cases, respectively. For the sparse pilot example, we assume that the system under consideration is LTE for which normalized CFO can be higher than one [25], and we simulate the algorithm over the normalized CFO range of [−5, 5]. As   evident from the results, the algorithm delivers sufficiently reliable estimates for small to large imbalance values. The MSE remains below than 10 −7 with preamble and in the order of 10 −5 when pilots are inserted sparsely at SNR = 20 dB.

SER
Performance. Now, we perform the SER simulations and illustrate the results by plotting the mean SER as a function of received SNR. A typical 64-QAM OFDM system is considered whose parameters are given in Table 2. The OFDM signal parameters are similar to 3GPP longterm evolution (LTE) specifications [35]. The I/Q imbalance In the following figures, legend "No FE Distortion" refers to the case when no transmitter and receiver I/Q imbalance and CFO is present and known channel estimates are used for channel equalization, "W/o Compensation" refers to when all radio impairments are present and only channel is equalized with known estimates, and "W/Tx Rx IQ Imbalance only" for the case when only transmitter and receiver IQ imbalance is present and channel is equalized with known estimates. "W/ZF Compensation" "W/ML Compensation" "W/LS Compensation", and "W/WLS Compensation" legends exemplify the schemes proposed in Section 3. The results depicted are averaged over 10 3 independent channel realizations.
For preamble pilot-based compensation, we transmit 50 OFDM symbols and use first OFDM symbol as pilot.
In each simulation run, we generate a random CFO in the range [−1, 1] and introduce it to the signal (with transmitter I/Q imbalance and channel distortions). The receiver I/Q imbalance is compensated with 3-tap compensation filter, built by utilizing the second-order statistics of the received signal. This is followed by CFO estimation and compensation, performed in time domain. ZF and ML estimation techniques are applied next. It can be seen in Figure 9 that communication system becomes completely unusable when no compensation scheme is in place and SER remains very high when only I/Q imbalance is present at both transmitter and receiver. The ZF equalizer is able to reduce the SER to the level of a system with no front-end distortion. On the other hand, the ML equalizer even outperforms the SER performance of system with no RF impairments and known channel estimates. In the low SNR region, the SER is close to the ideal system but as the SNR increases, the equalizer is able to extract the frequency diversity gain induced by transmitter I/Q imbalance. A similar diversity effect is also reported in [29] where only transmitter I/Q imbalance and channel distortions are considered.
The OFDM signal model for sparse pilot-based compensation is similar to LTE frame structure shown in Figure 10, consisting of one frame that is composed of ten subframes; each subframe is further divided into two slot. A slot consists of seven OFDM symbols, and pilot symbols are   domain for channel distortions and transmitter I/Q imbalance calibration. Figure 10 illustrates the simulation results that renders excellent calibration property of proposed algorithms, with WLS achieving close to SER of a system with no FE distortions.

Experimental RF Measurements Results
In order to further validate the performance of proposed compensation algorithms, real-world RF measurements have been performed. We first summarize the overall measurement setup and then describe the measurement procedure in detail.

Prototype Implementation.
The generic structure of the measurement setup is identical to the one shown in Figures  4 and 6, and the corresponding block diagram is depicted in Figure 11. The prototype implementation integrates Rohde and Schwarz (R&S) AFQ100A baseband I/Q signal generator, R&S FSG spectrum and signal analyzer, R&S SMJ vector signal generator, HP E4422B RF signal generator, LeCroy 534AL quad-channel oscilloscope, Analog Devices (AD) AD8349 direct conversion modulator evaluation kit, and Maxim 2023 demodulator evaluation kit. All the instruments are connected to a computer via general purpose interface bus (GPIB) for remote control operation.

Measurement Setup.
A computer with MATLAB is used for performing digital signal processing-related tasks and also for instrument control. The baseband OFDM signal with parameters similar to simulations as in Section 4 is generated in computer, and its samples are loaded into the memory of R&S AFQ. The analog I/Q signal output by R&S AFQ is upconverted to RF frequency by AD8349 modulator. In order to introduce the channel distortions and noise, we capture the baseband equivalent of the RF signal using built-in I/Q subsystem of R&S FSG.  (41). Z(n) and Z * (n) are time domain convolution matrix and conjugate convolution matrix formed from reference signal z(n) and its conjugate z * (n), respectively. Assuming perfect synchronization in time, frequency, and phase, the LS estimates of the front-end imbalance filters are then given by (42) [8,11]. In Figure 12, IRR curves with different I/Q imbalance filter lengths, utilizing 50, 000 samples and ensemble averaged over ten independent simulation runs, are plotted. The LO frequency is 1.5 GHz, and the signal is a 16-QAM modulated signal with 18 MHz two-sided bandwidth, six times oversampling, and root raised cosine pulse shaping. The results suggest that both chips have frequency selective image rejection, varying between 27-29 dB for AD8349 (transmitter modulator) and 30-33 dB for Maxim 2023 (receiver demodulator): . . .   these steps, we apply the compensation algorithm. For illustration purpose, we use a half-loaded OFDM signal with 300 active subcarriers (out of total 1024). The LO frequency is 1.5 GHz. Figure 13 shows the measured spectra before and after compensation, evidencing a clear improvement obtained with the compensation algorithm.

Characterization of Pilot-Based Compensation Schemes.
The performance analysis of pilot-based compensation schemes is carried out by creating an OFDM waveform with parameters given in Table 2  The samples are processed for delay estimation, timing synchronization, and LO leakage removal. After the data conditioning, the channel distortions and noise are added, before feeding the signal to R&S SMJ. The CFO effect is also introduced in R&S SMJ by changing the RF frequency. The RF signal is downconverted and sampled at 50 M Samples/second. These samples are extracted from oscilloscope and again processed in MATLAB for resampling to 30.72 MHz, delay estimation, phase synchronization, and scaling. After these processing stages, we have the baseband equivalent model of the signal that has transmitter/receiver I/Q imbalance, channel distortion, noise, and CFO effects included.
The example waveform used for preamble pilot-based compensation algorithms is an OFDM signal with same parameters as were in simulations. In the measurements, we transmit 10 OFDM symbols and use one known OFDM symbol during the estimation and calibration phase. The CFO is 10 KHz and a 3-tap compensation filter for receiver I/Q imbalance compensation and also for transmitter I/Q imbalance estimation. Both ML and ZF equalization schemes are applied to the signal, after receiver I/Q imbalance and CFO compensation. With a fixed channel, the obtained results ensemble averaged over 10 independent measurements are plotted. Figure 14 shows the measurement results of CFO estimation over different SNR values. The results indicate that the estimator performance improves with increasing SNR. Even for small SNR, the MSE remains very low. In Figure 15, measured SER versus SNR is plotted. The figure also shows a reference curve ("No FE Distortion,") obtained with MATLAB simulation with no I/Q imbalances, perfect CFO, and channel compensation with known  estimates. As evident from Figure 15, the algorithms give very consistent performance with SER close to simulated SER. Also, the ML technique provides better performance than ZF technique. Notice that for SNR greater than 26, there is a little performance degradation which is most likely due to measurement noise. In Figure 16, we also plot the symbol constellation after compensation with ZF equalization scheme. The excellent calibration property of algorithms is also visible here. In the second measurement example, the signal is an OFDM signal having 10 symbols and pilots located at every sixth subcarrier. The signal has altogether 600 active subcarriers out of 1024 subcarriers, with 15 KHz subcarrier spacing. Therefore, in total, there are 1000 pilot symbols. The signal is oversampled by 2. The CFO is now 45 KHz that is greater than the subcarrier spacing. Again the receiver I/Q imbalance is compensated with 3-taps, and the LS and WLS approaches discussed in Section 3 are employed for transmitter I/Q imbalance and channel equalization. The CFO estimation in terms of MSE between the estimated and the true CFO value versus SNR is plotted in Figure 17, which demonstrates a similar behavior to previous preamble pilot-based estimation example. The measured SER versus SNR curve in Figure 18 and symbols after compensation are plotted in Figure 19. The results suggest that the proposed methods are able to deliver very accurate estimates. An important implication of Figure 18 is that with more pilot data available over different symbols, we are able to reduce the effect of measurement noise which was very dominant in the previous measurement example (Figure 15).
In general, it can be concluded from the measurements results of Figures 15 and 18 that the proposed estimator and compensator structures provide considerable improvements to achievable SER, giving confidence to deploy them in practical systems.

Implementation Challenges.
In practice, the measurement setup is exposed to variety of errors, which must be corrected before introducing the channel distortion and noise as well as before applying the compensation algorithms. We briefly review these errors now.
The downconversion of RF signal (with transmitter I/Q imbalance) using R&S I/Q subsystem is subject to unavoidable residual errors such as synchronization, LO leakage, delay between original and measured signal, nonlinearities of components, and noise of its own. Frequency synchronization between the instruments is achieved by connecting the instruments to same reference clock of R&S FSG, as shown in Figure 11. Symbol timing synchronization is performed prior to downsampling the signal to symbol rate with ML algorithm. LO leakage results due to finite isolation between the LO and RF ports of the modulator/demodulator. Consequently, when RF signal is up-/downconverted, LO leakage causes offset. It can be mitigated by removing the sample mean of the measured signal. The delay estimation is performed in two steps: first integer delay is removed with FFT-based correlator, allowing non-data-aided (NDA) ML algorithm for fractional delay estimation and compensation. Detailed review of synchronization algorithms can be found in [36]. Similar to R&S FSG, the data captured from oscilloscope must be DSP conditioned before being used in actual estimation and compensation. The data conditioning involves resampling to original signal sample rate, delay estimation, and offset removal with already-mentioned algorithms.

Conclusion
In this paper, the performance of OFDM-based communication systems is studied in the presence of frequency-selective I/Q imbalance, channel distortions and CFO. Based on our signal model, generally applicable algorithms have been developed for the compensation of such impairments. The proposed methods are composed of time domain and frequency domain compensation. It is shown with simulations that the proposed, estimator and compensator structures produce symbol error rates close to the ideal ones. A laboratory measurement setup is also proposed and extensive measurements are carried out to prove the practical value of the algorithms. The measurements results show that the achievable SER is close to the ideal. Hence, the compensation algorithms provide significant improvement to obtainable link performance and can be used in real world radio transceivers. Future work will focus on generalizing the work to MIMO-OFDM and a real-time prototype implementation using FPGA's for the digital parts and integrated RF-ASIC's for the analog circuitry.

Introduction
To increase the transmission capacity, multiple-input multiple-output (MIMO) system has been proposed for the next generation wireless communication systems, and therefore the need for a high-performance and low-complexity MIMO detector becomes an important issue. The maximum likelihood (ML) detector is known to be an optimal detector; however, it is impractical for realization owing to its great computational complexity. Addressing this problem, researchers have proposed tree-based search algorithms, such as sphere decoding [1] and K-Best decoding [2], to reduce the complexity with near-optimal performance. On the other hand, channel matrix preprocessing technique, such as lattice-reduction-aided (LRA) detection [3], has been proposed to improve the MIMO detection performance.
The lattice reduction transforms the channel matrix into a more orthogonal one by finding a better basis for the same lattice so as to improve the diversity gain of the MIMO detector. The Lenstra-Lenstra-Lovász (LLL) algorithm is a well-known lattice reduction algorithm for its polynomial execution time. In the literature [4], the LLL algorithm is widely employed to improve the lattice-reduction MIMO detection or to reduce the MIMO detection complexity. However, the LLL algorithm has many redundant check operations that have never been addressed in the literature. These redundant operations lead to many unnecessary computations and thus increase the processing latency and complexity. Therefore, we propose a look-ahead check technique to detect and avoid the unnecessary check operations in the LLL algorithm. This technique not only generates the lattice-reduced matrix which obeys the size reduction and LLL reduction in the original LLL algorithm but also applies to real-and complex-value LLL algorithm [5].
The remainder of this paper is organized as follows. Section 2 briefly describes the signal model for MIMO detection. In Section 3, we introduce the lattice-reductionaided MIMO detection and the LLL algorithm. In Section 4, we demonstrate the proposed LR-LLL algorithm, and in Section 5 we present the simulation and analysis results. The corresponding hardware architecture and processing cycle counts estimation is shown in Section 6. Finally, we summarize our conclusions in Section 7.

System Model
A narrow-band N r × N t MIMO system consisting of N t transmitters and N r receivers can be modeled by INPUT: Q, R, such that H = QR OUTPUT: Q, R, T such that HT = Q R (1) Initialize Q = Q, R = R, swap columns k − 1 and k in R and T (13) calculate where x ∈ A Nt is the transmitted signal vector, y ∈ C Nr is the received signal vector, H = [h 1 , h 2 , . . . , h Nt ] represents a flatfading channel matrix, and n ∈ C Nr is the white Gaussian noise with variance σ 2 n . All the vectors h i are independent and identically distributed complex Gaussian random vectors with zero means and unity variances. Set A consists of the constellation points of the QAM modulation. Then, we re-formulate the equivalent real channel matrix as follows: Then, the dimension of H r becomes N × M, where M = 2N t and N = 2N r . The vectors y r and n r belong to R N and x r ∈ A M . The QR decomposition is often applied in the pre-processing of the MIMO detection because it provides decoding efficiency. Then, the channel matrix H r can be expressed by where Q r ∈ R N×M is an orthogonal matrix and R r ∈ R M×M is an upper triangular matrix. By multiplying Q T r on both sides of (2), we can obtain where Q T r n r is white Gaussian. In addition, we adopt column-norm-based sorted QR decomposition (SQRD) [6] because it not only enhances detection performance but also reduces the computational complexity of the lattice reduction [7].

Lattice Reduction
are the basis vectors. The lattice reduction algorithm aims to find a unimodular matrix T(| det T| = 1 and all elements of T are integers) such that a more orthogonal H r = H r T has the same lattice as H r . Then, the signal model becomes y r = H r x r + n r = H r T −1 x r + n r = H r s + n r .
If x r ∈ Z N , T −1 x r = s ∈ Z N . In practice, the transmitted signals x r do not belong to an integer set; however, we can still transform the signals x r ∈ A N into an integer set by linear operations such as scaling and shifting. Several lattice-reduction algorithms are described in the literature, and the LLL algorithm [11] is the most popular approach. Because QR preprocessing is often employed in the MIMO detector, the LLL algorithm is then modified for Q and R matrices [12], as shown in Algorithm 1. In the literature, lines (4) to (19) are often defined as a loop that can be decomposed into two parts: (1) lines (4) to (10) deals with the size reduction operations; and (2) lines (11) to (19) handle LLL reduction operations. The number of iterations performed in the size reduction depends on the index k, and the LLL reduction operation may increase or decrease the index k depending on the result of the LLL reduction check (δ|R(k − 1, k − 1)| 2 > |R(k, k)| 2 + |R(k − 1, k)| 2 ). Therefore, the number of loops certainly depends on the values in the R matrix, and thus the processing latency varies for different channel matrices. Moreover, we find that most of the computational complexity is contributed by the operations when the check conditions (μ / = 0 for size reduction and δ|R(k − 1, k − 1)| 2 > |R(k, k)| 2 + |R(k − 1, k)| 2 for LLL reduction check) are satisfied; that is, the size and  (13) swap columns k − 1 and k in R and T (14) calculate  Table 1. We can see that the percentage of the redundant decrease of k achieves 67% for 2 × 2 lattice reduction and converge to 28% if the MIMO dimension is larger than 10 × 10. Therefore, we propose a look-ahead check technique to modify index k and avoid unnecessary check operations in the original LLL algorithm.

Look-Ahead Check
The number of loops is often treated as a benchmark for computation complexity and latency in the literature on LRA MIMO detection [7]. In order to eliminate the redundant check operations in the LLL algorithm, we propose a lookahead check technique by classifying the original loops to forward loop and back loop. And the loop reduction LLL algorithm is shown in Algorithm 2. The corresponding flow chart of the proposed algorithm and each loop is shown in Figure 1.

Back Loop.
We define the back loop as the loop that only contains LLL reduction check and LLL violation processing as shown in Figure 1. We find that the size reduction constraint will not be violated after the k is decreased because the R(k, k) has already been size reduced in the previous processing and is not changed in the column-swapping operation. And the givens rotation will only change the k and k − 1 row while the row value above k − 2 row remains size reduced. That means only LLL reduction check is required in the back loop. This LLL reduction check is named as backstate LLL reduction check to differentiate with the origin one. Because only the LLL reduction part needs to be executed in this back loop, we use a while loop in our algorithm to avoid the redundant size reduction operation. Nonetheless, there is still one case that the size reduction will process. If the division result exactly equals 0.5, the original LLL algorithm will do the size reduction operation in the back loop. But our algorithm will skip. This will produce a different latticereduced matrix at last. However, to do the size reduction or not to do in this case will both produce the matrix that obeys the LLL lattice reduction criterion. Although, we cannot prove the performance is the same mathematically, we will show that their performance is the same through the Monte-Carlo simulation in latter section. So the lattice reduction 4 Journal of Electrical and Computer Engineering   Notice that if the stay-state size reduction constraint (μ / = 0) is not violated, the LLL reduction constraint must not be violated because the R(k, k) has already been LLL reduced in the previous processing and all values remain unchanged. Therefore, we can also perform the stay-state size reduction check ahead. If the stay-state size-reduction constraint is not violated, we can simply increase the index k by 2 to skip a redundant LLL reduction and enter the forward loop.

Simulation Results
To verify the proposed LLL algorithm, we simulate the LLLaided MIMO detections based on the MIMO system described in Section 2, and we employ sorted QR decomposition in all MIMO detectors. The LLL-reduction parameter δ equals 0.75, as suggested in [11]. Table 2 Figures 2 and 3, respectively. The performance is exactly the same for our algorithm and the original LLL algorithm.
We also analyze the computational complexity and latency of our algorithm. The results for 4 × 4 and 8 × 8 MIMO systems are listed in Tables 3 and 4. The computation is divided to four operations such as addition, multiplication, division, and givens rotation. Our algorithm is lower in total computational complexity and especially in the division which tends to cost more time for computation. The lower ratio is just like the loop-reduced ration. But only average computational complexity cannot clearly show the advantage of our algorithm. Since the original LLL algorithm contains lots of redundant checks operation which are unable to process in parallel, it will result in long average processing time to complete the lattice reduction operation. We try to simulate the latency by parallelizing all the possible operations. The latency counts are as follows: the line (5) to line (9) in our algorithm is counted as one division, one addition and one multiplication. The LLL reduction check operation contains four multiplications and two additions. The column swap operation is counted as no operation delay. The givens rotation counts one at each back loop. And the stay-state size   reduction is counted as a division operation. The latency is shown in the Table after the dashed line. The saving is about 22%∼29% and grows with antenna number.

Hardware Architecture
6.1. Top Structure. In this paper, we proposed a very intuitive structure for our LR-LLL algorithm in Figure 4. The center controller counts the index k by the LLL violation results. And it will send control signals to choose the specific matrix element to the input of the combinational circuit. We can also call size reduction part as size reduction loop and LLL reduction part as LLL reduction loop, respectively. The update circuit for the remaining R matrix, T matrix, and Q matrix are omitted for simplification. In this architecture, CORDIC circuit has two pipelined stages. So it required one cycle for size reduction loop and four cycles for a LLL reduction loop. The traditional LLL algorithm always processes the forward loop which contains the execution of the whole circuit. While using LR-LLL algorithm, some  forward loops will replace by back loops. The average cycle counts for the LLL algorithm, and our LR-LLL algorithm is listed in Table 5. We can find out that as the antenna number grows, the reduction of average cycle grows. And the FPGA results are shown in Table 6. In [5], the complexvalued LLL lattice reduction algorithm is proved to have lower computational complexity than real-valued system. This is mainly due to the double sized of the real number system comparing to complex number. So the hardware or cycle counts may be larger than the previous two complex number works.    division architecture. In Figure 5, we show a four-stage longdivision architecture for five bits output divr circuit. The size reduction update circuit is composed of multiplication and addition circuit. Instead of calculating the square norm to do the LLL reduction comparison, we choose the CORDIC vector mode circuit to calculate the square root of the norm which may also be the output if the LLL check violates. The square root of δ is set to 0.875 to approximate the square root of 0.7. CORDIC rotation mode is used to do the Givens rotation of the algorithm. The output of the comparison circuit is the LLL reduction violation check results which will control the center controller and also enable the update circuits. The LLL reduction update block contains multiple CORDIC rotation circuits to do the givens rotation of remaining row element of R and Q matrix. It also contains a swap circuit for T matrix.

Conclusion
In this paper, we propose a look-ahead check technique to eliminate unnecessary check operation in the LLL algorithm. The proposed algorithm not only reduces the average number of loops in the LLL algorithm but also reduces the computation complexity and latency of LLL algorithm. We also proposed a very intuitive architecture to estimate the clock cycle saving of our algorithm. The saving is dramatically increased while the antenna number grows. Therefore, we believe that the proposed loop reduction LLL algorithm benefits the lattice-reduction-aided MIMO detection.

Introduction
Numerous research results have shown that spatial multiplexing-based multiple-input multiple-output (MIMO) transceivers can boost the data rate of wireless communication systems without using additional power and channel bandwidth. When both transmitter and receiver are equipped with multiple antennas, multiple independent data streams can be concurrently launched into the air by the transmitter and separately decoded at the receiver, with aid of certain signal processing techniques [1]. By combining MIMO with the multicarrier schemes such as orthogonal frequency division multiplexing (the so-called MIMO-OFDM), channel capacity can be tremendously increased as the degree of freedom in frequency and spatial domain are jointly utilized.
By assuming that channel state information (CSI) is available at both ends of the link, singular value decomposition (SVD) [2] is commonly considered as the approach to extract spatial links within a MIMO channel. With SVD, the MIMO channel matrix can be transformed into a bank of parallel scalar air-pipes (widely termed as eigenmodes in the literature) by applying appropriate beam-forming matrices at both transmitter and receiver. Since the eigenmodes realized by SVD are equivalent to singular values of the MIMO channel matrix, their link strengths could be quite different. Therefore, proper bit-allocation and power distribution (such as water-filling) algorithms must be employed to optimize the system capacity. This, however, increases the complexity in adaptive control mechanism, as bit/power allocations among eigenmodes may have to be updated frequently in rapidly time-varying channels. For the sake of convenience, one may simply apply the modulation/coding format or distribute the power uniformly on all eigenmodes. Nevertheless, this may cause degradation on error performance of the system, especially when magnitude of the minimum channel singular value is too small [3].
Geometric mean decomposition (GMD), proposed by Jiang et al. [4], has emerged as an alternative method of designing MIMO transceiver. A set of parallel spatial links with identical gains intrinsic to a MIMO channel can be realized via GMD. Remarkably, these parallel links with the identical gain are equal to the geometric mean of channel channel singular values (eigenmodes). The property of parallel links with identical gain makes GMD useful for adaptive MIMO-OFDM systems, as complicated joint bit/power allocation algorithms in frequency and spatial domain can be significantly simplified. In recent years, much attention from both academia and industry has been paid to the GMD-based MIMO transceiver. For instance, GMD has been deemed as a prospective MIMO architecture in some next-generation cellular broadband networks, including two of the most important standards in this context-LTE-Advanced [5] and WiMAX (IEEE 802.16 m) [6]. Note that Jiang et al. have put forward another MIMO transceiver dubbed as uniform channel decomposition (UCD) [7], which also provides spatial links with identical gains. The scope of this paper is focussed on GMD, and the related work on UCD is, however, remained as open problems for future research.
In order to calculate performance metrics such as bit error rate and outage probability of a GMD-based MIMO transceiver operating in mobile fading channels, analytical results on statistical properties of MIMO eigenmode geometric mean are required. To the best of our knowledge, however, only a handful of papers in the existing literature have offered the relevant analysis. In [8], asymptotic behavior of geometric mean of MIMO channel eigenvalues has been examined. Also, the probability density function (PDF) for the link gain of MIMO-GMD in special cases with two antennas at both transmitter and receiver can be found in [9], but the results for more general cases are not available. This motivated us to derive closed-form expressions of PDFs (either exact or approximated) for systems with arbitrary number of antennas in Rayleigh fading channels. Specifically, apart from the univariate PDFs, we are also interested in the bivariate PDFs for MIMO eigenmode geometric mean and channel capacity, since the statistical behavior relating to channel variation is crucial for systems incorporating with adaptive transmission techniques. For instance, knowledge on how channel quality evolves with time allows engineers to determine the adaptation/feedback rate for MIMO-GMD in a more judicious manner. Further, for successive interference cancelation (SIC) detection algorithm in conventional MIMO-GMD transceivers [4], the reliability of a data stream is dependent on previously decoded data streams. That is, if the receiver fails to detect a data stream correctly, the data on the remaining streams may not be decoded successfully due to error propagation. Thus, it is possible to gauge how frequently that an error propagation event would occur if time-varying characteristics are known. In correspondence, this paper applies firstorder finite-state Markov chains (FSMC) [10,11] to model the fluctuations of both link gain and channel capacity under MIMO-GMD in baseline cases of Rayleigh fading environments. Note that FSMC has been used to emulate the capacity process of general MIMO wireless channels [12], but channels with MIMO-GMD transceiver have not been considered. Also, some researchers have designed the feedback mechanisms for adaptive modulation systems with channels modeled by FSMC [13]. Based on the univariate and bivariate PDFs developed in this paper, the transition probabilities among the discrete states of an FSMC are analytically computed.
In a nutshell, this paper encompasses the following contributions.
(1) Derivation of the exact PDFs for link gains and channel capacities of MIMO-GMD systems with two antennas at either transmitter or receiver and an arbitrary number of antennas on the other side. The rest of the paper is organized as follows. In Section 2, assumptions on channel model and background of the MIMO-GMD architecture are briefly reviewed. Then, in Section 3, the closed-form expressions of both univariate and bivariate PDFs for eigenmode geometric mean and channel capacity of MIMO-GMD schemes are derived. Based on the derived analytical results, FSMCs are constructed in Section 4 to study the time variation of MIMO channel under GMD. Finally, this paper ends with a concise conclusion drawn in Section 5.

Background and Assumptions
In this section, we first describe the considered system model. Then, the concept of GMD and the fundamental structure of an FSMC are reviewed subsequently.

Channel Model.
In general, this paper mainly studies a (N t , N r ) single-user, point-to-point MIMO system that has N t transmit antennas and N r receive antennas. The MIMO channel, H, is therefore an N r × N t matrix. Here, we denote m = min(N t , N r ) and n = max(N t , N r ). We presume this MIMO system is coupled with OFDM setting, so channel response per subcarrier is basically flat in frequency domain. Therefore, the signal model for each subcarrier can be written as: where y is an N r ×1 vector of the received signal, x is an N t ×1 vector of the transmitted data symbols, and z is an N r × 1 vector of additive complex zero mean Gaussian noise. Since Rayleigh fading is considered, all entries of H are modeled as complex Gaussian random variables with zero mean and unit variance, CN (0, 1). With SVD, the MIMO channel H can be written as where f (λ 1 , . . . , λ m ) is the joint PDF of all m eigenvalues of the channel correlation matrix H † H, as given in (4). To make this paper more self-contained, we have tabulated the numerical values of E(g), var(g), k, and θ for (4,4), (4,8), and (8,8) cases in Table 1. Hence, one may construct the PDFs for g in these MIMO cases accordingly. The approximations based on Gamma distributions (17) can be further leveraged to find the PDFs for channel capacity in MIMO-GMD schemes. Using (15), we have To justify the suitability of Gamma approximations for eigenvalue geometric means, similar comparisons as in Section 3.1 are carried out for (4,4), (4,8), and (8,8) cases. In Figure 4, we can see that Gamma distributions provide excellent approximation for eigenvalue geometric means. Although we did not show it here, it has been found that Gamma distributions also fit the simulation data in m = 2 cases. Hence, we can claim that, for practical MIMO systems (m ≤ 8) at least, the distributions of link gains in MIMO-GMD can be accurately approximated by Gamma random variables. For channel capacity, we compare the distributions of simulation samples with the capacity PDF (19), which was derived based on Gamma approximation of g. The comparisons are shown in Figure 5, and it is clear that the simulation data and computed results are almost aligned with each other. An important observation is that, in contrast to conventional MIMO systems, Gaussian approximation for capacity distribution [16] does not seem to be appropriate in GMD-based MIMO transceiver schemes.
Then, ρ can be calculated by substituting (24) into (21). Apparently, the closed-form expression for (24) is difficult, if not impossible, to obtain. However, it still can be solved via numerical integrations. Once the value of ρ is acquired, (20) can be extended to obtain the bivariate PDF for MIMO-GMD channel capacity Journal of Electrical and Computer Engineering 7 process. Using Jacobian transform, f (C, C) can be written as With PDFs derived in this section, one may construct an FSMC to model the fluctuation of MIMO-GMD link gain and capacity, as explained in Section 4.

FSMC Construction
In order to investigate the time-varying behavior of MIMO-GMD systems, FSMC is adopted to model both link gain and channel capacity. Based on the FSMC structure elaborated in Section 2.3, we delineate how transition probabilities can be analytically calculated using the PDFs derived in Section 3. The transition probabilities from S i to S j , denoted as P i, j , for the link gain process g(t) can be computed as where Prob g(t) ∈ S i , g(t + τ) ∈ S j = Sj Si f g, g dg d g, Prob g(t) ∈ S i = Si f g dg. (28) Note that f (g) has been derived as (14) or (17), depending on the value of m. Analytical calculations of (27) can be quite difficult, so numerical integrations are used in this work instead. For the sake of simplicity, this paper assumes that the channel variation is slow enough so the process would only transit to one of the adjacent states (from S i to S i−1 or S i+1 ) or stay in the same state (from S i to S i ). Therefore, we have For channel capacity in MIMO-GMD scheme, the transition probabilities can be computed in a similar fashion.    In order to verify that FSMC is an appropriate tool to model the fluctuation of MIMO-GMD channel, several Monte Carlo simulations have been carried out. In particular, simulation results on transition probabilities for both MIMO eigenvalue geometric mean and channel capacity are compared with our calculations by (26) and (29). Note that the threshold levels for state quantization are set arbitrarily 8 Journal of Electrical and Computer Engineering in this work. In practical adaptive modulation schemes, for example, the threshold levels for state quantization could be set based on minimum required channel gain for a target error performance. In all simulations, we have f D = 30 Hz, τ = 0.001 sec and γ = 10 dB. In Figure 6, the link gain of a (2,4) MIMO-GMD scheme is modeled as an FSMC consisting of four states with {T 0 , T 1 , T 2 } = {1.5, 2.5, 3.5}, and transition probabilities from both simulation and calculations are plotted. Similarly, we approximate the MIMO-GMD capacity process in a (2,2) system using a four-state FSMC with {T 0 , T 1 , T 2 } = {4, 7, 9} bps/Hz in Figure 7. In both cases, it is apparent that our calculations can provide very accurate approximations.

Conclusion
By applying GMD to a MIMO-OFDM system, multiple spatial links with identical gains can be realized within each subcarrier. This assuages the complexity at the transmitter side as the spatial domain bit/power allocation can be simplified. In order to seek for potential opportunities to further decrease the system complexity, this paper has investigated the statistical properties of MIMO systems using GMD. In particular, we have derived PDFs for link gains (MIMO eigenvalue geometric mean) and capacities. Although exact results are only available for cases with two parallel links, Gamma approximations can be used to model eigenvalue geometric means for general MIMO cases with arbitrary antenna configurations. Moreover, these results are extended to derive the bivariate PDFs, which is important for the analysis of time-varying behavior. These results are employed to model the channel variation in MIMO-GMD schemes by constructing FSMCs. To be specific, the transition probabilities among states could be computed using the PDFs derived in this paper. This paper proposed a potential approach to predict the fluctuation of MIMO channels under GMD, which may allow the engineers to relent the feedback rate and hence reduce the system burden. The analytical PDF results presented in this paper are for baseline cases of Rayleigh flat fading. For scenarios involve spatial correlation and Ricean fading, statistical properties of MIMO-GMD schemes are still open problems and should be addressed in the future.