In this paper, a configurable superimposed training (ST)/data-dependent ST (DDST) transmitter and architecture based on array processors (APs) for DDST channel estimation are presented. Both architectures, designed under full-hardware paradigm, were described using Verilog HDL, targeted in Xilinx Virtex-5 and they were compared with existent approaches. The synthesis results showed a FPGA slice consumption of 1% for the transmitter and 3% for the estimator with 160 and 115 MHz operating frequencies, respectively. The signal-to-quantization-noise ratio (SQNR) performance of the transmitter is about 82 dB to support 4/16/64-QAM modulation. A Monte Carlo simulation demonstrates that the mean square error (MSE) of the channel estimator implemented in hardware is practically the same as the one obtained with the floating-point golden model. The high performance and reduced hardware of the proposed architectures lead to the conclusion that the DDST concept can be applied in current communications standards.
1. Introduction
Presently, there is need to develop communications systems capable of transmitting/receiving various types of information (data, voice, video, etc.) at high speed. Nevertheless, designing these systems is always an extremely difficult task, and, therefore, the system must be broken down into several stages each with a specific task. The complexity of each stage is higher when the system operates in a wireless environment because the additional challenges that should be facing due to the complex nature of the channel and its susceptibility to several types of interference.
As it is not possible to avoid the influence of the channel on a transmitted data sent through it, an option is to characterize the channel parameters with enough precision so that their effects can be reverted in the receiver. For that reason, channel estimation stage is a key part of any reliable wireless system because a correct channel estimation leads to a reduction of the bit error rate (BER). The channel estimator must deal with multiple phenomenas, such as multipath propagation and frequency Doppler (due to the mobility of the users). In order to deal with these problems, current communication standards specify the transmission of pilot signals which are known in the receiver, allowing an ease estimation of the communication channel. The way of transmitting such pilot signals can be classified in to two major branches: pilot-assisted transmission (PAT)—where pilot and data signals are multiplexed in time, frequency, code, space, or in a combination of the mentioned domains—and implicit training (IT), a technique proposed recently where the pilot signal is hidden in the data transmitted. PAT is the technique implemented in actual standards, such as WiMAX, WiFi, and Bluetooth. It presents the advantage that pilots and data relies on orthogonal subspaces allowing a simple separation of them in the receiver; however, it is necessary to decrease the available bandwidth for data in order to transmit the pilot signal. On the other hand, IT overcomes this problem because all the time, data and pilot signal are transmitted; nevertheless, it leads to a transmission of such signals into nonorthogonal subspaces. Despite the aforementioned, IT has been recognized as a feasible alternative for future communication standards [1].
The simplest form to carry out IT is to add (superimpose) the pilot signal to the data. This approach is known as superimposed training (ST), first proposed in [2] and enhanced by diverse authors whose results are summarized in [3, Ch. 6]. In [4–8] was presented a refinement of ST known as data-dependent superimpose training (DDST), this technique makes it possible to null the interference of data during the estimation process via the addition of a new training sequence, which depends on the transmitted data, together with the data and the ST sequence.
Because of the benefits that ST/DDST offer, it is necessary to develop efficient implementations of these algorithms. Although these techniques have been widely studied, to this point, there exist few reported practical implementations in the literature. In fact, almost all of them are approximations based on floating point and software. In [9], the algorithms are programmed in a digital signal processor (DSP) for a low-rate communication system, while in [10] the proposed implementation is developed into an embedded microprocessor with hardware accelerators inside of a FPGA. At ReConFig 2011, we have presented a full-hardware architecture—with high throughput, low hardware consumption, and high degree of reusability—for the channel estimation stage of an ST/DDST receiver [11]. Its novelty consisted in that a systolic array processors (AP) was used for performing the entire estimation process instead of two separated signal processing modules. In this paper, we present a extended version of that paper, where a hardware-efficient architecture for configurable ST/DDST transmitter that supports 4/16/64-QAM constellations is used to complement the results presented in [11], because now, all transmitted data—in each Monte Carlo trial—are generated by the proposed transmitter hardware instead of the transmitter simulation model programmed in Matlab.
The rest of the paper is organized as follows. Section 2 presents the system model being considered, the ST/DDST transmitter structure, the channel estimation algorithm, and the cyclic mean reformulation onto systolic APs. Section 3 describes in detail the full-hardware architectures for the configurable ST/DDST transmitter. Section 4 proposes an architecture based on SA processor for the DDST channel estimator. In Section 5, the performance evaluation of the proposed architectures is carried out. Conclusions are set down in Section 6.
Notation 1. Lowercase (uppercase) bold letters denote column vectors (matrices). Operators (A)H, (A)T, and (A)-1, denote the Hermitian, transpose, and inverse operations of matrix A. 1n represents a column vector of length n with all its elements equal to one; similarly, 0n represents an all-zeros column vector of length n. In is the identity matrix of size n×n. [a]k denotes the kth element of vector a. [a]m:n denotes a vector conformed with the elements of a as follows: [[a]m,[a]m+1,…,[a]n]T. ⊗ represents Kronecker product. Finally, E(·) represents the expectation operator.
2. System Model
This section is devoted to introduce the DDST algorithm mentioned previously. Suppose a single carrier, baseband communication system based on DDST as the one presented in Figure 1. The transmitted signal x(k) conformed to the sum of the data sequence b(k), the training sequence c(k) and the data-dependent training sequence e(k). The index k helps to enumerate the samples of such signals which are transmitted at a rate equal to 1/T. c(k), is a periodic sequence with period equal to P and power equal to σc2 [12]. It is assumed that the data sequence is a zero-mean, stationary stochastic process with power equal to σb2, where the symbols of such process come from a equiprobable alphabet. The sequence e(k) is constructed as mentioned in [5]. s(k) is propagated through the communication channel h(k) whose time impulse response conformed to the convolution of the system filters and the propagation medium impulse responses (all of them assumed to be time-invariant). Such channel can be modeled as a finite impulse response (FIR) filter with L time-invariant coefficients as much. Finally, the distorted signal by the channel is contaminated with the noise n(k) for conforming the received signal x(k). n(k) is a zero-mean white Gaussian noise, which possess variance equal to σn2. The transmission of blocks of N symbols, which is preceded by a cyclic prefix of length CP≥L is assumed. Perfect block synchronization, which allows to fix P=L it is also assumed. For ease of implementation, it is assumed that N is a multiple of P and P is a power of two.
Digital communication system model considered.
Thus, the received signal after removing the cyclic prefix can be expressed in a matrix form as follows:
(1)x=H(b+c+e)+n,
where H is a circulant matrix whose first row is given by [hT,0N-LT], where h is a vector containing the coefficients of the channel impulse response (CIR). Similarly, x, b, c, e, and n are vectors equal to [x]k=x(k), [b]k=b(k), [c]k=c(k), [e]k=e(k), and [n]k=n(k), respectively, with 0≤k≤N-1.
2.1. Digital Transmitter with ST/DDST Included
Figure 2 depicts the discrete-time baseband block diagram of the (data-dependent) superimposed training transmitter. This is a modified version of the IT transmitter presented in [3]. From Figure 2, it can be noted that the key component of the transmitter is the sequence transformation block. It serves to implicitly embed the training sequences onto data sequence b by the affine transformation expressed as
(2)s=Ab+c,
where s represent the complex baseband discrete-time transmitted signal, A is a precoding matrix, and c refers to vector obtained by replicating NP times one period of the training signal cOCI of size P, that is,
(3)c=1NP⊗cOCI,
where NP=N/P, [cOCI]n=cOCI(n) and such sequence is given by [12]:
(4)cOCI(n)=σcej(π/P)(n(n+v))
with v=1 when P is odd, v=2 if P is even, and n=0,…,P-1.
Block diagram of the digital baseband (data-dependent) superimposed training transmitter.
The precoding matrix allows to modify the training technique according to
(5)Training={A=IN,c≠0forST,A=IN-G,c≠0forDDST
with(6a)G=1Np(1NP×1⊗K),(6b)K=11×NP⊗IP,where G and K are matrices of sizes N×N and P×N, respectively.
In the DDST case, the N-length vector e containing the data-dependent sequence (DDS) can be obtained from (2) and using (5)–(6b) as follows:
(7)e=Gb.
2.2. Channel Estimation Using DDST
It is possible to observe that due to the periodicity of c(k), s(k) will have a periodic signal embedded with a period equal to P. Taking advantage of this characteristic, an estimated of the cyclic mean of the received signal is utilized for performing the estimate of the channel. Such cyclic mean estimator can be defined as:
(8)y=Jx,
where y is a column vector of length P whose elements are the estimated coefficients of the cyclic mean of x and J given by
(9)J=1NP(1NPT⊗IP).
According to [4], the estimation of the CIR is given by
(10)h^=Γy,
where h^ is a vector containing the estimated CIR coefficients, Γ is a matrix formed by the first L rows of C-1, and C is a circulant matrix of size P×P formed by vector [c(0),c(1),…,c(P-1)]T.
2.3. Cyclic Mean Algorithm Using Array Processors and Partitioning
The next analysis describes how the cyclic mean is obtained using a systolic array that computes a matrix-vector multiplication (MVM). Consider (8), where it is not possible to perform directly the MVM operation due to the Kronecker product involved. To avoid this cumbersome operator, the same equation can be reformulated as follows:
(11)y=1NP(ℵ1NP),
where ℵ is a matrix of size P×NP which is defined as follows:
(12)ℵ=[[x]0[x]P[x]2P⋯[x](NP-1)P[x]1[x]P+1[x]2P+1⋯[x](NP-1)P+1[x]2[x]P+2[x]2P+2⋯[x](NP-1)P+2⋮⋮⋮⋮[x]P-1[x]2P-1[x]3P-1⋯[x]NPP-1].
An architecture based on AP for computing (11) would be impractical from the point of view of hardware consumption because it will need NP processor elements (PEs). This problem is known as problem-size-dependent array where the algorithm requires a systolic AP whose size depends on the complexity of the problem to be solved. However, it is possible to map the cyclic mean algorithm to a systolic AP of a smaller size using the partitioning method [13, Ch. 12]. Considers ℵ to be partitioned in blocks of size chosen to match a systolic array size P then (12) becomes
(13)ℵ=[B0∣B1∣⋯∣B(NP/P)-1],
where
(14)Bi=[[x]iP2[x]iP2+P⋯[x]iP2+(P-1)P[x]iP2+1[x]iP2+P+1⋯[x]iP2+(P-1)P+1[x]iP2+2[x]iP2+P+2⋯[x]iP2+(P-1)P+2⋮⋮⋮[x]iP2+P-1[x]iP2+2P-1⋯[x]iP2+P2-1],fori=0,…,NPP-1.
In similar way, 1NP is partitioned in NP/P unitary vectors 1P. Substituting (13) in (11), the cyclic mean with partitioning is concisely expressed as
(15)y=1NP(B01P+B11P+⋯+BNP/P-11P).
Therefore, the array of PEs will process one pair of B and 1P blocks after another in a sequential manner together with partial results.
3. A Configurable ST/DDST Transmitter Architecture
Considering the explained in Section 2.1, the architecture shown in Figure 3 is proposed for the transmitter. It is composed of the five hardware modules: the symbol adecuator, the mapper, the data sequence transformer, the Tx_control, and the Tx_AGU. The reconfigurability feature of the architecture allows to switch between two operating modes: ST or DDST, in order to send data blocks with a cyclic prefix attached. In both modes, the transmitter hardware supports 4/16/64-QAM constellations.
Digital architecture of the configurable ST/DDST transmitter.
In the next subsections, additional details about the main transmitter modules will be described.
3.1. Symbol Adecuator
The design of this module is widely conditioned by the features of the mapper. By early account, a key aspect exploited in the mapper design, it consists of the fact that the 4-QAM and 16-QAM constellations are contained in Grey-coded 64-QAM one, as shown in Figure 4. For that reason, the symbol adecuator is necessary because not all the same point-numbers in the three constellations are mapped to the same complex symbol output. For example, while the point number 2 of the 4-QAM constellation is mapped to -1+j symbol, 16-QAM will map this point number to 3-3j and 64-QAM will map to 3+5j.
Grey-coded 64-QAM constellation used in the mapper module. (a) The 4-QAM and 16-QAM constellations are delimited with a dashed lines; (b) label guide for identifying the constellation point number.
3.2. Mapper
As stated in Section 3.1, in the mapper design is only required the 64-QAM constellation. In this work, a memory-efficient scheme is proposed to build that constellation, whose eight possible values (1, 3, 5, 7, −1, −3, −5, and −7) of the I and Q axes are stored in the constellation LUT. Additionally, the mapper has to normalize the complex symbols based on two criteria: the constellation order and power assigned to each of the sequences involved. Thus, a normalization constant Norm_Mapp_Cte that combines the two criteria is given by
(16)Norm_Mapp_Cte=σb×Norm_QAM_Cte,
where
(17)Norm_QAM_Cte={12,for4-QAM,110,for16-QAM,142,for64-QAM,
with(18a)σb2+σc2=1forST,(18b)σb+e2+σc2=1forDDST.The mapper architecture designed is depicted in Figure 5. The constellation LUT was implemented with a dual-port ROM with eight memory locations, depth. On the contrary, in the normalization LUT, the ROM depth was 16 locations.
Hardware-efficient 4/16/64-QAM mapper with ST/DDST incorporated.
3.3. Data Sequence Transformer
The data sequence transformer is the greater complexity module of the transmitter. Thus, its design was broken down into three submodules, whose individual architectures are described in the following paragraphs.
3.3.1. Training Sequence Generator
Analyzing (4), it can be noticed that the parameters σc2, N, and P, needed to generate the training sequence, are known in advance and they remain constants during the transmitter operating. Hence, the P values of the training sequence can be calculated off-line, quantized, and stored in an LUT. This LUT is read NP times in order to expand the training sequence length, as indicated in (3), and it can be superimposed, element by element, with the data sequence by the complex adder.
3.3.2. ST Cyclic Prefix Insertion Submodule
There are several problems to arise because of the way in which the prefix cyclic is generated and its position where it is attached in the ST sequence.
Since the prefix cyclic conformed to the last P data of the sequence ST, it can only be generated from this sequence until it has been completely processed.
Given that, in all the N+P data to be transmitted, the first P data correspond to the cyclic prefix, it is necessary to use a memory buffer in order to store the remaining N data (ST sequence) and, thus, prevent data loss.
A dual-port RAM (RAM_CP) of depth N was used for the ST cyclic prefix insertion submodule designing. The RAM_CP have two independent address buses one for data reading (addr_rd_st_cp) and one for data writing (addr_wr_st_cp). This feature allows to read and write data simultaneously to/from the RAM_CP. The process for generating and attaching a cyclic prefix in the ST sequence can be summarized in the following steps (Figure 6).
When the (N-P+1)th datum is stored in RAM_CP, the previous datum stored is addressed by addr_rd_st bus.
During P clock cycles, the ST sequence storing and reading take place in the RAM_CP.
The ST sequence storing in the RAM_CP is stopped. However, the data reading will continue for N cycles.
ST cyclic prefix generation and insertion process.
3.3.3. Data-Dependent Sequence Generator
The operation of this submodule is based on (6a)–(7), which implies to compute two high-demand processing operations: an MVM and the Kronecker product. Moreover, similar to the cyclic prefix insertion case, the DDS can only be generated from data sequence b(k) until it has been completely processed. In consequence, the following adaptations should be made to the original equations in order to ease its mapping-to-hardware process.
The b(k) sequence is rearranged into a matrix of size P×NP, according to
(19)[BT]i,j=[b]iP+j=b(iP+j)fori=0,…,NP-1,j=0,1,…,P-1.
The mean of the each rows of the matrix B is obtained.
The P mean results are replicated NP+1 times in order to obtain the e vector and P data for DDST cyclic prefix purposes.
Figure 7 shows the hardware architecture of DDS generator. Its novel design avoids the b(k) sequence rearranging by the loop-back shift register lb_delay_dds. This register generates a P symbol delay in order to align the data for each B matrix row. So, the data rows can be added “on the fly” by the complex adder without the data input stream is stopped. The sum results are stored in the RAM_DDS, after its entire contents are read NP+1 times and each datum is divided by NP in the shifter block. Finally, the results are sent to the DDS generator outputs.
Data-dependent sequence generator submodule. (a) Simplified architecture; (b) pictorial representation of the e(k) sequence generation.
4. Systolic Channel Estimator Architecture
This section introduces an architecture for the DDST-based channel estimation process. Its design is based on MVM operation, which is carried out in a systolic way into AP. The main idea in the system design is to reuse the same systolic array for computing the cyclic mean of the received data. The proposed architecture, called in this paper “systolic DDST channel estimator” (SYSDCE) is depicted in Figure 8(a). Four functional units can be identified: a modified systolic matrix-vector multiplier (MSYSMVM), a data input feeder (DATINF), an inverse C look-up table (ICLUT), and a control unit (CU). Broadly speaking, the SYSDCE operation can be divided into three phases: input sequence storage, cyclic mean compute, and CIR estimate.
Systolic channel estimator for DDST receiver. (a) Simplified architecture; (b) the PE module.
As soon as the start signal is asserted, an N+P data samples (vector x and cyclic prefix, resp.) can be read from the input port IN. After excluding the samples corresponding to the cyclic prefix, the rest of samples are rearranged and stored in the memory bank of DATINF. When this process is finished, the CU configures the MSYSMVM unit and during NP cycles it reads P parallel data per cycle from DATINF and computes the cyclic mean y. Once this phase is finished, the obtained vector y together with ICLUT data are fed to the MSYSMVM again for performing the product expressed in (10). Finally, after P+1 cycles, the done flag is asserted and one by one the coefficients of the channel estimated h^ are sent to the bus H_OUT. It is worth mentioning that the SYSDCE can be configured to compute only the cyclic mean if mode input control signal has been set to zero. In this case, the cm_flag out is asserted to indicate that valid results are available in CM_OUT bus. Thus, the channel estimator is prepared for another data sequence processing. A deeper explanation about each component of the SYSDCE architecture will be given in the subsections.
The fundamental operation to perform by SYSDCE is a matrix-vector multiplication which is high time-processing demanding. The hardware design for solving this operation is the most critical part in the architecture. The obvious strategy for accelerating MVM consists in computing as many operations as possible, with the penalty of a great consumption of FPGA resources. Therefore, this paper proposes a modification of the systolic MVM presented in [14, Ch. 3] in order to obtain a good performance with reasonable resources consumption. This modification allows to compute the cyclic mean using partitioning method with the same systolic array reported. Figure 8(b) shows the processor element (PE), which is the atomic digital signal processing module in MSYSMVM. It processes three flows: the data flow from the ICLUT or DATINF, the input registers values, and the data produced by the previous adjacent PE.
In the MSYSMVM design was considered that the number of PEs needed (AP size) is P, which matches with the dimensions of matrix Γ and vector y, respectively. The projection vector d=[10]T (see details in [14]) was used with a vector schedule s=[11]T. The pipelining period for this design is equal to 1 and the computing time for the full MVM is 2P-1 clock periods.
For computing the cyclic mean using the MSYSMVM module, the original structure of PE was modified with an additional multiplexer. For that reason, the PE can perform all trivial multiplications by bypassing the data from the input of the complex multiplier directly to the complex adder.
4.2. Data Input Feeder (DATINF)
Similar to almost any systolic array, the MSYSMVM needs the data, which will be fed to each of its PEs to be given in a defined order before processing it. In the proposed approach, the module DATINF is responsible for performing this task. It is made up of an array of P memories, each with a depth of NP, organized as a memory bank as shown in Figure 9. DATINF reads N+P data from IN bus; it identifies and removes the first P data corresponding to CP. Subsequently, this module rearranges this sequence (correspondence to x(k)) in NP/P blocks of size P×P in order to form B0,B1,…,BNP/P. Therefore, the N stored data can be viewed as a NP×P matrix, where each individual memory in the bank stores one column of each block and the blocks are stored consecutively one after another, as depicted in Figure 9.
Data block organization in the DATINF.
Each datum of the input sequence x has associated three addresses that define its location inside the memory bank: block number (blk_num), memory number (mem_num), and memory address (mem_addr). The DATINF must generate these addresses using the following expressions: (20a)blk_num=⌊k×NPN×P⌋,k=0,1,…,N-1,(20b)mem_num=⌊k×NP-blk_num×N×PN⌋,(20c)mem_addr=(kmodP)+(P×blk_num),where k is the kth element of x and ⌊·⌋ denotes the floor operator.
In order to minimize the hardware consumption, a “hard-wired” addressing approach was built for the memory bank. As shown in Figure 10, the log2(N) bits corresponding to the DATINF address bus are split into three parts. The first log2(N/NP) most significant bits (MSB) are used for block selecting, the next log2(NP/P) MSB are used to select a particular memory in the bank and the remaining log2(P) bits are used to individually address each of the locations in the selected memory.
Hard-wired addressing for memory bank.
4.3. Inverse C Look-Up Table (ICLUT)
The values of the circulant matrix C-1 are constants that can be precomputed once off-line and stored in a LUT. Only the values of the first column are necessary because the remaining columns are shifted versions of the first one. Consequently, the ROM location’s number required for the LUT is just P. If traditional design is used, then the LUT will be designed with a multiport ROM of P locations, but it will be synthesized by the employed compiler tool as an array of P single-port ROMs. Therefore, the number of memory locations is increased to P2. A novel solution was designed with an array of P registers operating as a circular buffer. This is called “inverse C look-up table” (ICLUT) and it saves P(P-1) memory locations. The first row values of C-1 are stored in the registers. Next, one rotation is applied in each tick of the clock to change the register’s outputs, as indicated in Figure 11.
Simplified architecture of inverse C look-up table (ICLUT) and its corresponding outputs values.
5. Results
In this section, the proposed architectures are evaluated. First, the hardware utilization and throughput of the ST/DDST transmitter implementation are presented. After, its functional performance from the point of the signal-to-quantization-noise ratio (SQNR) is analyzed. Next, the FPGA resources consumption and throughput of the SYSDCE implementation are obtained. Finally, the SYSDCE functional results specified in terms of the MSE of the channel estimated and SQNR performance are carried out by Monte Carlo simulations and using the transmitter hardware in DDST mode.
5.1. Implementation and Simulation of the Transmitter
The configurable ST/DDST transmitter architecture was implemented in RTL level using Verilog hardware description hardware. It is able to transmits ST or DDST data blocks of length N with CP=P. The power of training sequence is set to 0.2σs2 with a period P=8. The configurable transmitter was synthesized and targeted in Xilinx Virtex-5 XC5VLX110T FPGA. Default settings and no “user constraints” were selected in the EDA tool Xilinx ISE v11. No IP core o predesigned component were used. All signals are represented in signed fixed-point two’s complement, and nonrounding scheme was considered.
Table 1 summarizes the synthesis results for the proposed ST/DDT transmitter. Analyzing this table, it can be noted a operating frequency of 160 MHz with a symbolic FPGA resource utilization. So, it is clear that excellent area-frequency balance is achieved.
Synthesis results of the ST/DDST transmitter.
FPGA resource
Used
Available
Utilization
Frequency
160.12
MHz
—
Slice registers
141
69120
<1%
Slice LUTs
437
69120
<1%
Fully used LUT-FF pairs
134
444
30%
IOBs
46
640
7%
BRAMs
4
148
2%
On the other hand, it is difficult to compare directly the proposed transmitter and channel estimator with the others previously presented in [9, 10] because of the differences in technology, paradigms used, and testing conditions. In [9], DDST communication system was implemented under full-software philosophy in TMS320C6713 DSP with a 300 MHz external clock. A hybrid software-hardware FPGA implementation of the DDST receiver is described in [10]. In both DDST implementations mentioned, the comparison against our transmitter was not possible. In the former because the transmitter was full-software based and the latter only the DDST receiver was implemented.
The transmitter operating validity is presented in Figure 12. The first graph (Figure 12(a)) shows clearly that the transmitter hardware has embedded the training sequence c(k) into b(k). It can be noted that the data sequence energy is spread in all frequency components. In contrast, the training sequence energy are only concentrated in P equispaced frequency components. Similar behavior occurs in the DDST mode (Figure 12(b)), but now the pilots signals also have the same energy. This is unequivocal proof that the transmitter architecture is properly superimposing c(k) and e(k) into b(k).
Discrete Fourier transform of the s(k) sequence generated by the transmitter architecture. (a) ST mode; (b) DDST mode.
The SQNR obtained for 100 Monte Carlo trials is monitored, in order to quantify the difference between the s(k) sequence obtained with the hardware transmitter compared with the floating-point transmitter golden model. Thus, the histogram of Figure 13 represents concisely the results of this test. The most of the occurrence are concentrated in 84 dB.
SQNR performance histogram of the ST/DDST architecture for 100 Monte Carlo trials.
5.2. Implementation and Simulation of the Channel Estimator
The SYSDCE architecture was implemented using the same considerations and design parameters of the transmitter. Also, the systolic channel estimator was synthesized and targeted in the same FPGA.
Table 2 summarizes the synthesis results for the proposed estimator. The values in the parenthesis in each feature indicate the total of corresponding available resources in the FPGA. The results in Table 1 reveal a frequency operation of 115.247 MHz with a minimal consumption (except DSP48Es) with respect to the total resources of the FPGA.
Synthesis results of the SYSDCE.
Input length (without CP)
(N)
512
Frequency
(MHz)
115.247
Slice registers
(69120)
1370 (1%)
Slice LUTs
(69120)
2587 (3%)
Fully used LUT-FF pairs
(3348)
609 (18%)
Block RAMs
(148)
8 (5%)
DSP48Es
(64)
32 (50%)
Againly, it was not possible to compare the SYSDCE against the existent approaches. In [10], the module corresponding to the channel estimation, only the arithmetic mean was accelerated by a dedicated coprocessor. In this work, the input sequence length was assumed (but it did not explicitly mentioned) to be N=512 symbols. The MVM operation described in (9) was implemented in software. Also, no results—from the point of view of the mean square error (MSE) in the channel estimated or SQNR performance—are presented.
Other important parameter of the proposed estimator is the number of cycles required for performing the tasks estimation. Particularly, the cyclic mean requires
(21)cyclesy^=(N+P)+(NP+P-1).
The first term in (21) corresponds to the input storage phase and the second to the NP/P MVM operations involved in the cyclic mean task. Furthermore, the number of cycles required for the CIR estimator is
(22)cyclesh^=cyclesy^+2P-1.
Consider the set of metrics listed in Table 3 to compare the performance of the SYSDCE system. The processing time (PT) is the time elapsed from the beginning of cyclic mean or channel estimation process until its computing has finished. The throughput (TP) per area is another useful metric, a higher value of this ratio indicates that the system implementation is better. As can be seen from Table 3, the proposed architecture provides a better performance compared to the arithmetic mean coprocessor used in [10].
Channel estimator throughputs comparison.
Channel estimator
Input length
Cycles/estimation
CT (us)
TP (MS/s)
TP/area (MS/s/slices)
SYSDCE (cyclic mean mode)
512
591
5.128
101.40
25.625e3
SYSDCE (channel estimator mode)
512
606
5.258
98.91
24.996e3
Arithmetic mean coprocessor in [10]
512
2238
20
26.39
NA
The validity of the provided architectures is granted by comparing their results with the floating-point simulation golden model programmed in Matlab, in terms of channel estimation error versus signal-to-noise ratio (SNR). Thereby, the following scenario (similar to that used in [6]) was considered. The hardware transmitter was configured in DDST mode, in order to send data blocks of N=512 symbols obtained from a 4 QAM constellation. The channel is randomly generated at each Monte Carlo trial and it is assumed to be Rayleigh with length L=8. The power of training sequence is set to 0.2σs2 with a period P equal to L.
Figure 14 shows the MSE of channel estimated, which is averaged over 300 Monte Carlo simulations for each value of SNR. Note that the MSE of the hardware estimator is too close to the theoretical line [4] and almost indistinguishable with respect to the golden model. On the other hand, Figure 15 presents the probability density function (PDF) of the SYSDCE hardware, obtained for the same Monte Carlo trials. Analyzing such PDF, it can be noted that the fixed-point performance in average is about 68 dB in terms of SQNR.
MSE performance of the SYSDCE hardware for 300 Monte Carlo trials.
SQNR probability density function of SYSDCE architecture for 300 Monte Carlo trials.
6. Conclusions
In this paper, digital architectures for transmitter and channel estimation stages of the ST/DDST communications systems have been presented. These architectures represent the first implementations under the full-hardware philosophy for a wireless systems based on ST/DDST. Both architectures present high throughput and reduced FPGA resources consumption, achieving a good trade-off between performance and area utilization. The proposed transmitter architecture is configurable enough to generate two types of training using three constellation orders. In the SYSDCE hardware, it is possible to observe a great flexibility and reusability because the same systolic array is used for two different tasks (operations): cyclic mean and channel estimation. Also, the SYSDCE design can be easily modified (by means of partitioning strategy) for processing channels of different lengths. The validity and performance of these approaches have been verified by Monte Carlo simulations, where an SQNR of 82 dB and 68 dB in average are achieved for the transmitter and the SYSDCE, respectively. At the same time both architectures present a insignificant differences in the performance results when they are compared with their respective floating-point golden models. The provided results show that ST/DDST concepts can be effectively utilized in current and future wireless communications standards.
Acknowledgments
This work was supported by PROMEP ITSON-92, CONACYT-181962, and Mixbaal 158899 Research Grants.
GoljahaniA.BenvenutoN.TomasinS.VangelistaL.Superimposed sequence versus pilot aided channele estimations for next generation DVB-T systemsFarhang-BoroujenyB.Pilot-based channel identification: proposal for semi-blind identification of communication channelsHaykinS.Ray LiuK. J.Alameda-HernandezE.McLernonD. C.Orozco-LugoA. G.LaraM. M.GhoghoM.Frame/training sequence synchronization and DC-offset removal for (data-dependent) superimposed training based channel estimationGhoghoM.SwamiA.Improved channel estimation using superimposed trainingProceedings of the IEEE 5th Workshop on Signal Processing Advances in Wireless Communications (SPAWC '04)July 20041101142-s2.0-21644489209GhoghoM.McLernonD.Alameda-HernandezE.SwamiA.Channel estimation and symbol detection for block transmission using data-dependent superimposed trainingLongoria-GandaraO.Parra-MichelR.BazdreschM.Orozco-LugoA. G.Iterative mean removal superimposed training for siso and mimo channel estimationCarrasco-AlvarezR.Parra-MichelR.Orozco-LugoA. G.TugnaitJ. K.Enhanced channel estimation using superimposed training based on universal basis expansionNajera-BelloV.Martín del CampoF.CumplidoR.Perez-AndradeR.Orozco-LugoA. G.A system on a programmable chip architecture for data-dependent superimposed training channel estimationRomero-AguirreE.Parra-MichelR.Carrasco-AlvarezR.Orozco-LugoA. G.Architecture based on array processors for data-dependent superimposed training channel estimationProceeding of the International Conference on Reconfigurable Computing and FPGAs (RECONFIG '11)December 2011303308Orozco-LugoA. G.LaraM. M.McLernonD. C.Channel estimation using implicit trainingPetkovN.KungS.