Future 4th Generation (4G) wireless multiuser communication systems will have to provide advanced multimedia services to an increasing number of users, making good use of the scarce spectrum resources. Thus, 4G system design should pursue both higher-transmission bit rates and higher spectral efficiencies. To achieve this goal, multiple antenna systems are called to play a crucial role. In this contribution we address the implementation in FPGAs of a multiple-input multiple-output (MIMO) decoder embedded in a prototype of a 4G mobile receiver. This MIMO decoder is part of a multicarrier code-division multiple-access (MC-CDMA) radio system, equipped with multiple antennas at both ends of the link, that is able to handle up to 32 users and provides raw transmission bit-rates up to 125 Mbps. The task of the MIMO decoder is to appropriately combine the signals simultaneously received on all antennas to construct an improved signal, free of interference, from which to estimate the transmitted symbols. A comprehensive explanation of the complete design process is provided, including architectural decisions, floating-point to fixed-point translation, and description of the validation procedure. We also report implementation results using FPGA devices of the Xilinx Virtex-4 family.
The aim of the 4MORE Project (4G
MC-CDMA Multiple Antenna System-on-Chip for Radio Enhancements) is to
complement worldwide research efforts on MIMO systems, MC-CDMA, and other
advanced signal processing techniques that will provide the high data rates
and spectral efficiencies expected from 4G wireless multiuser communication
systems. In order to investigate the real performance and feasibility of
implementation of these technologies, a complete hardware demonstrator of a
broadband mobile terminal (MT) has been designed and is being constructed
within the 4MORE project [
Multi-carrier CDMA, based on the serial combination of
direct sequence CDMA and OFDM, has been considered for the physical layer in
the downlink because it derives benefits from both technologies: OFDM, with
appropriate carrier spacing and guard interval, provides robustness against
multipath, avoiding intersymbol interference; whereas the use of CDMA with
orthogonal spreading codes provides frequency diversity and multiple-user
flexibility [
The use of multiple antennas
is another enabling technology for 4G systems, which helps to exploit spatial
diversity, to increase capacity and to mitigate the effects of fading. In our
system the space-time block code for two transmit antennas designed by Alamouti
[
To achieve good bit error rate (BER) performance,
state-of-the-art channel coding techniques, including duo-binary turbo codes [
The joint use of all these sophisticated technologies greatly increases the complexity of the transceiver. To deal with the constraints of VLSI design, the demonstrator includes ASICs as well as FPGAs. From the onset of the project it was clear that the demonstrator would make use of some well-established algorithms that could be implemented on ASICs, but the flexibility provided by FPGAs was required to accommodate to the more innovative algorithms to be investigated, bearing in mind that design and implementation tasks would partially overlap in time.
The rest of the paper describes the design and implementation in FPGAs of the hardware module that performs MIMO decoding in the MT, and is organized as
follows. In Section
A simplified diagram of the transmitting BS is shown
in Figure
Simplified diagram of the BS transmitter.
Each modulated symbol is multiplied by the spreading
code of the corresponding user, and the spread symbols of the
An OFDM symbol consists of
Data is prepared for multiantenna transmission by the
MIMO encoding module. According to the Alamouti scheme [
Before OFDM modulation, the framing module interleaves
pilot symbols in the data stream, in order to aid channel estimation at the
receiver. One IFFT operation per transmit antenna is required for OFDM
modulation, to convert data to the time domain. The IFFT size is
Each stream of complex OFDM symbols is finally IQ-modulated, power amplified by independent RF front-ends, and radiated in the 5-GHz band.
A simplified diagram of the MT receiver is depicted in
Figure
Simplified diagram of the MT receiver.
One FFT operation per antenna branch is required to recover the symbols in the frequency domain (OFDM demodulation).
Next, pilots are split from information symbols by the de-framing module. By interpolation of pilot symbols in time and frequency, the MIMO channel estimator provides the MIMO decoder with channel state information (CSI), which is combined with two contiguously received OFDM symbols to build the improved signal from which to estimate the modulated symbols.
However, the output stream of the MIMO decoder further
requires module equalization [
The fact that during each symbol period both antennas simultaneously transmit different information implies that a linear combination of symbols, affected by the channel frequency response of the different paths, will be received at each antenna of the MT. Due to the intelligent way in which spatial diversity is introduced, a simple linear processing of the signals received by the two antennas during a space-time block eliminates the co-antenna interference (CAI) artificially created by MIMO transmission.
For each space-time block, the MIMO decoder must perform the following linear combination:
The MIMO decoder must implement (
The memory of the Alamouti scheme is one OFDM symbol.
Throughout the paper we have used the pair
Architecture for the MIMO decoder (real part
We do not show the full details of the architectures
used to evaluate
This architecture can be easily and efficiently adapted to a different number of antennas at the receiver. To this end, the arithmetic
blocks surrounded by dotted lines in Figure
The fixed-point translation of the architectural design described in the previous section was accomplished following three steps.
Determine the
range of each input, output, and intermediate signal involved in the MIMO
decoder. Obtain the
number of bits (precision) required for each signal. Test the
robustness of the design by performing BER simulations.
Following this
process, similar to that described in [
This task was accomplished with the help of the SystemC-based floating-point software simulator that has been developed within the 4MORE Project, which accurately models the behaviour of all the modules in the demonstrator and includes a realistic MIMO channel model. It is possible with this simulator to obtain traces of the signals at any point in the communication link.
We show in Table
Parameters of the modes implemented in the demonstrator.
Modulation | Channel coding rate ( | Number of users ( |
---|---|---|
QPSK ( | 1 to 32 | |
16-QAM ( | 1 to 32 | |
64-QAM ( | 1 to 32 |
Once the ranges for input signals were known, those of
intermediate and output signals could be obtained taking into account the
theoretical margins that result when operating with inputs whose range is
already known. Nevertheless, this would lead to an overdimensioned module, due
to the existence of hidden correlations between the inputs. After all, each of
the received signals
Fixed-point quantization rules.
Signal | Q1 | Q2 | Q3 | ||||
---|---|---|---|---|---|---|---|
Range | Bits | Range | Bits | Range | Bits | ||
Inputs | |||||||
Output of | M1 | 14 | |||||
A1 | |||||||
A2 | |||||||
Output of | M1 | ||||||
A1 | |||||||
A2 | |||||||
Global outputs | |||||||
9 |
To ease this task we developed a simple software model of the MIMO decoder, identical to the module included in the floating-point SystemC simulator of the whole chain, but much faster and practical, since all unnecessary burdens were removed. This new software model can be quickly modified to include fixed-point conversion effects in any of its parts.
As performance metric we used the signal-to-quantization noise ratio (SQNR) at the outputs of the MIMO decoder, measured by comparison of the outputs of the floating-point version of the module with that obtained after including quantization effects in some signal, or in all of them. By doing so we seek to keep the power of quantization noise much lower than that of additive white Gaussian (AWGN) noise, hence guaranteeing a negligible effect of the first one on performance.
Fixed-point conversion effects were introduced one signal at a time, and simulations were run in parallel with both versions of the MIMO decoder. The number of bits assigned to the fractional part of the signal under study was then adjusted and simulations repeated until a target value for the SQNR was reached.
Next, fixed-point effects were removed from that point, and we proceeded to optimize the word-length of another signal in the module.
Nevertheless, for those signals that share the same
statistics, quantization effects were simultaneously analysed. For instance,
optimization of the number of bits at the output of all multipliers M1 in
Figure
Following this procedure we obtained, three sets of
quantization rules, to which we will refer as Q1, Q2, and Q3 from now on, each
of them established aiming at a different goal. The final parameters for these
quantization rules are shown in Table
Quantization rule Q1 was conceived overdimensioned to ensure that it would work with every mode of the demonstrator. Quantization rule Q2, slightly less resource-consuming than Q1, was tried for 64-QAM, but final results were not good enough. As it will be shown in next section, the 64-QAM constellation is very sensitive to even small noise increments. Finally, Q3 was designed to work only with QPSK modulation, using the minimum number of resources.
Signal traces to run the tests were obtained from the
complete SystemC simulator, always setting
As will be shown later (see Figure
BER degradation comparing the floating-point version of the MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart implementation Q1 (dashed lines with marker “x”).
At the end of the word-length optimization process we
ran a final simulation to compare the floating-point version with the optimized
fixed-point one, including all quantization effects simultaneously. The measured
SQNR value was about
As final step, the SystemC simulator was used to
validate in terms of BER performance the final decisions concerning signal
ranges and word-length optimization. For this purpose a complete fixed-point
software model of the MIMO decoder was developed, which is bit-accurate with
the VHDL source code to be implemented in the FPGAs. By substitution of the
original floating-point MIMO decoding module by its fixed-point counterpart in
the complete SystemC simulation chain, and including appropriate
floating/fixed-point interfaces to the neighbouring modules, we verified the
degradation in BER performance introduced by the fixed-point MIMO decoder. This
can be checked in Figures
BER degradation comparing the floating-point version of the MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart implementation Q2 (dashed lines with marker “x”).
BER degradation comparing the floating-point version of the
MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart
implementation Q3 (dashed lines with marker “x”). In the zoomed area, results
for the fixed-point implementation Q2 are also shown for comparison (dotted
lines with marker “
As it can be seen in Figure
The following tools were used during the design:
Xilinx ISE 7.1 and the XST engine were used for VHDL synthesis and
place-and-route, while Mentor ModelSim SE 6.0d was used to run functional and
post place-and-route simulations. The target FPGAs considered for the
implementation are Xilinx Virtex-4, since they are most suitable for
implementation of wireless systems [
Table
Synthesis results for the MIMO decoding module.
DSP48 | Flip-flops | Slices | LUTs | Logic | Route-through | Shift registers | DSP slices | Min. clock cycle (ns) | |
---|---|---|---|---|---|---|---|---|---|
Q1 | Auto | ||||||||
Q1 | Yes | ||||||||
Q2 | Yes | ||||||||
Q2 | Auto | ||||||||
Q2 | No | ||||||||
Q3 | Auto |
The second column, labelled “DSP48,” refers to an option of the synthesis tool which can take three different values: “no” means that no DSP blocks are allowed; “yes” tells the synthesis tool to use as many of them as required; and “auto” triggers a free use of the DSP blocks, depending on the best trade-off found by the tool.
The value of that option has a very significant effect on the column “DSP slices” since the architecture of MIMO decoder needs 24 multipliers. When using “auto” for the “DSP48” option, these are made available as DSP blocks by the synthesis tool, whereas when the “yes” option is selected, the tool also maps the 21 adders (including 15 adders, 4 substractors, and 2 programmable adders/substractors) and other elements in DSP blocks, finally getting 49 DSP slices used, and consequently reducing the number of LUTs in the column “Logic” (from 3163 to 92 for Q2, while shift registers keep the same size).
The column “LUTs” can be obtained by adding the following three: “Logic,” LUTs used for logic functions and arithmetic; “Route-through” for routing paths between slices; and “Shift registers.” The data in this last column are very relevant for our design, since shift registers are large components in the architecture and consume the greatest part of the resources (except in the case of value “no” for “DSP48”). They affect the slice count, since the width of the registers is reduced when changing to more severe quantizations (from Q1 to Q3).
Considering the total number of slices, there is a reduction of 23% from quantization Q1 to Q2 (“auto”), while it is only 7.5% from Q2 to Q3.
The column “Flip-flops” includes the registers needed in the control unit and also those used for the pipeline. This excludes the registers that follow the arithmetic units mapped to DSP blocks, since they are directly taken from the blocks, and not from the slices.
The last column is the minimum clock cycle inferred by
the synthesis tool with a timing constraint of
Quantized outputs of the deframing and channel
estimation modules (see Figure
We have presented the design methodology used in the implementation of a MIMO decoder within a 4G radio system. The architecture of the system has been optimized to comply with the throughput requirements while reducing implementation area.
Given the random nature of the inputs, the design of wireless systems demands a simulation-based fixed-point translation approach for word-length optimization. A robust simulation framework, able to deal both with floating-point and fixed-point descriptions, has proven to be essential in the design.
Several quantization versions have been developed, synthesized with different options, in order to check the trade-offs between accuracy and use of resources in different conditions.
Our implementation results using Xilinx Virtex-4 devices show that the MIMO decoder requires a limited number of FPGA resources, while achieving high performance.
This work has been supported by European FP6 IST 2002 507039 Project 4MORE and by the Spanish Ministry of Science and Technology under Project TEC2006-13067-C03-03.