^{1}

^{1}

^{2}

^{2}

^{1}

^{1}

^{2}

Universal filtered multicarrier (UFMC) is a low complexity promising waveform that provides quasi-orthogonal property among subcarriers. In addition, it can achieve much better out-of-band emission performance than orthogonal frequency division multiplexing (OFDM) system. Authors have proposed a hardware platform to implement a UFMC transmitter in this paper. Highly reduced complexity schemes for IFFT, filtering, and spectrum shifting are realized on actual hardware. This helps to achieve overall architecture of the transmitter at the cost of minimal FPGA resource usage. Hence, the overall design uses only 1038 slice registers, 1154 slice LUTs, and 64 multipliers of Xilinx Virtex-7 XC7VX330t device. A throughput of 773.5 Msamples/sec at an operational frequency of 364 MHz is achieved. This throughput is adequate for processing 50 Physical Resource Blocks (PRB) of LTE 10 MHz channelization in required time. The presented architecture provides a latency of only 2% of one LTE 10MHz channelization symbol due to the implementation of pipelining at different levels. Although the presented hardware design in its current form meets LTE 10MHz channelization throughput requirements, further increase in throughput is possible due to the scalable nature of the architecture. To the best of our knowledge, this work is first ever FPGA solution for UFMC transmitter presented in the literature.

ADVENT of 5G mobile telecommunication technology has sparked the start of new era of research in the field of telecommunications. 5G standardization aims to address requirements related to three main communication scenarios: enhanced mobile broadband (eMBB), massive machine type communications (mMTC), and ultrareliable low latency communications (URLLC) [

Apart from performance, hardware complexity is another key factor associated with acceptability of any proposed waveform. In this regard, a reduced complexity UFMC transmitter architecture is proposed in [

In this paper, we have filled this gap of unavailability of actual hardware implementation of a UFMC transmitter. Hence, first real time FPGA implementation of UFMC transmitter complying with the timing requirements of 10MHz channelization of LTE is presented here. To achieve this real time hardware, the following is the summary of contributions:

Proposal of reduced complexity implementation solutions for all constituent building blocks of the transmitter while avoiding computation/storage of redundant information

Hardware design of the constituent building blocks while achieving highest possible operational frequency at low area overhead

Resource allocation to constituent blocks and their scheduling in order to meet the established timing requirements for LTE 10MHz channelization

Finally, UFMC FPGA implementation results comparison to OFDM transmitter implementation results.

The rest of the paper is organized as follows: the next section presents the system model. Section

UFMC is filtered variant of zero padding based OFDM (ZP-OFDM). The symbols obtained from mapper are divided into groups of carriers. These groups of carriers, i.e., PRBs, go through the process of IFFT and filtering after which they are summed up to generate the final UFMC waveform. UFMC model shown in Figure

Classical UFMC transmitter scheme.

Low complexity UFMC scheme.

As shown in Figure

In case of transforming an OFDM based waveform into UFMC waveform with the same system overhead in time domain, filter length will be equal to

Unlike the classic model, the modified UFMC model proposed in [

We have taken the simplest model of UFMC proposed in [

To obtain low computational complexity in IFFT, Radix-2 algorithm is selected as there are opportunities to avoid redundant computation in case of UFMC implementation through UFMC transmission scheme of Figure

Consider the portion of Radix-2 decimation in time (DIT) with bit reversed input format implementation of 64-point IFFT as show in Figure

Part of Radix-2 implementation of 64-point IFFT.

In order to exploit this simplification, one needs to copy 12 pieces of data at multiple locations before starting the computation of useful butterflies, e.g., first sample of PRB; i.e., X(0) shall be copied at locations 32, 16, and 48 of input memory of third stage of butterflies as shown in Figure

To implement 64-point IFFT, Radix-4 architecture is also considered. The partial data flow of 64-point IFFT using Radix-4 implementation is shown in Figure

Part of Radix-4 implementation of 64-point IFFT.

While comparing the utilization of Radix-2 and Radix-4 implementation schemes for simplified IFFT computation, Radix-2 has a clear advantage over Radix-4 scheme in terms of lesser hardware placement, 100 % utilization rate of placed hardware, and simplification of the implementation. In addition, Radix-2 solution encourages the scalable architecture where more Radix-2 BFs can be placed in parallel to achieve higher throughput. Hence, placement of fewer hardware resources, higher utilization rate of placed hardware, less complexity of architecture, and support for scalability are the key elements for selecting Radix-2 implementation in order to exploit the simplified solution of computing a 64-point IFFT operation during UFMC waveform synthesis.

As shown in Figure

FIR filter of classical UFMC.

Simplified filtering scheme.

Once a sample enters the filter and shift of memory elements is performed, then in next 16 cycles the filter coefficients are multiplexed one by one to generate 16 outputs of the filter. Hence, for our case study, on one side only 64 samples from IFFT operations will be required; i.e., no actual zero padding is required and secondly only 5 multipliers, 4 shift registers, 4 adders, and 5 16-to-1 multiplexers will be used in place of 73 to 80 multipliers and 72 to 79 adders (depending on filter taps) and shift registers. Moreover, the same hardware can be used for considered tap lengths of 73 or 80 by changing the values of filter coefficients at the input of multiplexers.

In the scheme of [

In expression (

LUT based solution for storing frequency shifting coefficients in 256-location ROM for LTE 10 MHz channelization specifications.

There is another possibility of reduction in the size of LUT which is shown in Figure

LUT based solution for storing frequency shifting coefficients in 128-location ROM carrying 128 values of one-half of a cosine wave.

Using this scheme, the LUT size reduces 4 times as compared to scheme presented in Figure

In order to achieve a hardware design for LTE 10MHz channelization, we need to find the required time to process all PRBs. Based on this information we can scale and schedule hardware resources.

The useful OFDM symbol time is 66.7

The UFMC waveform synthesis is performed in 5 processes as shown in Figure

Locations of data copying from external to internal memory.

EXTERNAL MEMORY LOCATIONS | DATA IN EXTERNAL MEMORY | INTERNAL MEMORY LOCATIONS |
---|---|---|

0th | X(0) | 0th, 32nd, 16th, 48th |

1st | X(01) | 1st, 33rd, 17th, 49th |

2nd | X(02) | 2nd, 34th, 18th, 50th |

3rd | X(03) | 3rd, 35th, 19th, 51st |

4th | X(04) | 4th, 36th, 20th, 52nd, 12th, 44th, 28th, 60th |

5th | X(05) | 5th, 37th, 21st, 53rd, 13th, 45th, 29th, 61st |

6th | X(06) | 6th, 38th, 22nd, 54th, 14th, 46th, 30th, 62nd |

7th | X(07) | 7th, 39th, 23rd, 55th, 15th, 47th, 31st, 63rd |

8th | X(08) | 8th, 40th, 24th, 56th |

9th | X(09) | 9th, 41st, 25th, 57th |

10th | X(10) | 10th, 42nd, 26th, 58th |

11th | X(11) | 11th, 43rd, 27th, 59th |

In order to perform this data transfer, a Finite State Machine with Data path (FSMD) is designed for which (Algorithmic State Machine) ASM Chart is shown in Figure

ASM Chart for data fetching process.

A total of 68 clock cycles are required to complete this process. Hence, 451 clock cycles are left from 519 clock cycles to execute remaining four processes related to one PRB. In order to copy data related to PRBs other than first PRB in IFFT process RAM, the developed mechanism is shown in Figure

Block diagram of data fetching process.

64-point IFFT process block diagram.

Using this process, the reduced number of butterflies, i.e., 112 for

Architecture for FIR filter.

Spectrum shifting is achieved by using the LUT architecture explained in Section

Processes 3, 4, and 5.

Before storing the data in the memory, previous data is read first. Then spectrum shifted data is added to data read and saved in the memory. In total, data level pipelining is established in three processes of filtering, spectrum shifting, and final addition of all processed PRBs. The whole process is shown in Figure

In overall design, serial execution of group of first two processes and pipelined execution of group of last three processes individually meet timing requirements. Hence, by executing both groups of processes in a pipelined fashion, overall system level timing requirements can be met.

To materialize the pipelining of two groups of processes, the concept of ping-pong buffering/memory [

Sequence of execution of all processes.

In order to obtain implementation results for FPGA, the whole project is created on Xilinx ISE design suite. Virtex 7 XC7V2000t is selected as an implementation platform. The synthesis results of different building blocks are summarized in Table

Post-place and route results.

| | | | | | |
---|---|---|---|---|---|---|

Process 1 | 27 | 28 | 0 | 0 | 68 | 540 |

| ||||||

Process 2 | 33 | 233 | 4 | 3 | 336 | 440 |

| ||||||

Process 3 | 714 | 720 | 40 | 0 | 514 | 364 |

| ||||||

Processes 4 & 5 | 136 | 152 | 20 | 0 | 514 | 372 |

| ||||||

| | | | | | |

It can be seen that a very few resources are required to enable simplifications in IFFT operation using processes 1 and 2. A total of 27 slice registers are required to implement address register and state register of process 1. Similarly, 28 LUTs are required to complete FSMD of process 1. Process 2 of 64-point IFFT operation requires few slice registers along with 233 slice LUTs and 4 multipliers. These FPGA resources are used to generate one complex multiplier, one complex adder, and one complex subtractor to implement one Radix-2 DIT butterfly structure along with address generator and control logic required to execute this butterfly 112 times. Two BRAMs are required to work as ping-pong memories whereas one BRAM is used to create the LUT for twiddle factors. As far as process 3 is concerned, 40 DSP48 slices are used in filtering part as we have to multiply a real valued filtering coefficient with a complex valued output of IFFT operation, i.e., two real multipliers are required in one multiplier of the filter. Hence, 40 real multipliers are used to create 20 multipliers. While looking at process 4 and 5, 20 multipliers are used for spectrum shifting. Out of these 20 multipliers, 16 multipliers are used in 4 complex multiplications, i.e., multiplication of 4 simultaneous complex valued outputs of filter with 4 complex valued spectrum shifting coefficients. Finally, the last 4 multipliers are used in the address generation mechanism of spectrum shifting coefficient LUT, i.e., multiplication of PRB number

In presented architecture of UFMC transmitter, serial execution of processes (processes 1 and 2), data sample level pipelining (3,4, and 5), and finally process group level pipelining (1 and 2 with 3,4, and 5) are used. Processes 1 and 2 jointly take 404 clock cycles and processes 3, 4, and 5 take 516 clock cycles in total. Taking 516 clock cycles as critical path to process single PRB and post place and route frequency of 364 MHz, the processing delay (latency), i.e., time from start of data input of 1 PRB till generation of UFMC waveform for 1 PRB, is (1032 clock cycles) 2.835

In order to assess out-of-band emission results achieved through FPGA a prototype system is realized. In this system, the input test vectors were generated from fixed point reference model of UFMC transmitter. These test vectors are saved in FPGA memories which are accessed by the presented UFMC transmitter. The output of proposed transmitter was then saved in FPGA internal block RAMs. The output results on block RAMs were then read through a PC (using serial link). Finally the frequency response of FPGA output was plotted along with results achieved through software based golden reference model as shown in Figure

Out-of-band emission results of FPGA and floating point golden reference model.

Zoomed-out of band emission results of FPGA and floating point golden reference model.

Complexity analyses based on number of computations are presented in [

While comparing OFDM and UFMC, the IFFT complexity for 10MHz channelization of OFDM has 5120 butterfly executions (single 1024-point IFFT) while in UFMC we have computed 5600 butterflies for 50 PRBs (112 actual butterflies in 64-point IFFT per PRB). Hence, clock cycles required to perform 1024 points through our hardware will take

Keeping in view the different parameters, i.e., different number of carriers selected for FBMC transmitter in the work presented through [

Result comparison.

| | | | | | | |
---|---|---|---|---|---|---|---|

[ | Zynq-7 | 2720 | 3711 + (876 for RAM) = 4587 | 20 | - | 2.33 | 3.65 |

| |||||||

[ | Kintex-7 | 11300 | 7990 | 30 | 19 | 66.7 | - |

| |||||||

Our Solution | Virtex-7 | 910 | 1133 | 64 | 3 | 1.417 | 2.835 |

In this paper we have targeted real time FPGA implementation of UFMC transmitter in order to compare it with other popular waveforms. In this regard, first of all a hardware architecture of UFMC is proposed while taking LTE 10MHz channelization as a case study. By selecting efficient methods of implementing constituent blocks of a UFMC transmitter, we have achieved a solution which consumes very fewer FPGA resources. In our design, through the use of data and process level pipelining, we have achieved high operational frequency. This elevated frequency helped in increasing the overall throughput; hence, timing requirements of 10MHz channelization are met with fewer FPGA resources. To the best of our knowledge, the work presented in this paper is first ever effort in presenting dedicated hardware solution for UFMC transmitter and comparing it with OFDM and FBMC on actual FPGA implementation bases.

The authors have reached the conclusion by modeling the whole system in VHDL and then synthesizing and implementing the design using ISE tool from Xilinx. These results are presented in the article. The VHDL models cannot be shared as they are having commercial value for the funding agencies.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This research was partly funded by EPSRC Global Challenges Research Fund, the DARE Project EP/P028764/1. This work is funded by both Bahira University, Islamabad, Pakistan, and University of Glasgow, Scotland, UK.