Novel Receiver Architecture for LTE-A Downlink Physical Control Format Indicator Channel with Diversity

Physical control format indicator channel (PCFICH) carries the control information about the number of orthogonal frequency division multiplexing (OFDM) symbols used for transmission of control information in long term evolution-advanced (LTEA) downlink system. In this paper, two novel low complexity receiver architectures are proposed to implement the maximum likelihood(ML-) based algorithm which decodes the CFI value in field programmable gate array (FPGA) at user equipment (UE). The performance of the proposed architectures is analyzed in terms of the timing cycles, operational resource requirement, and resource complexity. In LTE-A, base station and UE have multiple antenna ports to provide transmit and receive diversities. The proposed architectures are implemented in Virtex-6 xc6vlx240tff1156-1 FPGA device for various antenna configurations at base station and UE. When multiple antenna ports are used at base station, transmit diversity is obtained by applying the concept of space frequency block code (SFBC). It is shown that the proposed architectures use minimum number of operational units in FPGA compared to the traditional direct method of implementation.


Introduction
The goal of third generation partnership project (3GPP) long term evolution-advanced (LTE-A) wireless standard is to increase the capacity and speed of wireless data communication.The LTE-A physical layer is a highly efficient means of conveying both data and control information between an enhanced base station, popularly known as eNodeB, and mobile user equipment (UE).It supports both frequency division duplex (FDD) and time division duplex (TDD) configurations in uplink and downlink operations.Further, it provides a wide range of system bandwidths in order to operate in a large number of different spectrum allocations [1].
LTE-A standard has six physical channels for downlink.They are physical broadcast channel (PBCH), physical downlink shared channel (PDSCH), physical multicast channel (PMCH), physical downlink control channel (PDCCH), physical hybrid automatic repeat request (ARQ) indicator channel (PHICH), and physical control format indicator channel (PCFICH).PBCH carries the basic system information for the other channels to be configured and operated in the LTE-A grid.The PDSCH is the main data-bearing channel.PMCH is defined for future use.In LTE-A, the control signals are transmitted at the start of each subframe in the LTE-A grid.PDCCH is used to carry the scheduling information of different types such as downlink resource scheduling and uplink power control instructions.PHICH is used to send the acknowledgement/negative acknowledgement bit to UEs to indicate whether the uplink user data is correctly received or not.PCFICH carries the control information about the number of orthogonal frequency division multiplexing (OFDM) symbols used for transmission of downlink control information.The high data rate in LTE-A requires high processing demands on all layers of the system which includes high digital signal processing (DSP) hardware processing in the physical layer.Further, the hardware implementation of receiver structures of various physical channels in LTE-A becomes a challenging task as the computational complexity increases.

VLSI Design
In [2], receivers were designed for a 2 × 2 antenna system and for quadrature phase shift keying (QPSK) modulation and quadrature amplitude modulation (16-QAM and 64-QAM).Though successive interference cancellation (SIC) receiver meets the timing requirements in the LTE system, it is complex and the K-best list sphere detector (K-LSD) receiver has high latency.In [3], field programmable gate array (FPGA) and application specific integrated circuit (ASIC) implementations of receivers based on the linear minimum mean-square error (LMMSE), the K-LSD, iterative successive interference cancellation (SIC) detector, and the iterative K-LSD algorithms are carried out for spatial multiplexing based LTE-A system.The SIC algorithm is found to perform worse than the K-LSD when the MIMO channels are highly correlated, while the performance difference diminishes when the correlation decreases.The ASIC receivers are designed to meet the decoding throughput requirements in LTE and the K-LSD is found to be the most complex receiver although it gives the best reliable data transmission throughput.It is shown that the receiver architecture which could be reconfigured to use a simple or a more complex detector as the channel conditions change would achieve the best performance while consuming the least amount of power in the receiver.FPGA implementation of MIMO detector based on two typical sphere decoding algorithms, namely, the Viterbo-Boutros (VB) algorithm and the Schnorr-Euchner (SE) algorithm, is carried out in [4].In this implementation method, three levels of parallelism are explored to improve the decoding rate: the concurrent execution of the channel matrix preprocessing on an embedded processor and the decoding functions on customized hardware modules, the parallel decoding of real/imaginary parts for complex constellation, and the concurrent execution of multiple steps during the closest lattice point search.The implementation of low-complexity codebook searching engine is proposed to support both LTE and LTE-A operations [5].In [6], VLSI implementation of a low-complexity multiple input multiple output (MIMO) symbol detector based on a novel MIMO detection algorithm called modified fixed-complexity soft-output (MFCSO) detection is presented.It includes a microcode-controlled channel preprocessing unit, separate channel memory, and a pipelined detection unit.MATLABbased downlink physical-layer simulator for LTE only for research applications is presented [7].In [8], maximum likelihood-(ML-) based receiver structures are developed for decoding the downlink control channels PCFICH and PHICH in LTE wireless standard and the performance of the receivers has been analyzed for various configurations.The analytical results were validated against computer simulations but hardware implementation of the structures was not coded or synthesized.In [9], direct implementation of receive algorithms was carried out in FPGA for downlink control channels in LTE.However, most of these works either propose architectures for FPGA implementation or analyze the performance of various receiver structures in a generalized manner.The objective of this paper is to propose novel architectures for FPGA implementation of transmit and receive processing of downlink PCFICH channel in LTE-A standard in particular.The 32-bit code word corresponding to the value of CFI is scrambled and QPSK modulated.The resultant 16 QPSK complex symbols are mapped to the resource elements of the first OFDM symbol of every subframe after layer mapping and precoding to obtain transmit diversity when two or more antenna ports are used at eNodeB [10].The 32-bit code words for the four possible values of CFI are given in Table 1.
A general block diagram of the transmitter and receiver processing of PCFICH is shown in Figure 1.
The OFDM signal is transmitted through a frequency selective fading channel.It is assumed that the number of receive antenna ports at UE is .At each receive antenna port of the UE resource-element demapping follows the cyclic prefix removal and fast fourier transformation (FFT).The 16×1 receive signal vector at each antenna port is equalized in frequency domain at each subcarrier using the corresponding 16 × 1 channel frequency response vector.The outputs of frequency domain equalizer from each antenna port are summed up.The resultant 16 × 1 complex vector is applied to the maximum likelihood (ML) detector for detecting the CFI value.The objective of this paper is to synthesize and implement the receiver architecture for PCFICH.
The paper is structured as follows.Section 2 explains the system model and basic implementation architectures for single input single output (SISO) and single input multiple output (SIMO) configurations.The system model and basic implementation architecture for multiple input single output (MISO) and multiple input multiple output (MIMO) configurations are described in Sections 3 and 4, respectively.The proposed implementation architectures using folding and superscalar methods are given in Section 5 for SISO, SIMO, MISO, and MIMO configurations.Section 6 analyzes the performance of the proposed architectures and Section 7 concludes the paper with remarks on future work.

System Model and Implementation Architecture for SISO and SIMO Configurations
The received signal model for SISO configuration of PCFICH is given by  d (2)   d (3)   d (4)   16 × 1 channel frequency response vector h 16 × 1 received signal vector y  where y = [ 0 ,  1 , . . .,  15 ]  is a 16 × 1 received signal vector, h = [ℎ 0 , ℎ 1 , . . ., ℎ 15 ]  is a 16 × 1 channel frequency response vector, d () = [ ()  0 ,  () 1 , . . .,  ()  15 ] is a 16 × 1 complex QPSK symbol vector corresponding to CFI value from the set {1, 2, 3, 4}, "∘" represents the element by element multiplication, and w is a 16 × 1 additive white noise vector and its elements are zero mean Gaussian random numbers with unit variance.The objective is to detect the value of CFI from the received signal vector y assuming the channel frequency response vector h to be known.Using maximum likelihood (ML) principle, CFI is detected as Figure 2 shows the basic architecture for estimating CFI using (2), in SISO configuration.The received signal vector y and the channel frequency response vector h are provided as input to the four receiver processing blocks (RPB) along with precomputed data vectors d (1) , d (2) , d (3) , and d (4) .The internal diagram for RPB CFI-1 is shown in Figure 3.It computes the expression ‖y − h ∘ d (1) ‖ 2 assuming the CFI = 1.In RPBm, the precomputed data vector d () is multiplied element by element with the channel frequency response vector.The resultant (16 × 1) vector is subtracted from the (16 × 1) received signal vector y.The sum of squared magnitude of each element in the resultant vector is the output of RPB.
The inputs to the CFI detector are the 16-bit outputs of RPBs  1 ,  2 ,  3 , and  4 .The CFI detector determines which RPB output has minimum value.The internal diagram for CFI detector circuit which has 4 comparator modules (CM) is shown in Figure 4.In CM-1, input  2 and one's complement d (1)   0 of input  1 are added.If carry is generated, then  1 is less than  2 .The outputs Cr 1 and Sr 1 of the CM-1 are defined as In CM-2, input  4 and one's complement of input  3 are added.
In SIMO, the 16 × 1 receive signal vector at the th receive antenna is modeled as where "" represents the number of receive antennas at UE, h () is 16 × 1 channel frequency response vector between the transmit antenna and th receive antenna, and w  is 16 × 1 noise vector at th receive antenna.Now, the objective is to detect the value of CFI from the received signal vectors at each receive antenna, assuming the channel frequency response vectors at each receive antenna are known.The maximal ratio combining is carried out at the receiver.Using maximum likelihood (ML) principle, CFI is estimated as [9] ĈFI = min The basic architecture for estimating CFI using (6) in 1 × 2 SIMO configuration shown in Figure 5 is similar to the basic architecture of SISO configuration.The received signal vector y () and the channel frequency response vector h ()  are provided as input to the four receiver processing blocks (RPB-CFI ()   ) at th receive antenna, along with precomputed data vectors d (1) , d (2) , d (3) , and d (4) .The outputs from the mth RPB at 0th receive antenna  (0)   and 1st receive antenna  (1)   are added to get the mth input   of the CFI detector circuit.

System Model and Implementation Architecture for MISO Configuration
In MISO and MIMO configurations, space frequency block code (SFBC) based layer mapping and precoding are carried out to obtain transmit diversity when two or more antenna ports are used at eNodeB as per the 3GPP LTE wireless standard [1,11].It is assumed that 2 antenna ports are used at eNodeB.The 16 × 1 complex symbol vector output of the modulation mapper is applied to the layer mapper.h (1)   z z (1)   z (2)   z (3)   z (4)   r 1 r 2 r 3 r 4 16 × 1 channel frequency response vectors h (0) and h (1)   16 × 1 received signal vector y the SFBC in the LTE-A standard.The precoder output at antenna port 0 ( 0 ) and antenna port 1 ( 1 ) is shown in Figure 6.
The PCFICH receive architecture for 2 × 1 MISO configuration is shown in Figure 7. Receiver decoding block (RDB) gets the 16 × 1 received signal vector y and computes the decoder output vector using (10), assuming that the channel frequency response vectors h (0) and h (1) are known.The detailed internal architecture of RDBM is shown in Figure 11.

System Model and Implementation
Architecture for MIMO Configuration In MIMO system, the signals at th and ( + 1)th subcarrier in the receive array are given by for  = 0, 2, 4, 6, 8, 10, 12, 14, where ℎ  represents the channel frequency response vector between th transmit antenna and th receive antenna and

(𝑗) 𝑖
represents the noise in th subcarrier in th receive antenna.In vector form, it is written as where H  eff, is the Hermitian of the 4×2 channel transmission matrix.This can be expanded as for  = 0, 2, 4, 6, 8, 10, 12, 14. (16) The decoder outputs are given by The PCFICH receiver architecture of 2 × 2 MIMO configurations is shown in Figure 10.
Receiver decoding block (RDBM) gets the 16 × 1 received signal vector y and computes the decoder output vector using (14), assuming that the channel frequency response vectors h (00) , h (01) , h (10) , and h (11) are known.The 16×1 precomputed data vectors for CFI = 1, 2, 3, and 4 are represented as s (0)  1 , s (0)  2 , s (0) 3 , and s (0) 4 , respectively, for antenna 0, and as s (1)  1 , s (1)  2 , s (1)  3 , and s (1)  4 , respectively, for antenna 1.The received signal vectors y (0)  and y (1)   multiply with the four channel estimation vectors to give decoded output vector z that is sent to the processing block (PB) which is shown in Figure 9.The decoder outputs   ,  = 0, 2, 4, . . ., 14 are stacked as 16 × 1 vector z = [  0 ,   2 , . . .,   14 ]  .Similarly, RDBM1 gives output vector z (1) using the precomputed data vectors y (0) 1 and y (1)   1 and channel estimation vectors.The architecture of PBs and the CFI detection architecture are similar to that of the MISO system.The sum of the squared magnitude of the difference between each element in the decoded output vector z and its precomputed data in the vector z (1) is the output  1 of PB1.Similarly  2 ,  3 , and  4 are computed for other CFI.The  1 ,  2 ,  3 , and  4 are compared to determine the minimum value by the CFI detector shown in Figure 4.

PCFICH Receiver Implementation Methods
The PCFICH receiver architectures can be implemented directly based on the basic architectures developed in Sections 3 and 4. But, in order to effectively utilize the resources in FPGA, the implementation of basic architectures is done using the modified novel architectures based on VLSI DSP techniques, namely, folding and superscalar processing approach.

Direct Implementation with Multiplicands Rearranged
Method.In the receiver architecture for SISO and SIMO, the 16 × 1 received signal vector is directly subtracted from the precomputed data vector for a given CFI.This requires lesser number of multipliers and adders when compared to MISO and MIMO.In MISO and MIMO configurations, complex multiplications are necessary for the multiplication of H   with the received signal vector.It increases the number of multiplications in the CFI detection process.Hence, optimum rearrangement of the terms is carried out to minimize the number of multiplications.Further, the intermediate products are reused in the calculation of real and imaginary parts.Consider the multiplication of two complex numbers Re{ℎ}+ Im{ℎ} and Re{}+ Im{}.
Since the terms Re{} Im{ℎ} and Im{} Re{ℎ} are in (19), it requires only three multiplications but five additions.This kind of rearrangement of the multiplicands is employed in the processing blocks at the cost of increased additions as shown in Figure 12.

Proposed
Architecture Using Folding Method.Folding architecture systematically determines the control circuits in DSP architectures where multiple algorithmic operations are time-multiplexed to a single functional unit [12].It is used for synthesis of DSP architectures that can be operated at single or multiple clocks.It reduces the number of hardware functional units (FUs) by a factor of  at the expense of increased computation time.
The folding architecture is introduced in the receiver structure of RPB in SISO and SIMO configurations and of RPB and PB in MISO and MIMO configurations as shown in Figures 13 and 14, respectively.For SISO RPB, there are 16 hardware lines to calculate the value of  1 each requiring two multipliers.Hence the number of multipliers used in one RPB is 32.In order to reduce the number of multipliers and adders, folding architecture is proposed.This architecture uses only two multipliers and performs the operation of a single hardware line 16 times in sequential way.The difference between the product of channel frequency response vector with the precomputed data vector and the received signal vector is stored in registers.At a time, one resultant signal pair involves in computation using two multipliers to get the value of   .Four switches operating in system clock speed are involved in the architecture where two switches are used to pass the real part of the signal to one multiplier, while the other two switches are used to pass the imaginary part of the signal to another multiplier.The multipliers pass the products to the first adder for   .The output of the first adder is passed to the second adder with a delay to accumulate the values  0 to  15 into a register in subsequent clock cycles.This process requires 16 clock cycles and the CFI is detected at the 17th clock cycle.Though it takes longer time for the clock cycles to get the output, the resources are minimized in this method.
The folded architecture of decoding block of MISO and MIMO involving complex multiplication of the channel frequency response vector and the receive signal vector is shown in Figure 14.There are 2 complex multiplications and one addition in each of the 16 hardware lines.Hence total resource elements used are 32 complex multiplications and 16 additions.The folded architecture which reduces to just 2 complex multiplications and one addition requires five switches.Two switches are used to pass the first element of the receive signal vector and its corresponding channel  frequency response vector to one multiplier and other two switches are used to pass the second element of receive signal vector and its channel frequency response vector to another multiplier.These four switches operate in system clock speed.The multipliers pass their products to the adder through the fifth switch before moving to PB.This process requires 16 clock cycles and the CFI is detected at the 17th clock cycle.

Proposed
Architecture Using Superscalar Method.Superscalar approach is another low resource utilizing VLSI DSP technique.The superscalar processing method includes parallel processing and pipelining strategies.In this case, parallel operation for the 16 pairs of hardware lines is arranged with pipelining of the subtraction and square magnitude operations for each CFI.SISO configuration does not have complex multiplications and it has only square magnitude operations.Hence the RPB of SISO has 16 hardware lines each having 2 multipliers which results to a total of 32 multipliers.This setup requires more hardware resources than folding, but the output is obtained at every 4th clock cycle as shown in Figure 15.SIMO configuration which involves two receive antenna signal processing, requires twice the number of multiplications as that of SISO and the output is obtained at every 4th clock cycle.The block "" represents the delay element introduced to buffer the values and produce the outputs at the same time instant.
For MISO configuration the RDB has 16 hardware lines, with 2 complex multiplications each.Since each complex multiplication requires four real multiplications, RDB can be executed in two clock cycles by reusing 64 multipliers.32 multipliers are required for PB taking 4 clock cycles.Hence 96 multipliers are required in MISO configuration.For MIMO configuration, the RDB requires reuse of 128 multipliers taking 2 clock cycles and an additional 32 multipliers are required for the PB taking 4 clock cycles.Hence 160 multipliers are required for MISO configuration and the output is obtained at every 6th clock cycle as shown in the Figure 16.The block "" represents the delay element introduced to buffer the values and produce the outputs at the same time instant.

Results and Discussion
The proposed receiver architectures for PCFICH in SISO, SIMO, MISO, and MIMO configurations are implemented using the Xilinx PlanAhead tool on the Virtex-6 FPGA xc6vlx240tff1156-1 device board.The target device Virtex-6 has only 768 DSP elements.Table 2 shows the performance of the proposed architectures using folding and superscalar methods being compared with the direct implementation of PCFICH receiver, in terms of resource utilisation, speed, and power for all the SISO, SIMO, MISO, and MIMO  configurations.The proposed architectures based on folding and superscalar processing methods require less number of resource elements.
In the folding approach, resource utilization is less compared to the direct and superscalar approach at the cost of reduced speed of operation but it is suitable for real-time frame timings.When the LTE-A system operates at 1.4 MHz bandwidth, maximum time available for detection at each subcarrier is 992.063ns since each slot of 0.5 ms duration in a frame (10 ms radio frame duration) consists of 7 OFDM symbols and there are 72 subcarriers along one OFDM symbol.The total delay in the receiver architecture is within the LTE time constraint.The dynamic power consumption is less in the folding method compared to superscalar method due to decrease in block arithmetic.Direct method does not require sequential execution and clocking and hence total power consumption is due to static power.Hence, it is inferred that the proposed architecture based on folding method is more suitable for CFI detection.The simulation waveform of the proposed architecture based on folding method is shown in Figure 17 for SISO, SIMO, MISO, and MIMO configurations.
A general architecture based on folding method which operates at all the four SISO, SIMO, MISO, and MIMO configurations has also been developed.In this architecture, a control variable "" is used to enable or disable the submodules SISO, SIMO, MISO, or MIMO according to the selection input "diversity." CFI is detected at every 17th clock cycle.The synthesis results of a general architecture based on folding show that it utilizes minimum resources in XC6VLX240TFF1156-1 Virtex 6 device (768 DSPs).This is summarized in Table 3. Dynamic power consumption is due to internal switching contributed by the clock (246 mW), logic (670 mW), and the block arithmetic (103 mW).
Figure 19 shows the resource utilization graph which shows the percentage of registers, lookup tables (LUTs), slices, DSP elements, and buffers used.
Figure 20 shows the implemented device in FPGA editor with the implemented components and interconnections between the components configured into the FPGA device.

Conclusion
In this paper, low complexity, low resource single, or multiantenna CFI detection at the receiver system has been proposed and analyzed using modelsim and implementation in the Virtex-6 device in Xilinx PlanAhead tool.In the receiver, computational complexity and the resource utilization are minimized by employing arithmetic operational rearrangement and suboptimal sequential DSP algorithm called the folding approach.The proposed architecture using folding method complies with the LTE frame timing constraint in SISO, SIMO, MISO and MIMO configurations.It is a suitable solution for the area optimized hardware implementation of receiver structures for PCFICH.In future, a total hardware accommodating all the physical downlink control channels of the 3GPP-LTE-A with low resource utilization could be synthesized and implemented.

Figure 1 :
Figure 1: Block diagram of transmitter and receiver processing.

Figure 12 :
Figure 12: Multiplicands rearrangement for a single complex multiplication block.

Figure 15 :
Figure15: Illustration of superscalar method for SISO and SIMO (with no complex multiplications and operating from  1 to  4 ).

Figure 16 :
Figure16: Illustration of superscalar method for MISO and MIMO (with complex multiplications and operating from  1 to  6 ).

Figure 18 :
Figure 18: RTL schematic for combined PCFICH architecture with diversity.
Figure 14: Illustration of proposed architecture for RDB and PB in MISO and MIMO.Note: receiver decoding block (RDB) in MISO is termed as RDBM in MIMO.

Table 2 :
Performance of proposed architectures based on folding and superscalar method.

Table 3 :
Resource requirements of proposed architecture using folding method.
Figure 20: Implemented device in FPGA editor.