High Real-Time Design of Digital Pulse Compression Based on FPGA

Because of the poor real-time performance of in-place fast Fourier transforms, a reconfigurable radix-4 FFT processor is studied and designed, which is based on decimation-in-time and single floating-point computation.The proposedmethod adopts “pipeline and parallel” structure for accessing multiple memories to improve the FFT processing speed, and then it is applied to digital pulse compression. The experimental result shows that the proposed FFT based on radix-4 computation can implement digital pulse compression rapidly under no adding hardware resources. The proposed method can be also applied to other radix FFTs.


Introduction
The concept of pulse compression begins from the Second World War. Because of the technological difficulty, pulse compression signal has applied to long-range surveillance and long-range tracking radars until the early 1960s. From the 1970s, with theoretic maturity and improving technology, pulse compression can be widely applied to radars of 3D, phased array, reconnaissance, fire control and so on. Therefore the performance of these radars is proved obviously. As many novel technologies and new devices are progressively used to radar systems, the property of radar systems is much improved. Specially, the emerging of the fast Fourier transform (FFT) lays a solid foundation. It is a hotspot for radar design to study high real-time pulse compression [1][2][3]. In order to obtain high range resolution, pulse compression has been used.
There are two methods for digital pulse compression (DPC), that is, convolution integral in time domain and matched filter in frequency [4]. In engineering design, DPC can be mainly implemented in frequency. So the matched filtering of LFM signal is realized and shown as Figure 1.
The procedure of matched filtering in frequency is (1) using FFT to make discrete time signal into discrete spectrum, (2) multiplying by the frequency response function of the filter, and (3) using IFFT to back into time series, namely, gaining the time domain single of DPC.
Suppose that the transmitter signal is ( ), and the corresponding signal spectrum is ( ); thus the function of the matched filter is ( ) = * ( ). Given that the receiver signal is ( ), the output of pulse compression ( ) is It is seen that FFT/IFFT is still the concern in implementing DPC.
In the early days, general high-speed signal processor is mainly adopted to implement DPC, and this method is gradually eliminated as the new radar system with high-resolution real-time processing technology [5]. However, it becomes more and more mature for using hardware programmable logic, and FPGA (Field Programmable Gate Array) [6,7] can implement DPC meeting precision requirements and increasing speed.
Usually, the structures of FFT processor have the constant geometry [8], the pipeline [9], and in place [10,11]. The constant geometry FFT costs double memories. The pipeline architectures have high throughput, but they waste a large of resouces. Therefore, in-place architecture is employed to implement the FFT/IFFT, which are the main computations in DPC. A reconfigurable FFT processor based on "pipeline and parallel" structure is proposed, and it is applied to DPC.

Basic Theory
Given a length-complex sequence ( ), { = 0, 1, . . . , − 1}, its DFT ( ) is also a length-complex sequence defined by where = − 2 / and = √ −1. In this proposed algorithm, the sequence length is a composite number and it is equal to ; therefore, can be expressed as where 1 = 0, 1, . . . , − 1, 0 = 0, 1, . . . , − 1. 0 , 1 are the numbers of columns and rows, respectively. and are the amounts of columns and rows, respectively. In a similar way, the frequency index for the output sequence is expressed as where 1 = 0, 1, . . . , − 1, 0 = 0, 1, . . . , − 1. 1 is a column vector, and 0 is a row vector. Equation (2) can be rewritten as Then, we have where From above, ( ) can be mapped to ( 1 , 0 ); that is, can be decomposed into sets data with points. The data of the sets are independent of each other.

Proposed FFT Design
3.1. Structure. 4096-point radix-4 FFT is mainly discussed. The design is based on a "pipeline and parallel" structure. In the proposed design, only single radix-4 butterfly unit is needed; meanwhile, four dual-port memories are used for reducing processing time.
Before discussing the structure, a counter should be designed. It keeps consistent with the input data. When one datum is input, the counter adds "1". So, the range of the counter is from 0 to 4095.
First, the 4096 input data should be distributed into the four memories. According to the results of the modulo 4 of the designed counter, 4096 input data can be assigned to the four memories with depth being 1024, that is, RAM0∼RAM3. Then each set of 1024 data can be computed by radix-4 FFT. Because the four sets of data are independent of each other according to (6), they can compute in parallel. At the same time, the four sets are the same in computing; therefore, · · · · · · · · · · · · · · · · · · · · · · · · the four FFTs can be implemented in pipeline structure. The pipeline structure is shown in Figure 2, and only one radix-4 butterfly is needed to implement four 1024-point FFTs.
Detailed explanation of each unit is as follows.
(i) RAM0∼RAM3: four memories are used for storing the 4096 data. With high precision, the data are formatted in single-precision floating point. Therefore, the storage size of each memory is 64 K bits.
(ii) Control unit for memories: it is used to control the orders of memories accessed.
(iii) Cache unit 1: its function is to cache the pipeline data loaded from each memory. If the data are computed, the data should be in parallel.
(iv) Radix-4 butterfly unit: it is the basic radix-4 unit [12] and is computed in decimation-in-time. Considering hardware implementation, the butterfly unit mainly consists of floating-point adders and multipliers [13].
For the four operands, the first one needs not a multiplier, so there are three complex multipliers.
(v) Cache unit 2: its function is to cache the parallel results from radix-4 butterfly unit. Because the outputs are stored in pipeline, there is one cache unit to store them.
After the computation above, there is another stage to compute. In this stage, parallel accessing for the four memories is applied and the same butterfly unit is used. The structure is shown in Figure 3. After running 1024 radix-4 computations, the data are the results of 4096-point FFT.
Detailed explanation of each unit is as follows. Before being input into the butterfly unit, four data are accessed from one memory. Then, the four pipeline data are cached and output in parallel. Last, the four parallel data are input into the butterfly unit. The whole computing process of the butterfly unit is just as follows.
(1) According to the accessing addresses of the operands and the twiddle factors [14], the operands and the twiddle factors can be obtained. The four operands can be represented by Op (0) The timing diagram between the input and the output of the radix-4 butterfly unit is shown as in Figure 4.
From the timing diagram above, three cycles are idle in one radix-4 butterfly computation.
In the proposed method, the idle cycles can be used. So one butterfly unit is used for four 1024-point FFTs. It is important to arrange the order of loading data from the four sets, and the corresponding timing diagram is shown as in Figure 5.
Detailed explanations are as follows.

Reconfigurable FFT Design
Because it needs IFFT computation in DPC, a configurable FFT is designed, which can be configured to FFT or IFFT.
According to the relationship between FFT and IFFT, the designed FFT can be configured to IFFT. Due to the inverse transformation, From (2) and (8), the main differences are the coefficients and the results being times.
Therefore, modifying the FFT is as the following. First, we exchange real part with imaginary part of the input data and then compute FFT. Last, the real part and imaginary part are exchanged and then are divided by . Finally the IFFT results are obtained.
Therefore, a control signal is set. When it is high, FFT computation is done; when it is low, IFFT computation is done. Thus, the proposed FFT processor is set as a reconfigurable FFT.

DPC Design
On demand of the actual engineering, there is 4096-point DPC to process. The process of DPC is to compute FFT and then multiply by coefficients and at last do IFFT. Radix-4 FFT is used to compute 4096-point FFT and four memories are accessed in "pipeline and parallel" structure. First, 1024-point FFT should be implemented by one radix-4 butterfly unit, and the four sets are processed in pipeline; then the four memories are accessed in parallel. The stored results of 4096-point FFT in RAMs are in reversed order.
Second, the results of the above 4096-point FFT are read from memories, multiply by the matched filter coefficients and the results are stored in memories in pipeline. (In the example, hamming window is used for the matched filter coefficients.) Last, data after multiplication with coefficients do IFFT. This step uses the reconfigurable FFT to process IFFT, followed by division by 4096. The division can be replaced   FFT processors Number of FFTs LUTs Pipeline FFT [9] 4 0 9 6 1 6 5 6 1 FFT design [15] 4 0 9 6 1 2 4 6 7 The proposed design 4096 10427 by subtracting 12 because of floating-point operation being used.

Implementations and Analyses
FPGA is a large-scale programmable logic device with flexible logic cell, high integration, short developing time, and low developing cost; it is widely used in prototype design and prophase of new product development. Xilinx series FPGA is used for the proposed FFT and DPC verifications. The timing simulations are done in ModelSim and the resources are estimated by the device of Xilinx V6 series Xc6v1x240t (−1).

Implementation of the FFT Processor.
The processing time periods and resources of the proposed FFT are listed in Tables  1 and 2, respectively. From Table 1, the computation time of the proposed FFT is shorter than the compared methods. If 4096-point FFT is implemented with one memory and one butterfly, the processing time is 24259, which is 4 times as many as the proposed scheme. The running time of the other two methods which are based on the in-place architecture in spite of different platforms, is longer than the time of the proposed one. Therefore, the proposed design adopts "pipeline and parallel" accessing for memories to implement FFT and indeed reduces the processing time based on inplace architecture. From Table 2, the LUTs of the proposed design are less than the two compared methods. Because the accessing addresses for the operands and the twiddle factors are generated for 1024-point FFT, not for 4096-point FFT, the waste resources can be reduced. Therefore, the proposed method with "pipeline and parallel" structure is not only keeping less resources but also having higher processing speed to compute FFT and the above results show its advantage.

Implementation of DPC.
The resources of DPC are listed in Table 3. The maximum frequency is 122.128 MHz and satisfies the engineering demand.
The timing simulation of DPC is shown in Figure 7. The data source of 4096 numbers is generated by Matlab tool and the matched filtering coefficients are stored in ROM. Figure 7 shows that the computing cycles of FFT or IFFT are 6292, and the cycles of multiplications with coefficients are 4110. So the total of processing cycles of DPC is 16694. If the clock period is set 100 MHz, the running time of 4096-point DPC computation is 166.94 s. Table 4 shows the processing time of DPC with different schemes. If one butterfly is used with in-place structure and one memory, the running clocks are 52656 and if Xilinx FFT IP core is used to compute FFT and IFFT in DPC, the Because the processing delays of the operations are not the same in the first two schemes, the processing time is different for them. As the processing delay of arithmetic operations of the proposed scheme is the same as the first one, from the result, the proposed method wastes less processing time than the first one. Therefore, the proposed FFT processor can improve the processing speed of DPC and has its advantages in engineering.

Conclusions
The proposed scheme for FFT structure based on "pipeline and parallel" access can improve processing speed and can be applied to DPC design for single-channel data. According to the processing flow of DPC, the stages of FFT and IFFT adopt the proposed FFT structure to achieve the goals of high speed and less resources, so the DPC system can obtain high realtime performance. Meanwhile, the proposed method is also applied to multiple-channel data. The data of each channel can be stored in one memory and the multiple sets of multiple-channel data are stored in multiple memories.
Furthermore, the novel way can be used to other radix FFTs and simple mixed-radix FFTs. When radix-FFT adopt the scheme, usually memories are taken and data are divided into the memories and single radix-FFT is employed. Simply mixed-radix FFT, for example, radix-2/4 FFT, can use the proposed method. We can store data into four memories; thus there exists four memories for radix-2 butterfly computation. Two radix-2 butterflies can be computed in parallel. Because radix-4 butterfly can reconfigure into two radix-2 butterflies, radix-2/4 FFT can only use one radix-4 FFT to compute. Therefore, the proposed method can have wide applications.