Efficient FPGA Hardware Reuse in a Multiplierless Decimation Chain

In digital communications, an usual reception chain requires many stages of digital signal processing for filtering and sample rate reduction. For satellite on board applications, this need is hardly constrained by the very limited hardware resources available in space qualified FPGAs. This short paper focuses on the implementation of a dual chain of 14 stages of cascaded half band filters plus 2 : 1 decimators for complex signals (in-phase and quadrature) with minimal hardware resources, using a small portion of an UT6325 Aeroflex FPGA, as a part of a receiver designed for a low data rate command and telemetry channel.


Introduction
In digital receivers used in satellite applications [1], after conversion to an intermediate frequency (IF) and filtering, a common approach is to shift the signal to baseband [2].Since the IF spectrum is not symmetrical, the translation to baseband creates a complex signal [3] composed of inphase (I) and quadrature (Q) components; this complex signal is then band-limited by successive steps of filters and decimators before data detection.Many alternatives exist for filters and decimators, like cascaded integrator comb (CIC or Hogenauer) filters [4][5][6], cyclotomic polynomial filters [7][8][9], or polyphase decimators [10,11].
These algorithms are difficult to be implemented using space qualified FPGAs.These devices are designed to support high levels of ionizing radiation, composed by alpha particles, protons, and heavy ions, so they use technologies three or more times bigger in size details than current commercial FPGA devices, and main datasheet parameters are total accumulated radiation (TID) dose in krads and tolerance to single event upsets (SEU) and single event latchup (SEL) in MeV⋅cm 2 /mg; packages are also limited in size and weight to tolerate high accelerations and wide temperature operating ranges.
As a consequence of these constraints, space-qualified FPGAs are much slower and many times smaller in number of logical cells than commercial devices, and implementation efforts are directed to low complexity solutions [11,12], trying to avoid multipliers or complex arithmetic operations.
The design described in this paper is focused on the implementation of a block of two chains of 14 filters and decimators used for the reception of the low data rate (4 kbps) spaceflight tracking and data network command channel [13], transmitted from ground to a low earth orbit satellite (LEO), and the paper details the hardware improvements used for efficient filtering processes.

The Communications Link
A LEO satellite with a circular orbit at 700 km over the ground moves at high speed in reference to the earth station completing one orbit in 100 minutes.Each time it gets over the horizon, during the 6 minutes of line of sight time, it establishes a communication link with the ground station.The RF signal transmitted from earth is created by linear phase modulation of a 2.025 to 2.110 GHz carrier with a 16 kHz subcarrier, which is in turn BPSK modulated by

Building Blocks of Receiver
In the satellite receiver, the incoming RF signal is first filtered, amplified, and down-converted to a 70 MHz IF.Due to oscillators mismatch, thermal drifts, and Doppler, the carrier can be located with an offset of  = ±140 kHz from the central IF; therefore, the signal is filtered by a 500 kHz bandwidth SAW filter, much wider than the signal spectrum.As shown in Figure 1, this IF signal is undersampled at  = 6.222Msps ( is obtained by division of the FPGA 56 MHz oscillator XO, as  = XO/9). is related to the IF by a factor 11.25, so  (the lowest replica of the IF signal) becomes located at /4 = 1.555MHz with offset .Using a quarter rate translator (QRT) [14], this replica is translated to baseband, giving a complex signal (I0, Q0), yet shifted from 0 Hz by .
At this point the (I0, Q0) signals are decimated 8 : 1 to filter "out of band" noise and reduce sample rate.This decimation is performed by cascading 4 half band filters (HBFs) for each I/Q component (HBF is a lowpass filter with cut frequency in /4) with 2 : 1 decimators inserted in between; after F4 the sampling rate is 777.7 ksps, and noise over 194 kHz has been filtered.The signal is still affected by the  offset, so a CORDIC rotator (ROT) [15] driven by a numerically controlled oscillator (NCO) is used to shift it towards 0 Hz.
The rotator outputs I5/Q5 are fed to 14 stages of HBFs and decimators (referred as HBF6 to HBF19), with each filter running at half the rate of the previous one.The use of HBFs instead of CIC, polyphase, or other high-order filters is due to two facts: This point is the innovation described in this paper: the implementation of the complete chain of 14 cascaded complex filters and decimators HBF6 to HBF19 (the circuit area bounded by a dashed rectangle in Figure 1) as a very compact common block, using only one dual port RAM, one simple arithmetic block, one 15 bit-counter, and one control state machine.

The Selected Half Band Filter
When a signal sampled at  rate is decimated 2 : 1, the spectrum at /2 is folded over the origin.If the signals of interest (SOI) are located between 0 Hz and  SOI , a filter with high attenuation between (/2 −  SOI ) and (/2) must be applied before decimation.The architecture of the filter used, described in [16], is shown in Figure 2.
The transfer function of this filter in  domain is The Aeroflex UT6325 FPGA [17] has no multipliers, and the two products that seem to be needed (the  gain and the 0.5x multiplier) are replaced by shift and add operations.The filter behavior was evaluated for simple  values: for F1 to F3 and HBF6 to HBF19 the chosen gain  is 3/8 = 0.375 = 1/2 − 1/8, obtained by two fixed shifts and one subtraction; for F4 two cascaded HBFs (F4a/F4b) with gains  = 0.5 and  = 0.375 are used.The response for  = 0.375 (thin line) and the combined effect of F4a/F4b (thick line) are shown in Figure 3.
For gain 0.375 the filter provides a convenient behavior: (i) flat response within -0.1 dB from the origin to 0.15; (ii) attenuation greater than −39 dB from 0.4 to /2.
Besides it good filtering performance, the main reason to select this filter is that it can be easily synthesized as a sequential machine, using two temporary registers T and IX, and five RAM positions to store the input/output value FIO, and the four delay-line values B, C, D, and E.  The complete filter can be evaluated in 12 XO clock cycles, reserving two clock cycles for every addition, which leaves enough time for ripple carry propagation in the adder circuit.

Time Requirements to Implement Each Decimated Filter
In Section 3, it was written that each filter from HBF6 to HBF19 runs at half the rate of the previous one, and although each filter must be computed at its own rate to upgrade the delay-line values B, C, D, and E, only one half of its output values is used by the next filter, due to decimation.Since samples at I5/Q5 come at /8 = XO/72 rate, there are 72 XO clock cycles available to compute the whole chain.
If  cycles are needed for HBF6, the summation of time requirements for the dual chain of filters is 2 *  * (1 + 1/2 + 1/4 + 1/8 + 1/16 + ⋅ ⋅ ⋅ ) which is always less than 4 *  (this is the key feature), so if the available 72 cycles are divided by 4, it results that  = 18 cycles to solve each filter, a 50% margin over the time needed for the sequential solution as shown in Section 4.
In the common multiplexed circuit designed to evaluate the 2 × 14 = 28 filters, a 15-bit counter updated at XO/ rate is used to decide when to compute each filter.Calling "s" the LSB of the counter and using "x" to denote a "do not care" value, the time schedule is a function of the counter value: Table 1 shows how the different stages are computed as the counter increases; the grey shaded are those whose values are used in the next stage, and those not shaded are computed, but their output is ignored (decimated).It must be noted that the value 00000000000000s is unused, giving place to add more filters and decimators, if desired.

Hardware Proposal
The hardware used for solving the 28 filters is shown in Figure 5 and is composed of the following blocks.

Arithmetic Block (AB).
Together with the RAM, it includes the IX-and T-registers, one parallel adder, and some multiplexers.Although the inputs I5/Q5 to filter HBF6 are 14-bit wide (2 integer + 12 fractional), the arithmetic operations are solved with 17-bit precision, adding 3 guard bits to consider the filter gain and to avoid overflow problems.The block checks these 3 upper bits and saturates the output when it is over 12 bits.

Sequential Control Machine (SCM).
A simple circuit, based on a 5-bit counter, generates the RAM address for reading and writing within each page, the write strobe for the RAM, the enable for the registers, and the selection signals for the multiplexers.It is synchronized by a pulse coming from the rotator block.
RAM. RAM is a 256 × 16 dual port RAM built using two UT6325 RAM blocks, with asynchronous read.

Slot Time Counter (STC)
. STC a 15-bit binary counter updated at XO/18 rate, as described in Section 5.

First One Position (FOP).
It detects the position of the first "1" from right to left in the upper 14 bits of STC.Together with the LSB of the STC, the FOP output is used to address 28 pages of RAM, one for each filter; eight RAM positions are assigned to each page, although only five are used.Combined with the SCM it generates the strobes I8, Q9, I14/Q14, and I19/Q19 to identify when the filter computation is ready for different filters.FOP and STC have 18 XO cycles for update, so both circuits are implemented with serial arithmetic to minimize the use of hardware resources.

Compilation Results
The benefits of this design are evident when compilation reports are evaluated.The hardware proposed, compiled for an UT6325PQ208-

Comparison with Other Solutions
On [18], it is described a similar approach for a 3-tap filter, where different filters are multiplexed in time, but each filter is realized using distributed logic: 14 registers, 4 multipliers, 4 adders, and 11 multiplexers; this solution requires the redesign and growth of all multiplexers' size as the number of taps increases.No reports are given about effective implementation of [18] using an ASIC or FPGA.Instead, for the smaller regular structure shown in Figure 5 (5 MUX, 2 registers, one adder, no multiplier, and 16 bytes of RAM memory), the addition of a new decimation step only requires 16 new bytes of RAM, one new bit on STC, andsometimes-another bit on FOP.In fact, since each filter runs at one half the rate of the previous one, this same hardware can be used for a chain of half band filters plus 2 : 1 decimators as large as desired, provided that enough RAM is available.The solution showed on [19] for the filter is similar to the implementation proposed on this paper in what [19] calls combinatorial-sequential implementation; its synthesis results report a higher number of registers (246 instead of 99), for a single filter.However, since no filters are multiplexed on time, it cannot be compared in relation to RAM usage or control and scheduling complexity.
Automated solutions, like those obtained with DSP Builder [20] and Matlab/Silulink tools only look for performance and speed, and they make no efforts on hardware multiplexing and processing dead time usage as the decimation chain reduces the sampling rate.The number of registers, RAM and multipliers used by these solutions is huge, orders of magnitude higher than the solution proposed in this paper.
The main emphasis of this paper is on hardware reuse: at a comfortable speed, a complete chain of 28 second-order filters and decimators is implemented using only one parallel adder for the arithmetic operations, no multipliers, and a common RAM to store the state of the full set of filters; if the total hardware is divided by the 28 filters, FPGA requirements are less than 7 combinatorial cells and a mean of 3.5 flipflops to compute each second-order filter with 16 bits of precision.

Conclusion
Most DSP application reports put emphasis on new algorithms or levels of parallelism, with abstraction of real hardware requirements; it is a reasonable approach since modern commercial devices seem to have unlimited resources.However, this constraint is mandatory for space applications.
The key improvement of this solution is described in Section 5: instead of wasting processing speed capabilities running hardware for slower filters at slow speeds, this hardware is always running at the full speed to support all filters, and time slots assigned by the scheduler to slower filters are more distanced than those assigned to faster ones.
(a) the need to have access to intermediate decimated results.Different stages of the filter chain are used for different purposes as follows:(i) I8 (BW: 48.6 kHz), to recover 2x subcarrier frequency; (ii) Q9 (BW: 24.3 kHz), is used to recover BPSK modulated subcarrier, and the modulation data; (iii) I14/Q14 (BW: 759 Hz), are used by the frequency control loop, which in turn controls the NCO.This tap is selected because the frequency error after the sweep process is bounded to ±500 Hz; (iv) I19/Q19 (BW: 23.7 Hz), are used to control the phase error, adjusting the NCO to track the Doppler drift.(b)A common hardware structure simplifies hardware reuse.

Figure 4
Figure4shows a flow diagram of the filter, and the subscript letter on the T-register indicates the storage function fulfilled by T in successive steps of the sequential evaluation.The complete filter can be evaluated in 12 XO clock cycles, reserving two clock cycles for every addition, which leaves enough time for ripple carry propagation in the adder circuit.

Table 1 :Figure 5 :
Figure 5: Hardware resources for the complete filter.