We propose strategies to achieve a high-throughput FPGA architecture for quasi-cyclic low-density parity-check codes based on circulant-1 identity matrix construction. By splitting the node processing operation in the min-sum approximation algorithm, we achieve pipelining in the layered decoding schedule without utilizing additional hardware resources. High-level synthesis compilation is used to design and develop the architecture on the FPGA hardware platform. To validate this architecture, an
The year 2020 is slated to witness the first commercial deployment of the 5th generation of wireless technology. 5G is expected to deliver a uniform Quality of Service (QoS) of 100 Mb/s and peak data rates of up to
Illustration of our research methodology for the design and development of the channel coding architecture.
To accomplish this, in addition to the use of FPGA-based implementation, we use a High-Level Synthesis (HLS) compiler built in
Illustration of the HLS compile flow.
QC-LDPC codes or their variants (such as accumulator-based codes [
The remainder of this article is organized as follows. Section
LDPC codes are a class of linear block codes that have been shown to achieve near-capacity performance on a broad range of channels. Invented by Gallager [
A Tanner graph where the variable nodes (VN), representing the code bits, are shown as circles and the check nodes (CN), representing the parity-check equations, are shown as squares. Each edge in the graph corresponds to a nonzero entry (
The first LDPC codes by Gallager [
The construction of QC-LDPC codes relies on an
Base matrix
Layers |
Blocks |
|||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
−1 | −1 | −1 |
|
−1 |
|
−1 |
|
−1 |
|
−1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 |
|
−1 |
|
−1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 |
|
|
−1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
|
−1 | −1 |
|
−1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 |
|
|
−1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
−1 |
|
−1 |
|
−1 | −1 |
|
−1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 |
|
|
|
|
−1 | −1 | −1 |
|
−1 |
|
−1 | −1 | −1 |
|
−1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 |
|
−1 |
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 |
|
−1 | −1 |
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 | −1 |
|
−1 |
|
−1 |
|
|
−1 | −1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
−1 |
|
|
|
−1 |
|
|
−1 | −1 | −1 | −1 | −1 |
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
|
|
|
−1 |
|
−1 |
|
−1 | −1 |
|
|
−1 | −1 |
|
|
−1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 | −1 |
|
LDPC codes can be suboptimally decoded using the BP method [
For
The steps of the scaled-MSA are given below.
It is well known that since the MSA is an approximation of the sum-product algorithm (SPA) [
The standard BP algorithm is based on the so-called
The HLS compiler in
Loop unrolling on FPGA platforms is a well-known compiler optimization used to exploit parallelism [
Ineffective loop unrolling. Shown on the left are representative schematics of the
Timing diagram before unrolling
Timing diagram after unrolling
However, if unrolling is performed only when it improves throughput, a trade-off between throughput and resource consumption can be achieved in the implementation. An illustrative example is provided in Figure
Throughput improvement using access pattern analysis. Shown on (a) is the representative schematic of the
Algorithm description (application diagram)
Without access pattern analysis
With access pattern analysis
The memory access pattern analysis in
Loop unrolling may not be effective if the memory access speed cannot keep up with the data throughput request set by the user. This is particularly true for processing intensive applications like the ones studied and implemented in this work.
Memory blocks on modern FPGAs typically have only two ports, one of which is generally read-only. Implementing memories with more ports can become very resource intensive and can drastically reduce the clock rate of the design. The limited amount of memory ports often causes accesses to get serialized. These serialized memory access requests often make computational cores idle, thus resulting in a reduction of the system throughput [
In many applications, memory access is sequential and predictive. When multiple accesses to a memory can be computed in parallel, the values can be accessed together in one clump rather than as many separate smaller accesses. We refer to this as
All of the above techniques have been successfully employed by
The authors would like to emphasize that the algorithmic compiler
To understand the high-throughput requirements for LDPC decoding, let us first define the decoding throughput
Let
Even though
As noted in Section
Careful observation reveals that, among (
To achieve linear complexity
Initialization: let Comparison: for
The CN message computation given by (
An arbitrary submatrix
This implies that no CN in this set of
Arbitrary submatrix
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|
|
0 |
|
0 | 1 | 0 |
|
0 |
|
0 |
|
0 | 0 | 1 |
|
0 |
|
|
| |||||
|
0 |
|
0 | 0 | 0 |
|
0 |
|
0 |
|
1 | 0 | 0 |
|
0 |
In the flooding schedule discussed in Section
From the perspective of CN processing, two or more CNs can be processed at the same time (i.e., they are independent of each other) if they do not have one or more VNs (code bits) in common.
The row-layering technique used in this work essentially relies on the condition in Fact
Observing that
From the VN or column perspective,
Illustration of message passing in row-layered decoding in a section of the PCM
Layers |
Blocks |
||||
---|---|---|---|---|---|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To |
To |
To |
The idea of parallelizing
Construction of for set for if
Let
The block index (shift) matrix
Block index matrix
Layers |
Blocks |
|||||||
---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
Block shift matrix
Layers |
Blocks |
|||||||
---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
Let
The compaction ratio
In the QC-LDPC code in our case study,
In Section
If a block column of
It follows directly by applying Fact
In other words,
To accomplish this, we rearrange the
Rearranged block index matrix
Layers |
Blocks |
|||||||
---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
|
|
|
|
|
We call the set of layers
Block-level view of the pipeline timing diagram. (a) General case for a circulant-
Layer-level view of the pipeline timing diagram. (a) General case for a circulant-
Within a superlayer, while the LNPU processes messages for the blocks
It follows directly from the layer independence condition in Fact
Figure
Without loss of generality, the pipelining efficiency
For the case of pipelining, two layers are shown in Figure Since two layers are processed in the pipeline at any given time, Given a QC-LDPC code, Choice of
In our work,
Layer-level view of the pipeline timing diagram for the GNPU and LNPU arrays when two NPU arrays are employed to process four layers. Due to the requirement of two NPU arrays, this method is inefficient compared to the two-layer pipelining method. Moreover, this method is not adopted for the implementation as the number of layers in a parallel run is limited by the number of ports in the shared memory.
High-level decoder architecture showing the
The techniques for improving throughput in an efficient manner, described in Section
To evaluate the proposed strategies for achieving high-throughput, we have implemented the scaled-MSA based decoder for the QC-LDPC code in the
We represent the input LLRs from the channel and the CTV and VTC messages with 6 signed bits and 4 fractional bits. Figure
LDPC decoder IP FPGA resource utilization and throughput after mapping onto the Xilinx
|
|
|
---|---|---|
Device |
|
|
Throughput (Mb/s) | 337 | 608 |
FF (%) | 9.1 | 5.3 |
BRAM (%) | 4.7 | 6.4 |
DSP48 (%) | 5.2 | 5.2 |
LUT (%) | 8.7 | 8.2 |
BER performance comparison between uncoded BPSK (rightmost), rate = 1/2 LDPC with 4 iterations using fixed-point data representation (second from right), rate = 1/2 LDPC with 8 iterations using fixed-point data representation (third from right), and rate = 1/2 LDPC with 8 iterations using floating-point data representation (leftmost).
The clock rate selection in the HLS compiler generally determines pipeline stage depth of each primitive operation. For example, a higher target clock rate would result in a deeper pipeline stage. This requires more FPGA resources and a relatively longer compile time. Various
Performance and resource utilization comparison, after mapping onto the FPGA, for versions with varying number of cores of the QC-LDPC decoder implemented on the
Cores | 1 | 2 | 4 | 5 | 6 |
---|---|---|---|---|---|
Throughput (Mb/s) | 420 | 830 | 1650 | 2060 | 2476 |
Clock rate (MHz) | 200 | 200 | 200 | 200 | 200 |
Time to VHDL (min) | 2.08 | 2.08 | 2.08 | 2.02 | 2.04 |
Total compile (min) |
|
|
|
|
|
Total slice (%) | 28 | 44 | 77 | 85 | 97 |
LUT (%) | 18 | 28 | 51 | 62 | 73 |
FF (%) | 10 | 16 | 28 | 33 | 39 |
DSP (%) | 5 | 11 | 21 | 26 | 32 |
BRAM (%) | 11 | 18 | 31 | 38 | 44 |
On account of the scalability and reconfigurability of the decoder architecture in [ fixed latency of decoding the frames across all cores, time-staggered operation of cores, tightly controlled execution of the round-robin serial-parallel-serial conversion process.
To validate the multicore decoder architecture, in this case study, we chose the
High-level system schematic illustrating the fixed latency, parallel processing of the decoder cores.
The multicore decoder was developed in stages. The first stage is the aforementioned pipelined decoder core to which additional cores were added incrementally as per the scheme depicted in Figure
Hybrid-ARQ (HARQ) is a transmission technique that combines Forward Error Correction (FEC) with ARQ. In HARQ, a suitable FEC code protects the data and error-detection code bits. In its simplest form, the FEC encoded packet—referred to as a Redundancy Version (RV) in this context—is transmitted as per the ARQ mechanism protocol. If the receiver is able to decode the data, it sends an acknowledgement (ACK) back to the transmitter. However, if it fails to recover the data, the receiver sends a negative acknowledgement (NAK) or retransmission request to the transmitter. In this scenario, the FEC simply increases the probability of successful transmission, thus reducing the average number of transmissions required in an ARQ scheme. HARQ has two modes of operation: Type-I and Type-II. In Type-I, a current retransmission is chase-combined [
To study the performance of the two HARQ schemes (Type-I and Type-II), we have implemented a baseband bidirectional link with two transceiver
HARQ system schematic for one node. Overall, the system simulation uses two nodes (BS and UE).
At the receiver node, header bits are decoded and the RV combiner uses the information in the header to combine the received signal values for Type-I mode or Type-II mode. CRC values from the header and the decoded data are compared to generate a feedback for the initiator node. The feedback (
The HARQ system comprises subsystems that can be classified into two main categories based on the nature of the processing they perform. The
Realization of an example of
RTL block diagram: realization of an example of
Schematic depiction of the description of a
Schematic depiction of the description of a
The HARQ system has been implemented on the
Performance and resource utilization, after mapping onto the FPGA, for the HARQ system (that supports both Type-I and Type-II mode of operation) on the
Utilization | |
---|---|
Clock rate (MHz) | 80 |
Time to generate VHDL (min) | 5 |
Total slice (%) | 54 |
LUT (%) | 32 |
FF (%) | 19 |
DSP (%) | 12 |
BRAM (%) | 30 |
FER performance of Type-I and Type-II schemes. Note that the FER of Type-I and Type-II overlap as expected.
Throughput performance of Type-I and Type-II schemes.
On a host machine, a
A survey of the state of the art for channel code architectures and their implementation using HLS technology reveals that insightful work on the topic has been done. In this section, we list some of the notable contemporary works. While there are a myriad of LDPC architecture designs implemented on the FPGA platform, here we restrict ourselves to a subset of those works that utilize HLS technology. In this section, we list some of the notable contemporary works that fall into this category.
The performance of an implementation depends on a host of factors such as the vendor specific device(s) with its associated HLS technology and the type of channel code in consideration. Thus, the intent of the authors is not to claim an all-encompassing performance comparison demonstrating gains or losses with respect to each other, but to provide the reader with a qualitative survey of the state of the art. Table
Comparative survey of the state-of-the-art. Note that, while there are multiple implementation case studies in [
Work |
Andrade et al. [ |
Pratas et al. [ |
Andrade et al. [ |
Scheiber et al. [ |
This work |
---|---|---|---|---|---|
HLS Technology |
|
|
|
|
|
Standard |
|
|
|
|
|
LDPC Parameters |
|
|
|
|
|
BP Decoding Schedule | flooding | flooding | flooding | layered | serial and layered |
Throughput (Mb/s) | 103.9 | 540 | 21 | 13.4 | 608 |
Decoding Iterations | 10 | 10 | 10 | 3 | 4 |
Developement |
n.a. | ~weeks | n.a. | n.a. | ~days |
FPGA Device |
|
|
|
|
|
Fixed-point Precision (total bits) | 8 | n.a. | n.a. | n.a. | 10 |
Clock Rate (MHz) | 222.6 | 150 | 157 | 122 | 200 |
LUT (%) | 42.9 | n.a. | 41 | 3 | 8.2 |
FF (%) | 42.3 | n.a. | 36 | 2 | 5.3 |
BRAM (%) | 75.3 | n.a. | 67 | 20.9 | 6.4 |
DSP (%) | 3.8 | n.a. | 0 | 0 | 5.2 |
n.a.: not available (i.e. not reported in the cited work).
We use an HLS compiler that without expert-level hardware domain knowledge enables us to reliably prototype our research in a short amount of time. With techniques such as timing estimation, pipelining, loop unrolling, and memory inference from arrays,
Here, we briefly discuss the operation of the protocol-sensitive subsystems (Section
Flowchart showing the
Flowchart showing the
At the receiver, for the Type-I scheme of HARQ, the
In Section
Pipeline timing diagram from the block processing perspective for (a) the
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors would like to thank the Department of Electrical and Computer Engineering, Rutgers University, NJ, USA, and the National Instruments Corporation, Austin, TX, USA, for their continual support for this research work.