Flexible channel decoding is getting significance with the increase in number of wireless standards and modes within a standard. A flexible channel decoder is a solution providing interstandard and intrastandard support without change in hardware. However, the design of efficient implementation of flexible lowdensity paritycheck (LDPC) code decoders satisfying area, speed, and power constraints is a challenging task and still requires considerable research effort. This paper provides an overview of stateoftheart in the design of flexible LDPC decoders. The published solutions are evaluated at two levels of architectural design: the processing element (PE) and the interconnection structure. A qualitative and quantitative analysis of different design choices is carried out, and comparison is provided in terms of achieved flexibility, throughput, decoding efficiency, and area (power) consumption.
With the word flexibility regarding channel decoding, we mean the ability of a decoder to support different types of codes, enabling its usage in a wide variety of situations. Much research has been done in this sense after the great increase in number of standards, standard complexity, and code variety witnessed during the last years. Nextgeneration wireless standards such as DVBS2 [
This work gives an overview of the most remarkable techniques in context of flexible channel decoding. We will discuss design and implementation of two major functional blocks of flexible decoders: processing element (PE) and interconnection structure. Various design choices are analyzed in terms of achieved flexibility, performance, design novelty, and area (power) consumption.
The paper is organized as follows. Section
LDPC codes [
Nextgeneration wireless communication standards adopt structured LDPC codes, which hold good interconnection, memory and scalability properties at the decoder implementation level. In these codes, the parity check matrix
The nature of LDPC decoding algorithms is mainly iterative. Most of these algorithms are derived from the wellknown belief propagation (BP) algorithm [
As given in (
Check node update for LDPC decoding algorithms.
Algorithm  Formulation: 

SP 

 
MS 

OMS 



NMS 


Modifying the VN update rule (
After CN update is finished for one block row, the results are immediately used to update the VNs, whose results are then used to update the next layer of check nodes. Therefore, an updated information is available to CNs at each subiteration. Based on the same concept, the authors in [
The standard TPMP algorithm described in the previous section exploits the bipartite nature of the Tanner Graph: since no direct connection is present between nodes of the same kind, all CN (or VN) operations are independent from each other and can be performed in parallel. Thus, a first broad classification of LDPC decoders can be done in terms of the degree of parallelism. The hardware implementation of LDPC decoders can be serial, partially parallel, and fully parallel.
Serial LDPC decoder implementation is the simplest in terms of area and routing. It consists of a single check node, a single variable node, and a memory. The variable nodes are updated one at a time, and then check nodes are updated in serial manner. Maximum flexibility could be achieved by uploading new check matrices in memory. However, each edge of the graph must be handled separately: as a result, throughput is usually very low, insufficient for most of standard applications.
A fully parallel architecture is the direct mapping of Tanner graph to hardware. All node operations (CNs and VNs) are directly realized in hardware PEs and connected through dedicated links. This results in huge connection complexity that in extreme cases dominates the total decoder area and results in severe layout congestion: maximum throughput can be, however, theoretically reached. In [
Hardware implementation of LDPC decoders is mainly dictated by the nature of application. LDPC codes have been adopted by a number of communication (wireless, wired, and broadcast) standards and storage applications: a few of them are briefly summarized in Table
LDPC codes applications.
Application  Standard  Code length  Code rates  Throughput 

WMAN  IEEE 802.16e  576–2304  1/2–5/6  70 Mb/s 
WLAN  IEEE 802.11n  648–1944  1/2–5/6  450 Mb/s 
Broadcast  DVBS2  6400,64800  1/4–9/10  90 Mb/s 
Wired  10GbaseT  2048  Arbitrary  6.4 Gbps 
In wireless communication domain, LDPC codes are adopted in IEEE 802.16e WiMAX which is a wireless metropolitan area network (WMAN) standard and IEEE 802.11n WiFi which is a wireless local area network (WLAN) standard. Both standards have adopted LDPC codes as an optional channel coding scheme with various code lengths and code rates. LDPC codes are also used in digital video broadcast via satellite (DVBS2) standard which requires very large code lengths of 64800 bits and 16200 bits with 11 different codes rates, and a 90 Mb/s decoding throughput. In wireline communication domain, LDPC codes are adopted in 10 Gbit Ethernet copper (10GBASET) standard which specifies a high code rate LDPC code with a fixed code length of 2048 bits, with a very high decoding throughput of 6.4 Gbps.
There is no standard for magnetic recording hard disk; however, they demand high code rate, lowerror floor, and high decoding throughput. In [
The varied nature of applications makes the selection of a suitable hardware platform an important choice. Typical platforms for LDPC decoder implementation include programmable devices (e.g., microprocessors, digital signal processors (DSPs) and applicationspecific instruction set processors (ASIPs)), customized applicationSpecific integrated circuits (ASICs), and reconfigurable devices (e.g., FPGAs).
General purpose microprocessors and DSPs utilize strong programmability to achieve highly flexible LDPC decoding, allowing to modify various code parameters at run time. Programmable devices are often used in the design, test, and performance comparison of decoding algorithms. However, they are usually constituted by a limited number of PEs that execute in a serial manner, thus limiting the computational power to a great extent. An LDPC decoder implemented on TMS320C64xx could yield 5.4 Mb/s throughput running at 600 MHz [
Reconfigurable hardware platforms like FPGAs are widely used due to several reasons. First, they speed up the empirical testing phases of decoding algorithms which are not possible in software. Secondly, they allow rapid prototyping of decoder. Once verified, the algorithm can be employed on the same reconfigurable hardware. It also allows easy handling of different code rates and SNRs, power requirements, block lengths, and other variable parameters. However, FPGAs are suited for datapath intensive designs and have programmable switch matrix (PSM) optimized for local routing. High parallelism and the intrinsic low adjacency of parity check matrix lead to longer and complex routing, not fully supported by most FPGA devices. Some designs [
Customized ASICs are a typical choice which yield a dedicated, highperformance IC. ASICs can be used to fulfill high computational requirements of LDPC decoding, delivering very high throughputs with reasonable parallelism. The resulting IC usually meets area, power, and speed metrics. However, ASIC designs are limited in their flexibility and usually intended for single standard applications only: flexibility, if reached at all, comes at the cost of very long design time and nonnegligible area, power or speed sacrifices. An alternative or parallel approach is the usage of ASIPs, that greatly overcome the limitations of general purpose microprocessors and DSPs. Fully customized instruction set, pipeline and memory achieve efficient, highperformance decoding: ASIP solutions are able to provide inter and intrastandard flexibility through limited programmability, guaranteeing average to high throughput.
A partial parallel architecture becomes mandatory to realize flexible LDPC decoding. Generally, functional description of a generic LDPC decoder can be broken down into two parts:
node processors;
interconnection structure.
A partially parallel decoder with parallelism P consists of P node processors, while an interconnection structure allows various kinds of message passing according to the implemented architecture. Based on the decoding schedule that is, TPMP or Layered decoding, the datapath can be optimized accordingly. Figure
Generalized datapath of LDPC Decoder. (a) TPMP Decoding, (b) Layered Decoding.
In the TPMP structure depicted in [
The layered decoding datapath described in [
In both datapath architectures described above, assignment of PEs to nodes (VNs and CNs) is determined by a given code structure and can be done efficiently by designing LDPC codes using permuted identity matrices. Considering parallelism
In order to realize an efficient LDPC decoder, optimization is required both at PE and interconnection level. Overall complexity and performance of decoder are largely determined by the characteristics of these two functional units. In the next two sections we will discuss them in detail and analyze various design choices aimed at realizing highperformance flexible LDPC decoder.
The PE is the core of the decoding process, where the algorithm operations are performed. Its design is an important step that heavily affects overall performance, complexity and flexibility of decoder. The PE can be designed to be serial, with internal pipelining to maximize throughput, or parallel, processing all data concurrently. Depending on this initial choice, critical design issues can arise in either latency and memory requirements or complex interconnection structures and extended logic area.
As described in Section
MinSum PE block scheme. (a) Serial Approach, (b) Parallel Approach.
Table
Code rate 













12  8  6  4  

WiMax 




WiFi 




Flexible LDPC decoder ASIC implementations. CMOS technology process (Tech), area occupation (A), A_{norm} (normalized area @ 130 nm), scheduling (sched. TDMP/TPMP), code type (C.T), block length
Design  Tech. (nm) 


Sched.  C.T 

DM  Flex.  It.  T.P Mb/s 

PE  Dp  TAR  DE  FE 

1.337  5.348  QCWiMAX  576–2304  114  25, 20  48–333  24–96  1245.3  16.7  356.0  
[ 
65  3.861  15.444  TDMP  DVBS2  64800  20  D.T  50, 15  60–708  400  Se  90  687.6  26.6  34.4 
1.023  4.092  QCWiFi  648–1944  12  25, 20  54–281  27–81  1373.4  14.1  41.4  


[ 
65  0.51  2.04  TDMP  WiMedia  1200–1320  8  R.T  5, 3  1120–1220  264  Pa  90  1794.1  13.9  54.5 
[ 
90  9.60  20.03  TDMP  DVBS2  64800  20  R.T  15  181–998  320  Se  360  747.4  46.8  46.7 
[ 
90  0.42  0.87  TDMP  QCWiFi  1944  12  R.T  30  43  294  Pa  324  1482.7  4.4  60.7 
[ 
180  3.39  1.768  TDMP  QCWiMAX  576–2304  114  R.T  10  68  100  Se  24  384.6  6.8  438.5 
[ 
130  6.3  6.3  TDMP/TPMP  BlockLDPC  576–2304  114  R.T  15  205  260  Se  24–96  487.5  11.8  213.5 
[ 
130  2.46  2.46  Overlapped TDMP  QCWiMAX  576–2304  114  R.T  15  248–287  150  Se  96  1750.0  28.7  1330.0 
[ 
130  3.843  3.843  TDMP/TPMP  QCWiMAX  576–2304  114  D.T  10,15  83–610  333  Se  24–96  2380.9  18.3  4.8 
[ 
90  0.679  1.416  TDMP  QCWiMAX  576–2304  114  R.T  8–12  200  400  Pa  16  1694.9  4.0  322.0 
[ 
90  6.25  13.04  TDMP  QCWiMAX  576–2304  114  R.T  20  105  150  Se  24–96  161.0  14.0  122.4 
[ 
130  8.29  8.29  TDMP  QCWiMAX  576–2304  19  R.T  2–8  222  83.3  Pa  4–8  214.2  5.3  12.1 
[ 
130  4.94  4.94  TPMP  Arbitrary  1536  Arbitrary  R.T  2–8  86  125  Pa  Arbitrary  139.3  1.37  N/A 
Realizing high throughput decoders (supporting data rates up to few hundred Mb/s) either asks for massive parallelism or high clock frequency, resulting in significant area and power overhead. However, parallelism at CN level can bring significant increase in throughput with affordable complexity. A parallel PE manages all VNCN messages in parallel and writes back the results simultaneously to all connected VNs. This results in lower update latency and consequently higher throughput. A parallel MinSum PE for
Flexibility as a design parameter is not always addressed as an important figure of merit, but various design techniques have been reported in the literature which can be compared in terms of throughput, complexity, and number of supported decoding modes, thus evaluating the obtainable degree of flexibility.
The partially parallel decoder presented by Kuo and Willson in [
In [
When designing an efficient multi mode decoder a typical approach is to find similarities between different modes and then implementing common parts as reusable hardware components. Controlling the data flow between reusable components guarantees multi mode flexibility. One of such efforts is the work by Brack et al. [
As discussed in Section
An interesting way to tackle the flexibility issue is proposed in [
A single design flow is exploited in [
A classical layered scheduling is used in the DVBS2 decoder proposed in [
The work in [
In addition to serial check node architectures, the stateoftheart for flexible LDPC decoders also reports some solutions utilizing parallel check nodes. The work in [
The work in [
The WiMedia standard [
A more technological point of view is given in [
Future mobile and wireless communication standards will require support for seamless service and heterogeneous interoperability: convolutional, turbo, and LDPC codes are established channel coding schemes for almost all upcoming wireless standards. To provide the aforementioned flexibility, ASIPs are potential candidates. The stateoftheart reports a number of design efforts in this domain, thanks to good performance and acceptable degree of flexibility.
The work portrayed in [
A possible dual turbo/LDPC decoder architecture is described in [
In [
The solution proposed in [
Tables
Flexible LDPC decoders ASIP Implementations. CMOS technology process (Tech), area occupation (A), normalized area (A_{norm})@ 130 nm, code type (C.T), flexibility (Flex.) design time (D.T), run time (R.T), maximum throughput (T.P), maximum iterations (It.), number of datapaths (Dp), operating frequency
Design  Tech. (nm) 


C.T  Flex.  It.  T.P Mb/s 

Dp  PE  TAR  DE 

Multicore ASIP [ 
90  2.6  5.42  LDPC{WiMAX,WiFi} 
R.T  10 

500  24  Se 


2D NOC ASIP [ 
130  N/A  N/A  Turbo 
R.T  8  86.5 
200  16  Se  N/A  3.46 
FlexiCHAP [ 
65  0.62  2.48  LDPC{WiMAX, WiFi} 
R.T  10–20 

400 (max.)  27  Se 


Bin/nonBin [ 
65  3.4  13.6  Bin LDPC{WiMAX, WiFi} 
R.T  10 
90 
400  96  Se  66.2 
2.25 
To evaluate the effective flexibility of each decoder, and its cost, a metric called
As shown in Table
Among the ASIP solutions (Table
As shown through the previous sections, in the great majority of current LDPC decoders, some kind of intradecoder communication is necessary. Except for very few singlecore implementations based on the singleinstruction single datapath (SISD) paradigm, the need for message routing or permutation is a constant throughout the wireless communication state of the art. As a first classification, two scenarios can be roughly devised:
Structured LDPC codes decoding, regardless of their implementation, often require shift or shuffle operation to route information between PEs or to/from memories. This is particularly true for some kinds of LDPC codes, as QCLDPC and
The barrel shifter (BS) is a wellknown circuit designed to perform all the permutations of its inputs obtainable with a shift operation, thus being well suited for the circularly shifted structure of QCLDPC
Rovini et al. in [
In [
Barrel shifters, though providing the most immediate implementation of the shift operation, often lack the necessary flexibility to directly tackle multiple block sizes. For this reason, they are usually joint to more complex structures.
One of the most common implementations among the simplest interconnection structures is the Benes network (BeN). This kind of network is a rearrangeable nonblocking network frequently used as a permutation network. Defining
In [
To exploit at best the proprieties of
The BeN is used to define the links between VNUs and CNUs: these links are static, once the parameters
The flexible decoder is able to achieve 3.6 Gb/s with an area of
In [
Not every supported standard require all the 12 datapaths to be active: the chosen parallelism is the minimum necessary for throughput compliance, and the same can be said for the working frequency. The implementation results show full compliance with WiMAX, WiFi, 3GPPHSDPA, and DVBSH, at the cost of 0.9 mm
One of the limitations of the traditional BeN is the number of its inputs and outputs, that are bound to be a power of 2. However, LDPC decoders often need a permutation of different size: for example, WiMAX codes require shift permutations of sizes corresponding to the possible expansion factors, that is, from 24 to 96 with steps of 4. In [
In [
Similar to the BeN is the Banyan network (ByN) [
The work described in [
NetworksonChip (NoCs) [
In [
The performance of another topology, the 2D toroidal mesh, is evaluated in [
The work in [
ZONoC routing element.
The NoC approach guarantees a very high degree of flexibility and, in theory, a NoCbased decoder can reach very high throughput. The achievable throughput is proportional to the number of PEs: but increasing the size of the network means rising the latency, and thus degrading performance back. Very few state of the art solutions have managed to solve this problem, and those who do suffer from large complexity and power consumption. We have tried to overcome these shortcomings in some recent works.
The solution described in [
In order to comply with WiMAX standard throughput requirements, the size of the 2D torus mesh has been risen from 16 nodes to 25. As detailed in [
While the former decoder is compliant with WiMAX in worst case, that is, when the maximum allowed number of iteration is performed, a codeword is averagely corrected with fewer iterations: the unnecessary iterations significantly contribute to the NoC highpower consumption. In [
Various combinations of the two methods have been tried, together with different parallelisms of the 2D torus mesh. Implementation of the ES criterion requires a dedicated processing block with minimal PE modifications, while MS requires a threshold comparison block for each PE and switching to online dynamic routing. This is necessary since stopping a message invalidates the statically computed communication pattern. While the ES method guarantees an average
In [
The performance of a wide set of topologies (ring, spidergon, toroidal meshes, honeycomb, De Bruijn, and Kautz) has been evaluated in terms of achievable throughput and complexity, considering different parallelisms. Exploiting a modified version of the cycleaccurate simulation tool described in [
The simulations revealed the Kautz topology [
A complete overview of LDPC decoders, with particular emphasis on flexibility, is drawn. Various classifications are depicted, according to degree of parallelism and implementation choices, focusing on common design choices and elements for flexible LDPC decoders. An indepth view is given over the PE and interconnection part of the decoders, with comparison with the current stateoftheart, the latest work by the authors on NoCbased decoders is briefly described.