Reduced-Complexity LDPC Decoding for Next-Generation IoT Networks

Low-density parity-check (LDPC) codes have become the focal choice for next-generation Internet of things (IoT) networks. This correspondence proposes an e ﬃ cient decoding algorithm, dual min-sum (DMS), to estimate the ﬁ rst two minima from a set of variable nodes for check-node update (CNU) operation of min-sum (MS) LDPC decoder. The proposed architecture entirely eliminates the large-sized multiplexing system of sorting-based architecture which results in a prominent decrement in hardware complexity and critical delay. Speci ﬁ cally, the DMS architecture eliminates a large number of comparators and multiplexors while keeping the critical delay equal to the most delay-e ﬃ cient tree-based architecture. Based on experimental results, if the number of inputs is equal to 64, the proposed architecture saves 69%, 68%, and 52% area over the sorting-based, the tree-based, and the low-complexity tree-based architectures, respectively. Furthermore, the simulation results show that the proposed approach provides an excellent error-correction performance in terms of bit error rate (BER) and block error rate (BLER) over an additive white Gaussian noise (AWGN) channel.

decoding phases, i.e., check-node update and variable-node update. Among various decoding algorithms, sum-product (SP) [28] algorithm provides a tremendous decoding performance close to Shannon capacity. However, it suffers from large complexity because of logarithmic and multiplicative functions involved in CNU operation. For hardware implementation of decoder, an area-efficient approximation of SP called min-sum (MS) [29] algorithm was proposed which provides implementation advantages over SP algorithm by computing two minimum values from a set of messages arriving at check nodes. But it suffers from performance degradation. The normalized min-sum (NMS) and offset min-sum (OMS) [30], modified versions of MS, significantly improve the performance of MS by introducing additional normalization and offset factors, respectively.
In hardware implementation of MS decoder, each iteration involves two operations, i.e., CNU and variable-node update (VNU). For CNU, a minimum-value unit (mvu), also called minimum finder, is required to estimate the first two minima ðMin 1 , Min 2 Þ and index of the first minimum value. For large block-length LDPC codes required in high data rate applications, a huge number of minimum-value units are needed to estimate the first two minima and index of Min 1 which significantly increases the complexity of CNU operation. Existing methods require circuitry with high complexity in terms of comparators, multiplexors, latency, and area time. Thus, a low cost algorithm is greatly desired to reduce the complexity of CNU operation of MS decoder.
Recently, some attempts have been utilized to estimate the first two minima from a set of messages arriving at check node. In [31], a single minimum min-sum (smMS) algorithm was proposed which only computes the absolute minimum value and the second minimum value is computed by adding a corrective constant in the first minimum. The smMS provides a significant reduction in hardware complexity of CNU processor, but it suffers from performance degradation. Wang et al. proposed a modification factor min-sum (mfMS) algorithm in [32]; the mfMS algorithm improves the performance of smMS by introducing a modification factor in absolute minimum value. Zhang et al. used the mfMS approach to design a flexible LDPC decoder for multigigabit per second applications [33]. A variable-weight min-sum (vwMS) algorithm was proposed by Angarita et al. in [29] by introducing a variable iteration-based correction factor; the performance of vwMS is better than smMS and mfMS. A simplified variableweight min-sum (svwMS) is also proposed in [29] which requires low computational cost to determine if more than one input message shares the same first minimum value. In [29,[31][32][33], the absolute minimum value is calculated first, and then, the second minimum is estimated by applying a modification or correction factor to absolute minimum value. Researches have also investigated various problems on the other related topics of communications [34][35][36][37][38][39][40][41][42][43][44].
Besides the single minimum-based algorithms, some efforts have been made to propose architectures which compute the two minima from a set of messages for CNU operation [45][46][47][48][49]. A sorting-based architecture was proposed by Xie et al. in [46] for finding two minima, but it suffers from large critical delay. Chen-Long et al. proposed a tree-based architecture in [47] which requires some additional complexity but provides critical delay less than that of sortingbased architecture. A low-complexity tree-based architecture [48] was proposed by Lee et al. which reduces some hardware complexity of tree-based structure while keeping the critical delay between those of the sorting-based and treebased architectures. This manuscript presents an efficient approach, known as dual min-sum (DMS) architecture, for finding the first two minima ðMin 1 and Min 2 Þ from a set of variable nodes participating in CNU operation. Compared to existing sorting-based and tree-based architectures, the proposed scheme efficiently eliminates a large number of comparators and multiplexors while keeping the critical delay almost equal to the tree-based architecture. Based on experimental results, if the number of inputs is equal to 64, the proposed architecture saves 69%, 68%, and 52% area over the sorting-based, tree-based, and low-complexity tree-based architectures, respectively. Furthermore, the simulation results show that the proposed approach outperforms its counterparts by providing an excellent errorcorrection performance close to NMS algorithm over an additive white Gaussian noise (AWGN) channel.
The remainder of this correspondence is arranged as follows. In Section 2, the basic concepts about LDPC codes and min-sum decoding are given. A detailed review of the state-of-the-art architectures for finding the first two minima is given in Section 3. Section 4 presents a proposed architecture to find the first two minima for CNU operation of min-sum LDPC decoder. The performance analysis and hardware implementation of the proposed architecture are given in Section 5, and the conclusion of this correspondence is presented in Section 6.

Min-Sum LDPC Decoding
An ðN, KÞ LDPC code can be described by the null space of a M × N sparse parity-check matrix H, where M denotes to  Let N m = fn : H mn = 1g denote the set of variable nodes involve in check node c m and M n = fm : H mn = 1g denote the set of check nodes connected to variable node v n . Also, let N m\n represent the set N m with excluding the variable node n and set M n\m represents exclusion of check node m from the set M n . The log-likelihood ratio (LLR) for a random variable can be defined as ln ðð1 − γÞ/γÞ, where γ represents the probability of transmitted bit being equal to zero. In addition, let φ ðjÞ n⟶m denote the LLR message for bit n, sent from variable node v n to check node c m in the jth iteration. Similarly, ψ ðjÞ m⟶n denotes the LLR message for bit n, sent from check node c m to variable node v n in the jth iteration. Finally, w = ½w 1 , w 2 , ⋯, w N and r = ½r 1 , r 2 , ⋯, r N denote the transmitted and the received codewords, respectively. Also, let us assume that ℓ = ½ℓ 1 , ℓ 2 , ⋯, ℓ N denote the intrinsic reliability provided by the channel. The MS decoding consist of the following steps: (5) Hard decision: applying a hard decision to compute the transmitted sequenceŴ = ðŵ 1 ,ŵ 2 , ⋯,ŵ N Þ as IfŴH T = 0 or the maximum number of iteration J max is reached, move to Step 6; otherwise, set j = j + 1 and go back to Step 3: (6) Output: declare the estimated sequence W∧ ðjÞ as the decoder output As compared to conventional SP and NMS algorithms, although the performance of MS algorithm is lower, it requires much simpler hardware circuitry for CNU operation performed in check-node update processor. In practical implementation of MS decoder, instead of finding the minimum value in (2), two minimum values are computed from the set of messages arriving at check node and a suitable one is selected depending upon the information received at the check node. Thus, the MS decoder reduces the hardware complexity and provides implementation advantages in terms of area and delay. In the next section, we introduce some existing architectures to find the first two minima for CNU operation of MS decoder.

Related Architectures
Generally, the hardware circuit used to find the first two minima from a set of messages arriving at check node is known as search module (SM). Let, for a given set of mw -bit messages received at check node, X = fx 0 , x 1 , ⋯, x m−1 g; SM generates three outputs: (1) the first minimum value of set fXg, (2) the second minimum value of fXg, and (3) the index of the first minimum value. For hardware realization, two 2-input units, mvu 2−1 and mvu 2−2 , are used as the fundamental units of a search module. mvu 2−1 , as shown in Figure 1(a), consists of one comparator and one w-bit 2-to-1 multiplexor and it returns the smaller value from two inputs. mvu 2−2 consists of one comparator and two w-bit 2-to-1 multiplexors, and it returns both smaller and larger values, 3 Wireless Communications and Mobile Computing as depicted in Figure 1(b). Also, assume m inputs of SM be a power of 2, i.e., m = 2 k . If m is not a power of 2, then such SM can be obtained by pruning some leaf nodes of the balanced SM having 2 k inputs as described in previous literatures [45][46][47]. Next, we present some state-of-the-art architectures to find the first two minima and index of the first minimum value.
The sorting-based SM architecture for eight inputs is depicted in Figure 2. The overall process of sorting-based SM is partitioned into two steps: (1) Min 1 is computed with the binary search tree and (2) an index-controlled multiplexing system is used to compute Min 2 . In Figure 1(c), the index of Min 1 can be estimated from comparison results. A set of candidates, Y = fy 1 , y 2 , y 3 g, is computed by the multiplexing system which employs three 8-to-1 multiplexors to estimate the value of Min 2 . Once the set Y is in hand, two mvu 2−1 are required to compute Min 2 . Consequently, the sorting-based SM requires nine 2-to-1 multiplexors, nine comparators, and three 8-to-1 multiplexors for processing eight inputs. But it causes the long critical delay due to serially connected multiplexing system.
The sorting-based architecture is not feasible for highspeed applications because it induces a large critical delay due to serially connected multiplexing system. A tree-based architecture, as depicted in Figure 3, was proposed in [47] for high-speed realization. In tree-based SM, Min 1 and Mi n 2 have almost the same processing time due to the hierarchical tree architecture. Compared to sorting-based SM, it requires more comparators and multiplexors for finding Min 2 . Three mvu 2−1 and one 2-to-1 multiplexor are additionally required for combining two subtrees. But the serially connected multiplexing system is completely removed which reduces the critical delay.   Index generator

Wireless Communications and Mobile Computing
The tree-based architecture provides implementation advantages over sorting-based architecture in terms of critical delay, but it is not cost-effective for large block-length LDPC codes. Thus, it has higher hardware complexity that arises from large number of comparators and multiplexors. A low-complexity tree-based architecture was proposed in [48] which reduces the number of comparators while keeping the critical delay between those of the sortingbased and tree-based architectures. A low-complexity tree-based SM, referred to as SM pro , for eight inputs is depicted in Figure 4 where a PRO 8 unit provides a candidate set, Y = fy 1 , y 2 , y 3 g, for finding Min 2 . A tree structure composed of two mvu 2−1 is required to find Min 2 from candidate set Y. SM pro requires nine comparators and twenty 2-to-1 multiplexors to process eight inputs. Therefore, the existing sorting-based and tree-based search modules are not cost-effective for large block-length LDPC codes. Hence, a low-cost SM architecture is greatly needed for hardware implementation of MS-LDPC decoder. Next, we present SM, known as DMS architecture, which reduces the hardware complexity of MS decoder for large block-length LDPC codes.

Proposed Architecture
The complexity of comparators and multiplexors is considerable for hardware realization of the MS-LDPC decoder. A DMS-based SM is presented which reduces a large number of comparators and multiplexors while keeping the critical delay almost equal to the tree-based architecture. The proposed SM is conceptually similar to sorting-based SM. But the serially connected multiplexing system for finding Min 2 is completely removed which reduces the hardware complexity and critical delay. The proposed DMS-based SM estimates the Min~2 value using a logical unit, as depicted in Figure 5. The complexity and delay of logical unit are much less than those of the serially connected multiplexing system. The hardware complexity of both the proposed and sortingbased architectures is the same to find Min 1 . But the DMSbased SM estimates the Min~2 using a logical unit which reduces the hardware complexity.
The DMS-based SM for eight inputs is depicted in Figure 5, where seven comparators and seven 2-to-1 multiplexors are required to find Min 1 . The logical unit, as depicted in Figure 6, requires two adders, one right-shift register, and one AND gate for estimating Min 2 . The first step of DMS approach is to replace the CNU function in (2) with In other words, the sign and output magnitudes are estimated from all N m variable nodes arriving at check node c m . The next step is to find the first two minimum values for CNU operation. Let λ ðjÞ min and λ ðjÞ sub denote Min 1 and Min 2 , respectively. The magnitude of check-node output is computed as where a and b denote the variable nodes participating in the last mvu 2−1 of DMS architecture. Thus, the DMS Input: a set X of m positive values. for j = 1: m do Step 1 Partition set X into pairs of values and find the minimum value of each pair. Continue partitioning, and find Min 1 from the last pair of values.
Step 2 Input the last pair of values in Step 1 to logical unit, and estimate Min~2. end for Output: X min =fMin 1 , Min~2g Algorithm 1: DMS algorithm. As an illustrative example, assume a set X of eight input values, X = f2, 8, 1, 6, 5, 3, 7, 4g. Based on Step 1 of the DMS algorithm, set X as partitioned into pair of values as R = ff2, 8g, f1, 6g, f5, 3g, f7, 4gg. Finding the minimum value of each pair, a subset is obtained as X 1 = f2, 1, 3, 4g. Again, partitioned subset X 1 into a pair of values as R 1 = ff2, 1g, f3, 4gg. Finding the minimum value of each pair, we obtain the last pair of values as f1, 3g which returns the first minimum value as Min 1 = f1g. According to Step 2 of the DMS algorithm, the last pair of values, f1, 3g, is passed to the logical unit for finding Min~2. Based on (5), Min~2 can be estimated as dð1/2Þð1 + 3Þe = 2. Afterward, the DMS algorithm returns the output as X min = f1, 2g. It is important to mention that the DMS algorithm returns Min 1 which is always the first minimum value of set X, but it returns Min~2 which is the estimated second minimum value among the values of X; it may or may not be the exact second minimum value. Consequently, the DMS algorithm provides an efficient architecture which is more cost-effective for large block-length LDPC codes.

Performance Analysis.
In this section, the errorcorrection performance of the proposed DMS approach in terms of bit error rate (BER) and block error rate (BLER) is compared with its counterparts under the same conditions. The standard IEEE802.16e LDPC codes with code rates 0.5 and 0.75 having a block length of 2304 are used for evaluating the performance of the proposed and some other existing algorithms. The performance of the proposed approach is compared with the NMS, mfMs, svwMS, and exMin-n [49] algorithms with maximum number of decoding iterations equal to 50. Binary phase-shift keying (BPSK) transmission is assumed over an AWGN channel. Figures 7  and 8 depict the performance analysis for the (2304, 1152) and (2304, 576) IEEE802.16e LDPC codes. Figure 7 compares the error-correction performance of the proposed DMS algorithm with NMS, svwMS, and

Architecture
Sorting based [46] Tree based [47] Low-complexity tree based [48] Proposed Components Comparators Similarly, the error-correction performance of the DMS algorithm is also compared with NMS, mfMS, and exMinn, for n = 3, for IEEE802.16e standard LDPC code with code rate 0.75 and a code length of 2304. Figure 9 reveals that the DMS algorithm performs close to the NMS algorithm with a degradation of 0.06 dB at BER of 10 −5 . But the exMin-3 and mfMS algorithms provide a performance loss of 0.22 dB and 0.26 dB, respectively. As a result, the proposed DMS algorithm outperforms its counterparts under the same conditions by providing an error-correction performance very close to the NMS algorithm.

Complexity and Speed
Performance. As compared to the state-of-the-art architectures [46][47][48], the proposed DMS architecture reduces the computational complexity for CNU operation of the MS-LDPC decoder. According to Table 1, a comparison of the hardware complexity and critical delay of DMS architecture with sorting-and tree-based architectures is shown, where τ c , τ M2 , τ Mk , and τ Lu denote the delay of comparator, multiplexor (2-to-1), multiplexor (2 k -to-1), and logical unit, respectively. The sorting-based [46] and low-complexity tree-based [48] architectures require 2 k + k − 2 comparators, and the tree-based [47] architecture requires 2 k+1 − 3 comparators to find the first two minima. As the DMS architecture completely removes the multiplexing system inevitable for sorting-based SM, it requires 2 k − 1 comparators for finding two minima. The sorting-based SM requires 2 k + k − 2 2-to-1 and k2 k -to-1 multiplexors, where the tree-and low-complexity treebased architectures require 3:2 k − 4 comparators to find the first two minima. But the DMS architecture requires 2 k − 1 multiplexors for finding Min 1 and Min~2. Also, the DMS architecture additionally requires two adders, one rightshift register, and one AND gate for the implementation of logical unit, but it keeps the critical delay almost equal to that of the tree-based architecture. Consequently, if the number of input values is equal to 16, for example, the DMS architecture eliminates 16.66% comparators compared with the sorting-based and low-complexity tree-based architectures and 48.27% comparators compared with the treebased architecture. Also, the proposed architecture requires 65.90% less multiplexors compared with the tree-based and low-complexity tree-based architectures.
For fair comparison, four types of architectures are implemented in 6-bit CMOS standard cell library process: the sorting-based [46], tree-based [47], low-complexity tree-based [48], and proposed DMS architectures. Figure 9 depicts the critical delay for four architectures against different numbers of inputs. To the best of our knowledge, the tree-based [47] architecture is assumed to be the best architecture in literature for high-speed realization. Figure 9 shows that the critical delay of the DMS architecture is almost the same as that of the tree-based [47] architecture.
The most area-efficient architecture was proposed by Lee et al. in [48]. Figure 10 shows that when k is equal to 6, the proposed architecture saves 69%, 68%, and 52% area over the sorting-based, tree-based, and low-complexity treebased architectures, respectively. Consequently, the proposed architecture is proved to be the most area-efficient architecture for high-speed realization. Consequently, the

Conclusion
An efficient approach has been proposed to find the first two minima for CNU operation of the MS-LDPC decoder. The proposed architecture is conceptually similar to the sorting-based architecture, but it completely removes the large-sized multiplexing system which results in a prominent reduction in hardware complexity and critical delay. The proposed architecture estimates the second minimum value by utilizing a logical unit circuit having complexity and delay less than those of the multiplexing system. Based on the experimental results, the proposed architecture provides a critical delay almost the same as that of the tree-based architecture. More specifically, the proposed SM eliminates a large number of comparators and multiplexors for CNU operation of the MS-LDPC decoder. Therefore, the DMS architecture saves 69%, 68%, and 52% area over the sorting-based, treebased, and low-complexity tree-based architectures, respectively. Furthermore, simulation results show that the proposed approach outperforms its competitors in terms of bit error rate (BER) and block error rate (BLER) by providing an excellent error-correction performance over an AWGN channel.

Data Availability
No data were used to support this study.

Conflicts of Interest
The authors declare that they have no conflicts of interest.