Embedded Parallel Implementation of LDPC Decoder for Ultra-Reliable Low-Latency Communications

. Ultra-reliable low-latency communications, URLLC, are designed for applications such as self-driving cars and telesurgery requiring a response in milliseconds and are very sensitive to transmission errors. To match the computational complexity of LDPC decoding algorithms to URLLC applications on IoTdevices having very limited computational resources, this paper presents a new parallel and low-latency software implementation of the LDPC decoder. First, a decoding algorithm optimization and a compact data structure are proposed. Next, a parallel software implementation is performed on ARM multicore platforms in order to evaluate the latency of the proposed optimization. Te synthesis results highlight a reduction in the memory size requirement by 50% and a three-time speedup in terms of processing time when compared to previous software decoder implementations. Te reached decoding latency on the parallel processing platform is 150 μ s for 288bits with a bit error ratio of 3.410 –9 .


Introduction
Initially introduced by Gallager in 1962 [1] and then reworked by Mackay and Neal in 1996 [2], low density parity check (LDPC) codes have recently been used in several wireless standards such as WiMax (IEEE 802.16e),WiFi (IEEE 802.11n), 5G New Radio (NR), and DVB-S2.
LDPCs are linear block codes defned by a sparse binary parity check matrix H. Tey are typically represented by bipartite graphs formed by variable nodes and check nodes connected by bidirectional edges, also called Tanner graphs [3].LDPC codes have attracted considerable attention because of their superior error correction capability based on the iterative log-likelihood-ratio belief propagation algorithm (LLR BP) [4].
However, LDPC codes also have the disadvantage of high decoding complexity, which makes it signifcantly challenging to meet the requirements for low latency in communication systems.For example, the latency of the ultra-reliable low-latency communications needs to achieve 32Octect in less than 1 ms with a bit error ratio at 10 − 5 [5].
Tere are two approaches to implement the LDPC decoder.Te frst one is hardware based on application-specifc integrated circuits (ASICs) or feld-programmable gate array (FPGA) circuits.Tis approach not only achieves low latency and high throughput [6][7][8][9] but also brings a high development cost.Tis is a limitation for applications requiring fast time-to-market or for technologies with multiple and fast-evolving standards.
Te second approach is the software implementation based on coding the algorithm in a programming language and compiling it into a binary program that is loaded into the memory of the target architecture and then executed by its processor.Te software solution considerably reduces the hardware resources and development time necessary for the deployment and allows modifying and updating functionalities by uploading and running a new software version without changing the hardware.
Several studies have recently focused on software implementations of LDPC decoders on multicore devices, using three soft implementation strategies for parallel processing: some works manipulate GPUs or x86 multicore to parallelize processing [10,11], others use SIMD architecture [12] to accelerate computation, and the third exploits multicore system on chip (SoC) to take advantage of hard acceleration and soft fexibility [13,14].
In [10], the comparison of parallelization strategies for the min-sum decoding algorithm of irregular LDPC codes has demonstrated that the GPU can achieve decoding with higher throughputs than a general processor.However, a GPU-based solution works well only for the large code length because of the time spent for the data transfer between the host and GPU.For medium or small code length, the data transfer times are revealed to be higher than the needed time for the decoding process.
In [12], we proposed a multi-Gbps LDPC decoder on a GPU, using single-instruction multiple data to parallelize the processing of many received packets.Tis approach helps to achieve higher throughputs on embedded mobile devices.However, the spent times for data transfer between the host and GPU device are still higher for low-latency applications.
In [13], Kharin et al. performed an implementation on 8 DSP-and 4 ARM-cores multicore system on chip (SoC).Tis solution takes advantage of the DSP fexibility and efciency in signal processing tasks, but the interprocessor communication framework for ARM-DSP cooperative functionality takes about 2.75 ms for the DSP data loading, consequently increasing the latency decoding.
Terefore, our choice is oriented towards a software solution based on a new proposed algorithm optimization of the LDPC decoding process which reduces the computation complexity and, consequently, the decoding latency.Tis choice is also motivated by the emergence of generalpurpose processors with high computing power and multicore embedded systems that allow very powerful parallel calculations, which allows for the near-performance of the hard solutions with a reduction in cost and an adaptation to the various application contexts.
Te rest of the paper is organized as follows.Section 2 reviews the LDPC codes, the LLR BP, the min-sum (MSA), and the normalized min-sum (NMSA) algorithms.In Section 3, the proposed optimization algorithm and data structure are introduced, the complexity of the proposed algorithm in terms of the number of computational operation is discussed, and the parallel computational model is presented.Section 4 is dedicated to the simulation results in terms of CPU run time, the error-correcting performances, and the latency of the parallel software implementation on x86 and ARM multicore platforms.Section 5 concludes the paper.

LDPC Codes.
A binary LDPC channel code is a linear (N, K) block code used to correct transmission errors.From the transmitter, the coding generates an N-bit code word from a K-bit information message with the addition of M � N − K parity bits.Te decoding procession used a binary sparse parity-check matrix H of seize M x N; if x is a valid codeword vector, then H • x T � 0, where x T is the transposed of the code-word vector and "•" represents the matrix multiplication modulo 2.
If the parity check matrix contains the same number of ones per row (noted d c ), and the same number of ones per column (noted d v ), the code is called a regular LDPC code.Otherwise, the code is called irregular code.In this paper, we limited our focus to irregular codes for their good convergence [15] in terms of bit error ratio (BER) related to signal to noise ratio (SNR).
An example of the H matrix with N � 12 variable nodes v j and M � 9 control nodes c i is shown in Figure 1(b).Tis matrix can be represented by a Tanner graph (Figure 1(b)) containing M check nodes (c i ), N variable nodes (v j ), and E bidirectional edges connecting the check node c i to the variable node v j when the value the matrix element H ij is 1.
Te structured position of the nonzero elements in the parity check matrix H allows for a reduction in the LDPC encoding complexity.Quasi-cyclic LDPC codes (QC-LDPC) are a class of structured codes that have a good error correction performance [17,18].For these codes, the H matrix is composed by a set of Z × Z submatrices, where Z is called the expansion factor; each Z × Z submatrix is obtained by the right circular permutation of the Z × Z identity matrix.Te permutation value defnes the diferent submatrices.Tese values are called shift coefcients, and the set of shift coefcients is then collected in a matrix called the expansion matrix noted H bm .
For a H matrix of size M × N, the related expansion matrix H bm size is m × n, where m � M/Z and n � N/Z.Te expansion matrix H bm is expanded to the H matrix by replacing each negative shift coefcient with a Z × Z all zeros matrix, each zero shift coefcient with a Z × Z identity matrix, and each positive shift coefcient with a right circular permutation Z × Z identity matrix.Figure 2 shows an example of the H bm expansion matrix of size (24,12) and expansion factor Z � 96, expanded to the WiMAX H matrix of size (2304,1152).Each shift coefcient is replaced by the right circular permutation of 96 × 96 identity matrices; the "0" elements are replaced by a 96 × 96 identity matrix, and the "− 1" elements are replaced by 96 × 96 all-zero matrices.

LLR BP Decoding for LDPC Codes.
Te LLR BP decoding is based on the belief propagation of the loglikelihood ratio (LLR) messages between connected nodes.Te LLR value is used to evaluate the ratio between the probabilities of a binary random variable to be 0 or 1.
Te C2V messages m c i ⟶ v j propagated from c i to v j are initialized at zero and the V2C messages m v j ⟶ c i propagated from v j to c i are initialized as follows: where y j denotes the channel-information of the j-th variable node, v j denotes the j-th code word bit, and L and p denote, respectively, the LLR value and the conditional probability.After initialization, each iteration is mainly described by horizontal and vertical intensive processing blocks.In the horizontal processing, the algorithm updates the messages propagated from each check node to each variable node.Te updated C2V messages m c i ⟶ v j are generated according to where N(c i )\v j denotes the neighboring variable nodes that are connected to check node c i , excluding variable node v j .In vertical processing, the updated V2C messages m v j ⟶ c i are calculated as follows: where N(v j )\c i denotes the neighbors check nodes that are connected to variable node v j , except for the check node c i .
After the generation and propagation of all the updated V2C and C2V messages, a hard decision on the variable node v j is made, based on the new LLR update L(v j ) calculated as follows: where N(v j ) denotes all the neighbors check nodes that are connected to the variable node v j .Te iterative decoding process will not stop until all the check equations are satisfed, i.e., H • v T � 0, or the predefned maximum number of iterations is reached.
Te decoding complexity can be signifcantly reduced thanks to various algorithms available for C2V messages m c i ⟶ v j updates simplifcation.Te widely used ones, in the recent works, are min-sum (MSA) and normalized min-sum (NMSA) algorithms [19][20][21].For the MSA algorithm, the update equation became For the NMSA algorithm, this value is normalized by a factor α, where α < 1: Tese two algorithms are mainly based on the determination of the frst and second minimum of the modules of the V2C messages noted as min1 and min2.Te normalized min-sum algorithm (NMSA) improves the correction performance compared to the MSA approximation.
LLR-BP decoding is typically performed by repeating the fooding schedule, where all check-to-variable messages (C2V) are updated in the horizontal step, and subsequently all variable-to-check messages (V2C) are updated in the vertical step.
However, the convergence process is slowed down as the latest updated information, in the current iteration, must be used in the next iteration.To speed up convergence and increase error correction performance, sequential scheduling methods have been proposed, with both a predetermined and fxed sequence of updates.Tis sequential scheduling strategy difers from fooding in that the last updated information is used in the current iteration.Shufed [22,23] and dynamic [16,24] scheduling are two variants of this strategy and allow decoding convergence to accelerate twice as fast as fooding scheduling.
In order to have a better BER convergence performance, the NMSA algorithm is associated with shufed scheduling that allows accelerating decoding convergence.Algorithm 1 depicts the pseudocode of this association.Initialization of C2V and V2C is carried out (line 1) according to equation (1), and then the maximum number of iterations is set (line 2).Te C2V computation is performed in the vertical processing according to equation ( 6), followed by the V2C and the new LLR calculation in the vertical processing according to equations ( 3) and (4).Te hard decision is made based on the new LLR update (line 22).If the sign of the LLR value is positive, then the code-word bit is set to 1, else it is set to 0. Once the estimated code word is obtained, the syndrome is executed to evaluate if a valid code-word is found (line 24); otherwise, a new decoding iteration is started (line 2), and the iterative decoding process will not stop until the valid code-word is found or the predefned maximum number of iterations is reached.

Proposed Parallel Software Implementation
Before going to throw, a profling procedure using the Valgrind [25] profler is frst executed to determine the total execution CPU time for each block of the LDPC decoder.Blocks that require more computing time are then identifed and considered as more suitable candidates for eventual optimizations.
Te memory size requirement and the run time spent in the write/read memory accesses depend on the adopted data structure for both the H matrix and message storage.Te data structure used in previous works [10][11][12][13][14] is used in this run-time analysis.In this representation, the H matrix is represented as two separate two-dimensional arrays: the frst contains the column indexes of each matrix row, which is used for the horizontal processing, and the second contains the row indexes of each matrix column, which is used for the vertical processing.Some values in the two-dimensional arrays are set at − 1 because these arrays represent an irregular LDPC code, which has diferent column weights and row weights.Terefore, the size of the frst two-dimensional array is M × d cmax and the size of the second one is N × d vmax , where d cmax and d vmax are respectively the maximum value of the number of ones per row d c and the maximum value of the number of ones per column d v .
For the message storage, two arrays of size E are used for the C2V and V2C message updates, and two other arrays of size N are used for initial and updated LLR values.
Terefore, the memory size required for this data structure is calculated as blow (equation ( 7)).Te frst part of the equation concerns the H matrix storage, and the second part concerns the message storage: where E is the total number of edges in the entire Tanner graph, N is the number of the variable node, M is the number of the check node, and sizeof() refers to the operator which gives the amount of storage, in bytes, required to store an object of the type of the operand.Te profling results of Tables 1 and 2 are reported in Figure 3, in terms of the CPU percentage cycles.Figure 3(a) illustrates that the V2C message updates block is the most time-consuming module (66%) followed by the C2V message update block which takes 28.3% of the execution time.Te higher percentage of run-time taken by the V2C block is justifed by the fact that the number of V2C updating Figure 3(b) shows that 62% of the CPU time is spent on memory access (49% for the data read and 13% for the data write), 36% for instructions, and only 2% for memory access out of the cache memory.Te higher percentage of access memory can be justifed by the separated processing of (1) Initialize all m c i ⟶ v j � 0, m v j ⟶ c i � L(0) v j and Itermax (2) for i � 1 to Itermax do Horizontal processing (C2V computation): (3) for each check node c i (4) for each variable node v j connected to c i (5) Calculate min 1 & min 2 (6) end for line 4 (7) for each variable node v j connected to c i (8) end for line 7 Vertical processing (V2C computation) and LLR update: (11) for each variable node v j connected to c i (12) for every check node c a connected to v j (13) tmp � tmp + m c a ⟶ v j (14) end for line 12 (15) L(v j ) � L(0) v j + tmp (16) for every check node c a connected to v j (17) (18) end for line 16 (19) end for line 11 (20) end for line 3 Hard decision: (21) for each variable node (22) Make hard decision if sign(L(v j )) > 0 then v j � 1; else v j � 0 (23) end for line 19 Parity check equations (Syndrome) and stopping criteria: (24) if H•v T � 0 then break else i � i + 1 (25) end for line 2 ALGORITHM 1: Horizontal shufe NMSA.horizontal and vertical stages; in fact, for the same data, the memory access is done in read mode in the horizontal stage using the column-mapped matrix table and in write mode in the vertical step using the row-mapped matrix table, and vice-versa.
According to these analysis results, the V2C computation bloc is chosen to be optimized in order to decrease the access memory and instruction required in the decoding process.
Unlike the previous works [10][11][12][13][14], in this paper we propose an optimized version of the NMSA horizontal shufed scheduling algorithm by merging the horizontal and vertical steps into one step, thereby allowing to decrease the memory accesses and minimize the number of arithmetic operations.
We also propose a compacted data structure to represent the H matrix that is organized by the processing order and that is suitable for parallel implementation in order to take advantage of the multicore platforms for lower decoding latency.

Proposed Optimization.
Separate data processing increases the memory access for the same data.In order to solve this problem, an optimized algorithm is proposed to compute all the message updates (C2V, V2C, and L(v j )) corresponding to the data overloaded in the CPU processor in the same step before loading the next data.Tis allows taking advantage of the temporal locality of the cache memory and makes all computations on the current CPUfetched data.
From equations (3) and (4), we can observe that the initial LLR value of the variable node and the current iteration's V2C message value are both stored in the current iteration's LLR value, so we can calculate the V2C message value from the diference between the variable node's LLR value and the connected C2V message according to the equation as follows: Also, as presented in Figure 4, once one C2V message is updated, the calculation of the L(v j ) can be directly done by replacing the old C2V value with the new one.Te diference between the old LLR value and the old C2V is equal to the old V2C value (equation ( 7)), so the updated L(v j ) is calculated according to the following equation: Algorithm 2 depicts the pseudocode of the proposed optimization.Initialization of C2V and V2C is carried out (line 1) according to equation ( 1), and then for each check node, the processor uploads the old C2V and LLR values of each variable node connected to the check node (lines 3 to 4).Te min1 and min2 values are calculated from the diference between L(v j ) and C2V uploaded (lines 5 and 6), then the C2V and L(v j ) messages update are calculated in the same loop (lines 9, 10, 11) before passing to the next check node.Te hard decision and syndrome blocks are the same as shown in Algorithm 1.

Proposed Data Structure.
Te proposed data structure is generated by scanning the H matrix in a row-major order and by sequentially mapping the column index associated with nonzero elements in the H matrix. Tese column indexes are collected and stored in consecutive memory positions inside the table of size E, noted Col.
In order to take advantage of the spatial locality of the memory cache, the C2V messages are also mapped in memory in a row-major order in consecutive memory positions.In this way, each element of the Col table records the address of the corresponding C2V value.Using the proposed algorithm, the V2C messages are not memorized because they will be calculated in the decoding process using equation (8).
Figure 5 shows the diferent arrays required for the proposed algorithm using, as an example, the matrix shown in Figure 1.Te access memory to the C2V and Col tables is made directly because the C2V messages and column indexes are stored in the memory following the execution order (row-major order); the access memory to the L(v j ) message is done towards the Col table.For example, the check node c1 contains 4 nonzero elements: the frst one is connected to the third variable node (Col [1] = 3), the second one is connected to the sixth variable node (Col [  Te memory size required for this data structure is calculated by the following equation: Te frst part of the equation concerns the H matrix storage (Col table), and the second part concerns the messages storage (C2V[] and L(v j )).It is clear from equations ( 7) and ( 10) that the memory size required for the proposed data structure is lower by 50% than the data structure proposed in [10][11][12][13][14], and consequently, the run-time required for the memory access is highly reduced.

Parallel Computational Models.
Te complexity can also be highly reduced depending on the multicore platform and the algorithm parallelism level, which is correlated to the Figure 4: LLR value update.
(1) Initialize all m c i ⟶ v j � 0, and L(v j ) � y j Itermax (2) for i � 1 to Itermax do Horizontal processing: (3) for each check node c i (4) for each variable node vj connected to ci (5) Calculate Calculate min 1 & min 2 (7) end for line 4 (8) for each variable node end for line 8 (13) end for line 3 Hard decision: (14) for each variable node (15) Make hard decision if L(v j ) > 0 then v j � 1; else v j � 0 (16) end for line 19 Parity check equations (Syndrome) and stopping criteria: (17) if H•v T � 0 then break else i � i + 1 (18) end for line 2 ALGORITHM 2: Proposed optimization NMSA.
Applied Computational Intelligence and Soft Computing data dependencies during the decoding process, allowing the parallel memory access.
Te proposed data structure allows parallel execution because the related data is grouped into consecutive memory locations; each check node has its independent part of the Col and C2V array in the consecutive order.However, the L(v j ) array is shared between all check nodes and several kernels can read or write the same L(v j ) in the same times, which generates expensive synchronization run-time for memory access.
Terefore, the parallel processing is allowed between k check nodes if their neighboring variable nodes that are connected to these check nodes do not share any variable nodes; otherwise, Te value k represents the algorithm parallelism level.For the quasi-cyclic LDPC codes, each row of the H bm expansion matrix is expended to Z rows by replacing each shift coefcient by a Z x Z identity matrix or circularly shifted Z × Z identity matrix, so the Z rows generated from one raw of H bm satisfy the equation (11), because there is no data dependency among diferent rows of the identity matrix.Terefore, as depicted in Figure 6, Z threads are lunched in a parallel way; each thread processes one H matrix row, the transition to the next Z rows is performed sequentially, and so on, until all M rows are processed.Since the entire Z rows generated from one row of the H bm expansion matrix have the same d c value, Z threads processing the same H bm row have almost exactly the same run-time when calculating the updates messages.Consequently, no synchronization is required between the Z threads.Table 3 reports the overall computation operation and memory access needed for both algorithms.Because of the spearing horizontal and vertical processing proposed in previous works, the complexity of the algorithms can be expressed as O(9.E + 6.d v .E) with a signifcant number of arithmetic operations and memory access for the vertical steps, which contributes to increasing the decoding complexity, while the complexity of the optimized algorithm is about O(9.E).Te d v value is always strictly superior to two because a variable node is connected at least to two control nodes.For the WiMAX H matrix shown in Figure 2, the value d v is 4. Terefore, the proposed optimization clearly reduces complexity compared to the previous works.

Experimental Results
4.1.CPU Run-Time Cycles Evaluation. Figure 7 shows a comparison of the proposed LDPC decoder in terms of CPU cycles with the no-optimized HS and NMSA algorithms, for three codes.IR, DR, and DW corresponds, respectively, to the CPU cycle spent in the decoding instruction, the CPU cycle number spent in the read     Applied Computational Intelligence and Soft Computing memory accesses and the CPU cycle number spent in the write memory accesses.Te proposed optimization allowed a reduction of the total CPU cycle from 1.05 × 10 7 to 3.45 × 10 6 for (9216, 4608) code, from 2.6 × 10 6 to 8.5 × 10 5 for (2304, 1152) code, and from 6.5 × 10 5 to 2.1 × 10 5 for (576, 288) code.Te CPU run-time reduction percentage of the proposed optimization is presented in Table 4. Te reduction is about 66% which corresponds to a speedup of 3 times in terms of processing time compared to previous software LDPC decoder using separate horizontal and vertical steps.
Te lowest data read (DR) and data write (DW) memory accesses compared to previous work, confrms that the size of memory is compacted and the proposed optimization allows minimizing the memory accesses to the same data.

Latency Results on Multicore Platforms.
Te proposed parallel implementation is lunched on two diferent parallel processing platforms.Te frst platform is a Marwan HPC platform with an Intel Xeon gold 6130 with 08 CPU at 2.10 Ghz [26].Te second platform is a quad-core Cortex-A72 (ARM v8) @ 1.5 GHz.Tis platform is running with the Linux distribution and is used in IoT solutions.
Figures 8 and 9 report the latency time in milliseconds between the serial and parallel processing for diferent code sizes.In the frst platform, the speedup between the serial and parallel processing shows signifcant results for the bigger code sizes; it was approximately 2.8 for (9216 × 4608) and 2 for (4608 × 2304).
However, the parallel decoding results of matrix size below (1152 × 576) demonstrate worst performances than the serial results.An important reason is that this matrix does not have so many edges, so the computation of each thread needs less time than the run time required for OpenMP's thread creation and synchronization.
On the quad-core Cortex-A72 (ARM v8) platform, the parallel processing showed a signifcant speedup between the serial and parallel processing for matrix sizes higher than 576 × 288.In contrast to the frst platform, even for the 1152 matrix, we can notice a speedup of 2. Tis is mainly due to    Applied Computational Intelligence and Soft Computing the clock speed of 1.5 GHz less than the frst platform one (2.1 GHz).Te calculation time on the quad-core Cortex-A72 platform is always higher than the time needed for the creation and synchronization threads.
According to the results reported in Figures 8 and 9, we defne thresholds for enabling parallelism, and the number of CPUs used is chosen dynamically depending on the code size.
Table 5 reports the latency result for several code sizes obtained by a dynamic choice between serial or parallel processing using the clause "if" in OpenMP.Te latency is 0.11 ms for the (576, 144) code and 0.74 ms for the big code with a bit error ratio balanced between 3.4 × 10 -9 and 7 × 10 − 10 .In the quad-core Cortex-A72 platform, the latency is 0.15 ms for the (576 × 288) code (equivalent to 32 Byte) with a bit error ratio of 3.4 × 10 -9 , so it has clearly shown that the latency result obtained responds widely to the latency and reliability required for the ultra-reliable low-latency communications.

Conclusion
In this paper, we present a new software algorithm optimization, in order to decrease the decoding latency with the same performance obtained by the horizontal shufe NMSA algorithm.In the optimization algorithm, the separate horizontal and vertical steps are replaced by one step without storage of the V2C message update and with a compact data structure matrix representation, allowing a net reduction of the memory size requirement by about 50% and implementation on a multicore platform.Te proposed algorithm achieves much better BER; the bit error ratio is reaching 3.4 × 10 -9 and the complexity is reduced by 66%.In addition, the latency is reaching 150 μs for 288 on the ARM quad-core Cortex-A72 system on chip (SoC), used for IoT projects and applications, showing that the proposed optimization behaves much better when considering the BER, decoding complexity, and latency for the ultra-reliable low-latency communication even on IoT devices with very limited computing resources.
Based on the work performed in this thesis and the obtained results, we can identify the following future perspectives: (i) Increase the expansion coefcient per parity-check matrix in order to make the parallel section wider and reduce the number of crossings between parallel sections, which occurs by creating threads at each entry of the parallel section, thus increasing the time required for thread creation.(ii) Use codesign methodology to implement the optimized NMSA algorithm on hardware (an FPGA circuit) and the shufed scheduling and variable memory organization on software.

Figure 1 :
Figure 1: (a) H matrix with N � 12 variable nodes v j and M � 9 control nodes c i .(b) Corresponding Tanner graph.Figure 1 is reproduced from Benhayoun et al. [16] (under the creative commons attribution license/public domain).

Figure 3 :
Figure 3: (a) CPU percentage cycles spent by code block.(b) CPU percentage cycles spent by memory access and instructions.

3. 5 .
Complexity Analysis.Te complexity is evaluated for each iteration of the LDPC decoding, where one iteration means the process of selecting and updating all edges in the Tanner graph.Te total number of edges in the entire Tanner graph is E � d c .M � d v .N, where d c and d v denote the average degree of check nodes and variable nodes, respectively.As depicted in Algorithm 1, the NMSA horizontal shufe algorithm, the horizontal processing involves the calculation of the min1 and min2 values (line 5) which require 2.d c .M = 2.E comparison and 2.d c .M = 2.E read access memory.Te calculation of the C2V messages update (line 8) requires d c .M multiplication and dc.M write access memory.Te vertical processing involves the calculation of the summation of the C2V (line 13) that require d v .dc .M = d v .E addition, and d v .dc .M = d v .E read access memory operation, a LLR update (line 15) require d c .M = E addition operation, d c .M = E reads access memory operation, and d c .M = E writes access memory operation, and fnally, the V2C messages update (line 17) which requires d v .dc.M = d v .E subtraction, 2.d v .dc .M = 2.d v .E reads access memory and d v .dc .M = d v .E writes access memory operation.As depicted in Algorithm 2, for each check node, the processor uploads the i-1 iteration's C2V messages and L(v j ) message, and all update calculations corresponding to the data uploaded are done before passing to the next check node.Te min1 and min2 values are calculated from the diferences between L(v j ) and C2V uploaded (lines 5 and 6) that require d c .M = E subtraction, 2.d c .M = 2.E read access memory, and 2.d c .M = 2.E comparison.Ten, the C2V and L(v j )message update are calculated in the same loop (lines 9, 10, and 11) which require d c .M = E multiplication operation, by the normalized factor α (line 9), d c .M = E addition operation for L(v j ) update using the equation (line 10), d c .M = E writes access memory operation required for writing the L(v j ) value (line 10), and d c .M = E written access memory operation (line 11).

Figure 8 :
Figure 8: Latency time comparison between serial and parallel processing on the 1 st platform for diferent code sizes.

Table 1
reports the overall CPU run-time in cycles spent in diferent algorithm blocks depicted in Algorithm 1. Table2reports the same time spent by arithmetic operations, data memory accesses in read/write mode, and the cycles spent in searching data outcomes at level

Table 1 :
CPU run-time in cycles spent by the algorithm block.

Table 2 :
CPU run-time in cycles spent by instruction and memory accesses.

Table 3 :
Computation operations and memory by iteration.

Table 4 :
CPU run-time reduction compared to previous works.