LDPC Decoding on GPU for Mobile Device

A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.


Introduction
Low Density Parity Check (LDPC) error correcting code is a kind of linear block codes, proposed by Gallager in 1962 [1] and rediscovered by Mackay and Neal in 1996 [2].It takes its name from its sparse check matrix.LDPC codes are capacity-approaching codes, which means that it allows the noise threshold to be set very close to the Shannon limit for a symmetric memoryless channel; thus, the practical constructions of LDPC code exists.
Good performance of the LDPC code is at the cost of a very large amount of calculation.DCP decoding computation has very high parallel computation.The current commercial LDPC decoder is based on the hardware implementation, which only allows several kinds of specific LDPC codes at the same time and is difficult to upgrade.There are a large number of studies using FPGA to realize the efficient LDCP decoder [3,4].With the rapid development of the graphics processing units (GPU) on the desktop, there are a lot of researches using CUDA framework for LDPC decoding [5,6].The LDPC code is widely used in the fourth generation of mobile telecommunications technology, which makes it significant to develop efficient software LDPC decoding on the mobile device.At the same time, software LDPC code can dynamically change the parameters, including code length, code rate, and the number of iterations to quickly deal with all kinds of network environment.
Open Computing Language (OpenCL) [7] is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU, DSP, FPGA, and other processors or hardware accelerators.This technical specification was reviewed by the Khronos members and approved for public release on 2008.Compute Unified Device Architecture (CUDA) [8] also enables developers to develop parallel computing program on GPU at the desktop.OpenCL appears later, but it supports more scenarios.With the rapid development of mobile devices, many mobile devices especially mobile phone began to have their own high-performance GPU chips.Some vendors such as Qualcomm, Imagination PowerVR, ARM, and Vivante are beginning to support the OpenCL on their mobile GPU [9], which make developing parallel computing program on mobile devices based on GPU easier.In this article, we tried to develop a LDPC decoder on the mobile GPU based on the OpenCL.Nevertheless, the global memory is limited on a mobile GPU; therefore, the performance is not as good as on the desktop GPU.We improve the decoding through making full use of the local memory of each computing unit and the private memory of each processing unit.At the same time, we properly reduce the number of threads per code word and add code-words 2 Mobile Information Systems in decoding process, and better performance is obtained.In our experiments, as the best result in the decoder, the throughput reached 160 Mbps, which can satisfy the current mobile wireless communication in many cases, and delay time is less than 2 milliseconds (ms), which can satisfy many real-time applications like video calling.

MSA for LDPC Decoding
Belief propagation (BP) algorithm is a kind of important message passing algorithm, often used in the field of artificial intelligence [9].Algorithm between each node transfers the belief information.For example, the belief information from bit node BN  to check node CN  depends on the observation of BN  and all the check nodes BN  connected with, except CN  .Similarly, the belief information from check node CN  to bit node BN  depends on the observation of CN  and all the bit nodes CN  connected with, except BN  .As a BP algorithm, the Min Sum Algorithm (MSA) is a very efficient LDPC decoding algorithm [10].It is based on the belief propagation between nodes connected as indicated by the Tanner graph [11] edges. Figure 1 shows the Tanner graph of a particular 4 × 8 H matrix. MSA, proposed by Gallager, operates in the logarithmic probabilistic domain.
LDPC code is a special form of linear (, ) block code, defined by sparse binary parity check H matrices of dimension  × , while  =  − .We assume that the channel is an additive white Gaussian noise (AWGN) channel with the mean 0 and the variance  2 .BPSK modulation maps a code-word c = ( 1 ,  2 , . . .,   ) onto the sequence x = ( 1 ,  2 , . . .,   ), according to   = (−1)   .The received sequence is y = ( 1 ,  2 , . . .,   ), with   =   +   .In the case of receiving   , the logarithmic a priori probability of   is  0  .MSA is as shown in Figure 2. Before entering the loop iteration, we use the received sequence y to initialize the prior probabilities of BN  as follows: In this algorithm, we do not compute the posterior probabilities of BN  and CN  directly; instead, we compute the message transferring between the bit nodes and check nodes as well as the posterior probabilities before hard decoding.
In the step of updating message CN  to BN  , for th iteration, accessing H in row-major order,    as the message sent from CN  to BN  is updated according to any bit nodes connected to CN  in Tanner graph, except the BN  .The update process, called minimum step, is as follows: Using the H matrix and Tanner graph in Figure 1, for instance,   0,0 is updated by BN 1 and BN 2 , as in Figure 3, The posterior probabilities of BN  is updated by the prior probabilities of BN  and all the check nodes connected to BN  : Similarly, in the step of updating message BN  to CN  , for th iteration,    , as the message sent from BN  to CN  is updated according to any check nodes connected to BN  in Tanner graph, except the CN  .The update process is called sum step.
Using the Tanner graph in Figure 1, for instance,   0,0 is updated by CN 2 , as in Figure 4,   0,0 = (  2,0 ).Actually, the steps of updating    and    can be exchanged.If we update    first, the result of    can be used to update    , which reduces the repeated computation.The final hard decoding is performed at the end of an iteration.
The iteration procedure is stopped if the decoded word c verifies all parity check equations cH  = 0, or the maximum iteration is reached.
The implementation of decoder is achieved by a flood scheduling algorithm [12].It guarantees that the bit nodes would not interfere with each other in the update step and when updating check nodes, check nodes will not interfere with each other too.Using this principle allows the true parallel execution of MSA for LDPC decoding based on the stream-based computing method.

OpenCL for Mobile GPU
Modern GPU is based on ultra high parallel computing ability and programmable pipeline.Stream processor of GPU is able to do general-purpose computation [13].GPU is more efficient than CPU floating point performance especially when we deal with the single instruction multiple data (SIMD) and the completion of compute-intensive tasks, in which data processing operation needs far more time than the data scheduling and data transmission [14].
Unlike the dedicated GPU for desktop computers, a mobile GPU is typically integrated into an application processor, which also includes a multicore CPU, an image processing engine, DSPs, and other accelerators [15].Recently, modern mobile GPUs such as the Qualcomm Adreno GPU [16], the Imagination PowerVR GPU, ARM Mali, and GPGPU on Vivante tend to integrate more compute units in a chip.Mobile GPUs have gained general-purpose parallel computing capability thanks to the multicore architecture and emerging frameworks such as OpenCL, and they are likely to offer flexibility similar to vendor specific solutions designed for desktop computers, such as CUDA of Nvidia.
The parallel jobs can be divided into work-groups, and each of them consists of many work-items which are the basic processing units to execute a kernel in parallel.OpenCL defines a hierarchical memory model containing a large global memory but with long latency and a small but fast local memory which can be shared by work-items in the same work-group; what is more, each work-item has its own memory, which is not shared with other items and is fastest accessing.

Mobile Information Systems
To efficiently and fully utilize the limited computation resources on a mobile processor for better performance, we partition the tasks between CPU and GPU and explore the algorithmic parallelism, and memory access optimization needs to be carefully considered.
On embedded platform, to handle various tasks is becoming a trend.OpenCL specification describes a subset of the OpenCL specification for handheld and embedded platforms.
The OpenCL embedded profile has some restrictions; for instance, there are optional support for 3D images and no support for 64-bit integers and no support for 64-bit integers.The details of the OpenCL embedded profile can be found in Khronos's website [17].
Despite these specification restrictions, it is possible to use OpenCL to accelerate the program on the mobile devices.The compute-intensive computation on the mobile device is transferred to the GPU or other devices supporting OpenCL; not only these tasks can perform even more efficiently, but also CPU can handle more tasks that it is good at.Actually, LDPC decoding is a kind of traditional compute-intensive computation.

Parallel MSA LDPC Decoding on Mobile GPU
MSA is an intensive processing, which should be processed in a high-performance specific computing engine, or in a highly parallel programmable device.On the mobile device, the GPU is a good choice.This general model, supported by GPU using OpenCL, executes kernels in parallel on several multiprocessors.Each processor is composed by several cores that dispatch multiple threads.In this section, a parallel processing to save the information of matrix H into workitems is showed.In order to save the private memory, each work-item only keeps the compressed information that related to its own computation.After that, the specific parallel algorithm in OpenCL kernel is introduced.Given an (, ) LDPC code, it is important to manage the computation to reduce the expenditure in parallel programming.updates the message in the whole row,  BN is not necessary to be accessed by any other work-items.
The  CN data structure is used in the vertical processing step.It can be defined as a sequential representation of the edges associated with nonnull value in H.It is generated by scanning the H matrix in a column-major order. CN is also saved in the private memory.Because each work-item updates the message in the neighbor / rows,  CN is not necessary to be accessed by other work-items too.

4.2.
Programming the MSA on the OpenCL Grid.Each workgroup contains  work-items that represent threads.Instead of the whole matrix H or  BN , each work-item can save the necessary part of information of  BN in the private memory, which make access to perform the update faster.Again, the same principle applies to the update of    messages.According to LDPC code length, the CPU on mobile do allocate memory in GPU, including the global memory for storing the check matrix H, input data, output data, and the local memory for saving the message data sent from bit nodes to check nodes, marked as    and from check nodes to the bit nodes, marked as    (Algorithm 3).In step (2), the compact  BN and  CN are generated in private memory by Algorithms 1 and 2.
The same as the normal MSA algorithm, the loop execution from step (3) will end until the output code word is current or it reaches the maximum loop times.
It executes a horizontal processing, a vertical processing, and a synchronization for all threads in steps ( 5)- (9).Generally, all threads should be synchronized after the horizontal and vertical processing, but in this algorithm, every workitem takes charge of its own check node, and    data is not shared with other work-items, so it is able to cancel the synchronization after horizontal processing to improve performance.   data is still shared with all work-items, so the synchronization after vertical processing is retained.
(1) Initialize the work-group size (or number of work-item per work-group).
After the synchronization, it calculates the posterior probabilities of BN  and every work-item deals with / bit nodes as in steps ( 10)- (13).After the second synchronization, it performs the hard decoding by posterior probabilities, according to the method described in Section 2.
True parallel execution is conducted and the overall processing time required to decode a code word can be significantly reduced as a result, as it will be seen in the next section.More data parallelism can be exploited by decoding several code words simultaneously, but it was not considered in this work.

Implementation and Experimental Results
The experimental setup to evaluate the performance of the proposed parallel LDPC decoder on the GPU consists of a PowerVR G6200 with 256 MB global memory and 4 KB local memory and was programmed using the C language and the OpenCL programming interface (version 1.1).In this algorithm, each code word is decoded in a work-group.Because of the limited local memory, only small LDPC code can be used in this test mobile phone.However, the workgroup number can be large due to the relatively large global memory on the GPU.
To decode a batch of code words, whose original size is 1 Mbit, the variation in performance is minimal and in Figure 5 we show only the best results achieved.As a 144 × 576 matrix, the work-items per work-group are equal to their row number, which means we use 144 work-items per workgroup and 1000 work-groups in this experiment.
The decoding times reported in Figure 5 define global processing times, including data transmission time and decoding time.The decoding time increases along with the increase of iterations.They have a linear relation.The computation capacity of GPU is fully used.The throughput decreases as iterations increase when iterations increase when the size of data for decoding remains the same.
On the mobile device we attach as much importance to the delay as the throughput.Figure 6 shows the decoding delay when the speed is from 10 Kbps to 100 Mbps.With the speed exponential increase, the delay increases but slowly.
Actually, the size of data for decoding on the GPU in a decoding cycle is too small that some capacity of GPU is waste and the parallel effect is not obvious with low speed.It is obvious that the delay increases when code words for GPU decoding increase.However, the time of a decoding cycle, which is the most important part of the delay, increases but slowly thanks to the more fully use of the computation capacity.The mean time for a code word decreases in higher speed.Thus, it can be applied for some high-speed mobile services, like large file transmission, and delay-sensitive services like video calling.

Conclusion
This paper proposes a multicode word parallel LDPC decoder using a GPU on the mobile device running OpenCL.LDPC is widely used in the fourth generation of mobile telecommunications technology, so it is significant to realize high-speed LDPC decoding on the mobile devices.

Figure 5 :
Figure 5: LDPC decoding times on the GPU and corresponding throughput using MSA.
Figure 1: A 4 × 8 H matrix and its Tanner graph representation.Initialize message from CN m to BN n Instead of using ×  work-items (  is the maximum column weight of matrix H), the model uses  × 1 work-items in each workgroup, and each work-item updates the message about one check node, which means  work-items work for  check nodes, respectively.4.1.Compact Representation of the Tanner Graph.The Tanner graph of a LPDC code is defined as H.We propose it in two separate data structures, namely,  BN and  CN .This is because one iteration of the LDPC decoder can be decomposed into horizontal and vertical processing, which means we update message from CN  to BN  and message from BN  to CN  , respectively.The data structure used in the horizontal step is defined as  BN .It is generated by scanning the matrix H in a row-major order and mapping only the bit nodes' edges associated with nonnull elements in H used by a single check node equation in the same row.Algorithm 1 describes this procedure in detail for a matrix having  rows and  columns. BN is saved in the private memory.Because each work-item (1) as the work-item  in a work-group: do Algorithm 2: Generating compact  CN from matrix H.