High-Throughput Fast-SSC Polar Decoder for Wireless Communications

Polar code has been proven to achieve the symmetric capacity of memoryless channels. However, the successive cancellation decoding algorithm is inherent serial in nature, which will lead to high latency and low throughput. In order to obtain high throughput, we design a deeply pipelined polar decoder and optimize the processing elements and storage structure. We also propose an improved fixed-point nonuniform quantization scheme, and it is close to the floating-point performance. Two-level control strategy is presented to simplify the controller. In addition, we adopt FIFO structure to implement the α memory and β memory and propose the 348-stage pipeline decoder.


Introduction
Wireless communication is changing our life and has been applied to many scenarios [1][2][3][4][5], and error-correcting codes are utilized to improve its transmission efficiency and reliability.Polar code is a class of error-correction codes proposed by Arikan [6].Within the ongoing 5th-generation wireless systems (5G) standardization, polar codes have been adopted as channel coding for the enhanced mobile broadband (eMBB) communication service for its excellent errorcorrection performance.Especially, for the ultra-reliable lowlatency communications (URLLC), it should satisfy the high throughput of several tens Gbps [7], which bring in a challenge for polar decoder.In the past, great research efforts have been made on polar codes in decoding algorithm and hardware architecture since Arikan presented the successive cancellation (SC) decoding algorithm.SC has the advantages of low complexity and simple decoding structure.Although polar codes can theoretically achieve channel capacity when code length is infinite, the performance of SC is mediocre for codes of short and moderate lengths.To address this issue, successive cancellation list (SCL) decoding algorithm is proposed in [8].Different from the SC algorithm, SCL does not focus on a single candidate codeword; it saves L most reliable candidate codewords at every step.The decoding performance of SCL has been significantly improved.K. Chen and K. Niu proposed CRC-aided SCL (CA-SCL) algorithm [9] based on that the correct codewords can pass the CRC check.And they proposed the successive cancellation stack (SCS) decoding algorithm in [10] and successive cancellation hybrid (SCH) decoding algorithm in [11].Unlike the SCL decoding which preserves the L most reliable paths in each layer, SCS always extends the most reliable path.Compared with SCL decoding, the performance of SCS is the same as SCL, but the time complexity is lower and the space complexity is higher.The actual time complexity of SCS decoding is far below than that of SCL in the high-SNR regime and is close to SC decoding.SCH algorithm combines the advantages of SCL and SCS, and the performance of SCH is close to that of maximum likelihood (ML) [12].The researchers of Huawei proposed the adaptive CA-SCL (aCA-SCL) [13] decoding algorithm based on CA-SCL algorithm.aCA-SCL improves the decoding performance by gradually expanding the search width L. aCA-SCL can reduce the software complexity significantly.The above decoding algorithm is proposed for improving the performance, but their throughput is not ideal.Thus A. Alamdar-Yazdi and F. R. Kschischang propose simplified successive cancellation (SSC) decoding algorithm in [14] based on the location of frozen and information bits.SSC decoding reduces the computational complexity and improves the decoding parallelism by combining some leaf nodes, such as Rate-1 node whose leaf nodes are all unfrozen bits.Simulation results illustrate that the performance is similar to that of SC.G. Sarkis and W. J. Gross divide the leaf nodes into Rate-0, Rate-1, and Rate-R nodes and propose the maximum likelihood SSC (ML-SSC) decoding algorithm in [15] which is mainly to improve the performance of Rate-R nodes decoder in SSC decoding.Compared with the semiparallel SC decoding in [16], ML-SSC decoding improves the decoding throughput by 25 times.In order to further reduce the decoding complexity and improve the throughput, G. Sarkis proposes the fast simplified successive cancellation (Fast-SSC) decoding algorithm, which mainly improves the decoding rules of Rate-R nodes and gives the specific operation for each constituent node [17].Fast-SSC decoding divides the Rate-R nodes into repetition (REP), single-paritycheck (SPC), and REP-SPC nodes and improves the throughput.
In addition, for the decoding hardware architecture of polar codes, a semi-parallel architecture is proposed in [16].In order to improve resource utilization, this method reuses the processing elements (PE) which effectively reduce the hardware complexity.The overlapped architectures proposed in [18] have advantages in both latency and throughput, which uses precalculation function calculate the possible results firstly and according to the decoded results to choose the corresponding results.It is proved that the decoding latency is reduced by 50% when the code length is larger than 2 7 .Then B. Yuan proposed the SCL decoder with multibit decision which effectively reduces the decoding latency [19].And an unrolled hardware polar decoder is proposed in [20] on the basis of Fast-SSC.This decoder loads one frame channel decoding data and outputs a frame of codeword each clock.The PEs are no longer reused and dedicated PEs are assigned to each stage.Graphics processor unit (GPU) provides the flexibility and massive parallel units; the GPU-based polar decoders obtain high throughput [21][22][23].
In this paper, we investigate the characters of LLRs for different stages of Fast-SSC polar decoder and propose an improved nonuniform fixed-point quantization method.It adopts (6,5,1) quantization scheme; the decoding performance is close to the floating-point decoding performance.The proposed decoder employs deeply pipelined architecture and optimizes the REP decoder and G operation.To simplify the deeply pipelined control, the controller is divided into global controller and local controller. memory and  memory use FIFO architecture to reduce the control logic.Finally, a 348-stage pipeline architecture is devised, which is implemented on Altera Stratix V 5SGXEA7N2F45C2.To test the decoding performance, we design a platform based on FPGA.
The remainder of this paper is organized as follows.A brief review of Fast-SSC decoding algorithm and analysis of the quantization schemes are shown in Section 2. Section 3 depicts the deeply pipelined architecture and the PEs.Performance is evaluated in Section 4 and conclusions are drawn in Section 5.

Review of Fast-SSC
2.1.Polar Codes.A polar code can be represented by (, ), where  denotes the code length and / is the code rate.Polar code of length N can be constructed by concatenating two polar codes of length /2.The construction method can be denoted by  =  ⊗ , where  = { 0 ,  1 , . . .,  −1 } is the input sequence that to be encoded, and  = { 0 ,  1 , . . .,  −1 } denote the codewords. ⊗ is the n-th Kronecker power of the generator matrix  = [ 1 0 1 1 ].Polar codes select K most reliable channels to transmit information bits, and the other N-K channels transmit frozen bits.

Fast-SSC Decoding Algorithm.
The binary decoding tree of Fast-SSC nodes is divided into four types: Rate-0, Rate-1, REP, and SPC.Compared with SC decoding tree, Fast-SSC has less leaf nodes.Since the polar decoder traverses the entire binary tree during iterations, Fast-SSC decoding algorithm has low latency.Figure 1 shows SC decoding tree and corresponding Fast-SSC decoding tree for a (16,8) polar code.For instance, the REP node consists of leaf node {4, 5, 6, 7} and SPC node includes leaf node {8, 9, 10, 11}.The leaf nodes of Rate-0 node are all frozen bits.Therefore, its output will be the zero vector.The leaf nodes of Rate-1 node are all information bits.The decoding result of such nodes is obtained by For the REP node, only the last bit of it is information bit, and others are frozen bits.The REP node adds all the input  first and then makes a hard decision as where  V denotes the code length of the node.The SPC node, of which only the first leaf node is frozen bit, performs threshold detection by (3) on the input LLRs firstly.The parity of all the inputs is calculated by (4).Then the least reliable bit is founded and flipped if the parity constraint is not satisfied.The threshold detection can be written via The parity of the input is calculated as Finally, the output of the SPC node is In addition to the above four type nodes, the rest colored in grey is referred to as other node, as shown in Figure 1.The decoding method of other nodes uses standard SC algorithm as in Figure 2. When node V is activated, it will receive  V from its parent node  V and then calculate the soft-valued input to its left child,   , which is calculated using the F operation.
Once   of the left child node is estimated, it is used to calculate the input to the right child node   with G operation.
Finally,   and   are combined to calculate  V as Table 1 lists the number of constituent nodes of the decoding tree for a (1024, 512) polar code.It can be seen that the total number of constituent nodes is 104 of the Fast-SSC decoding, which decreases from 1024 of the SC decoding tree.The decoder does not need to traverse the entire decoding tree, it just traverses the pruned tree.Thus Fast-SSC algorithm improves the decoding efficiency and throughput and decreases the latency.

Quantization Scheme.
The quantization scheme is divided into uniform and nonuniform quantization.The uniform quantization is simple, but the consumption of resources is more than that of nonuniform scheme.The nonuniform quantization employs different quantization bits in different decoding stages and uses less storage resources, but the memory structure is not regular [27].Unlike the conventional SC decoder which memory is shared for the nodes of different stages, the PEs of deeply pipelined decoder in each stage are equipped with a separate memory.In order to reduce the memory consumption, the nonuniform scheme is adopted to quantitate channel LLRs and internal LLRs.In [20], it adopts the all-integer quantization method, where channel LLR is 4 bits and internal LLR is 5 bits.In this paper, an improved quantization scheme is proposed based on LLR distribution of different stages.At the beginning, the internal LLRs is small, and it is quantitated with the same bits as channel LLRs.To avoid catastrophic overflow, the internal LLRs of latter stages are quantitated with larger bits.Let (  ,   ,   ) denote the quantization scheme, where   presents the quantization bits of channel LLRs and that of LLRs for the former stages of the decoder,   denotes the internal LLRs of other stages, and   is the fractional bits.Figure 3 shows the block error rate (BLER) performance of SC, Fast-SSC algorithm, and different quantization schemes.The floating decoding performance of Fast-SSC is close to that of SC.In the quantization schemes of Fast-SSC, it can be seen that the performance of (6,5,1) quantization is close to the floatingpoint performance, but (6,4,1) quantization results in less than 0.2dB performance loss in high   / 0 .Therefore, this paper adopts (6,5,1) quantization scheme.

Architecture of Fast-SSC Decoder
The decoder is implemented in deeply pipelined architecture to improve the decoding throughput.This paper optimizes the PEs, storage, and control modules to lower the latency.  is computed, the converting operation by G  for it will take one separate clock cycle.According to the above analysis, the deeply pipelined architecture has a total of 332 stages.
To implement the deeply pipelined architecture, we unfold the overall decoder.The G-0R, R0-R1, and R0-SPC operations are introduced to reduce the number of stages.The decoder can directly active the right child when the left child is Rate-0 node; thus it can reduce the decoding latency and the storage capacity.Moreover, in order to balance the pipeline at all stages and lower wire routing congestion, we refine the F and G operations into F with front complement, G without complement, G without front complement, and G without latter complement operations when the inputs are large.subtracts [/2 + ] when  = 1.The structure is shown in Figure 7.

F Module. F operation is used to calculate the 𝛼
When the input size of G is larger than 256, high decoding latency will be brought.It is clear that when the input length  is long, the next stage after G is usually F operation.Since the complexity of F operation is less than that of G operation, to balance the frequency of the two operations, the complement operation of G is moved into the F operation.The optimized architecture is depicted as Figure 8.The left and right sides of the dotted line are the PEs of two stages, respectively.The complement operation of G colored in grey is performed in the next stage.

C Module.
The C module combines   of left child and   of right child to calculate  V of parent node.According to (6), the first half of   is obtained by   XOR   , and the half is equal to   directly.The structure of C module is shown in Figure 9.

REP Module.
The number of input ports in REP module is 4, 8, 16, 32, and 64.The 4-input module is the basic REP; other types can be decomposed into 4-input type.The hardware architecture for 8-input REP is presented in Figure 10.
When the input length of REP nodes increases, the decoding latency also increases.To improve the working frequency, the 8-input REP is divided into two stages; thus it will use two clock cycles.The first stage translates the input data to complement and generates four internal results.The second stage adds them, and the REP module outputs the result of all the 8-inputs data.The hardware architecture is shown in  Similarly, REP module of other lengths is divided into different stages.For instance, the 16-, 64-, and 128-input REP modules are divided into 2, 3, and 4 stages, respectively.
3.9.SPC Module.In order to improve the frequency, the length of SPC node is constrained to 4 as shown in Figure 11.index of the minimum of  [2] and  [3].If [0] is less than  [1] min01 flag is set to zero; otherwise min01 flag is one.If min01 is less than min23, sel is set to zero; otherwise sel is one.D1∼D3 is determined by judge module.For instance, if sel and min01 flag are both zeroes, then D0 is set to one and others are set to zero.If sel is zero and min01 flag is one, then D1 is set to one and others are set to zero.

Kronecker Power Module.
The decoding result of the constitute nodes requires the conversion of nth-Kronecker power to get the final result by (9).The code length of the architecture of the Kronecker power module is 8 as shown in Figure 12, where ⊕ denotes XOR operation and • denotes that the data is connected directly.It can be found that if V  is equal to zero, then it can be obtained as zero directly.As shown in Figure 12, the three XOR operations colored in red can be removed for V 0 is zero.As depicted in Figure 14, addr  stgae2 denotes the address bus of the  memory in stage2.When satge2 en is asserted, the corresponding address adds one.Namely, an internal LLR is stored into  memory stage2.Compared with the one-level controller to generate all the control signals, it can reduce the implementation complexity of the controller.

Performance Analysis
4.1.Test Platform.To test the high throughput of Fast-SSC decoder, we implement a test platform based on FPGA.The overall platform including generating test data is completed on the FPGA to reduce the communication cost with the host computer.As depicted in Figure 15, the platform consists of random number generator, CRC check, polar encoder, BPSK AWGN, Fast-SSC decoder, and statistics module.PCIe is responsible for the communication between the host computer and FPGA platform.Meanwhile, this paper designs a software platform based on C++ that compares the results of hardware test and software simulation.At the beginning, the host computer generates random number seeds, Gaussian noise seeds, the number of test frames, and start signal and transmits them to FPGA.When the decoding is completed, the statistics module uploads the number of error frames; then the host computer calculates the BLER and displays the test parameters.For the (1024, 512) polar code, simulations show that the test platform takes 19.18s at 300MHz to test data with 1.4 * 10 10 bits.3. It can be observed that the proposed decoder costs more memory compared with other decoder based on FPGA for that 6 bits is used to quantize the LLRs partly.However, it costs less registers compared with [20].For the decoder in [24], it costs less resource because it does not adopt deeply pipelined architecture.
In this paper, for the (1024, 512) polar decoder, its working frequency can achieve 300MHz.The decoder requires 348 clocks to decode one frame.By (10) and (11), the latency is 1.16us, and the throughput is 307.2Gbps.Table 4 compares the proposed decoder with other polar decoders.In [20], a deeply pipelined decoder based on FPGA is capable of achieving the throughput over 237 Gbps for a (1024, 512) polar code.The latency of the decoder is twice more than this work.And the throughput of the proposed decoder is 1.3 times than that.It shows that either the latency or the throughput this work is better than that in [20,24,25].O. Dizdar and E. Arikan proposed a deeply pipelined polar decoder based on SC decoding algorithm.That decoder operates at lower clock frequency and costs less dynamic power.The proposed decoder has three times higher latency but is over 119 times faster than that in [26].

Conclusions
In this paper, a decoder in deeply pipelined architecture has been presented based on Fast-SSC decoding algorithm.The proposed decoder can output 1024 bits at each clock.To optimize the critical path, the PEs are decomposed and recombined to balance the latency of two adjacent stages.The fixed-point nonuniform quantization scheme lowers storage capacity and obtains a good decoding performance.The two-level mode is proposed to reduce the complexity of the controller.Moreover, we build a platform based on FPGA to test its performance.Numerical results show that the decoder can achieve high throughput.

Figure 3 :
Figure 3: The BLER performance of different decoding schemes.
of left child nods.According to (4), the sign bit of   [] is obtained by XOR operation, and the numerical bits are the minimum of   [] and   [ + 1] by compare operation.The structure is shown in Figure6.

3. 6 .
G Module.The G operation calculates   of the right child based on   of the left child and  V of the parent node.According to (5), [] adds [/2 + ] when  = 0, and []

Figure 6 :
Figure 6: Structure of F module.

Figure 7 :
Figure 7: Structure of G module.

Figure 10 .
Figure 10.The dotted line indicates that the original 8-input REP module is divided into two stages.Similarly, REP module of other lengths is divided into different stages.For instance, the 16-, 64-, and 128-input REP modules are divided into 2, 3, and 4 stages, respectively.
|[]| denotes the absolute value of [], and sign([]) is the sign bit of [].The 4MIN1 module selects the smallest LLRs through compare operation.min01 flag denotes the index of the minimum of [0] and [1], and min23 flag denotes the

4. 2 .
Resource Consumption.The (1024, 512) polar decoder is implemented on Altera Stratix V 5SGXEA7N2F45C2 in Quartus II 15.0.The resources used by the decoder based on FPGA are shown in Table

4. 3 .
Performance.The latency and throughput are the main performance parameters of the polar decoder.Let freq decode be the frequency of the decoder, and frame decoder clocks

Table 1 :
The number of constitute nodes.
3.1.Architecture.The structure of Fast-SSC decoder is depicted in Figure4, which consists of PE, memory, and controller.The PE is composed of various functions, such as F, G function, and Kronecker power module.Memory is divided into  memory and  memory, which are utilized to store the channel and internal LLRs and the hard decision of each constituent node, respectively.Because the decoding result of every constituent node needs to multiply   , the Kronecker power module contains G  (N = 4, 8, 16, 32, 64) matrix for the leaf nodes of different length.The entire decoding process is manipulated by the controller module.When the channel LLRs signal (en cha alpha) is valid, the decoder starts to load one new frame into the decoder, and it outputs the codeword estimates.If the results of current stage are not used immediately by the next stage, it will be stored into  memory or  memory.3.2.Deeply Pipelined.The deeply pipelined architecture for a (1024, 512) polar code is illustrated in Figure4.The dotted lined rectangles represent the PEs such as REP128, where REP denotes the operation type; the subscript 128 represents the input length of the node.The spotted rectangles represent the RAMs, which are used to store the internal results to give the latter pipelined stages when the current results are not used immediately and the data is larger than 16.The solid lined rectangles represent registers.When the two-stage operation using a certain data is closer and the amount of data is small, the registers are used to store the data temporarily to reduce the memory control signal.The deeply pipelined architecture is designed according to the node activation order of the decoding tree and the operation order of the local decoding of each node.The software simulation which consists of 368 operations and the hardware implementation is a total of 330 operations.In order to achieve high throughput, this paper split the stages with larger latency.For example, REP128 can be split into four stages.The final architecture has 348 stages; thus the decoding delay is 348 clock cycles.Each stage in the pipeline contains one PE.
3.3.Memory.For node V in Figure2, the inputs data  V need to be used twice during the decoding process.Firstly, it is used to calculate  V of left child node; then it is utilized to calculate  V of right child node.Similarly, the local decoding result  V of node V  needs to be input into Kronecker product module to obtain the final decoded words, and it also needs to calculate the local decoding results  V of its brother node V  to obtain the result  V of node V by  operation.The internal  and  need to be used twice in different stages, so they need to be stored.Since the bit widths of  and  are different, they are stored separately.The memory is divided into  memory and  memory as shown in Figure4.If a node produces  internal data at  1 clock and uses them at  2 , assuming  =  2 −  1 , it will require  ⋅ ( + 2) memory unit.When the memory unit is less than 16, it can be stored with registers.Otherwise, we will use RAMs to store the internal results.The access timing of RAM is shown in Figure5, where Adi denotes the i-th address of RAM.It is clear that the read sequence is the same as write sequence and they only differ d in clock cycles.Therefore, we can use FIFO to replace RAM. of leaf nodes, respectively.In addition to the 330 operations, there are two else stages operations.One exists in the first stage and the other one in the last stage.The first stage is used to cache the channel LLRs and occupies one clock.The results of the local nodes need to multiply G  matrix to recover the local codeword.After the last constituent node

Table 2 :
The number of each operation for one frame.
The controller uses a two-level mode to generate control signals.As shown in Figure13, the first level generates the global control signals, which assigns an enable signal to each stage to determine whether the corresponding stage works or not.For instance, if stage1 en is asserted, then stage1 is working; otherwise, stage1 is idle.The second level only generates the local control signals for each stage, such as

Table 3 :
Statistics of resources.

Table 4 :
Comparison with other polar decoders.