XOR-FREE Implementation of Convolutional Encoder for Reconfigurable Hardware

This paper presents a novel XOR-FREE algorithm to implement the convolutional encoder using reconfigurable hardware. The approach completely removes theXORprocessing of a chosennonsystematic, feedforward generator polynomial of larger constraint length. The hardware (HW) implementation of new architecture uses Lookup Table (LUT) for storing the parity bits. The design implements architectural reconfigurability by modifying the generator polynomial of the same constraint length and code rate to reduce the design complexity. The proposed architecture reduces the dynamic power up to 30% and improves the hardware cost and propagation delay up to 20% and 32%, respectively. The performance of the proposed architecture is validated in MATLAB Simulink and tested on Zynq-7 series FPGA.


Introduction
Data transmission reliability and stringent QoS requirements of modern 3GPP, 3GPP2, and LTE standards over unreliable wireless channel require efficient, cost-effective forward error correction (FEC) codes for the mobile equipment [1][2][3][4].The usage and application of such FEC depend mainly on their error correcting capabilities and power efficient implementation.Convolution codes are preferred over block codes due to their economical soft decoding capability and, moreover, they yield higher coding gain [5].While the error correction capabilities are related to polynomial strength, they are also designed to achieve higher free distance [6].The conventional convolutional encoder is usually realized with shift register (SR) using delay elements and modulo-2 adders (XOR gates).The key operation in the process of convolution is multiplication, which is implemented using shifts and additions [7].The addition operation dominates the complexity in comparison to the shift operations and consumes a significant amount of dynamic power while encoding.Hence, the optimization of adder utilization becomes a key factor while implementing with reconfigurable hardware.Few attractive heuristic approaches are reported in the literature to minimize adder count by eliminating the redundant computations [8][9][10][11][12][13].Such approaches are mostly based on Common Subexpression Elimination (CSE) method which depends upon the commonalities of the used polynomials involving modulo-two adders and consumes more power.
Convolution codes find their extensive usage as channel codes in popular wireless standards like 3GPP-WCDMA, 3GPP2-CDMA2000, LTE, and IEEE-802.11and software defined radio and cognitive radio applications.They are the key building blocks in many powerful concatenated codes, such as Turbo codes (parallel), Woven codes (serial) [14], and Ungerboeck Code (a.k.a.Trellis Coded Modulation (TCM)) [15].This paper describes a new algorithm to implement an XOR-FREE architecture of a chosen generator polynomial having constraint length ( ≤ 9) for popular wireless standards.The approach reduces the standard polynomial into a LUT comprising parity bits.The previous state of the encoder and the input bit jointly forms the decoding addresses of such LUT.The ROM based architecture provides ease in FPGA implementation with a powerful add-on feature of dynamic reconfiguration.The FPGAs have large resources of logic gates and ROM blocks.Hence, the hardware realization of this architecture is inferred in the desired approach and achieves greater flexibility.The paper is organized into five sections.In Section 2, the preliminary of the convolutional code is presented with two different approaches.Section 3 covers in depth a new algorithm with an example of GSM-900 based standard generator polynomial.Section 4 states the results and discussion, whereas Section 5 provides a conclusion.

Preliminary
For realization of convolutional code, the theory of groups and finite fields is used.Two approaches are found in the literature.The first approach is taken by Massey [16] and McEliece and Onyszchuk [17].The method defines the code first, as a -dimensional subspace of an -dimensional vector space over a suitable field, and then defines the encoder as a  ×  matrix whose rows are a basis for the code.Such a convolutional code can be described by an "infinite matrix, " as shown in (1).This matrix depends on  × , {  },  = 0 ⋅ ⋅ ⋅ , submatrices: ) .
Here,  is the constraint length and  are the generator sequences of the encoder.  = [ 1 , . . .,   ] are the th block of  information bits.  = [ 1 , . . .,   ] are the block of  coded bits at the output.This is similar to the block coding,  = , as shown in ) . ( The second approach is based on Forney's approach which defines the encoder as a -input, -output linear, sequential circuit [18,19] and can be realized by shift register as follows: where  ()   are elements of matrix   .Equation (3) can be expanded explicitly in  components, that is,  1 ,  2 ,  3 , . . .,   as shown in (4).Obviously, shift register of length  will have   different internal configurations or states.The behavior of such convolutional coder depends upon the present input   and  previous input blocks  −1 ,  −2 , . . .,  − .  can be calculated by memorizing  input values in shift registers as expressed in (5).Here, one register  ∈ 1 ⋅ ⋅ ⋅ , for each  bit of the input and memories for which  ()  = 1 are connected to adder .Such realization of convolutional encoder can be captured by a Deterministic Finite State Machine (DFSM) or deterministic automaton [20].
Using (5) shift register based realization of the convolutional encoder for a used generator polynomial can be obtained as shown in Figure 1.A combination of sequential logic and combinational logic is required in the shift register based realization.The convolution encoder as a sequential machine can be presented as an FSM with the finite states.While working for optimal synthesis of sequential circuit, the approaches like state minimization and the state assignment exist, whereas for the combinational circuit, the logic minimization based on different topologies, namely, AND-OR or AND-XOR decomposition of functions, is available.
The state minimization recognizes the equivalent states of a machine and then minimizes the internal states of a machine [21].Such popular state minimization techniques are row based, state implication table based, and heuristic, which follows the thumb rule of state equivalency [22] while minimizing.These methods may not lead to a fruitful result as reducing the number of states may not always reduce the hardware.This happens because the number of eliminating states may not reduce the total state count by a power of two, since the convolutional encoder is a state minimal machine and hence the state minimization techniques will not reduce its hardware, whereas the other front is state assignment or state encoding.It is assigning a binary representation to the control states.The state assignment problem can also be viewed as decomposition of an FSM.Some of these popular approaches towards area minimization are Decomposition and Factorization [23], Modify and Restore [24], and Decomposition as Constrained Covering [25,26].With different viable approaches towards the state assignment, the heuristic seems to be the most promising [21].The optimal state encoding methods require state assignment tools like NOVA [27] for area minimization of PLA implementations; JEDI [28] for multilevel implementations; Decomp; One Hot [23], and a lot more.These approaches are not promising while implementing state minimal convolution encoder with the fact that a significant source of dynamic power dissipation is modulo-two adders instead of shift registers.This orients the approach towards the combinational logic minimization instead of the sequential minimization.
The generator polynomials, that is,  0 and  1 as shown in Figure 1, can be realized with modulo-two adders with the constraint length  as 4 and the code rate  as 1/2.For higher QoS and error correcting capabilities, polynomials with larger  are required, which increases the logical depth of the adder.This adder logic uses XOR gates, which have a higher switching probability or transitions as compared to the basic gates, considering the fact that power dissipation depends on the data, switching probability, and activity of the functional topology.The switching probability  with respect to number of inputs can be calculated for XOR and the basic gates have been computed from Table 1 and depicted in Figure 2. The XOR gates have constant transition probability of 25% where others follow logarithmic means.In Table 1, signal probabilities   and   are assumed as 0.5 for static CMOS gates [11].

The Construction Technique
The proposed algorithm is used for the implementation of XOR-FREE processing architecture of a chosen nonsystematic polynomial; that is, the input bit itself is not a part of encoded output and feedforward; that is, no feedback path exists in the design.The algorithm consists of five major steps which are summarized as follows.
Deterministic finite automaton of convolution transducer is a six-tuple, (Σ, Γ, ,  0 , , ), consisting of a finite set of input symbols called the alphabet (Σ), where Γ is the output alphabet (a finite, nonempty set of symbols),  is a finite set of states (), with start state ( 0 ∈ ) and transition function ( :  × Σ → ), and () is the output function [20].

Algorithm 1 (for XOR-FREE processing of convolutional encoder architecture). Consider
Step 1 (conversion).This is Mealy to Moore conversion by assuming input bit as logic "0"; that is, Σ = {0}, for all possible encoder states .Next state follows the transition function ( :  + 1 → ) like behavior of  − 1 bit counter.Find the output responses using (3) and arrange them in tabular form.
Step 3 (decomposition).Split the encoded bits used for the state representation in two subparts, row tag (RT) and column tag (CT).Hence, Step 4 (Isomorphs).Find the Isomorphs* based on similar encoder state and output.
Step 5 (restoring).Restore Mealy back from Moore machine; that is, reconsider logic "1."While applying Even-Odd concept of XOR on the parity bits, the system regains its full functionality.
Algorithm 1 is explained with elaborated example of GSM-900 convolutional generator polynomial.The first step of the algorithm finds the output response of the chosen polynomial using a C-code.The code assumes an input bit as static logic "0," for the possible combinations of states  2. This results in four groups, with each having 2 2 states, rearranged, as shown in Table 3.The shown states are 2 2 instead of 2 3 since logic "1" is never considered as input yet.This can be easily interpreted by MSB bit of Tables 2 and 3 which is logic "0" throughout the computations.
Step 3 splits the encoder state assignment bits into two subparts, RT and CT, as shown in Table 3. Splitting is done using (6).It is highlighted in color code (orange) in the first column of Table 3.After splitting, Table 4 is prepared where color highlights the common rows or the so-called Isomorphs.The Isomorph* states are those which share the same RT and CT and result in the same parity bits.The Isomorph group is formed as shown in Table 5.Finally, the ROM is prepared by merging the common rows of Table 4 and assigning new reduced RT with the same CT.Table 6 is used as a ROM for hardware implementation where the current state of SR and input bit are used for decoding its addresses.The XOR-FREE architecture is then implemented using SR and the ROM formed from Table 6.The addresses of ROM are RT and CT, formed by current input bit and the previous state of shift register as shown in Figure 3.The outputs of the ROM are the encoded sequences which can be taken in serial or parallel sequence as per the requirements.

Results and Discussion
The hardware implementation of the proposed architecture requires shift register and a ROM in addition to its decoding logic.The reconfigurable hardware like FPGA infers these elements from its standard library and synthesizes the architecture in desired manner.While keeping the constraint length  and code rate constant, any change in generator polynomial results in a new ROM with the same elements at different RT and CT positions.This gives an additional advantage of architectural reconfigurability with lesser design complexity.Further, this feature reduces the overhead in partial bit files and saves reconfiguration time.The proposed architecture is implemented in VHDL language and synthesized for Zynq-7 device "XC7Z020-CLG484," that is, Zedboard [29] using Xilinx ISE v14.7 tool.Further, the system structural model can be generated from the behavioral model of Figure 4.Such system structural model consists of a variety of system components such as processor, IPs, memories, and arbiters.The model  uses different bus protocols to connect its components [30,31].Such system level implementation structure on FPGA along with ROM is depicted in Figure 5.This figure presents the schematic of system level approach; however, detailed implementation is not considered presently.The present paper mainly focuses on the new XOR-FREE algorithm and its behavioral implementation on FPGA.
The dynamic power of system is estimated with XILINX XPower Analyzer tool.For estimating dynamic power, an additional test stimulus is written which generates a random sequence of 1000 bits, clocking at 200 MHz.The functional verification is done in MATLAB system generator environment [32] where the proposed architecture is validated with the convolutional encoder block of Simulink library as shown in Figure 4.The delay element in Figure 4 is used to adjust one clock cycle, consumed during the reset state of the proposed design.The same can be predicted from the waveforms of Figure 6.The hardware cosimulation is performed which tests the proposed design and its functional verification waveform with ModelSim environment as shown in Figure 7.
Finally, it is worth mentioning an interesting observation.The complicated state diagram of 2 −1 states is now reduced into two substate diagrams due to RT and CT split as depicted in Figure 8.The algorithm has been tested for various standard polynomials of communication as given in Table 7.
The resource utilization for two different architectures has been estimated and summarized in Table 8.The dynamic activity at different levels, namely, signal, IOs, logic, and   clock domain, is calculated and compared for conventional and proposed encoder architecture as depicted in Figure 9.The power estimation and resource estimation are done for the highest , that is, 9 for WCDMA/CDMA2000 standard.The algorithm completely removes the delay consumed to compute the logical XOR while convolving and hence improves the propagation delay up to 32%, dynamic power up to 30%, and hardware cost up to 20% as compared to conventional architecture.Usually, the size of the hardware depends upon the nature of the used polynomial, constraint length, and the code rate.While using the proposed algorithm, sometimes the size of multiplexer grows with the constraint length of the polynomial.Moreover, in few cases above  ≥ 12, the algorithm fails to find commonalities in the chosen polynomials.This tends to increase the ROM size and degrades the maximum operating frequency.However, the present algorithm is validated up to  = 9 for various standard polynomials as shown in Table 7 and claims a significant improvement in the performance.

Conclusions
The paper addresses the problem of optimization of modulo-2 adders which are a significant source of dynamic power and proposes a novel algorithm to implement XOR-FREE architecture for nonsystematic, feedforward convolution encoder.The approach reduces the standard polynomial into a ROM which gives ease in FPGA implementation.Hardware implementation uses ROM which gives the benefit of reconfiguration by just modifying polynomial of the same constraint length and code rate.The algorithm is tested for known    polynomials of wireless standards.The proposed architecture is novel and promises to improve the propagation delay, dynamic power, and hardware cost for reconfigurable hardware.

Figure 5 :
Figure 5: A system level implementation structure on FPGA along with ROM.

Figure 6 :
Figure 6: Functional verification of the proposed architecture in MATLAB Simulink.

Figure 7 :Figure 8 :
Figure 7: Functional verification of the proposed architecture in ModelSim cosimulation environment.

Table 2 :
The convolved output for input logic "0" and 2 −1 SR states with parity bits as output.

Table 3 :
Step 1 output is rearranged and then decomposed into RT and CT.

Table 4 :
The decomposed RT and CT arrangement.

Table 6 :
The ROM formed for Table4.

Table 7 :
The proposed algorithm is tested for the following wireless standards.

Table 8 :
Resource utilization comparison for the architectures.