This paper presents a novel approach to implementing multiplication of Galois Fields with
Galois Field arithmetic is widely used in applications such as error-correcting codes and cryptography. Generally, the Galois Fields used are GF(
Conventionally, these polynomials are represented as binary numbers, where the
Addition of GF(
Do a polynomial-multiply of two inputs.
Do a polynomial-remainder of the product modulo a third input, the prime polynomial.
In software, GF multiplications are usually performed using Look-up Tables (LUTs). For large N, the LUT becomes rather large, requiring prohibitive large memory size. The processing time also becomes prohibitive at high data rates.
To further complicate issues, processors need to be able to handle Galois Fields of different lengths. Consequently, several processors have added instructions for Galois Field Multiplication (GFM). The most representative is the TI C64x DSP.
A general purpose GFM instruction needs 4 inputs, at least 3 of which must be in registers, the two inputs and the prime polynomial. Since most instruction-sets do not provide for 4 inputs fields, a GFM instruction generally uses a special-purpose register that provides either the length or the prime, or both. For instance, the GMPY4 operation in the TI C64x DSP uses the GFPGFR special purpose register to specify the length and prime polynomial [
In this paper, we describe an approach that implements GFM using two instructions, one of which implements polynomial-multiply over GF(2), and the other of which implements the polynomial-remainder over GF(2). In the Sandblaster 2.0 architecture [
The Sandblaster 2.0 has a 16-way SIMD unit; consequently, we also have 16 way variants of polynomial-multiply and polynomial-remainder instructions called rgfmul and rgfnorm. Additionally, the SIMD unit supports polynomial-multiply-and-add and polynomial-multiply-and-reduce instructions called rgfmac and rgfmulred.
It turns out that it is possible to specify the gfmul and gfnormi operations in such a fashion that we can use almost identical hardware to implement both functions. Consequently, there is no hardware overhead to split the GFM operation into two operations.
The Galois Field sum of several GFM operations can be simplified to the polynomial sum (i.e., xor) of several polynomial-multiplies, followed by a single polynomial remainder. This is quite common in several algorithms that use Galois Field arithmetic. In those cases, we can implement the sum of N GFM operations using N polynomial-multiplies and 1 polynomial-remainder, incurring a 1 instruction overhead because of our split implementation.
Section
The Sandblaster scalar unit has 16 32-bit general purpose registers. Like most RISC architectures, at most 2 registers can be read and 1 written per integer operation. An integer operation has fields to specify up to 3 registers. An extended immediate variant of the instruction can additionally provide up to 12 bits of immediate data.
In the customary binary representation of polynomials over GF(2), the bits are right-aligned, with the LSB bit representing the coefficient of term x0, and bit i representing the coefficient of term
Note that in this representation, without knowing the length of the polynomial, we cannot be sure which polynomial is represented by a specific number. For instance, 0xb000_0000 could be interpreted as x3
We picked this representation to make it easier to compute the remainder. Since the high-order term of the divisor and dividend is left-aligned, we can start subtraction without requiring any shifting to line up the start of the polynomials.
For correctness, it is assumed that all unused bits in the register are 0. Both polynomial-multiply and -remainder are implemented so that they leave their results left-aligned with unused bits as 0.
There is one wrinkle about the representation. We assume that the polynomial remainder is performed with a left-aligned divisor so that the MSB is always 1. In this case, representing the leading coefficient is redundant. So, we do not represent the leading bit of a divisor polynomial. Instead the MSB represents the coefficient of the second highest term
The poly-multiply instruction in the Sandblaster architecture, gfmul, has the following format:
It does a polynomial multiplication of the upper-most 8 bits of ra and with the upper 8 bits of rb, and wites the 15 bit result of the poly-multiply in the upper bits of the target register rc. The remaining bits of rc are zeroed.
The poly-remainder instruction in the Sandblaster architecture, gfnormi, has the format
The dividend is the 32-bit number formed by the upper 16 bits of rc right padded with 0. The divisor is the 17 bit number formed by prepending a 1 to the upper 16 bits of rp. J is an immediate operand ranging from 0 to 7. The gfnormi instruction performs J
Implementing a GFM over GF(
the product inputs are stored in the upper K bits of two registers, ra, rb,
the leading bit of P is dropped and the remaining K bits are stored in the upper K bits of a register, rp,
all unused bits are set to 0.
After executing the following code sequence, the final result of the GFM will be the upper K bits of rt:
Table
Galois field multiply GF(26).
ra | 0xb000_0000 | ||
rb | 0x6c00_0000 | ||
011_1101_0100 | rc | 0x7a80_0000 | |
% | rp | 0x2400_0000 | |
rt | 0xa800_0000 |
The SIMD unit in Sandbridge 2.0 architecture has eight 16
The SIMD unit supports GFM through the rgfmul and rgfnorm instructions, which have the following format:
These instructions do 16 poly-multiplies/poly-remainders in parallel. Since the SIMD register elements are 16-bit wide, the rgfmul uses the upper 8 bits of each element, while the rgfnorm uses the entire 16 bits of the element. Other than that, their behavior is identical to the gfmul/gfnormi instructions.
Upto three SIMD registers can be read per cycle; we use the extra read-port to implement a poly-multiply-and-add instruction with the format:
The rgfmac instruction 16 poly-multiplies of the 16 elements of va and vb, and then poly-adds (xor’s) the products with the corresponding elements of vs.
The SIMD unit has an idiom where the 16 results of element-wise operations (such as rgfmul) are combined together to form a single value that is written to the accumulator. The poly-multiply-and-sum-reduce instruction follows this idiom
The 16 elements of va and vb are poly-multiplied together, and the 16 resulting products are poly-summed (xor-ed) together to form a single 16-bit value that is written to the accumulator register.
The gfnormi and gfmul instructions can be implemented by the same block with very little overhead. As can be seen from the pseudocode in Algorithm
gfop (ismul, ra, rb, J)
The gfnormi instruction computes the remainder using polynomial long division. Since the values are left-aligned, we start the process at bit 31 of the dividend value. The divisor consists of a leading 1 and the upper 16 bits of the divisor register. The immediate argument to the gfnormi instruction specifies the number of divide steps executed, J
At each step, the result is xor-ed with 0 or with the divisor, depending on the leading bit being 0/1. The result is then left-shifted by 1 to ensure that the remainder after the division step is left-aligned. Note that xor-ing with 0 is the identity operation; this results in just a left-shift. This is done when the intermediate remainder is smaller than the divisor.
Each poly-multiply step needs to follow the same pattern as the poly-remainder so that much of the hardware is common. If we are going to J
the partial result is initialized to all 0s,
the “divisor” at each step is one of the multiply inputs prepended with J
the control to select whether the xor is with the divisor or 0s are the bits of the second multiply inputs starting with the upper-most bit.
The example below multiplies 101100 and 011011 using 6 steps. 10110 is used as the control input
The gfmul instruction always does 8 steps of multiply. Consequently, in the implementation, the “:divisor” is prepended by 8 zeroes.
The unified block that implements gfnormi and gfmul in the SB3500 consists of some setup followed by 8 stages of a computational kernel. This computation in each stage is an xor-select, as shown in Figure
xor-select block.
In the case of the polynomial remainder operation,
For the polynomial multiply operation, res is set to 0 and div is set to the value of the rb register prepended by 8 zeroes; the count N is always 8. The top 8 bits of the rb register are used as controls to the 8 xor-shift stage; if the corresponding bit is 1, res is shifted and xor-ed with the value in div; otherwise it is just shifted by 1.
From the diagram shown in Figure
The SB3500 implemented is targeted for a 1.6 nanoseconds clock. It has a 2-stage execute pipeline, so the gf-op block is pipelined across 2 stages. This gives the synthesis tool 3.3 nanoseconds to implement the block. The synthesis tool used this relaxed timing to pick a power and area optimized implementation. In this implementation the gf-op block occupies approximately 2018
It is possible to implement various look-ahead schemes that would reduce the critical path at the expense of extra logic. Since we have ample slack, we did not investigate any area/speed tradeoffs.
We have implemented a RS encoder/decoder that is designed to be implemented on a SIMD architecture. The numbers presented in this section are tuned for the DVB (digital video broadcasting) standard. This standard uses RS(204,188) encoding; that is, it adds 16 check symbols to a 188-byte packet resulting in a total code word of length 204 bytes.
The RS encoder used in this study does the following steps [
append N zeroes to a data block,
perform successive Horner reduction of the polynomial whose coefficients are the data block plus zeroes to obtain remainders,
multiply remainders by pre-calculated coefficients and sum.
All operations are over GF(28). The RS decoder [
Our implementation combines several techniques to improve the error decoding capability.
Correct codeword using Peterson-Gorenstein-Zieler (PGZ) [
If that fails to correct the errors, successively apply 2, 4, 6, 8 erasures, deriving an error locator polynomial, until an error locator polynomial of correct degree is derived [
If an error locator polynomial is identified, attempt to decode the word using the Forney-Messey-Berlekamp (FMB) method [
Again, all operations are over GF(28). The details of this approach have been published previously [
We started off with an original version of the code that was designed to use Galois Field operations. This base code was then rewritten to use SIMD forms of polynomial-multiply and -remainder operations.
The experiments that were run encode one RS(204,188) packet artificially introduce enough errors to require 8 erasures and then decode the packet. Note that this is the worst-case decode situation; in practice 98% of all packets have all syndromes equal to zero, so no error decoding is required.
Table
Polynomial operations in RS(208,188) encoder/decoder. In the decoder, by contrast, the number of poly remainder operations is 25% of the number of poly-multiplies. This allows us to get a 11.5x speed up.
Operation | Encoder | Decoder PGZ) | Decoder FMB one iteration | |
Galois field multiply | 3384 | 8864 | 9267 | |
Polynomial multiply | gfmul | |||
rgfmul | 203 | 134 | 134 | |
rgfmac | 16 | 56 | 0 | |
rgfmulred | 423 | 529 | ||
Total | 219 | 613 | 663 | |
Polynomial remainder | gfnormi | 2 | 2 | |
rgfnorm | 204 | 152 | 152 | |
Total | 204 | 154 | 154 | |
polynomial operations | 423 | 767 | 817 |
For the DVB-T case, for the highest bitrate of 31.67 Mbps, the decoder is called 21 763 times per second. The total number of cycles spent by the processor in vector mode for the GF operations only is less then 18 MHz (a fraction of the SBX processor capabilities) compared to 277 MHz in scalar mode. The iterative decoding algorithm was tested in the end-to-end DVB-T/H simulated system, specified by ETSI EN 744 V1.4.1 (2001-01). The simulations were performed by using the SBX simulation tools. Using our GF instructions, the total number of cycles per second consumed in the SBX processor, for the highest bit rates specified in the standards and assuming that every packet has eight errors and eight erasures, is the following: 29 MHz for the 31.67 Mbps DVB-T, 9 MHz for the 4.4 Mbps DVB-H including the optional second RS decoder at the link level.
The method for implementing GFM we have described, that implements a GFM using 2 instructions a poly-multiply and a poly-remainder, allows the addition of GFM to a standard architecture without the need to introduce a special purpose register for the GFM. Further, both of these instructions can be implemented using the same hardware block.
We have shown that, for some applications, it is not necessary to execute both a poly-multiply and a poly-remainder for each GFM. In the cases where the results of several GFM are added together, the products of the corresponding poly-multiplies are summed and then a single poly-remainder is used. In one specific case, only 25% of the poly-multiplies required a poly-remainder. Our simulation results indicate a speedup of 11.5x of the extended processor versus the standard processor.