We propose an energy-efficient error control scheme for on-chip interconnects capable of correcting a combination of multiple random and burst errors. The iterative decoding method, interleaver, using two-dimensional Hamming product codes and a simplified type-II hybrid ARQ, achieves several orders of magnitude improvement in residual flit-error rate for multiwire errors and up to 45% improvement in throughput in high noise environments. For a given system reliability requirement, the proposed error control scheme yields up to 50% energy improvement over other error correction schemes. The low overhead of our approach makes it suitable for implementation in on-chip interconnect switches.

On-chip interconnect errors, exacerbated by very deep submicron (VDSM)
technology [

Unfortunately, interconnect links have tight speed, area, and
energy constraints, making the use of complex but powerful codes unsuitable. Product
codes [

There are
three classes of interconnect errors—permanent, intermittent, and transient
[

Previous work in this area has often focused on one- or two-bit error
scenarios. In [

Product codes [

Concept of two-dimensional product codes.

The simplest two-dimensional product
codes are single-parity check (SPC) product codes, guaranteed to correct only
one error by inverting the intersection bit in the erroneous row and column [

In our approach, we use
two-dimensional product codes, in which an SEC-DED (e.g., extended Hamming) code

Proposed encoding process.

The
simplest strategy of product code decoding is the two-step row-column (or
column-row) decoding algorithm [

Realization of status vectors.

Compared to two-step row-column decoding,
our method properly addresses the rectangular four-error pattern problem by
recording behavior of the row and column decoders using row and
column status vectors. Instead of only passing coded data between the row and
column decoders, the row and column status
vectors are passed between stages and used to help make decoding decisions. The realization of the row and column status vectors can be described by two rules, shown in
Figure

Row decoding of the received encoded matrix. If the
error is correctable, the error bit
indicated by the syndrome is flipped. The corresponding
row status vector position is set according to the
mapping in Figure

Column
decoding of the updated matrix. First, the individual syndromes of each column are
calculated in parallel. If a column syndrome is nonzero, there are two possible
scenarios, depending on the error position indicated by that syndrome

If the row
state corresponding to

If the
corresponding row state value is “00”,

Row decoding
the matrix after changes from
Step

Figure

Decoding of rectangular four-error pattern using the proposed iterative decoding scheme.

A comprehensive simulation of all possible error patterns consisting of five random errors or fewer was performed, verifying that the proposed iterative decoding method operates correctly.

In this section, we present realization of the proposed error correction scheme and iterative decoding of Hamming product codes.

Figure

Block diagram of proposed transceiver.

Transmitter

Receiver

Flow chart of proposed transceiver operation.

Transmitter

Receiver

Figure

Figure

Implementation of the proposed transmitter design and interleaver mapping algorithm.

Transmitter design

Interleaving for 64-bit input information

The interleaved
row encoder outputs are transmitted to the receiver and fed into

Figure

Implementation of proposed receiver and row decoder.

Receiver design

Row decoder

In the proposed receiver design, if a retransmission is required, the total row
and column parity check bits are used for
iterative decoding of Hamming product
codes. The block diagram of the proposed
three-stage iterative decoding is described in
Figure

Proposed iterative decoding method of two-dimensional Hamming product codes.

Figure

Implementation of the column decoder in the proposed iterative decoding method.

Column decoder

Combinational ones counter

Implementation of combinational ones counter.

Four-bit ones counter

Merge circuit

The performance
of the proposed coding scheme was evaluated in terms of complexity, reliability,
throughput, power, and energy consumption, for a 64-bit input message arranged
into a

Parameters used for link model.

Width ( | Space ( | Thickness ( | Height ( | Dielectric constant |
---|---|---|---|---|

0.31 | 0.31 | 0.83 | 0.14 | 2.1 |

Simulation
results were compared to forward error correction
(FEC) schemes using Hamming code H(71,64), ARQ schemes using standardized
CRC-5 with
generator polynomial

Bus widths, delay, and equivalent gate number of different error control schemes.

Coding scheme | Bus width | Delay | Equivalent gates |
---|---|---|---|

Hamming (71,64) | 71 | 0.53 ns | 2.6 kgates |

ARQ (CRC-5) | 69 | 0.47 ns | 2.8 kgates |

Hybrid ARQ (extended Hamming (72,64)) | 72 | 0.57 ns | 4.7 kgates |

The proposed scheme | 88 | 0.61 ns | 11.9 kgates |

Table

Table

On-chip
communication errors can be attributed to voltage perturbations induced by
noise from many sources. A simple
model proposed in [

The
residual flit-error rate, which is the probability of decoding failure or error, was used
to evaluate the reliability of different error control schemes. Hamming codes can only correct one error at a time and if more than one error occurs
in a codeword it will lead to uncorrected errors.
In ARQ and HARQ scheme, an encoded message is accepted by the receiver only if
it either contains no errors or contains an undetectable error pattern. The residual flit-error rate of ARQ and HARQ scheme can be expressed in (see [

Figure

Residual flit-error rate comparison of different error control schemes.

Multiple independent errors

The combination of multiple independent errors and burst errors

Figure

Residual flit-error rate for up to seven-bit burst error modeled.

Another main
concern in on-chip communication is the throughput. In our simulations, go-back-

The throughput
of the different error control schemes is compared in Figure

Throughput comparison of different error control schemes.

Multiple independent errors

A combination of multiple independent errors and burst errors

Figure

Codec power consumption comparison.

The link power

Link swing voltage for different error control schemes.

In network-on-chip (NoC) architecture, the link length is the distance
between two switches, which is decided by the tile block size.
In mesh- or torus-shaped
NoC design, the links between two switches are generally a few
millimeters long wires [

Link power comparison of different error control schemes for different noise environments.

Low noise environment

High noise environment

The average energy to successfully transmit one flit,

For a given reliability requirement, the proposed method consumes the
largest codec energy but the least link energy. We evaluated whether such a link
energy reduction was beneficial in terms of average energy consumption

Energy comparison of different error control schemes for different link lengths.

Link length 1 mm

Link length 2 mm

Link length 3 mm

In the proposed method, the encoded message is separated into two
transmissions. The reliability of the proposed method depends on both the error
detection capability in the first transmission and error correction capability
for the iterative decoding method. In the first transmission, the error
patterns with single errors in different rows are corrected and the error
patterns with two errors in a row are detected. The proposed method is capable of
detecting 75% of all random independent five-error patterns and 100% of error
patterns consisting of two burst errors of up to three bits each (e.g., one three-bit
burst error and another single-bit random error) in the first transmission. The
iterative decoding algorithm can correct up to five-bit errors once the row and
column parity check bits are received. Also, our method can correct permanent
errors that are distributed in different rows. More complex codes can be used
to increase the reliability of on-chip communication. Our primary concern is
energy efficiency; combining error correction with retransmission is a good
approach that balances energy and reliability. The codec area overhead is
relatively small when compared to the millions of transistors integrated in a system-on-chip
(SoC) [

In this paper, we presented an error control scheme combining Hamming product codes with a simplified type-II hybrid ARQ for on-chip interconnects. The efficient combination of powerful product codes with retransmission shows a good balance between the reliability and energy efficiency in error scenarios where a combination of multiple random and burst errors is considered. Moreover, an efficient iterative decoding method of Hamming product codes is proposed. The proposed decoding algorithm is easily realized in a three-stage pipelined architecture by modifying the conventional row-column decoding algorithm, with a small increase in delay and complexity.

The performance of the proposed method was evaluated in terms of reliability, throughput, and energy consumption. Several orders of reduction in residual flit-error rate can be achieved using the proposed method when multiple errors are considered. Compared to an ARQ scheme using CRC codes and an HARQ scheme using extended Hamming codes, the proposed method achieves about 45% and 10% improvement in throughput, respectively, in high noise environments. The high reliability of the proposed method can permit a reduction in the link swing voltage and consequently, reduction in communication energy. The decreased link energy counterbalances the overhead of codec energy. For a given reliability requirement, the proposed error control scheme can achieve up to 50% reduction in energy consumption compared to other error correction schemes in high noise environments.