^{1}

^{1}

^{2}

^{1}

^{2}

Multivariate polynomial interpolation is a key computation in many areas of science and engineering and, in our case, is crucial for the
solution of the reverse engineering of genetic networks modeled by finite fields. Faster implementations of such algorithms are needed to
cope with the increasing quantity and complexity of genetic data. We present a new algorithm based on Lagrange interpolation for multivariate
polynomials that not only identifies redundant variables in the data and generates polynomials containing only nonredundant variables, but
also computes exclusively on a reduced data set. Implementation of this algorithm to FPGA led us to identify a systolic array-based architecture
useful for performing three interpolation subtasks: Boolean cover, distinctness, and polynomial addition. We present a generalization of these tasks
that simplifies their mapping to the systolic array, and control and storage considerations to guarantee correct results for input sequences
longer than the array. The subtasks were modeled and implemented to FPGA using the proposed architecture, then used as building blocks to
implement the rest of the algorithm. Speedups up to

Recent years have seen a significant increase in methods and tools to collect genetic data from which important information can be extracted using a number of techniques [

Our research group focuses on multivariate finite field gene network (MFFGN) models, in which multiple genes are monitored at each time step and their expression levels are discretized to a predefined set of values

Fast algorithms and implementations are needed to sustain rapidly increasing bioinformatics computational demands [

In this paper, we discuss the interpolation algorithm and our proposed architecture. We emphasize several computational substructures that appear repeatedly throughout the design and which can be implemented effectively using a hardware systolic array-based architecture. These tasks are of the type where we perform a certain reduction or rearrangement of a sequence of elements from multivariate polynomials or boolean expressions. The systolic array concurrently manages data receipt and parallel processing, making it an amenable structure when dealing with streamed data. The simplicity of the array cells, storage, and control unit, allows the instantiation of multiple cells while maintaining competitive clock frequencies, thus achieving high performance. Several tasks critical to interpolation were modeled and implemented to FPGA using the proposed architecture, obtaining speedups up to 172× when compared to a software implementation, while achieving low resource utilization. These implementations were used as components to develop a complete, high-performance multivariate polynomial interpolation methodology on hardware.

Section

Discrete models, in particular finite field models, have been proposed for regulatory processes such as genetic networks [

Given a sequence of

When

In the context of genetic networks modeled by polynomials over a finite field, the resulting polynomials give a sense to the biologist about the interactions between genes. Nevertheless, results from (

Let us consider the function

Using (

Table for

0 | 1 | 2 | 0 |

1 | 1 | 2 | 0 |

1 | 2 | 0 | 1 |

0 | 2 | 1 | 2 |

1 | 1 | 0 | 2 |

Redundant variables are undesirable as they introduce complexity into the polynomials without adding information valuable to the biologist. Furthermore, empirical data suggest that genetic networks are sparsely connected [

For any set

Let

If

Let

For any function

Our goal is to determine interpolating polynomials over finite fields in terms of the variables of any of its bases. To this end, let us first review an algorithm of Sasao [

Let

The set of variables appearing in each disjunct of

Thus, the algorithm consists of two stages. First, determine

The following lemma, whose proof is immediate, is useful in simplifying expressions such as

Let

In what follows we abbreviate

Let

Applying Algorithm

0 | 2 | 1 | 2 | 3 | 1 |

0 | 2 | 1 | 3 | 1 | 1 |

2 | 1 | 1 | 0 | 1 | 1 |

0 | 1 | 1 | 2 | 0 | 2 |

2 | 1 | 2 | 1 | 0 | 2 |

2 | 1 | 2 | 0 | 1 | 2 |

2 | 1 | 2 | 2 | 1 | 2 |

0 | 1 | 2 | 1 | 1 | 3 |

2 | 2 | 2 | 2 | 2 | 3 |

Express

Given any partially defined function

Let

In other words,

Let

Let

For each

Note that for each of the factors

Equation (

Using (

Let us assume that the

Applying (

Thus, a polynomial which depends only on the variables

As can be deduced from the previous section, an implementation of the multivariate polynomial interpolation algorithm must be able to represent and compute on disjunctive normal form (DNF), conjunctive normal form (CNF), and multivariate polynomial expressions. The following subsections establish our method of representation, which ultimately determines the techniques used for implementing the algorithms.

A Boolean expression

The DNF expression

A Boolean expression

The CNF expression

A multivariate polynomial

The multivariate polynomial

Our algorithm for determining interpolating polynomials that do not contain any redundant variables consists of two stages. The first stage identifies the dependent variables using Sasao's algorithm, while the second uses this information along with (

To take advantage of the FPGA's fine-grained parallelism and considering their limitations in I/O bandwidth, these computations were implemented in a pipelined manner. In other words, computational blocks were designed to sustain stream processing as allowed by the FPGA's resources, rather than accumulate all needed data and then perform block processing.

Figure

Block diagram of the implementation, highlighting the subtasks that are implemented using the systolic array architecture.

The highlighted blocks in Figure

The following sections describe a generalized algorithm and systolic array-based architecture for the type of reduction operations common throughout our interpolation methodology. This is followed by a description of how they are utilized as part of the architectural data path.

The subtasks that deal with reduction in our interpolation algorithm can be generally described as follows. Given the sequence

The tasks of distinctness, polynomial addition, and redundant Boolean term elimination (hereon referred to as

The task of identifying the distinct elements of a sequence

Assume

We provide a correctness proof for this operation. Proofs of the other reduction operations given in this section are similar.

For an input sequence

We show by induction that after

After

Hence after

Assume that each of the elements in

Assume

Assume that each of the elements in

Assume

Algorithm

This section explains our proposed hardware structure and outlines the mapping for the reduction tasks. We first discuss the systolic array and cells with the assumption that the array is deep enough to process the input sequence without overflowing. We proceed by discussing the additional components that must be added to guarantee correct processing even if the overflow condition does not hold.

The proposed structure is a linear systolic array of

Block diagram of systolic array.

More formally, our linear systolic array computation can be modeled as a simplified version of the generic systolic array presented in [

Figure

Illustration of index generation by Algorithm

Systolic array implementing the distinctness operation for

The functionality of the basic cells mandates their implementation on user logic inside the FPGA (rather than on embedded units, such as block RAMS). Depending on the application, the characteristics of the data sequence, and FPGA model, the FPGA might not have enough resources to instantiate a systolic array whose depth

Pipeline with added overflow FIFO and control unit to support sequences longer than

The

The

Contents of the systolic array and OF after a first pass of the sequences through (a) sorting and (b) polynomial addition.

Contents of systolic array and overflow FIFO for the shown sequence

In the case of a nonreducing operation like sorting (Figure

When adding polynomials (Figure

Figure

Aside from the reduction tasks, the computational blocks in our implementation can be classified as performing one of two functionalities: element pair generation or multiple term multiplication. Element pair generators, such as the

Element pair generator.

Computation of the disjunction of literals

Computation of the terms of a binomial for (

Multiple-term multiplication/conjunction, such as needed in blocks

Illustration of a step in the conversion of CNF

Illustration of a step in the multiplication of binomials

The architecture depicted in Figure

This section presents and discusses results for the individual reduction tasks as well as the complete interpolation algorithm implementation.

The systolic array cells for the subtasks of distinctness, polynomial addition, and Boolean cover were modeled using Verilog HDL, based on their function definitions presented in Section

For each of the three subtasks, several systolic array depths (

Several important observations can be highlighted from the results in Table

Experimental results for subtask blocks.

Slice | Slice F/ | 4 input LUT | FIFO16/RAMB1 | freq (Mhz) | speedup | ||||
---|---|---|---|---|---|---|---|---|---|

3*Cover | 64 | 4141 (4.65%) | 4348 (2.44%) | 7708 (4.33%) | 9 (2.68%) | 176 | 11.51× | ||

128 | 8279 (9.29%) | 8700 (4.88%) | 15054 (8.45%) | 9 (2.68%) | 176 | 21.93× | |||

256 | 16554 (18.58%) | 17404 (9.77%) | 30796 (17.28%) | 9 (2.68%) | 176 | 40.33× | |||

3*P-add | 64 | 2967 (3.33%) | 3516 (1.97%) | 4119 (2.31%) | 7 (2.08%) | 200 | 49.17× | ||

128 | 5885 (6.61%) | 7455 (4.18%) | 7156 (4.02%) | 7 (2.08%) | 200 | 93.99× | |||

256 | 11741 (13.18%) | 14880 (8.35%) | 14228 (7.99%) | 7 (2.08%) | 200 | 172.28× | |||

3*Distinct | 64 | 2680 (3.01%) | 3741 (2.10%) | 3173 (1.78%) | 7 (2.08%) | 190 | 46.87× | ||

128 | 5342 (6.00%) | 7465 (4.19%) | 6261 (3.51%) | 7 (2.08%) | 190 | 89.60× | |||

256 | 10659 (11.96%) | 14903 (8.36%) | 12437 (6.98%) | 7 (2.08%) | 190 | 164.25× |

Part of our purpose here is to argue that the reduction tasks can be implemented efficiently using the systolic array design. With Table

Experimental results for stage 1.

Slices | Slice FF | 4-input LUT | FIFO16/ RAMB1 | Speedup | ||||
---|---|---|---|---|---|---|---|---|

8 | 32 | 1651 (1.85%) | 1908 (1.07%) | 3047 (1.71%) | 25 (7.44%) | 0.04 | 1.27 | 29.59× |

9 | 64 | 3011 (3.38%) | 3596 (2.02%) | 5515 (3.1%) | 27 (8.04%) | 0.19 | 2.64 | 13.87× |

10 | 128 | 5860 (6.58%) | 7201 (4.04%) | 11186 (6.28%) | 33 (9.82%) | 0.49 | 7.22 | 14.78× |

11 | 128 | 6191 (6.95%) | 7728 (4.34%) | 11755 (6.60%) | 43 (12.80%) | 1.01 | 22.97 | 22.74× |

12 | 256 | 11880 (13.34%) | 14907 (8.37%) | 22573 (12.67%) | 55 (16.37%) | 2.54 | 82.28 | 32.46× |

13 | 256 | 14101 (15.83%) | 16981 (9.53%) | 25676 (14.41%) | 95 (28.27%) | 8.89 | 596.9 | 67.13× |

Experimental results for stage 2.

Basis Vars | Class Reps | Slices | Slice FF | 4-input LUT | FIFO16/ RAMB1 | Speedup | |||
---|---|---|---|---|---|---|---|---|---|

8 | 2 | 23 | 7745 (8.69%) | 8335 (4.68%) | 13846 (7.77%) | 28 (8.33%) | 0.10 | 1.59 | 16.67× |

8 | 3 | 63 | 7950 (8.92%) | 8352 (4.69%) | 14205 (7.97%) | 28 (8.33%) | 0.44 | 7.76 | 17.68× |

8 | 4 | 90 | 7955 (8.93%) | 8359 (4.69%) | 14213 (7.98%) | 30 (8.93%) | 0.84 | 16.60 | 19.71× |

9 | 2 | 25 | 8520 (9.56%) | 9220 (5.17%) | 15205 (8.53%) | 28 (8.33%) | 0.10 | 1.85 | 17.64× |

9 | 3 | 77 | 8539 (9.58%) | 9226 (5.18%) | 15239 (8.55%) | 28 (8.33%) | 0.69 | 11.20 | 16.13× |

9 | 4 | 94 | 8582 (9.63%) | 9260 (5.2%) | 15340 (8.61%) | 34 (10.12%) | 1.14 | 20.26 | 17.81× |

10 | 2 | 24 | 9188 (10.31%) | 10098 (5.67%) | 16710 (9.38%) | 31 (9.23%) | 0.10 | 1.78 | 17.95× |

10 | 3 | 65 | 9192 (10.32%) | 10106 (5.67%) | 16714 (9.38%) | 31 (9.23%) | 0.47 | 8.30 | 17.69× |

10 | 4 | 92 | 9201 (10.33%) | 10130 (5.69%) | 16742 (9.40%) | 33 (9.82%) | 0.86 | 18.43 | 21.36× |

11 | 2 | 25 | 10130 (11.37%) | 10978 (6.16%) | 18024 (10.12%) | 34 (10.12%) | 0.11 | 2.01 | 18.80× |

11 | 3 | 75 | 10156 (11.4%) | 11001 (6.17%) | 18067 (10.14%) | 36 (10.71%) | 0.58 | 12.80 | 22.22× |

11 | 4 | 92 | 10115 (11.35%) | 11008 (6.18%) | 18025 (10.12%) | 37 (11.01%) | 1.05 | 23.00 | 21.93× |

12 | 2 | 25 | 10861 (12.19%) | 11861 (6.66%) | 19077 (10.71%) | 36 (10.71%) | 0.10 | 2.05 | 19.93× |

12 | 3 | 70 | 10806 (12.13%) | 11881 (6.67%) | 19012 (10.67%) | 38 (11.31%) | 0.52 | 10.68 | 20.41× |

12 | 4 | 93 | 10877 (12.21%) | 11895 (6.68%) | 19121 (10.73%) | 38 (11.31%) | 1.03 | 19.90 | 19.26× |

13 | 2 | 25 | 11405 (12.8%) | 12747 (7.15%) | 20240 (11.36%) | 39 (11.61%) | 0.10 | 2.13 | 20.74× |

13 | 3 | 71 | 11469 (12.87%) | 12760 (7.16%) | 20358 (11.43%) | 40 (11.90%) | 0.53 | 12.50 | 23.68× |

13 | 4 | 91 | 11416 (12.81%) | 12771 (7.17%) | 20274 (11.38%) | 41 (12.20%) | 0.85 | 21.38 | 25.21× |

We modeled the proposed interpolation algorithm in Verilog HDL using the reduction tasks and the architectures described in previous sections as building blocks. The behavioral simulation and synthesis tools as well as the software compilation parameters and platform were the same as in the reduction task experiments.

The results are shown separately for each of the two interpolation stages since, once the bases have been computed by the first stage, a scientist's intervention would be necessary to choose the most appropriate before proceeding to the second stage. For both stages, the synthesis tool determined maximum operating frequencies above 150 MHz, thus we chose 150 MHz for the experiments. The Xilinx ISE 11.1 place and route tools (with default effort level) were able to meet the timing requirements reported for Stages 1 and 2. We attribute this to the fact that the great majority of connections in our design are local and grow only linearly with

Table

Although the ultimate goal of reverse engineering might be understanding complete complex biological systems, recent reported models limit their exploration over specific subsystems or punctual mechanisms, for example, the study cell-cycle regulatory network of fission yeast and apoptosis (programmed cell death) [

Speedup is maintained at a cost of increasing resource utilization. For the considered cases, BRAMs, which are used to implement the various FIFOs and RAMs of stage 1, is the fastest growing resource. This could pose a challenge to maintain performance at even higher

The increase in user logic (Slices, LUTs) in higher

As explained in Section

The timing results shown in Table

Resource utilization distribution in Stage 2 is roughly uniform between the user logic, that is, Slices, and BRAMs. Furthermore, the increment in resource utilization for increasing values of

This paper presents a new methodology based on Lagrange interpolation with two important properties: (1) it identifies redundant variables and generates polynomials containing only nonredundant variables, and (2) it computes exclusively on a reduced data set. The analysis of the methodology for its hardware implementation led us to the identification of several reduction tasks which were generalized to a simple algorithm. The generalized algorithm can be efficiently mapped to a systolic array in which each processing cell implements a pair of binary operations between an incoming and a stored value. The tasks of Boolean cover, distinctness, and multivariate polynomial addition were implemented and served as building blocks to the rest of the application. The FPGA implementation of the reduction operations and the complete application achieved speedups of up to 172× and 67×, respectively, as compared to software implementations run on a contemporary CPU, with moderate resource utilization.

An earlier version of this paper appeared as “A systolic array based architecture for implementing multivariate polynomial interpolation tasks” in the Proceedings of the 2009 International Conference on ReConFigurable Computing and FPGAs (ReConFig’09) [