We present a software toolchain for constructing large-scale

Regular expression matching (REM) has many applications ranging from text processing to packet filtering. In the narrow sense, each regular expression defines a regular language over the alphabet of input characters. A regular language applies three basic operators on the alphabet:

Improving large-scale REM performance has been a research focus in the recent years [

Basic architectures of RE-NFA (a) and RE-DFA (b). Our work focuses on regular expression matching using the RE-NFA architecture.

In an RE-NFA approach [

In an RE-DFA approach, several regular expressions are grouped (

Due to the matching power of regular expressions and the complexity of the strings being matched, the REM process can be the slowest bottleneck of a system. To match a regular expression of length

Modern FPGAs offer large amount of reconfigurable logic (LUTs) and on-chip memory (BRAM). We developed a compact and high-performance RE-NFA architecture for REM which utilizes both on-chip logic and memory resources on modern FPGAs [

Automatic conversion from regular expression parse tree [

Automatic generation of RTL code in VHDL for each RE-NFA. The resulting circuit is spatially stacked a configurable number of times for multicharacter matching.

Allocation of centralized character classification in BRAM for up to 256 REMEs using a simple heuristics.

Automatic construction of up to 16 pipelines in a two-dimensional structure.

A benchmark generator of regular expressions with configurable pattern complexity parameters (

The rest of this paper is organized as follows. The background and prior work of RE-NFA on FPGA are discussed in Section

Hardware implementation of regular expression matching (REM) was first studied by Floyd and Ullman [

Automatic REME construction on FPGAs was first proposed in [

A multi-character decoder was proposed in [

The main purpose of our software toolchain is to automate the construction and optimization of large-scale RE-NFA circuits on FPGA. The toolchain allows us to generate the whole RTL circuit matching thousands of regular expressions in orders of seconds using a single command. Such a toolchain can help us not only to avoid the tedious and error-prone circuit construction, but also to generate a large-scale regular expression matching engine (REME) for implementation in a small amount of time.

Figure

Overview of our toolchain for large-scale REME construction.

In practice, the two paths of REME Construction in Figure

In addition to the basic operators of concatenation, union (

REM operators support by our software.

Op. | Name | Example | Description |
---|---|---|---|

- | Concatenation | ||

Union | Either | ||

* | Kleene closure | ||

+ | Repetition | ||

? | Optionality | ||

{ | Constrained rep. | ||

Character class | [ | Either | |

Inv. char. class | Neither | ||

Match beginning | |||

Match ending |

The REME Construction is performed in three steps: (1) parse the regular expressions into tree structures, (2) use the

PROCEDURE

BEGIN

The first step is to represent each regular expression as a corresponding parse tree using a standard compiler technique. This step is the same as that described in [

Parse-tree representation of “

Graphical representation of the modified McNaughton-Yamada (MMY) construction. Note that unlike the original construction, no

The resulting parse tree always consists of three types of internal nodes,

Unlike previous work in [

A formal definition of the construction mechanism is given in Algorithm

Two special entities are used in Algorithm

The second entity is the pseudostate

The MMY construction algorithm produces an NFA extremely modular and easy to map to HDL codes. For example, using the modified construction algorithm, the regular expression “

A modular NFA for “

To translate the RE-NFA (like Figure

The REM circuit for Figure

REM circuits constructed by mapping Figure

The construction of a 2-character matching circuit.

Our REM architecture in [

Furthermore, if two states (either within the same regular expression or across different regular expressions) match the same character class, then they can share the same BRAM column output. We use a two-phase procedure to aggregate the matching outputs of identical character classes.

In phase 1, the software collects the set of unique character classes from a regular expression. Each unique character class is associated with a floating-point

if the character class appears only once in the regular expression, then the sorting key is its (only) position index within the regular expression;

if the charactter class appears multiple times in the regular expression, then the sorting key is the average of all its position indexes within the regular expression;

In phase 2, the unique character classes are sorted according to their sorting keys and instantiated as BRAM columns. Each BRAM column is also associated with the identifier of the instantiated character class. The output of each BRAM column is then connected to the character matching inputs with the same identifier.

The two-phase procedure allows our software to use the minimum number of BRAM columns for character class matching. It also minimizes routing distance by exploiting the natural ordering (the sorting keys) of the character classes within the regular expressions. The aggregation of character classes and their distribution to the RE-NFA states take

After constructing REMEs individually for all regular expressions, the software applies two architectural optimizations [

In contrast to the NFA-level

The time complexity to construct an

The spatial stacking approach can generate an MCM REME of any natural number

In practice,

The program code to construct any

In general, to construct an

remove state register

disconnect state output

disconnect state output

the combined circuit receives

With a straight-forward implementation, the BRAM-based character classifier (Section

A

Structure of a 2D staged pipeline with total

Marshaling REMEs into this staged pipeline structure, however, is painstaking and error-prone when done manually. This is mainly due to the buffering and distribution of the character matching signals (the thick vertical arrows in Figure

First calculate the average number of states per pipeline,

Add any of the

Add the most compatible REME to the pipeline. Recompute the compatibility of all remaining REMEs.

Repeat step 3 until the total number of states in the pipeline is greater than

Go back to step 2 to work on a new pipeline until all REMEs are exhausted.

After marshaling the REMEs into different pipelines, the REMEs within each pipeline are marshaled into different stages in a similar manner. When adding an REME to a pipeline, a function is called to compare each of the character class in the REME to the character classes previously collected in BRAM. If an identical character class is found, then proper connections are made from the BRAM output to the inputs of the respective states.

The time complexity of this procedure is

Matching outputs from all REMEs are prioritized. Currently, the software assigns higher priority to lower-indexed pipelines and stages, although the priority can be programmed in any other way with little additional complexity.

We developed a regular expression

Structure of the regular expressions from the benchmark generator.

A state transition

The time taken to translate a set of parsed regular expressions to VHDL was roughly proportional to the product of the

REME construction time for various number of regular expressions and multi-character matching parameters.

These results show that the software proposed in this paper is suitable for large-scale REME construction. Since it takes only a few seconds to translate a thousand regular expressions into structural VHDL, the software can be used to reconstruct a large-scale REME quickly in response to dictionary changes. Due to the large number of logic resource used, however, the synthesis and place and route times are in the order of several tens minutes.

We first used the benchmark generator described in Section

Figure

Clock frequency and LUT usage of group of 6 identical synthetic REMEs versus length of every REME. Solid lines (left scale) are clock frequencies; dashed lines (right scale) are number of LUTs.

Series

In Figure

Clock frequency and LUT usage of group of 64-state synthetic REMEs versus number of REMEs implemented. Solid lines (left scale) are clock frequencies; dashed lines (right scale) are number of LUTs.

As shown in Figure

Above 16 REMEs, however, the staged pipeline came into effect, keeping the clock rates at slightly above 300 MHz. This evidently shows that the staged pipeline proposed in [

As expected, a higher

Figure

Clock frequency versus state fan-in of the synthetic REMEs.

The clock frequency was found to decline sublinearly with respect to the state fan-in, at a rate consistent with the findings in Section

Overall our experiments show that the REME construction algorithms proposed in [

We presented a software toolchain which automates the construction and optimizations of regular expression matching engines (REMEs) on FPGA. The software accepts a potentially large number of regular expressions as input and generates RTL codes in VHDL as output, which could be accepted directly by FPGA synthesis and implementation tools. The automated REME optimizations include centralized character classifications, multi-character matching, and staged pipelining. We also developed a benchmark generator to produce REMEs of configurable pattern complexities to evaluate the performance of the software.

On a 2 GHz Athlon 64 PC, our software generates a compact and high-performance REME circuit matching over a thousand regular expressions in just a few seconds. Extensive studies showed that the two-dimensional staged pipeline effectively localized signal routing and achieved a clock rate over 300 MHz while processing hundreds of REMEs in parallel.

This work was supported by U.S. National Science Foundation under Grant CCR-0702784.