^{1}

^{2}

^{2}

^{1}

^{2}

We propose an FPGA design for the relevancy computation part of a high-throughput real-time search application. The application matches terms in a stream of documents against a static profile, held in off-chip memory. We present a mathematical analysis of the throughput of the application and apply it to the problem of scaling the Bloom filter used to discard nonmatches.

The focus on real-time search is growing with the increasing adoption and spread and of social networking applications. Real-time search is equally important in other areas such as analysing emails for spam or search web traffic for particular patterns.

FPGAs have great potential for speeding up many types of applications and algorithms. By performing a task in a fraction of the time of a conventional processor, large energy savings can be achieved. Therefore, there is a growing interest in the use of FPGA platforms for data centres. Because of the dramatic reduction in the required energy per query, data centres with FPGA search solutions could operate at a fraction of the power of current data centres, eliminating the need for cooling infrastructure altogether. As the cost of cooling is actually the dominant cost in today’s data centres [

Real-time search, in information retrieval parlance called “document filtering,” consists of matching a stream of documents against a fixed set of terms, called the “profile.” Typically, the profile is large and must therefore be stored in external memory.

The algorithm implemented on the FPGA can be expressed as follows.

A

The profile

In this work we are concerned with the computation of the document score, which indicates how well a document matches the profile. The document has been converted to the bag-of-words representation in a separate stage. We perform this stage on the host processor using the Open Source information retrieval toolkit Lemur [

Simplifying slightly, to determine if a document matches a given profile, we compute the sum of the products of term frequency and term weight

The weight is typically a high-precision word (64 bits) stored in a lookup table in the external memory. If the score is above a given threshold, we return the document identifier and the score by writing it into the external memory.

The target platform for this work is the Novo-G FPGA supercomputer [

To simplify the discussion, we first consider the case where terms are scored sequentially, and that, as in our original work, we use a Bloom filter to limit the number of external memory accesses.

For every term in the document, the application needs to look up the corresponding profile term to obtain the term weight. As the profile is stored in the external SDRAM, this is an expensive operation (typically 20 cycles per access). The purpose of document filtering is to identify a small amount of relevant documents from a very large document set. As most documents are not relevant, most of the lookups will fail (i.e., most terms in most documents will not occur in the profile). Therefore, it is important to discard the negatives first. For that purpose we use a “trivial” Bloom filter implemented using the FPGA’s on-chip memory.

A Bloom filter [

Our Bloom filter is a “trivial” edge case of this more general implementation; our hashing function is the identity function

The internal block RAMs of the Altera Stratix-III FPGA that support efficient single-bit access are limited to 4 Mb; on a Stratix-III SE260, there are 864 M9K blocks that can be configured as 8 K

The document stream is a list of

To mark the start a document we insert a header word (identified by

In the current implementation, the lookup table that stores the profile is implemented in the most straightforward way; as the vocabulary size is 2^{24} and the weight for each term in the profile can be stored in 64 bits, a profile consisting of the entire vocabulary could be stored in the 256 MB SDRAM, which is less than the size of the fixed SDRAM on the PROCStar-III board. Consequently, there is no need for hashing, the memory contains zero weights for all terms not present in the profile.

The diagram for the sequential implementation of the design is shown in Figure

Sequential document term scoring.

Using the lookup table architecture and document stream format as described above, the actual lookup and scoring system is quite straightforward, the input stream is scanned for header and footer words. The header word action is to set the document score to 0; the footer word action is to collect and output the document score. For every term in the document, first the Bloom filter is used to discard negatives, and then the profile term weight is read from the SDRAM. The score is computed and accumulated for all terms in the document, and finally the score stream is filtered against a threshold before being output to the host memory. The threshold is chosen so that only a few tens or hundreds of documents in a million are returned.

If we would simply look up every term in the external memory, the maximum achievable throughput would be

The scoring process as described above is sequential. However, as in the bag-of-words representation all terms are independent, there is scope for parallelisation. In principle, all terms of a document could be scored in parallel, as they are independent and ordering is of no importance.

In practice, even without the bottleneck of the external memory access, the amount of parallelism is limited by the I/O width of the FPGA, in our case 64 bits per memory bank. A document term can be encoded in 32 bits (a 24-bit term identifier and an 8-bit term frequency). As it takes at least one clock cycle of the FPGA clock to read in two new 64-bit words (one per bank), the best case for throughput would be if 4 terms per document would be scored in parallel in a single cycle. However, in practice scoring requires more than one cycle; to account for this, the process can be further parallelised by demultiplexing the document stream into a number of parallel streams. If, for example, scoring would take 4 cycles, then by scoring 4 parallel document streams the application could reach the maximal throughput.

Obviously, the above solution would be of no use if there would be only a single, single-access Bloom filter. The key to parallelisation of the lookup is that because the Bloom filter is stored in on-chip memory, accesses to it can be parallelised by partitioning the Bloom filter into a large number of small banks. The combined concepts of using parallel streams and a partitioned Bloom filter are illustrated in Figure

Parallelizing lookups using parallel streams and a multibank Bloom filter.

Every stream is multiplexed to all

In this section, we present the mathematical throughput analysis of the Bloom filter-based document scoring system. The analysis consists of four parts.

In Section

In Section

In Section

In Section

We need to calculate the probability of contention between

To do so, we need first to compute the

A

For each partition, we can compute the probability of it occurring as follows: if there are

In our case, each event has the same probability

This gives the probability for a sequence of

The actual sequence will consist of numbers

Finally, we must consider the permutations as wells for example, for

First we create an ordered set

That is,

Thus the final probability for each partition of

We observe that

In the next section we derive an expression for the access time for a given partition, depending on the number of accesses that will result in an external memory lookup.

The time to perform

First, we will consider the case of 0 hits, that is, the most common case. In this case, the average access time for a given partition

In practice, a small number of Bloom filter lookups will result in a hit, and consequently there is a chance of having one or more hits for concurrent accesses.

Consider the case of a single hit (out of

Ferrers diagram for the partition

From this graph it is clear that the probability for finding the hit on the first cycle is 6/16; on the second to fourth cycle 2/16, on the fifth to eighth cycle 1/16. Consequently, the average time to encounter a hit will in this case be

To generalise this derivation, we observe first that the transposition of the Ferrers diagram of an integer partition

Ferrers diagram for the conjugate partition

We observe that the time it takes to reach a hit in part

The term in

And of course, as the hit results in an external access, the average access time is

For the case of

If there are two or more hits, the exact derivation would require enumerating all possible ways of distributing

If that is the case, a good approximation for the total elapsed time is the time until the

Conversely, we could consider the time until the

Therefore, we are only interested in these two cases, that is, the lowest, respectively, highest part of the partition with at least one hit. We need to compute the probability that the lowest (resp. highest) part will contain a hit, and the next but lowest (resp. highest) one, and so forth. For simplicity, we leave off

The number of all possible cases is

These are all the possible cases for not having a hit in

We now do the same for

Obviously, there must be enough space in the remaining parts to accommodate

To obtain the weight of a hit in

Finally, the average time it takes to reach a part in a given

With the above assumption, the average access time for

We observe that for

The upper bound is given by the probability that the highest part is occupied, and so forth, so the formula is the same as (

As we will see in Section

The chance that a term will occur in the profile depends on the size of the profile

This is actually a simplified view; it assumes that the terms occurring in the profile and the documents are drawn from the vocabulary in a uniform random way. In reality, the probability depends on how discriminating the profile is. As the aim of a search is of course to retrieve only the relevant documents, we can assume that actual profiles will be more discriminating than the random case. In that case (

The probability of

That is, there are

We can now compute the average access time over all

Finally, using (

In this section the expression obtained in Section

To evaluate the accuracy of the approximations introduced in Section

Accuracy of the approximation for

Next, we consider a more radical approximation; we assume that, for

From Figure

Accuracy of the single-hit approximation for

The throughput depends on the number of hits in the Bloom filter. Let us consider the case where the Bloom filter contains no hits at all. This is the maximum throughput the system could achieve, and it corresponds to a profile for which no document in the stream has any matches. We can use (

Note that for

The results are shown in Figure

Best case (0 hits) average access time for a Bloom filter with

Figure

Average access time for a Bloom filter with

A further illustration of the impact of

Impact of Bloom filter access time on throughput.

The final figure (Figure

Impact on throughput of hit probability and external memory access time.

We implemented our design on the GiDEL PROCStar-III development board (Figure

Block diagram of FPGA platform and photograph of experimental hardware.

Each board contains four Altera Stratix-III 260 E FPGAs running at 125 MHz. Each FPGA supports a five-level memory structure, with three kinds of memory blocks embedded in the FPGA:

5,100 MLAB RAM blocks (320 bit),

864 M9K RAM blocks (9 Kbit), and

48 M144K blocks (144 Kbit)

and 2 kinds of external DRAM memory:

256 MB DDR2 SDRAM onboard memory (Bank A) and

two 2 GB SODIMM DDR2 DRAM memories (Bank B and Bank C).

The embedded FPGA memories run at a maximum frequency of 300 MHz, Bank A and Bank B at 667 MHz, and Bank C at 360 MHz. The FPGA-board is connected to the host platform via 8-lane PCI Express I/O interface. The host system consists of a quad-core 64-bit Intel Xeon X5570 CPU with a clock frequency of 2.93 GHz and 3.5 GB DDR2 DRAM memory, the operating system is 32-bit Windows XP. The host computer transfers data to the FPGA using 32-bit DMA channels.

FPGA-accelerated applications for the PROCStar board are implemented in C++ using the GiDEL PROC-API libraries for interacting with the FPGA. This API defines a hardware abstraction layer that provides control over each hardware element in the system; for example, Memory I/O is implemented using the GiDEL MultiFIFO and MultiPort IPs. To achieve optimal performance, we implemented the FPGA algorithm in VHDL (as opposed to Mitrion-C as used in our previous work). We used the Altera Quartus toolchain to create the bitstream for the Stratix-III.

Figure

Overall block diagram of FPGA implementation.

Using a bag-of-words representation (see Section

As described in Section

As explained in Section

Implementing profile lookup and scoring.

The implementation above leverages the advantages of an FPGA-based design, in particular the memory architecture of the FPGA; on a general-purpose CPU-based system, it is not possible to create a very fast, very low-contention Bloom filter to discard negatives. Also, a general-purpose CPU-based system only has a single, shared memory. Consequently, reading the document stream will contend for memory access with reading the profile terms, and as there is no Bloom filter, we have to look up each profile term. We could of course implement a Bloom filter, but, as it will be stored in main memory as well, there is no benefit; looking up a bit in the Bloom filter is as costly as looking up the term directly. Furthermore, the FPGA design allows for lookup and scoring of several terms in parallel.

Our implementation used only 11,033 of the 203,520 logic elements (LEs) or a 5% utilisation of the logic in the FPGA, and 4,579,824 out of 15,040,512 for a 30% utilisation of the RAM. Of the 11,033 LEs utilised by whole design on the FPGA, the actual document filtering algorithm only occupied 1,655 LEs, which is less than 1% of utilisation, and rest was used by the GiDEL Memory IPs. The memory utilised for the whole design (4,579,824 bits) was mainly for the Bloom filter that is mapped on embedded memory blocks (MRAMs). The Quartus PowerPlay Analyzer tool estimates the power consumption of the design to be 6 W. The largest contribution to the power consumption is from the memory I/O.

In this section we discuss our evaluation results. We present our experimental methodology and the data summarising the performance of our FPGA evaluation and comparison with non-FPGA-accelerated baselines, and we conclude with the learnings from our experiments.

To accurately assess the performance of our FPGA implementation, we need to exercise the system on real-world input data; however, it is hard to get access to such real-world data; large collections such as patents are not freely available and governed by licenses that restrict their use. For example, although the researchers at Glasgow University have access to the TREC Aquaint collection and a large patent corpus, they are not allowed to share these with a third party. In this paper, therefore, we use synthetic document collections statistically matched to real-world collections. Our approach is to leverage summary information about representative datasets to create corresponding language models for the distribution of terms and the lengths of documents; we then use these language models to create synthetic datasets that are statistically identical to the original data sets. In addition to addressing IP issues, synthetic document collections have the advantages of being fast to generate and easy to experiment with, and not taking up large amounts of disk space.

We analysed the characteristics of several document collections—a newspaper collection (TREC Aquaint) and two collections of patents from the US Patent Office (USPTO) and the European Patent Office (EPO). These collections provide good coverage on the impact of different document lengths and sizes of documents on filtering time. We used the Lemur (

Summary statistics from representative real-world collections that we used as templates for our synthetic data sets.

Collection | No. docs. | Avg. Doc. Len. | Avg. Uniq. Terms |
---|---|---|---|

Aquaint | 1,033,461 | 437 | 169 |

USPTO | 1,406,200 | 1718 | 353 |

EPO | 989,507 | 3863 | 705 |

It is well known (see, e.g., [

Special purpose texts (scientific articles, technical instructions, etc.) follow variants of this distribution. Montemurro [

We determine the coefficients

Document lengths are sampled from a truncated Gaussian. The hypothesis that the document lengths in our template collections have a normal distribution was verified using a

Once the models for the distribution of terms and document lengths are determined, we use these models to create synthetic documents of varying lengths. Within each document, we create terms that follow the fitted rank-frequency distribution. Finally, we convert the documents into the standard bag-of-words representation, that is, a set of unordered

Statistically, the synthetic collection will have the same rank-frequency distribution for the terms as the original data sets. Consequently, the probability that a term in the collection matches a term in the profile will be the same in the synthetic collection and the original collection. The performance of the algorithm on the system now depends on

the size of the collection,

the size of the profile,

the “hit probability,” that is, the probability that the profile corresponding to a term has a nonzero weight.

To evaluate these effects, we studied a number of different configurations—with different document sizes, different profile lengths, and different profile constructions. Specifically, we studied profile sizes of 4 K, 16 K, and 64 K terms, the first two are of the same order of magnitude as the profile sizes for TREC Aquaint and EPO as used in our previous work [

We evaluated four ways of creating profiles. The first way (“Random”) is by selecting a number of random documents from the collection until the desired profile size is reached. These documents were then used to construct a relevance model. The relevance model defined the profiles which each document in the collection was matched against (as if it were being streamed from the network). The second type of profiles (“Selected”) was obtained by selecting terms that occur in very few documents (less than ten in a million). For our performance evaluation purpose, the main difference between these profiles is the hit probability, which was

The performance of the FPGA was measured using a cycle counter. The latency between starting the FPGA and the first term score is 22 cycles. For the subsequent terms, the delay depends on a number of factors. We considered three different cases:

“Best Case”: no contention on the Bloom filter access and no external memory access

“Bloom Filter Contention”: contention on the Bloom filter access for every term but no external memory access

“External Access”: no contention on the Bloom filter access, external memory access for every term

These cases were obtained by creating documents with contending/not contending term pairs and by setting all Bloom filter bits to 0 (no external access, which corresponds to an empty profile) or 1 (which correspond to a profile that would contain all terms in the vocabulary).

The results are shown in Table

FPGA Cycle counts for different cases.

Case | No. cycles/2 terms | Probability |
---|---|---|

Best case | 1 | .9375 |

Bloom filter contention | 5 | .0625 |

External access | 37 | <0.00001 |

The most interesting result in Table

As explained in Section

Table

Throughput of document filtering application (M terms/s) for (a) 128 K documents of 2048 terms and (b) 512 K documents of 512 terms.

Profile | System1 | System2 | FPGA board |
---|---|---|---|

Empty, 4 K | 31 | 48 | 800 |

Empty, 16 K | 31 | 48 | 800 |

Empty, 64 K | 31 | 48 | 800 |

Random, 4 K | 25 | 42 | 800 |

Random, 16 K | 24 | 41 | 800 |

Random, 64 K | 24 | 41 | 800 |

Selected, 4 K | 21 | 37 | 792 |

Selected, 16 K | 18 | 35 | 792 |

Selected, 64 K | 18 | 25 | 792 |

Profile | System1 | System2 | FPGA board |
---|---|---|---|

Empty, 4 K | 30 | 53 | 800 |

Empty, 16 K | 32 | 53 | 800 |

Empty, 64 K | 32 | 53 | 800 |

Random, 4 K | 26 | 47 | 800 |

Random, 16 K | 26 | 46 | 800 |

Random, 64 K | 25 | 46 | 800 |

Selected, 4 K | 20 | 40 | 796 |

Selected, 16 K | 19 | 38 | 792 |

Selected, 64 K | 17 | 27 | 796 |

To compare the FPGA performance against a conventional CPU, we ran the experiments discussed in Section

The results are summarised in Table

In the above sections we have used a preliminary implementation of our proposed design to validate the analytical model. The design does indeed behave in line with the model, for the case of two parallel terms and a 16-bank Bloom filter. The performance is 200 M terms/s. This design is not optimal for several reasons. On the one hand, the original aim was to support four parallel terms, but an issue with the access to one of the memories prevented this. On the other hand, as is clear from the model, a 16-bank implementation does not result in operation close to I/O rates. For four parallel terms, this would require 64 banks; even for two parallel terms, the performance is 80% of the I/O rate. Our aim was not so much to achieve optimal performance as to implement and evaluate our novel design and compare it to the analytical model. We therefore decided to limit the number of banks to 16 to reduce the complexity of the design, as the implementation was undertaken as a summer project.

This means that there is a lot of scope for improving the current implementation.

We will deploy our design on a PROCStar-IV board which does not have this issue, and thus we will be able to score 4 terms in parallel rather than 2 terms.

Even with a single SDRAM, we can be more efficient; the SDRAM I/O rate is 4 GB/s (according to the PROCStar-III databook); our current rate is only 1 GB/s. By demultiplexing the scoring circuit, it should be possible to increase this rate to 4 GB/s.

Combining both improvements, an improved design could score 16 terms in parallel. This will of course require a Bloom filter with more banks to reduce contention, but considering the current resource utilisation that is not a limitation. Consequently, the improved design should be able to operate up to 8× faster than the current design.

In terms of the analytical model itself, there is some scope for further refinement, in particular for the external access; we currently use a single access time for one and more hits. Just like for the Bloom filter, we can include a fixed cost for concurrent accesses on the external memory. We also want to refine the model to include the effect of grouping terms: that is, the

In this paper we have presented a novel design for a high-performance real-time information filtering application using a low-latency “trivial” Bloom filter. The main contribution of the paper is the derivation of an analytical model for the throughput of the application. This combinatorial model takes into account the access times to the Bloom filter and the external memory, the access probability, and the probability and cost of contention on the Bloom filter. The approach followed and the intermediate expressions are applicable to a large class of resource-sharing problems.

We have implemented our design on the GiDEL PROCStar-III board. The analysis of the system performance clearly demonstrates the potential of the design for delivering high-performance real-time search; we have shown that the system can in principle achieve the I/O-limited throughput of the design. Our current, suboptimal implementation works at 80% of its I/O rate, and this already results in speedups of up to a factor of 20 at 125 MHz compared to a CPU reference implementation on a 3.4 GHz Intel Core i7 processor. Our analysis indicates how the system should be dimensioned to achieve I/O-limited operation for different I/O widths and memory access times.

Our future work will focus on achieving higher I/O bandwidth by using both memory banks on the board and time-multiplexing the memory access. Our aim is to achieve an additional 8× speedup.

The authors acknowledge the support from HP, who hosted the FPGA board and provided funding for a summer internship. In particular, we’d like to thank Mitch Wright for technical support and Partha Ranganathan for managing the project.

We’d like to acknowledge Anton Frolov who implemented the synthetic document model.

Wim Vanderbauwhede wants to thank Dr. Catherine Brys for fruitful discussions on probability theory and counting problems.