A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.


Introduction
In high-dimensional spaces, exact search methods, such as kd-tree approaches and Q-gram, are only suitable for small size vectors due to huge computation resources. However, similar search algorithms can drastically improve the search speed while maintaining good precision [1], which include VA-files, best-bin-first, space filling curves, K-means (see [2] and references therein), NV tree [3], K-nearest neighbors (K-NN), and locality-sensitive hashing (LSH) [4]. Most K-NN methods adopt the Euclidean distance; they assume all coordinates are numerical and own same units and semantics. But, in some applications, the dimension may be string or category, which makes the Euclidean distance questionable and artificial.
In query tools, a bloom filter [5] (BF), as a space-efficient and constant query delay random data structure, has been applied to present a big set and retrieve memberships broadly [6]. But the BF only can present 1-dimensional elements of a set; references [7][8][9] extended it to present high-dimensional sets and dynamic sets. But these methods can only answer the membership query, not the similarity query. In [10,11], the LSH functions replace the random hash functions of the BF to implement the similarity search, while [10,11] only can deal with numerical coordinates and return the elements whose distances from the query are at most CR distance in Euclidean spaces, which lead to false negative probability (FNP).
Here, by computing the Hamming distance, we propose a new structure, called Similar-PBF-PHT, based on the BFs and hash tables (HT) to search the membership as well as the K-NN regardless of the radius CR. The Similar-PBF-PHT includes PBFs, PHTs, and a bitmatrix. The PBFs and PHTs apply BFs and HTs to store dimensions, and the bitmatrix stores the dependences of the dimensions. The experiments show that the Similar-PBF-PHT owns better performance in Hamming spaces than other methods. Meanwhile, with K-NN searching, it gets a balance performance and can process different data formats while other LSH-based methods can only deal with numerical value.

Related Work
There are different kinds of approximate search algorithms, and we divide them into three categories to discuss.

Computational Intelligence and Neuroscience
The famous one is space partition method, including IDistance [12] and MedRank [13]. The IDistance [12] clusters all high-dimensional elements into multiply spaces and converts them into 1-dimension space. It costs linear space and supports data insertion and deletion; however, if the data distribute uniformly or dimensions are anisotropic, space partition and center selection will be difficult. The MedRank [13] is a rank aggregation and instance optional algorithm, which aggregates the given dataset into sorted lists, where every element has an entry with a form (Id, key). The number of the lists equals log , where is the number of the elements, and by lists probing, the MedRank finds out approximate NN items. The MedRank possesses the best linear-space, but an element insertion or deletion needs update lists and every list requires sorting again.
The LSH and its variants are other famous K-NN search algorithms [14], like Rigorous-LSH [15], E2LSH [4,16], Adhoc-LSH [17], LSB-tree [18], LSB-forest [18], BLSH [11], and so on [19][20][21]. Let be a set of points in d-dimensional space, The Rigorous-LSH [15] applies C-approximate ball cover, which has radius and centers at the query point , denoted as ( , ). If ( , CR) contains at least one point in ( ≥ 1 is a constant), it returns a point that is at most CR distance to ; others return nothing. The Rigorous-LSH is theoretically perfect, but the query and space costs are expensive. In Euclidean spaces, E2LSH [4,16] achieves the Rigorous-LSH through p-stable distribution [22], which reduces CPU and memory costs of the Rigorous-LSH greatly. The Adhoc-LSH [17] modifies the drawbacks of the Rigorous-LSH by a heuristic approach. Let a query be and a magic radius RM; the Adhoc-LSH returns the points within the radius of RM. If the RM equals the distance between the and the exact NN, the Adhoc-LSH works well. If not, an improper RM may lead to FNP. Beyond locality sensitive hashing (BLSH) [11] scheme uses a two-level hashing algorithm to overcome the lower bound of the FNP [19] and finds CR-NN in Euclidean spaces. Different from other LSH methods, the second-level BLSH, parameterized by different center points, is a data-aware scheme. The outer hash table partitions the data sets into buckets of bounded diameter. For each bucket, the BLSH constructs an inner hash table, which applies the minimum enclosing ball of the points in the bucket as a center point. However, the BLSH still has high memory costs and will bring FNP. The LSB-tree and LSB-forest [18] implement the K-NN search by space mapping and Z-order coding. By the LSH functions, the LSB-tree [18] first maps dimensional points ( ) to a lower dimensional points ( ). Then the LSBtree gets the Z-order [23] value of the ( ), which is indexed by a conventional B-tree. Multiply LSB-trees form a LSBforest, which can update efficiently and satisfy query accuracy but space costs are expensive.
The BFs are introduced into the high-dimensional search, like high-dimensional dynamic BFs (MDDBFs) [7], PBF-BF [8], PBF-HT [8], similarity sets [24], and distance-sensitive BFs (DSBF) [10], and so on [25]. The MDDBFs [7] apply parallel standard BFs (PBFs) to present a d-dimensional dynamic dataset. By searching PBFs, the MDDBFs find out the membership, but the MDDBFs lack a way to verify the dependency of multiple dimensions of an item, which causes high FPPs with membership retrieval. To reduce the FPP, PBF-BF and PBF-HT [8] add another BF and hash table (HT) to the PBFs to store the verification value of the different dimensions. However the methods above based on the BFs can only answer the membership query, not similarity query.
Distance-sensitive BFs (DSBF) [10] replace the uniform hash functions in the BF with the LSH functions to find out similar strings. But the DSBF can only differentiate a query string that differs from all strings in the dataset on a (constant) -fraction of bits. The locality-sensitive bloom filter (LSBF) [26] uses two-level BFs to implement the approximate item query. The first-level bloom replaces the random hash functions with locality-sensitive hash function (LSH), which is based on p-stable distribution [22], and maps all items to bit-bloom arrays. To keep the integrity and reduce the FPP, the second level BF stores the hash verification signature formed by the LSH functions in the first-level BF. In order to reduce the FNP, the LSBF needs to probe the neighbor bits in the first-level BF, which leads to cost more query time. Meanwhile since the LSH function concentrates most points around the mean and maps some neighboring points to remote bits, it will bring bigger FPP and FNP. [5] applies an array of bits (initially all are set to 0) and independent hash functions ℎ to represent a set = { 1 , 2 , . . . , } of elements, as shown in Figure 1(a). If an element is mapped into the BF by ℎ , the corresponding bit ℎ ( )% is set to 1. Given a query , by hash functions ℎ ( )% mapping, the BF answers whether the is a member of with a FPP. In order to support elements deletion, counting bloom filter (CBF) [27,28] replaces the array of bits with counters.

Structures. A standard BF
In this paper, to present high-dimensional elements, parallel BFs (PBFs) and parallel hash tables (PHTs) are proposed to represent the elements with dimensions. At the same time, a bitmatrix is introduced to keep the inherent dependency of the dimensions and reduce the FPP, as shown in Figure 1(b).

PBFs.
To store the dimensions, this paper introduced BFs (Figure 1(b)), and every BF owns independent hash functions [29]

PHTs.
In order to find out which dimensions and how many dimensions of the elements in set are similar to the query , this paper utilizes parallel hash tables (PHTs) and hash links to store identifications (IDs) of the elements. Each hash table, denoted as HT, is indeed a link array with 2 length.

Bitmatrix.
Since dimensions are stored into BFs and HTs separately, the integrity of the elements is destroyed, which leads to query confusion. Thus, an auxiliary structure, called bitmatrix, is added to record dimensions hit in the PBFs and PHTs. After dimensions are checked in the PBFs and PHTs, numbers of the hit dimensions are summed up in the bitmatrix; that is, (2) = max(∑ =1 ). If (2) = , the query is a member of the set with a FPP, as shown in (1). If (2) = 0, no dimension of the query is in the set, for example, (2). If 0 < (2) < the query is a similar elements with a FPP, as shown in (3).

Query.
Only when the dimension returns 1 in the BF will the attenuated hash values be summed up and located in the corresponding HT. The hit elements' IDs are found and the corresponding bits in the bitmatrix are set to 1. After all dimensions are mapped, columns in the bitmatrix are summed up. If the summation is in the range between 1 and , the membership or similar elements are obtained.

Element Deletion.
Since bit deletion in a BF will bring FPP; the Similar-PBF-PHT only needs to delete the hash node in the corresponding HT.

Performance Analysis
Since the BF only has FPP but not FNP [24], we evaluate the performance of the Similar-PBF-PHT by the quality of results, FPP, query time, and space consumption.  Proof. In this paper, BFs are used to present the dimensions of elements, and each BF owns independent hash functions [19]. Let each random variable ℎ ( ) follow the uniform distribution with range {1, . . . , }, the expected value of (1 + )/2, and variance of ( − 1) 2 /12, and V( ) = ∑ =1 ℎ ( ) is ranged { , . . . , }. According to central limit theorem [1], if is big enough, the random variable V( ) satisfies a normal distribution with the expected value of (1 + )/2 and variance of ( − 1) 2 /12. Because the sum of attenuated value of hash functions V( ) = ∑ =1 ℎ ( )/2 can reduce the FPP, we store the attenuated value of jth dimension of element into the jth HT. It is difficult to estimate the probability density functions of V( ), and according to birthday attack [2], when V( ) distributes uniformly, the collision will be minimum. For simplicity, we suppose V( ) satisfies normal distribution to estimate the upper bound of false positive.
If dimensions are misdetected simultaneously, BF-HB gets the minimum value. If only one dimension is falsely detected, hit is the maximum value BF-HB . There is BF-HT ≤ hit similar-PBF-PHT ≤ BF-HT . False positive probability (log) Figure 2: The maximum and minimum value of log hit .
In Figure 2, let BF-HT = 0.5; with dimensions increasing, the maximum and minimum of the hit similar-PBF-PHT decay exponentially.

Average Overall Ratio.
We evaluate the quality of a K-NN search result by rank-i ratio and average overall ratio (AOR) [18], which are used in most experiments. The rank-i ratio is denoted by ( ) and defined as where ∈ [1, ], ‖ , ‖ is the distance of the queried ith neighbor to , and ‖ * , ‖ is the distance of the actual ith neighbor to . The overall approximation ratio is the mean of the ratios of all ranks, namely, (∑ =1 ( ))/ .

Storage Space.
The storage spaces of the Similar-PBF-PHT contain three parts: (i) When the FPP of a BF is not greater than and the number of hash functions is optimal, to express the set of elements, the size of the BF array must be ≥ (log 2 (1/ )/ ln 2) = − log 2 × log 2 . Then the spaces required by parallel BFs are (bits) (ii) A HT needs to store all IDs of the elements and the next node. Let be the length of the HT (0 < ), a node takes up 1 bits, and the HT requires spaces with a range from When = 0, space gets the minimum value. and when 1 , , and are constants, the space complexity of the Similar-PBF-PHT is

Search Time.
When querying, the PBFs need times hash calculation, the PHTs require 2 times to search the IDs, in which 2 is the average length of the HT bucket links.
During the membership searching, all hit element's IDs need to be recorded, and the time complexity is The bitmatrix traverses at most times. So the time complexity is

Dataset and Setting.
The BF is designed to represent a set, and there is no benchmark. Here we choose 4 datasets used in most experiments; they are Color [13], Mnist [12], Varden [18], and Reuters 21578 [31]. Data formats in the Reuters 21578 are various including digital, character, symbols, and their combinations. We use it to generate 49396-item dataset with 1000 dimensions to test the performance of the Similar-PBF-PHT, including the query latency and the ability of data processing. The experiments run on a computer with 2.5 GHz Intel double Core processors and 8 G RAM.

Membership Query.
In this section, we will discuss performances of different methods in membership query. Let = 3, = 320(640, 1280, 2560), = 5, = 1.1, and V = 32, where V is bits of every verification values in the PBF-HT and Similar-PBF-PHT. Figure 3 displays the FPPs of the SBF [5], PBF [7], PBF-HT [8], PBF-BF [8], and Similar-PBF-PHT on Reuters 21578 data. With the increasing, the FPPs decrease, and to a constant , the FPPs will increase with the number of the elements growing, especially the SBF and PBF. When the number of the items exceeds a threshold ( ≥ ), the FPPs of the SBF and PBF are nearly equal to 1, which is consistent in the BF theory. In different , the Similar-PBF-PHT gets the lowest FPP; even when = 320, the biggest FPP is not beyond 0.01, while the FPPs of others are almost 1. Figure 4 demonstrates memory usages of the PBF, PBF-BF, PBF-HT, and Similar-PBF-PHT on the Reuters 21578 dataset, when FPP = 0.00098, = 4, = 5, = [3000, 30000], and V = 32. According to formula (13), to fit a constant FPP, the memory usage will grow with the number of the items increasing. The hash tables and the bitmatrix reduce the FPP at the cost of memory, and the BF's bits arrays take up 1/4 spaces of the CBFs in other 3 schemes. All these make the space consumption of the Similar-PBF-PHT just a little higher than the PBF but lower than the PBF-HT.

K-NN Search.
To evaluate the accuracy of the K-NN search, we compare the average overall ratios of the Rigorous-LSH [15], MedRank [13], Adhoc-LSH [17], LSB-tree, and LSBforest [18] with the Similar-PBF-PHT on the Color and Mnist dataset, as shown in Figure 5. Workload is set to 50, and 1-100 nearest neighbors are searched. In Hamming spaces, the ratios of the Similar-PBF-PHT are almost equal to 1. In Euclidean spaces, the ratios of the Similar-PBF-PHT are not stable; the overall ratios on the Mnist are almost as good as the LSB-forest, but the ratios on Color are a little higher and increase with the number of nearest neighbors. The main reasons are that the dimensions of the Mnist are sparse (most values are 0), and most Hamming distances are 0. While the dimensions of the Color are dense, a small distance (0.0001) in Euclidean spaces will be recognized as 1 in Hamming Computational Intelligence and Neuroscience spaces. All these make the accuracy decrease, but the ratios are still beyond 0.98. Figure 6 displays average rank-i ratios of the Euclidean and Hamming distance. ‖ , ‖ presents Hamming distance; because of the FPP of the BF, the actual distance is less than the query distance; there exists ( ) ≤ 1. In Figure 6(a), with increasing, the rank-i ratios of Hamming distance are stable and not lower than 0.985. On the Mnist (Figure 6(b)), rank-i ratios of Euclidean distances of the Similar-PBF-PHT are minimum, almost equal to 1, while, on the Color ( Figure 6(c)), the ratios of Euclidean distance increase slowly and are higher than the LSB-tree and LSB-forest's; when the rank − ≥ 7, it becomes lower than the MedRank. Table 1  increases with and . When = 2 the memory costs of the LSB-Forest are almost as big as the Adhoc-LSH. The Similar-PBF-PHT can deal the dimensions with different formats and lengths, and the length of dimension and number of samples will affect the query time. In Figure 7(a), we set to 100, 500, and 1000, respectively, and every dimension contains 20 characters (big enough to most applications) to search 10-NN. With dimensions growing, the average query latency of the Similar-PBF-PHT increases linearly. Let = 5000, = 10, and = 10; Figures 7(b) and 7(c) demonstrate effects of different dimension's lengths on query delay with 10-NN searching. Average query latencies will increase with the numbers of the characters and dimensions. This is because most of the CPU time is wasted on processing the hash values of the characters.
In Figure 8, we analyze the effects of the parameters and BF on the AORs and FPPs of the Similar-PBF-PHT. Let = 50000, = 3, V = 32, and ∈ [20-100]-NN and let test , even small (0.01), and big BF (0.5), the Similar-PBF-PHT gets good query results and low FPPs. That means the PHT and the bitmatrix can effectively improve the detection accuracy. affects the query accuracy much more than the FPP of the BF. With increasing, the FPP decreases and the AOR increases; at the same time the space consumption increases.

Conclusions
In this paper, we propose a comprehensive structure, called Similar-PBF-PHT, to represent and search member and similar elements of a big dataset in high-dimensional spaces by computing Hamming distance. We analyze its working mechanism, FPP, and space and time complexity in detail. The experiments show that, with membership searching, compared with the PBF, PBF-HT, and PBF-BF, the Similar-PBF-PHT owns lower hit FPP by a low memory cost. The Similar-PBF-PHT costs less storage than the schemes based on the locality sensitive hash, including the Rigorous-LSH, LSB-forest, Adhoc-LSH, and BLSH. With K-NN items querying, it costs CPU time, not I/O times, which make it have less query latency. Meanwhile, the Similar-PBF-PHT computes hash values of all characters in each dimension, so it can deal with different data formats (chars, number, symbol, and so on), and the number of characters will affect the query time. The average overall ratios (query accuracy) and the average rank-i ratios of the Hamming distance are accurate. All these advantages make it appropriate for representing and searching items in high-dimensional spaces, such as database and documents similar search. Although the Similar-PBF-PHT can get good performance in Hamming spaces, memory costs and the FPP of Euclidean spaces for K-NN searching are still a little higher.