^{1}

^{2}

^{1}

^{1}

^{1}

^{2}

Bloom filters are space-efficient randomized data structures for fast membership queries, allowing false positives. Counting Bloom Filters (CBFs) perform the same operations on dynamic sets that can be updated via insertions and deletions. CBFs have been extensively used in MapReduce to accelerate large-scale data processing on large clusters by reducing the volume of datasets. The false positive probability of CBF should be made as low as possible for filtering out more redundant datasets. In this paper, we propose a multilevel optimization approach to building an Accurate Counting Bloom Filter (ACBF) for reducing the false positive probability. ACBF is constructed by partitioning the counter vector into multiple levels. We propose an optimized ACBF by maximizing the first level size, in order to minimize the false positive probability while maintaining the same functionality as CBF. Simulation results show that the optimized ACBF reduces the false positive probability by up to 98.4% at the same memory consumption compared to CBF. We also implement ACBFs in MapReduce to speed up the reduce-side join. Experiments on realistic datasets show that ACBF reduces the false positive probability by 72.3% as well as the map outputs by 33.9% and improves the join execution times by 20% compared to CBF.

A Bloom filter [

As standard Bloom filters do not support deleting elements, there has been a huge surge in the popularity of Bloom filters and variants. One well-known variant is the Counting Bloom Filter (CBF) [

Large-scale data processing has been extensively used in the Cloud. MapReduce [

However, large-scale data processing poses a significant challenge of performance to MapReduce. First, there are large amounts of data involved with data-intensive applications such as web search engines and log processing. For example, China Mobile has to process 5–8 TB of phone call records per day. Facebook gathers almost 6 TB new log data per day. For these applications, it is time consuming to distribute such data across hundreds or thousands of low-end machines for computations. Second, the join operation is very inefficient in the MapReduce framework. The join is one of fundamental query operations, which combines records from two different datasets based on a cross product. The main problem of the MapReduce join is that two entire datasets should be processed and distributed among a large amount of machines in the clusters. This causes high communication cost and even a performance bottleneck when a small fraction of data is relevant to the join operation. Third, there are not any auxiliary data structures such as indexes and filters in the MapReduce framework. This is due to the fact that MapReduce is initially designed to process only a single large dataset as its input. As all the records within a time period are analyzed together, dataset scans in the MapReduce framework therefore are preferable to index scans.

To address such challenge, CBFs have been widely used to accelerate large-scale data processing in MapReduce. In the reduce-side join [

There are three performance metrics of CBF: processing overhead, memory consumption, and false positive probability. The processing overhead is the number of memory accesses for each primitive operation, which dominates the CBF throughput. The memory consumption is the counter vector size of CBF. Four bits per counter are typically used to support insertions and deletions. As the counters blow up the available memory space, several variants [

This paper presents a multilevel optimization approach to building an Accurate Counting Bloom Filter (ACBF). The goal of ACBF is to reduce the false positive probability. ACBF is constructed by partitioning the counter vector into multiple levels that are organized by offset indexing. In ACBF, the first level is used to perform set membership queries, while other levels are used to calculate the counters on insertions and deletions. In order to minimize the false positive probability, we propose an optimized ACBF by maximizing the first level size while maintaining the same functionality as the standard CBF. Simulation results show that ACBF outperforms CBF in false positive probability at the same memory consumption. We also implement ACBFs in MapReduce to improve the reduce-side join performance by filtering out more redundant records shuffled. Experiments in Hadoop show that ACBF reduces the false positive probability by 72.3% as well as the map outputs by 33.9% and improves the total execution times by 20% compared to CBF.

This paper makes the following main contributions.

We propose a novel multilevel optimization approach to building a variant of CBF called ACBF for reducing the false positive probability. ACBF is built by partitioning the counter vector into multiple levels. We propose an optimized ACBF by maximizing the first level size, in order to minimize the false positive probability.

We show that ACBF outperforms CBF in false positive probability at the same memory consumption. Simulation results show that the optimized ACBF reduces the false positive probability by up to 98.4% compared to CBF and performs the same functionality as CBF.

We implement ACBFs in MapReduce to improve the join performance. ACBF is constructed in a distributed fashion and distributed via broadcast to all map tasks for filtering out more redundant records transferred during the shuffle phase.

Experiments on realistic datasets show that ACBF reduces the false positive probability by 72.3% as well as the map outputs by 33.9% and improves the join execution times by 20% compared to CBF.

The rest of this paper is organized as follows. Section

Bloom filters are space-efficient randomized data structures to perform approximate membership queries. A Bloom filter represents a set

A Bloom filter may yield false positives, but false negatives are not possible. The false positive probability is calculated as follows:

The standard Bloom filter allows insertion but not deletion. Deleting elements from the Bloom filter cannot be done simply by changing ones back to zeros. This is because a single bit in the vector may correspond to multiple elements inserted. The Counting Bloom Filter (CBF) [

CBFs have been widely used in a variety of applications such as networking [

CBFs have one of key disadvantages of wasteful fourfold memory space. Several improvements on CBF have recently been proposed to minimize the memory consumption. The

Moreover, other variants of Bloom filters have recently been proposed to improve the false positive probability. The power of two choices [

In this section, we describe a multilevel optimization approach to building an Accurate Counting Bloom Filter called ACBF. We first present the construction of ACBF and then describe the query and insertion/deletion algorithms. Next, we describe optimized ACBFs and then analyze the false positive probability. Finally, we show simulation results to compare optimized ACBFs with the standard CBF.

The basic idea of ACBF is to use a multilevel approach to partitioning the counter vector into multiple levels for higher accuracy. Using this approach, we separate the query operation and the insertion/deletion operations of ACBF. This separation is used to achieve a lower false positive probability while attaining updates on dynamic sets. This is done due to the observation of CBF. We observe that the counter vector of CBF is suitable to support quick updates by incrementing or decrementing the counters at the cost of the false positive probability. Figure

CBF with

We see that there are only three elements inserted into the filter, and the false positive probability is dominated by the number

ACBF has a hierarchical structure which is composed of

(1)

# ACBF is composed of

# Each bitmap

(2)

(3) index =

(4)

(5) return

(6)

(7)

(8) return

ACBF with four levels and an idle space.

ACBF is organized by using offset indexing for spanning the counters over different levels. We assume that each level

In order to insert or delete an element from ACBF, we must increment or decrement the counters hashed by the element. This is done by expanding or shrinking relative levels of the hierarchy. Algorithm

(1)

(2)

# Traverse

(3)

(4)

(5)

(6)

(7)

(8)

(9)

# Expand bitmap

(10)

(11)

(12) offset = popcount(

(13) expand_bitmap(

(14) Index = offset;

(15)

(16) exit();

(17)

(18)

(19)

(1)

(2)

# Traverse

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10) index = offset;

(11)

# Shrink bitmap

(12) shrink_bitmap(

(13)

(14) exit();

(15)

(16)

(17)

From Algorithms

To attain a lower false positive probability, we propose two optimization methods to improve the ACBF construction by increasing the first level size. We assume that ACBF consists of

The basic idea of the first optimization method is to simply increase the first level size by a multiplicative factor. Let

To achieve this goal, we propose the second optimization method for improving the ACBF construction. This method is deigned based on the following observation: if up to

Figure

Optimized ACBF with

Next, we compare the theoretical false positive probability of ACBFs with CBF in two cases of _{1}, ACBF_{2}, and ACBF_{3}, respectively. We note that ACBF_{1} has the same false positive probability as CBF due to _{1} as well as the false positive probability by up to 99.8% and up to 69.6% compared to ACBF_{2} and ACBF_{3}.

Theoretical false positive probability of CBF and ACBFs.

We conduct simulation experiments to test the performance of ACBFs on synthetic datasets. As a standard CBF has the same memory consumption as its previous variants, we mainly compare ACBFs with the standard in the experiments. We compare _{1}, _{2}, and _{3} in terms of the false positive probability, the query overhead, and the update overhead, at the same memory consumption. In the experiments, both CBF and ACBFs have the same parameters such as

For each synthetic experiment, we synthesize a data set and a query set. The data set contains 100 K unique strings that we represent with CBF and ACBFs, while the query set contains 1000 K strings that are tested through the filters. During an update period, 20 K elements are deleted from the filters, and another 20 K elements are inserted into the filters, maintaining constant 100 K elements in the filters. We do ten experimental trials and average the results.

Figure _{1}, ACBF_{2}, ACBF_{3}, and _{1}, ACBF_{2}, and ACBF_{3}. As shown in Figure _{1}, ACBF_{2}, and ACBF_{3}, respectively. Figure _{1}, _{3} has the same false positive probability as

False positive probability of CBF, ACBF_{1}, ACBF_{2}, ACBF_{3}, and

Figure _{1}, ACBF_{2}, ACBF_{3}, and _{1}, ACBF_{2}, and ACBF_{3}. This reason is that all the filters have the same number

Query overhead of CBF, ACBF_{1}, ACBF_{2}, ACBF_{3}, and

Figure _{1}, ACBF_{2}, ACBF_{3}, and

Update overhead of CBF, ACBF_{1}, ACBF_{2}, ACBF_{3}, and

In this section, we implement ACBFs in MapReduce to accelerate reduce-side joins for large-scale data processing. We first present the MapReduce overview and then describe the optimized reduce-side join with ACBF in MapReduce. Finally, we report experimental results on realistic datasets.

MapReduce [

A MapReduce program provides

When a MapReduce job is launched, a job tracker creates a total of

The join operation is one of fundamental query operations. It combines records from two different datasets based on a cross product [

There are two main join implementations in MapReduce: the map-side join and the reduce-side join. As their own names imply, the map-side join implements the join during the map phase, while the reduce-side join implements the join during the reduce phase [

The reduce-side join is the most general join approach implemented in MapReduce. The basic idea behind the reduce-side join is that a map task tags each key-value pair with its source and uses the join keys as the map output keys, so that the pairs with the same key are grouped for a reduce task. Figure

Reduce-side join with ACBF in MapReduce.

To mitigate I/O cost of the reduce-side join, CBF is widely used in the map phase to filter the map outputs shuffled across the network. We use ACBF to replace CBF in the reduce-side join for minimizing the amount of traffic during the shuffle phase. ACBF has so lower false positive probability that more redundant map outputs can be reduced. Figure _{1}, is often used to construct an ACBF in a distributed way. The job tracker then broadcasts the ACBF to all the map tasks, that is, _{1}, _{2}, and _{3}, by an efficient facility called _{1} and _{2}, perform the joins and produces the final results.

Figure _{1}, and then each map task (i.e., _{1} or _{2}) reads an input split, that is, _{1} or_{2}. A local hash table is created in each map task by adding the unique keys of each file split. Note that each map task consists of the _{1} or _{2}) is a chained hash table which has the same number of buckets and uses the same hash function called _{1}), which creates a global hash table. All the local hash tables are merged by a union function that eliminates duplicated keys. Next, the reduce task creates an ACBF by adding the keys of the global hash table in the filter. Finally, the ACBF is written to local storage in DFS and submitted via broadcast to all other map tasks for performing the reduce-side join (see Figure

ACBF construction in MapReduce.

To evaluate the optimized reduce-side join, we implement ACBFs in Hadoop that is an open-source Java implementation of MapReduce. We obtain the NBER US patent citations data files [

Table

Reduce-side join performance comparisons in Hadoop.

Filter parameters |
False positive probability | Map inputs (MB) | Map outputs (MB) | Filter construction times (s) | Total execution times (s) |
---|---|---|---|---|---|

Join + CBF | 0.01772 | 252.5 | 45.1 | 62 | 115 |

Join + ACB |
0.00491 | 252.5 | 29.8 | 68 | 92 |

We propose a multilevel optimization approach to building an accurate CBF called ACBF for reducing the false positive probability. ACBF is constructed by partitioning the counter vector into multiple levels. We propose an optimized ACBF named

We implement ACBFs in MapReduce to improve the reduce-side join performance. ACBF is used in the map phase to filter out redundant records shuffled. ACBF is constructed in a distributed way by merging local hash tables of all map tasks. Experiments on realistic patent citations data files show that

This work was supported in part by the National Basic Research Program (973 Program) of China under Grant no. 2012CB315805, the National Natural Science Foundation of China under Grant no. 61173167, no. 61100171, and no. 61272546, and the Scientific and Technological Project of Hunan Province, China, under Grant no. 2013SK3149.