Application of Filters to Multiway Joins in MapReduce

Joining multiple datasets in MapReduce may amplify the disk and network overheads because intermediate join results have to be written to the underlying distributed file system, or map output records have to be replicated multiple times.This paper proposes a method for applying filters based on the processing order of input datasets, which is appropriate for the two types of multiway joins: common attribute joins and distinct attribute joins. The number of redundant records filtered depends on the processing order. In common attribute joins, the input records do not need to be replicated, so a set of filters is created, which are applied in turn. In distinct attribute joins, the input records have to be replicated, so multiple sets of filters need to be created, which depend on the number of join attributes. The experimental results showed that our approach outperformed a cascade of two-way joins and basic multiway joins in cases where small portions of input datasets were joined.


Introduction
Join processing in MapReduce [1] has attracted the attention of researchers in recent years.This is because MapReduce does not support join operations directly, although it is a useful framework for large-scale data analysis.In particular, joining multiple datasets in MapReduce has been a challenging problem because it may amplify the disk and network overheads.Multiple datasets can be joined in the following two ways: (1) using a cascade of two-way (or smaller multiway) joins and (2) with a single multiway join.However, both methods have some drawbacks.A cascade of two-way joins has to write the intermediate join results to the underlying distributed file system, which generally replicates multiple records to ensure high availability and fault tolerance.To process multiway joins in a single MapReduce job, the map output records have to be replicated multiple times, instead of writing only the final join results to the distributed file system.
Previous studies have attempted to improve join performance using filtering techniques [2][3][4][5], including our previous study [6,7].These studies focused on reducing the number of map output records that are not joined.This may be more beneficial with multiway joins.The map output records are replicated multiple times, so filtering out redundant records removes multiple copies of the record in multiway joins.Figure 1 shows an example of basic multiway join processing in MapReduce.In this example, three input datasets, that is, R(a,b), S(b,c), and T(c,d), are joined with two attributes b and c.To join the three datasets simultaneously, some datasets need to be replicated, that is, R and T in this example.Replication may degrade the join performance, so it is important to reduce the number of redundant records, which are marked with strikethroughs in Figure 1.
In this study, we extend the concept of filtering techniques to multiway joins.Multiway joins can be classified into two types: common attribute joins and distinct attribute joins [8].A common attribute join combines datasets based on one or more shared attributes, whereas some relations do not have join attributes in a distinct attribute join.The example shown in Figure 1 illustrates a distinct attribute join, because R does not have the attribute c and T does not have the attribute b.Thus, we propose methods for filter application that are suitable for both cases.We do not limit the filtering technique used and various techniques can be applied.Simple hash filters were used for the evaluation.The contributions of this study are as follows.(i) We propose methods for applying filters to multiway joins.For common attribute joins, a set of filters is created and applied in turn.For distinct attribute joins, multiple sets of filters are created, according to the number of join attributes.The filters can be applied in various patterns, according to the processing order of the input datasets.
(ii) We provide specific details of our prototype implementation.We briefly introduce MFR-Join, which is our general join framework with filtering techniques implemented in Hadoop [9].Based on this framework, algorithms that perform multiway joins with filters are described.
(iii) Experimental results obtained using our basic framework are presented.Our proposed techniques were evaluated based on comparison with a cascade of twoway joins and a multiway join without filters.The results showed that our approach outperformed the other techniques in cases where small portions of the input datasets were joined.
The remainder of this paper is organized as follows.Section 2 reviews background information and previous research related to this study.Section 3 explains our proposed methods for applying filters to multi-way joins.Section 4 describes the implementation details of our methods.Section 5 discusses our experimental results.Finally, we conclude this paper in Section 6.

Background and Related Work
This section presents the basic concepts and previous research related to our study.Section 2.1 explains the basis framework, MapReduce.Next, two-way and multiway join algorithms in MapReduce are reviewed in Sections 2.2 and 2.3.

MapReduce.
MapReduce [1] is Google's programming model for large-scale data processing, which is run on a shared-nothing cluster.MapReduce liberates users from the responsibility of implementing parallel and distributed processing features by providing them automatically.Thus, users only have to write MapReduce programs with two functions: map and reduce.The map function takes a simple key/value pair as its input and it produces a set of intermediate key/value pairs.The reduce function takes an intermediate key and a set of values that correspond to the key as its input, and it generates the final output key/value pairs.
A MapReduce cluster comprises one master node and a number of worker nodes.When a MapReduce job is submitted, the master node creates the map, reduces tasks, and assigns each task to idle workers.A map worker reads the input split and executes the map function submitted by the user.The map output records are grouped and sorted by the key and then stored in partitions for each reduce worker.A reduce worker reads its corresponding partitions from all the map workers, merges the partitions, and executes the reduce function.When all of the tasks are complete, the MapReduce job is finished.
Hadoop [9] is a popular open-source implementation of the MapReduce framework.In Hadoop, the master node is called the jobtracker and the worker node is called the tasktracker.Tasktrackers run one or more mapper and reducer processes, which execute map and reduce tasks, respectively.The proposed method was implemented using Hadoop, so the Hadoop terminology is used in the remainder of this paper.

Join Processing in MapReduce.
Join algorithms in MapReduce are classified roughly into two categories: map-side joins and reduce-side joins [10].Map-side joins produce the final join results in the map phase and do not use the reduce phase.They do not need to pass intermediate results from mappers to reducers, which means that map-side joins are more efficient than reduce-side joins, although they can only be used in particular circumstances.Hadoop's map-side join [11], referred to as the map-merge join [10], merges input datasets that are partitioned and sorted on the join keys in the same manner, which is similar to the merge join in traditional DBMS.An additional MapReduce job is required if the input datasets are not partitioned and sorted in advance.The broadcast join [12] distributes the smaller of the input datasets to all of the mappers and performs the join in the map phase.This approach is efficient only if the input dataset is small.
Reduce-side joins can be used in more general cases, but they are inefficient because large intermediate records are sent from mappers to reducers.The repartition join [12] is the most common join algorithm in MapReduce, but all of the input records have to be sent to reducers, including redundant records that are not relevant to the join.This may lead to a performance bottleneck.The semijoin in MapReduce [12] works in a similar manner to semijoin in traditional DBMS.This approach may reduce the size of the intermediate results by filtering out the unreferenced records with unique join keys.Therefore, it is efficient when small portions of records participate in joins.However, the semijoin requires three MapReduce jobs, which means that the results of each job are written and read in the next job.This incurs additional I/O overheads.
Recent studies have attempted to adapt the bloomjoin [13], which filters out tuples that do not participate in a join using Bloom filters [14], to the MapReduce framework.Reduce-side joins with a Bloom filter were proposed previously [2,4,5], but they create the filter via an independent job.Therefore, they have to process the input datasets multiple times.Koutris [3] theoretically investigated join methods using Bloom filters within a single MapReduce job but did not provide specific technical details.

Multiway Joins in MapReduce.
Several datasets can be joined simultaneously in a single MapReduce job by replicating some input datasets, as shown in Figure 1.Previous studies have also attempted to optimize the number of input records replicated in multiway joins [15,16] and they use similar methods for minimizing the number.Figure 2 shows a partial replication of the input records for a join example with three datasets and nine reducers.Unlike the naive multiway join that replicates some input datasets fully, the input records of R and T are replicated by only three reducers, depending on the hash values of the join attributes b and c.This optimization problem can be formulated as a problem of minimizing the total number of records that are sent to reducers.Afrati and Ullman [15] solved the minimization problem using a method based on Lagrangian multipliers to find the optimal solution.Jiang et al. [16] used a heuristic approach to find an approximate solution.These studies can be used in combination with our approach, which uses filtering techniques to facilitate more efficient multi-way join processing.

Application of Filters to Multiway Joins
This section presents the basic methods used to create and apply filters to common and distinct multiway joins.To simplify the discussion, joins between three datasets are considered in the following subsections.We then consider the processing of general joins between multiple datasets.

Common Attribute Joins.
In common attribute joins, all of the input datasets share join attributes.In these cases, the input records do not need to be replicated and they can be processed in a similar manner to two-way joins.A set of filters is created and probed in turn, depending on the processing order of the input datasets.Figure 3 shows an example of a common attribute join between three input datasets, that is, R(a,b), S(a,c), and T(a,d), based on the attribute a. Similar to two-way joins, the input records of the first dataset, that is, R in the figure, are not filtered out and they are used to create a set of filters for the next dataset.The input records of the second dataset, that is, S in the figure, are processed using the filters and some redundant records can be filtered out.In addition, another set of filters is created using the map output records from S for the next dataset.The map output records are contained in the first set of filters, which means that the second set of filters is automatically the same as the intersection of the filters that are created independently using the first and second dataset.Finally, the third input dataset, that is, T in the figure, is processed with the second set of filters.Another set of filters does not need to be created because this is the final input dataset for the join attribute.
The input datasets are processed in the order of R, S, and T in this example, but any order can be processed in the same way.The join cost depends on the number of input records, the ratio of the joined records, and the false positive rate of the filters.

Distinct Attribute Joins.
In distinct attribute joins, the input datasets may not have some join attributes.Thus, some of the datasets with missing attributes need to be replicated because their records may be joined to the input records of other datasets with any values of the missing attributes.Let us consider the join example shown in Figure 1, which is a join between three input datasets, that is, R(a,b), S(b,c), and T(c,d), based on two attributes, that is, b and c.We assume that R and T are replicated by two reducers in this example.
The example join can be processed in 3! = 6 different orders of the input datasets.Depending on the processing order, the filters can be applied in three patterns: chain, star-fact, and star-dim.

Chain.
The chain pattern creates and probes filters in turn, in a similar manner to common attribute joins, except that each set of filters is created for a different join attribute.This is analogous to the indirect partitioning method proposed by Kemper et al. [17].Two processing orders correspond to this pattern, that is, R-S-T and T-S-R.Figure 4(a) illustrates an example of a distinct attribute join with the chain pattern.The first dataset R is replicated by reducers and a set of filters is created with the values of the join attribute, b in R. The second dataset S is processed using the filters and some redundant records may be filtered out.Meanwhile, another set of filters is created using the map output records from S based on the other join attribute c.Next, the third dataset in the figure, T, is replicated and processed with the second set of filters.

Star-Fact.
The star-fact pattern creates filters using the dataset with both join attributes and uses the filters to process the other datasets.In database terms, a fact table in a star join is used to create the filters.Two processing orders correspond to this pattern, that is, S-R-T and S-T-R.Figure 4(b) shows an example of a distinct attribute join with the star-fact pattern.The first dataset S, which has both join attributes, is processed and two sets of filters for each join attribute, that is, b and c, are created.Next, the other datasets, that is, R and T, are replicated and processed using the set of filters that correspond to the join attribute.

Star-Dim.
The star-dim pattern creates filters using the datasets with missing join attributes and uses the filters to process the other dataset.In database terms, the dimension tables in a star join are used to create the filters.The remaining two cases, that is, R-T-S and T-R-S, correspond to this pattern.Figure 4(c) shows an example of a distinct attribute join with the star-dim pattern.The first and second datasets, that is, R and T, are replicated and processed without filters.Each set of filters for the join attributes is created using their join attribute values.Next, the third dataset S is processed using both filters and some redundant records are filtered out.The star-dim pattern appears to be inefficient in this example, but it is efficient if the number of records in the third dataset is much larger than those in the other datasets.

General Multiway Joins.
The basic filtering patterns are presented in Sections 3.1 and 3.2.Multiway joins of more than three datasets can be processed by combining these patterns.
In general, multiple datasets can be joined in any processing order for input datasets using the following rules.
For each join attribute, (i) create a set of filters if the dataset is not the last one with the attribute; (ii) probe the existing set of filters if the dataset is not the first one with the attribute.
All combinations of these patterns can be summarized using these rules.The filters can be applied in any processing order, but the processing order must be selected carefully because it affects the join cost.

Cost Analysis.
The number of intermediate map output records is the most important factor that influences the overall cost.For multi-way joins, the number is affected by the replication factors for each join attribute.Let   be the number of records in the th input dataset and let   be the replication factor for the th join key.Next, let    be the ratio of joined records in the th dataset relative to a previously processed th dataset, and let    be the false positive probability of the previously created filters for the th join attributes when the  th dataset is processed.In addition, let (, ) be a binary value function that returns whether the th dataset contains the th join attribute as follows: 1, if th dataset contains the th join attribute, 0, otherwise.
Assuming that the attribute values of the input datasets are independent, the number of intermediate map output records   inter in a multi-way join between  datasets based on  join attributes can be expressed as follows: where We need to find the replication factors and processing order for input datasets that minimizes the number of intermediate map output records.Note that the factors   ,   , and    depend on the processing order of the input datasets.These equations can be used to select the processing order and to estimate the join cost, but there may be a large search space if the numbers of reducers and the input datasets are large.In these cases, the factors have to be selected using heuristics.For example, the replication factors can be computed using the method proposed by Afrati and Ullman [15], which does not consider filters, or they can be predefined by users or determined by optimizer modules.The brute force approach was used in the experiments conducted with ten reducers in the present study.

Implementation Details
In this section, we briefly introduce our basic framework, which is called MFR-Join.The two major issues when processing multi-way joins simultaneously in the framework are replicating the records for the corresponding reducers and processing multiple join attributes for filtering.The following subsections describe the specific implementation details that address these issues.

MFR-Join
. MFR-Join is a general join framework with filtering techniques in MapReduce, which is proposed in our previous studies [6,7].MFR-Join, which was implemented in Hadoop, has two major differences from the original Hadoop ⊳ : the number of join attributes ⊳ : an array of join attribute values (some values may be missing) ⊳ : an array of replication factors for each join attribute (1) procedure FindTargetReducers([1, . . ., ],  [1, . .
(1) procedure coordtoreducer( [ system.First, the map tasks are scheduled according to the order of the dataset, whereas Hadoop assigns map tasks based on the order of the input split size.This allows us to apply database techniques such as tuple filtering and join ordering.Second, the filters are constructed dynamically within a single MapReduce job.The filters are created based on the first input dataset in a distributed manner to filter out the second input dataset.In this way, MFR-Join can reduce the communication cost for redundant records by processing the input datasets only once.Our previous studies [6,7] provide further details of these processes.

Partition Assignment.
For each input dataset, its corresponding reducers for replication are determined as shown in Figure 2. The replication of input records for their corresponding reducers can be implemented in a similar manner to the data partitioning method described by Zhang et al. [18].Algorithm 1 demonstrates how to find the target reducers that correspond to an input record.Depending on the number of join attributes , we may assume that there is an -dimensional space with integer coordinates, where each dimension represents each join attribute and a position represents a reducer.Note that  is the number of join attributes, rather than the number of input datasets, which was the case in a previous study [18] that aimed to process theta joins.Then, the corresponding positions of an input record can be obtained by partitioning the join attribute values of the record in a range from zero to the corresponding replication factor-1.A coordinate for a missing join attribute can be expressed using a special character, that is, " * ." This indicates that the record corresponds to the reducers with all possible values for the coordinate.Next, the positions are converted into integer identifiers of the reducer by adding up the values of each position, which are multiplied by the replication factors for the preceding dimensions.Algorithms 2 and 3 show the pseudocodes for the conversion process.Now, we consider an example of three-way joins between R(a,b), S(b,c), and T(c,d) using the two join attributes shown in Figure 2. We assume that the number of reducers is nine and that there are three replication factors for both R and T. The positions that correspond to each record of R, S, and T in the figure are ⟨1, * ⟩, ⟨1, 2⟩, and ⟨ * , 2⟩, respectively.⟨1, * ⟩ represents the positions ⟨1, 0⟩, ⟨1, 1⟩, and ⟨1, 2⟩.If we assign each reducer with an integer identifier from zero to eight incrementally, starting from the top-left cell in a vertical direction, the identifiers of the reducers that correspond to the records for R are 1, 4, and 7. Similarly, because ⟨2⟩ represents the positions ⟨0, 2⟩, ⟨1, 2⟩, and ⟨2, 2⟩, the identifiers of the reducers that correspond to the records of T are 6, 7, and 8.The position of the record of S is ⟨1, 2⟩, so the identifier of its corresponding reducers is 7. Thus, these records are gathered and joined by the reducer with the identifier 7.

MapReduce Functions.
A prototype MFR-Join framework has been implemented to create and probe filters using the keys of map output records.To process distinct attribute multi-way joins with multiple join attributes, the keys need to be separated with a delimiter, which is configured using the additional parameter mapred.filter.key.delimiter.The target reducers for a record can be found using Algorithm 1, which is described in Section 4.2.Algorithm 4 is the pseudocode for the map function used in multiway joins.
The records generated by the map function are then processed by the MFR-Join framework, as explained in Section 3. Some redundant records will be filtered out, depending on the processing order of the input datasets.The map output records that passed the filters have been gathered in the corresponding reducers by their reducer identifiers.Using the reduce function, the records are classified based on the tag representing their original dataset and they are joined with traditional join algorithms.Algorithm 5 is the pseudo-code for the reduce function in multi-way joins.

Performance Evaluation
In this section, we present our experimental results for common and distinct attribute joins.All of the experiments were run on a cluster of 11 machines, which comprised one jobtracker and 10 tasktrackers.Each machine had a 3.1 GHz quad-core CPU, 4 GB RAM, and a 2 TB hard disk.The operating system was 32-bit Ubuntu 10.10 and the Java version used was 1.6.0 26.
Our proposed framework was implemented in Hadoop 0.20.2.The Hadoop distributed file system (HDFS) was set to use 128 MB blocks and to replicate them three times.Each tasktracker could run three map tasks and one reduce task simultaneously.The I/O buffer was set to 128 KB and the memory used to sort the data was set to 200 MB.

Common Attribute Joins.
We used TPC-H benchmark [19] datasets with a scale factor of 100 for evaluation.The scale factor was the size of the entire database in gigabytes.We performed a join between three tables in the database, that is, part, partsupp, and lineitem, which had a common attribute, that is, partkey.The sizes of the datasets are shown in Table 1.
To control the amount of joined records, the selection predicate ps.supplycost <   was added to the query.The attribute ps.supplycost had a decimal value in the range of 1.0 to 1000.0.We ran the query and changed the predicate value   in the predicate with increments of one hundred.Thus, the ratio of records that satisfied the predicate in partsupp   was changed by about 10%.We compared the performance of our join method with a repartition join without filters, because common attribute joins can be processed without replication in a single MapReduce job.In our method, simple hash filters with a size of 8 Mb were used and the input datasets were processed in the following order: part, partsupp, and lineitem.
We compared the performance of our techniques to that of the existing repartition join [12] because input records do not need to be replicated.Although there are a few existing techniques such as the semijoin [12] with Bloom filters [2], we did not compare them as our previous paper showed that MFR-Join outperforms them [7], and they do not support multiway joins directly.Figure 5 shows the execution times and the sizes of the intermediate results for the test queries.The results showed that our method significantly outperformed the repartition join for all of the test cases in Figure 5(a).This is because large numbers of redundant intermediate results from the lineitem dataset are filtered out using our method, as shown in Figure 5(b).The lineitem dataset has no selection predicate, so the repartition join has to generate the entire dataset as intermediate results.

Distinct Attribute Joins.
For distinct attribute joins, the TPC-H benchmark [19] datasets were also used, but with a scale factor of 300.We performed the following join query, which was extracted from TPC-H Q2, between the following five tables: nation, region, supplier, part, and partsupp, where the sizes are as shown in Table 2.The two tables, nation and region, contained just a few records, so we treated the joins of the tables as in-memory hashing and excluded the two tables from Table 2    Similar to the test query for common attribute joins, we added a selection predicate, p.size <=   , to control the amount of joined records.The attribute p.size had an integer value in the range of 1 to 50 and we ran the query by changing the predicate value   with increments of five.Thus, the ratio of records that satisfied the predicate in part   was changed by about 10%.Distinct attribute joins required the replication of some input datasets, so we selected the best from the results using all possible combinations of the replication factors.
We compared the performance of our multi-way join method (denoted as 3-way MFR-Join) with that of the basic multi-way join (denoted by 3-way replicate join) and with that of the cascade of two-way joins with and without filters (denoted by Cascade 2-way MFR-Join and Cascade 2-way join, resp.).Simple hash filters with a size of 8 Mb were also used for the multi-way join and the cascade of two-way MFR-Join.In the two-way joins, supplier and partsupp were joined first, before the intermediate join results and part were joined.Figure 6 shows the execution times and intermediate result sizes for the test queries.The results of our three-way MFR-Join with the star-dim filtering pattern had the best performance with the queries.The multi-way joins outperformed two-way joins for the test queries shown in Figure 6(a).The cascade of two-way joins processes the join queries in two MapReduce jobs, which means that they must write the intermediate results of the first join to HDFS, before reading them from HDFS.Furthermore, there are additional costs of initializing and cleaning up a job.In twoway and multi-way joins, our MFR-Join methods with filters delivered better performance than the basic join methods without filters.This was because large numbers of redundant intermediate results from the partsupp dataset, which had no selection predicate, were filtered out by the MFR-Join, as shown in Figure 6(b).
Figure 7 shows the experimental results obtained with our three-way MFR-Join using the filtering pattern.In Figure 7(a), the star-dim pattern delivered the best performance compared with the other patterns using the test queries.This was because the fact table, partsupp, was much larger than the dimension tables, supplier and part, in the test datasets.As shown in Figure 7(b), the number of intermediate results decreased most with the star-dim pattern.In particular, it should be noted that the star-fact pattern did not decrease the number of intermediate results at all.The join attributes in the queries were the foreign keys in the databases.Furthermore, no selection predicate was specified for the partsupp dataset in our test queries, so the records in partsupp did not play a role in filtering.The increase in the amount with a   value of 15 was caused by the difference in the replication factors with the best execution time.Thus, the execution times were increased slightly by creating, merging, and probing the filters needlessly.We consider that each filtering pattern may be effective in different cases, depending on the sizes of the input datasets and the ratios of joined records.Therefore, it is important to apply the filters using an advantageous pattern.If our methods are combined with upper-layer data warehouse systems, such as Hive [20], this Mathematical Problems in Engineering   could be determined using its optimizer module based on statistical information related to the stored tables.This task will be addressed in our future work.

Conclusions
In this study, we developed methods to improve the performance of multi-way joins by applying filters.A set of filters is created and applied in turn to achieve common attribute joins and multiple sets of filters are used in various patterns, which depend on the processing order of input datasets, thereby producing distinct attribute joins.We also provide specific details for assigning reducers and writing map/reduce functions using our basic framework.We compared our proposed approach with basic multiway joins and the cascade of two-way joins.The experimental results showed that our approach improves the execution time significantly by reducing the amount of intermediate results when small portions of input datasets are joined.In future work, we plan to integrate our framework with data warehouse systems that provide query languages, such as Hive, and we aim to modify the optimizer module that supports multi-way joins to exploit filters using an appropriate pattern.
Star-dim pattern

Figure 5 :
Figure 5: Performance of common attribute joins.
replicate join 3-way MFR-Join Cascade 2-way join Cascade 2-way MFR-Join (b) Intermediate result sizes

Figure 7 :
Figure 7: Performance of distinct attribute joins with the filtering pattern.
⊳ V: an input record from the th dataset ⊳ : replication factors that are pre-computed or pre-defined in the  phase (1) procedure map(, V) (2) extract the join attribute values [1, . . ., ] by parsing the input record V

Table 1 :
Test datasets for common attribute joins. .

Table 2 :
Test datasets for distinct attribute joins.