MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. *e factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. *is study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.


Introduction
Join algorithms between two data sets in stand-alone relational databases have been optimized for years; meanwhile, the increasing needs of big data analysis result in the emergence of various types of parallel join algorithms [1]. In the era of big data, such join operations on large data sets should be performed in existing distributed computing architectures, such as Apache Hadoop; that is, efficient joins must follow the scheme of programming models and require the extended revision of conventional joins for architectures [2]. At present, the Hadoop system is able to process big data in a rapid way [3]. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. e factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. Join algorithms should be revised to utilize the ability of parallel computing in Hadoop and data set sizes being joined in performing joins for large data sets in Hadoop. However, only a few of these algorithms have considered the data skewness of scientific data sets to optimize join operations. ere exists difficulty in estimating the amount of information within heterogeneous data sets for each join. e skewness causes the occurrences of uneven distributions and a severe problem of load imbalance; the use of join operations on big data even makes the skewness occur exponentially [4]. Since Shannon's theory indicates that information is a measurable commodity, it is possible that MapReduce stages and the transmission of data sets over Hadoop clusters can be treated as a message channel between senders (mappers) and receivers (reducers). A data set to be joined is considered as a memoryless message source that contains a set of messages with its own probabilities to send to receivers.
is study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. Our study highlights that leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in Hadoop MapReduce jobs. Such an action is particularly beneficial when different operations are required to handle different customized join operations for data with the same key on Hadoop. e remainder of this paper is organized as follows. e next section summarizes the related work in this research field. Section 3 firstly leverages Shannon entropy to model the data intensity of data sets in a MapReduce join operation, and then it proposes a novel join algorithm with dynamic partition strategies for two-way joins by optimizing the partitions of data sets with skewed data intensity for reduce tasks. Section 4 conducts a series of experiments on different settings of data set sizes to compare the proposed method with the methods in literature. en, Section 5 discusses our work in practice. Finally, the work is concluded in Section 6.

Related Work
MapReduce is the most widely used programming model in Hadoop and uses the Map, Combine, Shuffle, and Reduce stages to divide numerous data sets into small partitions and reduces the merged and sorted partitions on a cluster of computer nodes [5]. e Hadoop MapReduce architecture has contributed immensely to the existing big-data-related analyses. Joining large data sets, which is complex and time consuming, is one of the most important uses in Hadoop [6]. Applications of join operations include privacy preserving similarity [7], parallel set-similarity [8], and fuzzy joins [9], which focus on similarity estimation.
Join operations in MapReduce are time consuming and cumbersome. During a join operation in a Hadoop cluster, a mapper reads a partition from data sets and writes pairs of join keys and their corresponding values to intermediate files.
ereafter, the intermediate results are merged and sorted to be fed into numerous reducers. Eventually, each reducer that performs multiple Reduce tasks joins two sets of records with the same key. Join optimization has encountered considerable bottlenecks in reducing time cost and increasing the feasibility of dynamic partitioning data for different purposes. e main reason is that the shuffle stage on Hadoop is performed with a cluster network. With hardware upgrade, GPU is also very promising to improve the performance of join operations in a hybrid framework [10]. Existing join processing using the MapReduce programming model includes k-NN [11], Equi- [12], eta- [13], Similarity [14], Top-K [15], and filter-based joins [16]. To handle the problem that existing MapReducebased filtering methods require multiple MapReduce jobs to improve the join performance, the adaptive filter-based join algorithm is proposed [17]. e MapReduce framework is also suitable to handle large-scale high-dimensional data similarity joins [18]. For example, spatial join is able to perform data analysis in geospatial applications, which contain massive geographical information in a high-dimensional form [19]. e various types of join algorithms in accordance with the actual sides of performing MapReduce join operations on Hadoop include the map-side and reduce-side joins. Such algorithms are also known for repartition, broadcast, and replicated join in accordance with their features [20]. Join algorithms utilize the ability of parallel computing in Hadoop and data set sizes being joined in performing joins for large data sets in Hadoop. In general, Bloom filter is able to reduce workload such as minimizing nonjoining data and reducing the costs of communication [21]. However, only a few of these algorithms have considered the data skewness of scientific data sets to optimize join operations. e reason is the difficulty in estimating the amount of information within heterogeneous data sets for each join. Chen et al. [22] promoted metric similarity joins by using two samplingbased partition methods to enable equal-sized partitions. In general, a data set contains different amounts of relations among data fields, thereby causing difference in workloads of Reduce tasks fed into a reducer, which differs from traditional joins. To reduce the complexity of constructing the index structure by using the see-based dynamic partitioning, a MapReduce-based kNN-join algorithm for query processing is proposed [23]. e partition phrase is also focused on in existing research to produce filter balanced and meaning subdivisions of the data sets [24]. Without effectively estimating such a difference of workload in two data sets with proper measures, joining data sets with skewed data intensity affects join performance and its feasibility.
Shannon entropy in information theory, which initially measures the information of communication systems, characterizes the impurity of arbitrary collections of examples [25]. e entropy of a variable is used to estimate the degree of information or uncertainty inherent in the scope of the variable outcomes. It is originally used to estimate the information value of a message between different entities. However, the entropy is also useful to measure diversity. At present, this theory has shaped virtually all systems that store, process, or transmit information in digital forms (e.g., performing knowledge reduction on the basis of information entropy using MapReduce) [26]. Chaos of data is the complete unpredictability of all records in a data set and can be quantified using Shannon entropy [27]. erefore, a MapReduce-based join is naturally regarded as an entropybased system for storing, processing, and transmitting data sets. With the use of hyperparameter optimization with automated tools, MapReduce-based joins can be easily applied in big data production environments [28]. In the future, automated join optimizer systems will be able to perform optimization of the hyperparameters of join operations and obtain the most efficient join strategy for various big data applications [29].

Materials and Methods
We first illustrate the concept of Shannon entropy of data sets to be joined using MapReduce and then present a framework to evaluate the entropy of data sets for existing join algorithms using MapReduce. A novel join algorithm with dynamic partition strategies is also proposed.

Shannon Entropy of Data Sets.
Shannon's theory indicates that information is a measurable commodity. Considering entropy H of a discrete random variable X � {x 1 , . . ., x n } and the probability mass function is defined as P (X), we obtain H (X) � E [I (X)] � E [−log (P (X)] according to Shannon's theory. erefore, MapReduce stages and the transmission of data sets over Hadoop clusters can be treated as a message channel between senders (mappers) and receivers (reducers). A data set to be joined is a memoryless message source that contains a set of messages with its own probabilities to send to receivers. Figure 1 illustrates a toy example of two data sets to be joined in MapReduce-based joins. A message in a data set is assumed as a record (a row) with data fields. For example, given two data sets (i.e., left data set (L) and right data set (R)) in data fields D (T ∈ L) and C (T ∈ R), T is a join key between L and R. A two-way join operation discussed in this study is to join two data sets into a new data set containing data fields D, C, and T, where T ∈ L and T ∈ R.
DS is a data set with two data fields (D and T) similar to the left data set in Figure 1, where each record stands for an instance of relations between D and T. e amount of information of a record (r) is regarded as self-information (I) of a record in entropy. us, I (r) � −log (p r ), where p r stands for the probability of r within DS. erefore, the total information of DS is obtained as follows: where r i represents the ith of the unique records from DS and N is the number of unique records. H (DS) is called the first order of entropy of DS, which is also regarded as the degree of uncertainty within DS. Equation (1) is ideally an extreme case without skewed data intensity. In general, supposing that p i is the probability of a relation i in DS, we have the following equation: where p i is the number of occurrences of the records divided by N. e maximum entropy of DS is H (DS) � logN if all records have equivalent p i . e difference between logN and H (DS) in (2) is called the redundancy of data set. Equations (1) and (2) leverage Shannon entropy to estimate the uncertainty of data sets or the partitions using MapReduce on Hadoop clusters. Figure 2 illustrates the information changes of data sets in the different stages of MapReduce. Four data accessing points, namely, mapper input (MI), mapper output (MO), reducer input (RI), and reducer output (RO), are evaluated using Shannon entropy. Two data sets are input into separate mappers and output to an intermediate file. MI and MO are the input and output, respectively, of a mapper. After the shuffle stage, the merged and sorted outputs of a mapper are fed into numerous reducers. RI and RO are the input and output, respectively, of a reducer.
Supposing that a data set is divided into m partitions after MI, the entropy of a partition (Partition i ) can be defined as follows: where p i,j � m (i, j)/m (i) is the probability that a record of a partition i belongs to partition j, m (i, j) is the number of records in partition i with relation j, and m i is the number of records in partition i. e total entropy of all partitions after MI is defined as follows: where m i is the number of records in partition i and m is the total number of records in a data set. Equations (3) and (4) represent how the total entropy of a group of partitions is calculated after MO.
To measure the impurity of a group of partitions from a data set, information gain (G) is used to estimate the gain of entropy when a join algorithm divides the data set into a group of partitions (PS) following a pipeline of MI, MO, RI, and RO. For example, if the gain of entropy from MI to MO increases, then the aggregated output of mappers contains high uncertainty of data sets. Accordingly, reducers need capability to complete further join operations. If PS is a group of partitions, then we have the following equation: where DS y is a partition of DS and |DS| represents the number of records in DS. erefore, the information change between the input and output of mappers is denoted as G

MapReduce-Based Evaluation for Join Algorithms.
Evaluating information changes of data sets during MapReduce join operations are difficult when handling data sets with massive contents. erefore, a five-stage Left data set Right data set Figure 1: Toy example of two data sets to be joined.

Scientific Programming
MapReduce-based evaluation framework for MapReduce join algorithms is proposed (see Figure 3). e framework with Stages 1 to 5 is designed by using the MapReduce programming model to overcome the difficulty of the data set sizes. In Stage 1, the entropy of two data sets is calculated from the Left Mapper for the left data set and Right Mapper for the right data set. ereafter, the merged and sorted results are fed into the reducers. Eventually, H (left data set) and H (right data set) are obtained. ereafter, H (MI) is defined as in equation (6). In Stage 2, the entropy of the intermediate output of the mappers for the left and right data sets in a join algorithm is obtained in H (MO of left data set) and H (MO of right data set). ereafter, the total entropy H (MO) is the same as in equation (7). In Stage 3, the intermediate results after the shuffle stage in the join algorithm are evaluated in H (RI). In Stage 4, the entropy of RO in the join algorithm is obtained. Lastly, all entropy values of Stages 1 to 4 are summarized in Stage 5.
e entropy-based metrics representing the information changes of the data sets in join operation are defined as H (MI), H (MO), H (RI), and H (RO). Comparing the measures illustrates the performance of different MapReduce join algorithms in terms of change of entropy and accordingly reflects their efficiency of join operations.

MapReduce Join Algorithms.
e existing join algorithms have advantages in different application senecios [30]. e entropy-based framework in Figure 3 can integrate into these algorithms to quantify the uncertainty of the intermediate results produced in the Map and Reduce stages in Hadoop. To simplify the descriptions of these algorithms, Figure 4 illustrates the toy examples of two data sets to be joined in different MapReduce join algorithms. In the figure, T 1 and T 2 are two toy data sets, where the values of the join key are denoted as A reduce-side join, which is also called sort-merge join, in Figure 4(a) preprocess T 1 and T 2 to organize them in terms of join keys.
ereafter, all tuples are sorted and merged before being fed into the reducers on the basis of the join keys, such that all tuples with the same key go to one reducer. e join works only for equi-joins. A reducer in this join eventually receives a key and its values from both data sets. e number of reducers is the same as that of the keys generated from the mappers. ereafter, the Hadoop MapReduce framework sorts all keys and passes them to the reducers. Tuples from a data set come before another data set because the sorting is done on composite keys. Lastly, a reducer performs cross-product between records to obtain joining results.
A map-side join in Figure 4(b) occurs when a data set is relatively smaller than the other data set, assuming that a smaller data set can fit into memory easily [31]. e mapside join is also called memory-backed join [32]. In this case, this algorithm initially replicates the small data set to the distributed cache of Hadoop to all the mapper sides where the data from the smaller data set is loaded into the memory of the mapper hosts before implementing the tasks of the mapper. When a record from a large data set is input into a mapper, each mapper looks up join keys from the cached data sets and fetches values from the cache accordingly. ereafter, the mapper performs join operation using data sets in the mapper for direct output to HDFS without using reducers.
When the small data set cannot fit into memory, the cached table in the framework is replaced by a hash table or the Bloom filter [33] to store join keys from the aforementioned data set. Eventually, join performance is improved by reducing the unnecessary transmission of data through the network, which is called semireduce side join (see Figure 4(c)). is join often occurs when the join keys of the small data set can fit into the memory. ereafter, a reducer is responsible for the cross-product of records similar to the reduce-side join. e Bloom filter is used to solve the problem in which Hadoop is inefficient to perform the join operation because the reduce-side join constantly processes all records in the data sets even in the case when only a small fraction of the data sets are relevant. e probability of a false positive (p) after inserting n elements in the Bloom filter can be calculated in [34] as follows: where n is the number of elements, k is the number of independent hash functions, and m is the number of bits. e size of the Bloom filter is fixed regardless of the number of the elements n. When using the Bloom filter as shown in Figure 4(d), the parameters of the filter should be predefined to ensure that all keys from the small data set have sufficient spaces for indexing the join keys by the Bloom filter.

Join with Dynamic Partition Strategies.
Although the Hadoop framework internally divides the data sets into partitions for different mappers to process, the existing join algorithms, such as reduce-side join, cannot perform dynamic partitions on the basis of the data intensity of the data sets to be joined. In existing algorithms, an output key of a mapper is consistently the join key of reducers, thereby forcing all values belonging to one key to be fed into a single reducer. is scenario causes heavy workloads to a reducer when the reducer receives numerous values, whereas other reducers receive considerably less values.
To address the aforementioned problem, a novel join algorithm with dynamic partition strategies, which is called dynamic partition join (DPJ), over the number of records of data sets is proposed. DPJ aims to dynamically revise the number of Reduce tasks after a mapper's output by introducing a parameter that determines the number of partitions that the data set should be divided into. e join enables users to specify the number of partitions over the data sets to change the number of reducers. A two-phrase MapReduce job on Hadoop is required to perform this join. A toy example based on the data sets in Figure 1 is demonstrated as follows.
In the first stage, a counter job to calculate the number of records within the two data sets is performed. After the job is completed, the total numbers of records within the data sets are determined and fed into the second job as a parameter to perform job operations. e counter job uses Hadoop counters to obtain numbers of records. us, its time cost can be ignored.
In the second stage, a join job is performed to generate the joined results.
is job requires three parameters, namely, number of partitions (N), total record number of the left data set (RN L ), and total record number of the right data set (RN R ). e steps are as follows.
Step 1. Calculate the number of partitions of the left (P L ) and right (P R ) data sets.
After N is determined, the P L and P R parameters by using equations (9) and (10) are determined to mappers and reducers of the job during the job configuration.
Step 2. Set up mappers for the left and right data sets. e left data set is input into the Left Mapper defined in Algorithm 1 and the right data set is input into the Right Mapper defined in Algorithm 2. In the algorithms, [key, value] stands for a record consisting of a specified join key and its values in a data set, L (i) stands for the ith partitions of a key from the left data set, R (i) stands for the ith partitions of a key from the right data set, LN is a Hadoop counter to count the number of records in the left data set, and RN is another counter to count the number of records in the right data set.
Step 3. Set up reducers for the output of the combiners. e reducer defined in Algorithm 3 joins the partitioned records with the same key. Meanwhile, the job reduces the workload of each reducer because the number of Reduce task increases. In the algorithm, L stands for the left data set, R stands for the right data set, and k is the original join key located in the third part within the partition key. e proposed DPJ on Hadoop is defined in Steps 1 to 3. Table 1 illustrates a split data analysis based on the toy data sets. Table 2 shows its inputs of reducers.
Various variant methods based on DPJ include DPJ with combiner, DPJ with Bloom filter, and DPJ with combiner and Bloom filter. e combiner is used to reduce the time cost of shuffle stage. e Bloom filter is used to utilize the join keys of the small data set in reducing the number of unnecessary join operations between the two data sets.
e experimental results are examined in the different settings of the data set sizes for existing join algorithms and variant methods.

Experimentation and Results
To empirically evaluate the aforementioned join algorithms, several experiments are conducted in a Hadoop cluster to join two data sets with different settings of sizes. is section firstly introduces the detailed environment settings and the data set configurations in our experiments. en, eight types of join algorithms on Hadoop have been defined and used in performance evaluation. Finally, a comparison of running time cost, information changes of data sets, and performance change over hyperparameters, respectively, is conducted.

Environment Settings.
Our experiments are performed in the distributed Hadoop architecture. e cluster consists of one master node and 14 data nodes to support parallel computing. Its total configured capacity is 24.96 TB, where each data node has 1.78 TB capacity of distributed storage and 8.00 GB random-access memory. e performance metrics from MapReduce job history files are collected for analysis. e process of evaluating the proposed method is as follows. First, several data sets needed in performing MapReduce join operations are uploaded to Hadoop Distributed File Systems (HDFS). e data sets have different data sizes according to simulating different workloads of join operations in big data analysis. en, the proposed join algorithms have been revised following the MapReduce scheme in order to run on the Hadoop Reduce environment. Hyperparameters such as the input and output directories of data sets and partition number are supported flexibly to configure within the join algorithms, so we do not need to revise the same join algorithm in different cases of join (1) class Reducer List (R) � [value] from right data set (6) for each l in List (L): (7) for each r in List (R) (8) write (k, (l, r)) ALGORITHM 3: Reducing for both data sets.   operations. After the join algorithm finishes running, the running time cost from its job history file is collected and analysed. Other measures related to CPU, virtual memory, physical memory, and heap use can be collected from the log file of running MapReduce jobs. Finally, a performance comparison from the perspectives of running time cost, information changes, and tuning performance is conducted.

Data Sets.
e data sets used in the experiments are synthetic from real-world data sets. Different settings of using the data sets include three types of combinations of data set sizes, namely, Setting 1 (S1) in which both are relatively small data sets, Setting 2 (S2) in which one is a relatively large data set and another is a relatively small data set, and Setting 3 (S3) in which both are relatively large data sets. e small and large data sets referred to here are relative concepts. For example, a small data set may contain several hundred records, while the larger one may contain over one million records. e data sets have the same width in number of fields for proper comparison.
In S1, one data set with 15,490 records and another data set with 205,083 records are used. In S2, one data set with 205,083 records and another data set with 2,570,810 records are used. In S3, one data set with 1,299,729 records and another data set with 2,570,810 records are used. e same input of data sets and joined results in the experiments ensure that the measures are comparable.

Use of Join Algorithms in the Experiments.
To simplify the references of the algorithms in the following experiments, the abbreviations of join are used in Table 3. e join algorithms include map-side join (MSJ), reduce-side join (RSJ), semi-reduce-side join (S-RSJ), semi-reduce-side join with Bloom filter (BS-RSJ), dynamic partition join (DSJ), dynamic partition join with combiner (C-DPJ), dynamic partition join with Bloom filter (B-DPJ), and dynamic partition join with combiner and Bloomer filter (BC-DPJ).
Here the C-DPJ, B-DPJ, and BC-DPJ are variant methods of DPJ.
Each join algorithm is able to configure its hyperparameters in a flexible manner so we can evaluate the same join algorithm in different settings of hyperparameters. Our objective to is use multiple existing join algorithms and the proposed variant methods to examine the performance change of join operations in different cases.

Comparison of Time Cost between Join Algorithms.
is section presents the result of comparison of time cost between the used algorithms in different cases in Table 3. Figure 5 illustrates the comparison of the time cost of joins in S1, S2, and S3. With increasing data set sizes, all time costs of joins increase. In S1, the difference of time cost is relatively small, but the existing joins perform slightly better than the variant methods of DPJ. With increasing data set sizes, time cost varies. In S2, MSJ, BS-RSJ, RSJ, and S-RSJ perform better than all DPJ-related variant methods, but B-DPJ is the best among the variant methods. In S3, the performances of BS-RSJ and RSJ are better than those of the other joins. However, the performance of C-DPJ is approximately similar to that of S-RSJ. MSJ performs well in S1 and S2 but fails to run in S3. BS-RSJ performs best in S2 and S3.
Furthermore, we examine the time cost in different MapReduce stages over different settings of data sets. Table 4 summarizes the time costs in the different MapReduce stages in S1, S2, and S3. B-RSJ in S1 and S2 requires the largest setup time cost before its launch, although B-DPJ and BC-DPJ, which also use the Bloom filter, do not take substantial time for setup. However, the DPJ-derived methods need additional time during the cleanup. In S1, all time costs of the Map tasks are larger than those of the Reduce tasks. By contrast, the difference between them varies in S2 and S3. e time costs of the Map stage of B-DPJ and RSJ in S1 are larger than those of the Reduce stage in S2 and S3. e time cost of the Shuffle and Merge stages in S3 is less than those in S1 and S2. e time cost of MSJ in S3 is substantially larger than that in any other method owing to its failure to run in the cluster.

Comparison of Information Changes over Data Set Sizes.
To investigate the amount of information changes in different data set sizes during performing join operations, we used the proposed MapReduce-based entropy evaluation framework to estimate information changes and make a comparison of their entropy in different MapReduce phrases. e information changes of each stage in Map-Reduce join algorithms with entropy-based measures are estimated in S1, S2, and S3 (see Figure 6). e inputs and outputs of the different joins are comparable because all MIs and ROs in S1, S2, and S3 are equivalent. Figure 6 shows that the entropy in MO and RI varies. erefore, the joins change the amount of information during the Map and Reduce stages owing to their different techniques. e majority of the DPJ variant methods have less value of entropy than those of the other existing joins. Hence, the uncertainty degree of the data sets and partitions has changed, thereby reducing the workload of the Reduce tasks during MO and RI.

Comparison of Performance in Different DPJ Variants.
To compare the performances of different DPJ variant methods, we conduct several experiments to examine DPJ, C-DPJ, B-DPJ, and BC-DPJ in terms of different performance measures such as cumulative CPU, virtual memory, physical memory, and heap over data set settings. e usage of cumulative CPU, virtual memory, physical memory, and heap is summarized in Figure 7. To evaluate the performance of the DPJ variant methods from the perspective of resource usage in the cluster, Figure 7 shows that B-DPJ using the Bloom filter in Figure 7(b) uses less CPU resources than PDJ in Figure 7(a). However, the virtual memory of B-DPJ is larger than that of PDJ in Reduce tasks. C-DPJ using combiners in Figure 7(c) uses larger CPU resources than DPJ, but other metrics remain similar. BC-DPJ using the Bloom filter and combiner in Figure 7(d) uses larger physical memory than PDJ, B-DPJ, and C-DPJ. However, the usage of Table 3: Abbreviation of names of the join algorithms used in this study.

Name of join algorithm
Abbreviation Map-side join MSJ Reduce-side join RSJ Semi-reduce-side join S-RSJ Semi-reduce-side join with Bloom filter BS-RSJ Dynamic partition join DPJ Dynamic partition join with combiner C-DPJ Dynamic partition join with Bloom filter B-DPJ Dynamic partition join with combiner and Bloom filter BC-DPJ

Discussion
is study introduces Shannon entropy of information theory to evaluate the information changes of the data intensity of data sets during MapReduce join operations. A five-stage evaluation framework for evaluating entropy changes is successfully employed in evaluating the information changes of joins. e current research likewise introduces DPJ that dynamically changes the number of Reduce tasks and reduces its entropy for joins accordingly. e experimental results indicate that DPJ variant methods achieve the same joined results with relatively smaller time costs and a lower entropy compared with existing MapReduce join algorithms.
DPJ illustrates the use of Shannon entropy in quantifying the skewed data intensity of data sets in join operations from the perspective of information theory. e existing common join algorithms, such as MSJ and BS-RSJ, perform better than other joins in terms of time cost (see Figure 5). Gufler et al. [35] proposed two load balancing approaches, namely, fine partitioning and dynamic fragmentation, to handle skewed data and complex Reduce tasks. However, the majority of the existing join algorithms have failed to address the problem of skewed data sets which may affect the performance of joining data sets using MapReduce [36]. Several directions for future research related to join algorithms include exploring indexing methods to accelerate join queries and design optimization methods for selecting appropriate join algorithms [37]. Afrati and Ullman [38] studied the problem of optimizing shares given a fixed number of Reduce processes to determine map-key and the shares that yield the least replication. A key challenge of ensuring balanced workload on Hadoop is to reduce partition skew among reducers without detailed distribution information on mapped data. Shannon entropy has been used in handling the heterogeneity of nodes in virtualized clusters and new data sets [39]. MapReduce stages (see Figure 2) are considered a channel between senders and receivers. erefore, information change among the different stages reflects the efficiency of information processing tasks in mappers and reducers. A large entropy of input produces a small entropy of output that often stands for a low uncertainty degree of generated output data, which is confirmed in the comparison in Figure 6.
DPJ utilizes a parameter of the number of partitions to adjust the number of reduce tasks; this process is not based on join keys of a data set but a dynamic parameter. Wang et al. [40]   many micropartitions and gradually gather statistics on their sizes during mapping. e parameter is not for physical partitions but logical partitions. In particular, this parameter promotes the proper processing for skewed data sets to be joined. DPJ does not change the number of reducers set during the job configuration because a Reduce task is just an instance of reducers. Figure 8 indicates that the DPJ variant methods should be optimized by increasing the number of reducers to reduce their time cost and achieve similar time cost using S-RSJ. erefore, the workload of each Reduce task using DPJ can be divided into several subtasks for one join key (see Tables 1 and 2). Such a division is particularly

12
Scientific Programming beneficial when different operations are required to handle different customized join operations for small partitions with the same key. Our methods do not aim to replace the existing MapReduce join algorithms for various application scenarios on Hadoop. Common join algorithms, namely, MSJ, RSJ, S-RSJ, and BS-RSJ (see Figure 4), work in different types of join operations for data sets without prior knowledge. Evidently, so does our method. However, if the scheme and knowledge of data set are known, then the Trojan Join running on Hadoop++ has advantages for joining data sets in this specific scenario [41]. is join aims to copartition the data sets at load time and group them on any attribute other than join attributes. Otherwise, if we are working on two-relation joins, then the most cost-effective method is to perform broadcast join, such as S-RSJ and BS-RSJ (see results in Figure 5). e hash table in S-RSJ and Bloom filter in BS-RSJ are used to optimize the broadcast join, but this optimization requires additional setup time cost in accordance with Table 4. e three variant methods of DPJ are proposed by using the combiner, Bloom filter, and both of them, separately. Figure 5 shows that DPJ performs worse than C-DPJ and BC-DPJ. erefore, the combiner and Bloom filter play a substantial role in MapReduce joins. Overall, the criteria for selecting the proper join algorithms rely on the prior knowledge of schema and join conditions, transmission over network, and number of tuples. We introduce Shannon entropy in evaluating information changes between the input and output of mappers and reducers during join operations. e entropy-based measures naturally fit into the existing join operations because the definitions of senders and receivers in Shannon's theory are similar. In this manner, the mapper output is the input of the Reduce tasks through the Shuffle stage across the network.
ere are threats that can result in different outcomes in our study. We proposed the counter-attack methods to address the issues. First, since the distributed computing environments like Hadoop are dynamic and complex, the running time of the same join algorithm every time is slightly different. Hadoop clusters with different data node settings can vary in running job performance. erefore, we run the same experiment several times to obtain its mean running time for robust analysis in our experimental cluster. Second, the data set sizes used for evaluating MapReduce join performance include three groups of setting, namely, small size, middle size, and large size settings. We cannot examine the performance change of every combination of data sizes. erefore, the three groups of setting are used to simulate different join scenarios.
ird, the different data sets we used have various data skewness that may affect the performance of join operations on Hadoop. But we use entropy to evaluate the data skewness for convenience of comparison in the experiments. Finally, several measures to evaluate the performance on MapReduce jobs are used to ensure the comprehensive understanding of the job performance. With all aforementioned methods, the threats that exist in our experiments are addressed. e limitation of this study is as follows. First, different data set sizes used to evaluate two-way join operations are only for case study. Evidently, these data sets are not sufficiently large to perform the optimization of joining large data sets in real life. Big data applications in various scenarios based on our methods should be further studied. Second, multiway joins involving over two data sets, such as Reduce-Side One-Shot join, Reduce-Side Cascade Join, and eta-Join, are not considered [42]. Such multiway joins are increasingly important in big data analysis in the future. ird, although the entropybased measures are appropriate in evaluating the information change of each MapReduce stage, the time cost of transmission, which has considerable influence on the join performance over the network, should be considered in the future. us, the measures only reflect the information change of the data intensity during the stages of MapReduce for two-way joins. Multiway joins withe different entropy theories should be examined in the future. Besides, multiway join algorithms that considered data skewness in different distributed computing architectures such as Apache Spark [43] can be further studied on the basis of our research. Nonetheless, this study provides a novel method using MapReduce to achieve logically flexible partitions for join algorithms on Hadoop.

Conclusions
is study aims to improve the efficiency and enhance the flexibility of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called DPJ is proposed. e experimental results indicate that the DPJ variant methods on Hadoop achieve the same joined results with a lower entropy compared with the existing MapReduce join algorithms. Our study highlights that leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the Reduce tasks in Hadoop MapReduce jobs. Such an action is particularly beneficial when different operations are required to handle different customized join operations for data with the same key on Hadoop. In the future, we are going to examine the use of the MapReduce-based multiway join algorithms with different entropy theories and their hyperparameter optimization methods with skewed data sets in distributed computing environments.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Scientific Programming 13