Research Article Query Execution Optimization in Spark SQL

. Spark SQL is a big data processing tool for structured data query and analysis. However, due to the execution of Spark SQL, there are multiple times to write intermediate data to the disk, which reduces the execution eﬃciency of Spark SQL. Targeting on the existing issues, we design and implement an intermediate data cache layer between the underlying ﬁle system and the upper Spark core to reduce the cost of random disk I/O. By using the query pre-analysis module, we can dynamically adjust the capacity of cache layer for diﬀerent queries. And the allocation module can allocate proper memory for each node in cluster. According to the sharing of the intermediate data in the Spark SQL workﬂow, this paper proposes a cost-based correlation merging algorithm, which can eﬀectively reduce the cost of reading and writing redundant data. This paper develops the SSO (Spark SQL Optimizer) module and integrates it into the original Spark system to achieve the above functions. This paper compares the query performance with the existing Spark SQL by experiment data generated by TPC-H tool. The experimental results show that the SSO module can eﬀectively improve the query eﬃciency, reduce the disk I/O cost and make full use of the cluster memory resources.


Introduction
With the increasing popularity of e-commerce, social network, artificial intelligence and other new Internet applications, the amount of data being stored and processed by governments, enterprises and research institutions has increased dramatically. A substation in the power system generates 100000 alarm data per minute, and the Facebook generates more than 400 TB of logs every day. It is an urgent problem for enterprises and research institutions to store such large-scale data on hard disk persistently and retrieve the information required by users in a short time. Data storage and processing system in big data environment has received more and more attention in recent years. e early big data processing systems mainly revolved around Hadoop platform. In order to solve the problem that Hadoop platform frequently read and wrote intermediate data in HDFS, a method to cache Hadoop's Shuffle data in memory was proposed by Shi et al. [1] Although this method could effectively reduce the large amount of random disk I/O cost caused by reading and writing intermediate data, it was inflexible as the cache size created by this method was fixed for different applications. A certain amount of memory would be wasted for some applications with small amount of Shuffle data. To solve the problems of Hadoop platform, the memory-based distributed computing framework Apache Spark emerged.
Spark is a high-speed and versatile big data processing engine. It is based on the implementation of RDD [2] (Resilient Distributed Datasets) and implements data distribution and fault tolerance. As shown in Figure 1, the current Spark ecosystem consists of three layers: the bottom layer, the middle layer, and the top layer. e bottom layer can read input data from HDFS [3], Amazon S3, HyperTable and HBase [4]. e middle layer uses resource scheduling management platforms such as Standalone, Yarn [5] and Mesos [6] to complete the analysis and processing of applications. e top layer consists of a series of advanced tools, including Spark Streaming [7], GraphX [8], BlinkDB [9], MLlib [10] and Spark SQL [11]. Spark SQL is developed from Shark [12], it provides functions like Hive [13] which allows users to process structured data directly by entering SQL statements. e Catalyst which generates and optimizes execution plan of Spark SQL will perform algebraic optimization for SQL query statements submitted by users and generate Spark workflow and submit them for execution.
However, the Spark SQL system currently faces two problems. One problem is that the frequent reading and writing of intermediate data in the data interaction process between Spark tasks lead to serious random disk I/O cost. e number of intermediate files produced by a simple program on Spark is shown in Figure 2.
e other problem is that there are no suitable optimization rules for Spark workflow. Spark jobs with intermediate data correlation need to read the same input data from disk repeatedly, resulting in redundant disk I/O cost.
Under the above background, this paper aims to improve the execution efficiency of Spark SQL. e main contributions are as follows.
(1) Reduce the random disk I/O cost in the Shuffle phase by adding an intermediate data cache layer between the Spark core layer and the underlying distributed file system. (2) Reduce the read/write cost of the same intermediate data between Spark jobs by using a cost-based correlation merging algorithm, then further improving the performance of the data analysis system.

System Architecture
e SSO (Spark SQL Optimizer) prototype system is designed and developed. Its system architecture is shown in Figure 3. e existing Spark system runs in a master-slave mode. e Driver process on the main node receives query requests submitted from Client. e Worker process on the slave node executes specific query tasks. SSO system optimizes and improves the original Spark, adding a dynamic Shuffle optimizer module named Dynamic Shuffle Optimizer on the main node. Spark's original Catalyst framework is also modified to include a cost model module renamed SQL workflow optimizer. Reading and writing interfaces for distributed memory file system are added to the slave node for the Shuffle Map Task and Shuffle Reduce Task. e process of optimizing the users' queries by SSO system is described below.
Queries submitted by users are first parsed by the parser in the SQL workflow optimizer module to form a logic execution plan. en, the optimized execution plan is submitted to Dynamic Shuffle Optimizer and DAG scheduler. Dynamic Shuffle Optimizer calculates the size of intermediate data generated by the optimized SQL queries using the query pre-analysis module. en the allocation module at the cache layer performs buffer allocation on the distributed memory file system. Finally, the workflow generated by the execution plan will be submitted to the DAG scheduler for task assignment at the slave node.

Spark Shuffle Intermediate Data Cache Layer
In the existing Spark system, data interaction between Stages will frequently generate disk I/O overhead. erefore, a strategy for caching immediate data is proposed. e dynamic Shuffle optimizer is used to create buffers with different sizes on the distributed memory file system dynamically for different Spark SQL workflow. e random disk I/O cost of the Shuffle phase will be reduced by caching Shuffle intermediate data in memory.  performs join operations of p_partkey � l_partkey and finally writes the query results to disk. Analysis of this query show that the query execution workflow in Spark can be divided into two stages. One is S-Stage for projection, selection and aggregation operations, and the other is J-Stage for join operations.

Query Pre-Analysis
For S-Stage, the following rules can be used to calculate the size of output data at this stage.
(1) Projection L pro represents the length of projection attribute, and L D represents the total length of all attributes.
e value of β fil is related to specific selection conditions and data distribution of the original table.
(3) Aggregation Aggregation operation returns the sum or average value of an attribute, so the output size is negligible.
For example, Figure 5 is the running state diagram of this SQL query on Spark. e input data size read by Stage 0 is 233.2 MB, known Lpartkey � 10, L D � 194.
rough the above analysis, the output of this Stage can be calculated |D out | � β pro |D in | � 10/194 * 233≈13 MB. It approximates the actual output of Stage 0 of 10 MB. Similarly, the input data size read by Stage 1 is 7.3 GB, known Lpartkey � 10, Lquantity � Lextendedprice � 15, L D � 231. Sampling shows that the tuples with l_quantity >40 account for about 60% of the total tuples. erefore the output of this Stage is |D out | � β pro * β fil |D in |�(10 + 15 + 15)/231 * 0.6 * 7.3 GB≈ 776 MB, which approximates the actual output of 824 MB generated by Stage 1.

Cost Analysis of Join Operation.
For the J-Stage, the following rules are used to calculate the output data size of this Stage. Scientific Programming If there is no join condition C, it becomes Cartesian product and c � 1. If no tuple satisfies join condition C, c equals 0. Normally 0 ≤ c ≤ 1. If the join condition is D 1 ·A � D 2 ·B, there are three special cases. primary key of D 2 , and the attributes A and B do not obey uniform distribution, c needs to be sampled according to the specific data set, table structure and data size. As for the cost analysis method of join operation under non-uniform distribution, the idea of histogram method mentioned in [14] is referred to develop a method suitable for Spark SQL. Assuming that R is a relationship, field C is an attribute of R, and the value range of C is [min, . . ., max], where min and max are the minimum and maximum values of field C in relation R respectively. e [min, . . ., max] is divided into several intervals, called straight or bucket. en the number of tuples whose C attribute values are in these intervals is counted. In order to analyze the cost of join operation by histogram method, the first step is to construct the variable-width distribution histogram. e flow of the algorithm is shown in Algorithm 1. e input of the algorithm is frequency distribution histogram which records the value of each attribute and its occurrence times. e attribute values are sorted from small to large. e output is variable-width distribution histogram which consists of histogram buckets. Each histogram bucket records the starting and ending values of the attribute, and the frequency sum of attribute within the range. ere is no intersection between buckets. In the algorithm, a variable max is declared to record the maximum frequency in the current histogram bucket. For each of the following attributes and frequency pairs, if the difference between the frequency value and max only accounts for 5% or less of the frequency value, it can be considered that this attribute obeys the same distribution as each attribute in the current histogram bucket. erefore, this attribute is also included in the current histogram bucket. If the above conditions are not met, a new histogram bucket is created whose starting value is the current attribute value and max is the frequency value of the corresponding attribute value. Continue the above steps until all attributes and frequency pairs have been traversed.
After constructing the variable-width distribution histogram, the algorithm to calculate the total tuple number after joining is shown in Algorithm 2.
e input of the algorithm is the variable-width distribution histogram of relation R and relation S. e output is the total number of tuples after the join operation of the two relations.
e algorithm seeks overlaps between two histogram buckets from the first histogram bucket of relation R and relation S histogram. As the histogram bucket is uniformly distributed, templeft and tempright can be obtained by dividing the size of the overlaps by the width of the histogram bucket, and then multiplying the result by the total frequency of the histogram bucket. Templeft and tempright represent the frequency of the equivalent attribute in the histogram bucket of relation R and relation S respectively. Multiply templeft by tempright and divide by the size of the overlaps, then the total tuple number generated by the join operation in the first histogram bucket with overlaps is obtained. If there is no overlap between two histogram buckets and the range of bucket h i r is smaller than the range of bucket h j s, the range of bucket h i+1 r and bucket h j s need to be compared. Continue the above steps until all the histogram buckets in relation R or relation S are traversed.

Cache Layer Allocation Module.
After calculating the total size of the cache layer, the next question is how much memory is allocated for each node in the cluster. Due to the principle of Spark Data Locality, when reading HDFS files, Spark will assign the nearest Executor to the data storage. For nodes that store more input data, more memory should be allocated. e input data in HDFS is organized as blocks, and the default size of a block is 128 MB. So, after parsing users' query statements to obtain the tables to be required, it is necessary to analyze the distribution of blocks in the cluster for each table. Since the size of each block in HDFS is the same, the memory size M i allocated to each node can be obtained by calculating the ratio of the block number of each node B i to the total blocks B total .
For the Q1 query mentioned before, the size of table Lineitem is 7.2 GB including 60 blocks. e size of table Part is 233 MB including 2 blocks. e distribution of these blocks in the cluster is shown in Figure 6. After obtaining the distribution of the input data in the cluster, according to the total size of the cache layer calculated by query pre-analysis module and the formulas in the cache layer allocation module, the memory size required for each node is 790 * 2/62≈25. 48  all carried out in memory before or after Shuffle operations.
If the cluster has enough memory, even if a certain size of memory space has been allocated for intermediate data cache layer, the remaining memory is enough for Spark task to perform calculation. If the memory resources of the cluster are limited, the query pre-analysis module will calculate the cache size of Shuffle before the query runs, and allocate the memory through the cache layer allocation module. is will lead to occupying the memory before the Shuffle operation is carried out. is is not a reasonable approach obviously. e cache layer allocation module adopts a delay allocation scheme to solve this problem. e Spark program allocates as much memory as possible to Spark works in the non-Shuffle phase. In the Shuffle phase, Spark works do not need memory resources for calculation. en part of the memory resources in Spark works and the remaining memory resources in the cluster are called for the intermediate data cache layer.

Cost-Based Correlation Merging Algorithm
e optimization process of SQL query by the Catalyst framework in the traditional Spark system is shown in Figure 7. First, the query statements entered by users are parsed by SQL parser to form a logical execution plan tree. e plan tree is then algebraically optimized to some extent by the SQL optimizer. In the current Spark SQL workflow, there is a case of repeatedly reading and writing the same intermediate data. ere are multiple tasks at the nodes of the logical execution plan tree output the same intermediate data, resulting in additional disk I/O cost. erefore, an optimization rule can be added to the original SQL optimizer to merge the same intermediate data. Although merging will reduce the cost of writing output data on disk, subsequent tasks will read the intermediate data they do not need which also brings extra disk reading cost. So, it needs cost calculation whether to merge tasks with intermediate data correlation. A cost model module is introduced based on the original Catalyst framework. When merging tasks with intermediate data correlation, SQL optimizer will determine whether to execute the optimization rule by cost calculation. e optimized Catalyst framework is named as SQL workflow optimizer and its framework is shown in Figure 7.

Cost Model.
e cost model for Spark tasks execution should be established in order to merge Spark tasks with intermediate data correlation. en the benefits and extra costs of merging should be calculated based on model to decide whether to merge by comparison.
For establishing the task execution cost model in Spark, we improve the method proposed by Singhal and Singh [15] and add the cost generated by sorting operation. When calculating the Stage cost, reading input data, merging and sorting intermediate data, and writing output data are considered, that is Since both C read (Stage) and C write (Stage) are I/O cost, the cost calculation formula is C I/O � C 0 T + C 1 x. e definition of each parameter is shown in Table 2.
e cost of Stage reading phase C read (Stage) can be calculated by |D in | is determined by the size of the source input data or the output data of other Stage. e value of α is 0.3. It is determined by the default three-copy storage strategy of HDFS, one locally, one on the same rack and one on the remote rack. e number of I/O occurrence depends on the specific Stage. For S-Stage, if the source input data is read, T equals 1 as the source data is stored continuously. If the  . For J-Stage, the input of this Stage must be the output of the other two stages due to the join operation of two tables. Assuming that the previous two stages produce a total of |D out | size output data, if the two-way merge sort join algorithm is used, ⌈log 2(| D out |/B)⌉ times of scanning are needed, and the number of I/ O occurrence is |D out |/B⌈log 2(|D out |/B)⌉. In summary, the calculation formula for the cost in the reading phase of Stage is as follows.
For the cost of the writing phase of Stage C write (Stage), since the intermediate data is overwritten to the local disk, this process does not involve the network transmission cost. e calculation formula is where |D out | can be calculated by the method described in Chapter 2. e number of I/O occurrence T is determined by the number of intermediate files. e number of intermediate files is calculated by |D out |/B. en the calculation formula of the writing cost in Stage is given by Each task in Stage needs to sort and merge all the intermediate data files generated by itself. If a total of |D out |/B intermediate files are generated, it can be considered that each task generates |D out |/B m intermediate files, where m is the number of tasks in each Stage. So, the cost of the sorting phase C sort (Stage) is calculated by Assuming that the number of sorting P equals |D out |/ B m ⌈log 2(|D out |/B m )⌉, the execution cost of a Stage in Spark is given by en it stores the data in the form of key and value to the <key, value> pairs and outputs to the local disk. Take the work Stage 1 in Figure 8 as an example, the corresponding query is select l_partkey, l_quantity, l_extendedprice from lineitem, part where p_partkey � l_partkey.
e parsed workflows are all joined at the Reduce Task in Spark SQL. So, in the MapTask phase, the data is read from the lineitem table by row, then l_partkey is saved as key, and l_quantity and l_extendedprice are saved as values. To facilitate the join operation at Reduce, the actual <key, value> pairs are of the form < l_partkey, l_quantity|l_extendprice|>.
Similarly, For Stage 3 in Figure 8, the query is select l_partkey, l_quantity from lineitem, supplier where s_suppkey � l_suppkey and s_nationkey � 2. e <key, value> pairs generated in the Shuffle phase is <l_suppkey, l_partkey|l_quantity>.  Scientific Programming e field l_partkey appears in the intermediate data output of both Stage 1 and Stage 3, one as the Key and the other as the Value. Meanwhile, l_quantity also appears in the Value of both stages. Since both sub-queries involve only join and projection operations and no selection operations, the output data from MapTask of Stage 1 and Stage 3 contains all the rows in the lineitem table. It can be considered that the two types of pairs contain all the l_partkey and l_quantity in the lineitem table, so they can be merged. Although merging reduces the writing cost of the current Stage, it also increases the reading cost of the subsequent Stages. For example, Stage 2 joins the part table and lineitem  table, but reads the unwanted l_suppkey field after merging.  Similarly, Stage 5 joins the supplier table and lineitem table,  but reads the unwanted l_extendedprice field after merging. erefore, the writing cost of saved field and the reading cost of increased field should be weighed to determine whether to merge the Shuffle intermediate data or not. e problem can be solved by the shared field merging algorithm below.

Shared Field Merging Algorithm.
Assuming that the SQL statement of Stage i is select i 1 , i 2 , . . ., i n from table i , the format of <key, value> pairs generated is <i 1 , i 2 |i 3 |. . .|i n >. e format of <key, value> pairs generated by Stage j is <j 1 , j 2 | j 3 . . .|j m >. Let S i � {i 2 , i 3 , . . ., i n } and S j � {j 2 , j 3 , . . ., j m }, then the format of <key, value> pairs after merging is <i 1 ∪ j 1 , S i ∪ S j-i1 ∪ j 1 >. e keys of two pairs are first compared. Unify them into one field if the field names are the same, otherwise, join the two fields together separated by "|." Next, merge the value sets of the two pairs and remove the fields that have already appeared in the Key. In summary, the data format after merging the intermediate data of Stage 1 and Stage 3 is <l_partkey |l_suppkey, l_quantity |l_extendedprice>. To measure the writing benefit of saved fields and the reading cost of increased fields, the definition of cost model is where |D write | � |D out | L save_cols /L total , |D read | � |D out | L add_cols /L total . L save_cols is the length of saved fields of writing. L add_cols is the length of increased fields of reading. L total is the length of the total fields. |D out | can be calculated by the method described in Chapter 2. For example, after merging into the above pairs, the repeated writing of l_parkey and l_quantity can be reduced, and L save_cols � L l_partkey + L l_quantity . Similarly, the redundant reading of l_extendedprice and l_suppkey is increased, and L add_cols � L l_extendedprice + L l_suppkey .

Experiments
We verified the performance of SSO system in this section. e test data was generated by the TPC-H benchmarking tool. Part of the query statements provided by TPC-H was selected to test the performance of this system in two parts: intermediate data caching test and intermediate data correlation merging algorithm test.

Experimental Environment and Data Set.
e hardware environment was as follows. e experimental cluster included a main node and 10 sub-nodes. e nodes were connected through gigabit ethernet. Each node was configured with 2.7 GHz CPU(8-core 16 threads, 20M Cache, Turbo frequency), 64 GB DDR3 1600 and 500 GB SAS mechanical hard disk.
As for the software environment, the operating system used in the header node and the slave node was Centos6.7. Each node was installed with JDK 1.8.0, Hadoop 2.6.0, Spark 2.0.1 and Scala 2.1.0. e programming language used in the experimental development were Java and Scala. IDEA IDE was used in the development environment. e experimental data set used here was generated by TPC-H benchmarking tool. TPC-H was a standard for DBMS performance testing developed by TPC (American Transaction Processing Performance Council), a non-profit organization. It could be used to simulate the business application environment in real life and was widely used to evaluate the comprehensive performance of decision support system. e test data was generated using the command dbgen -s x provided by TPC-H where x represented the data size in GB to be generated. In addition to the test data set, TPC-H also provided 22 query statements Q1-Q22. Users utilized command qgen provided by TPC-H to generate these queries for performance testing.

Spark Shuffle Intermediate Data Caching Test.
After completing the memory allocation of Spark Shuffle cache layer in each node, several representative queries in TPC-H were chosen which were Q1, Q5, Q9, Q18 and Q19. Q1 and Q19 had larger input data and fewer intermediate data. Q5, Q9 and Q18 had fewer input data and larger intermediate data. e queries were programmed to be submitted to SSO system and the original Spark system, and the query execution time was recorded.
e experimental results are shown in Figure 9.
Results shows that for Q5, Q9 and Q18 whose intermediate data was much larger than the input data, the intermediate data cache layer could solve the problem of high random disk I/O cost. e optimization effect was obvious. But for Q1 and Q19 whose input data was larger than the intermediate data, the reading of the input data accounted for the cast majority of disk I/O. e optimization effect of the intermediate data cache layer for those queries was relatively limited. e change of disk I/O rate over the query process was shown in Figure 10. As can be seen from the figure, since the source data was stored sequentially, the process of reading input data produced centralized disk I/O. In both systems, the disk read rate was around 80 MB/S which is the peak performance of stable mechanical hard disk. During the running process, the program entered the Shuffle phase of reading and writing intermediate data. At this time, the original Spark system generated more random disk I/O and the disk read rate had a great fluctuation. It took a long time to complete the Shuffle process in the original Spark system. e SSO system used the intermediate data cache layer, and the intermediate data were read and written in memory. So, the Shuffle phase of SSO system did not generate disk I/O cost. e SSO system completed the query task faster than the original Spark system. Figure 11 showed that the memory usage of the two systems was similar in the first half of query process and both were steadily increasing. When entering the Shuffle phase, SSO system calculated the required memory size by query pre-analysis module and allocated the memory at each node by cache layer allocation module. erefore, the memory usage of SSO system would increase instantaneously at this time. is part of memory would be released after the end of query. In the whole process of query, the main node had enough memory (64 GB), while the size of intermediate data generated by query Q9 was only 14.1 GB. It was possible to create a piece of memory space as intermediate data buffer. SSO system had a higher memory usage rate, which was one of the main reasons for the higher execution efficiency of SSO system.

Intermediate Data Correlation Merging Algorithm Test.
By merging the intermediate data with the same fields, the size of intermediate data generated could be further reduced, Scientific Programming thus reducing the amount of memory used by the intermediate data cache layer. To verify the effectiveness of the algorithm, in this section, query Q17 of TPC-H was selected to be submitted to SSO system and the original Spark SQL system respectively. Meanwhile, without using the intermediate data correlation merging algorithm, the original Spark SQL system was tested in two cases: using the intermediate data cache layer or not. e query execution time was recorded and shown in Figure 12.
Results showed that the execution time of SSO system and Spark SQL system with intermediate data cache layer was both shorter than that of Spark SQL system without any optimization, while the execution time of the former two systems tended to be the same. In order to compare the two systems, the monitoring results of the cache size used by each node were shown in Table 3. e cache usage of SSO system with intermediate data correlation merging algorithm was less than that of Spark SQL system without optimization algorithm in each node. is was because by sharing the same fields l_partkey and l_quantity, intermediate data of 1.5 GB could be reduced and writing the data caused disk cost of 19.2 seconds. Redundant reads of field l_extendedprice and l_suppkey were also generated. Reading the 1.26 GB data caused disk I/O of 16.4 seconds. Comparison showed that the saved cost was greater  than the generated cost. So, in this case, the algorithm proposed would merge the same intermediate data. However, the intermediate data were read and written in memory after adopting the cache layer and the memory read/write rate was more than dozens of GB per second. On the premise that the cache layer cached all the intermediate data, the amount of cache usage had little effect on task execution efficiency. at was the main reason why the query time of the first two experiments tended to be consistent. But in the case of tight cluster memory resources, the intermediate data correlation merging algorithm could reduce the cache usage, thus saving memory resources effectively.

Conclusion
With the maturity of memory computing framework, Spark, as the representative of memory computing framework, has attracted more and more attention from enterprises and research groups. As a bridge between data analysts and Spark systems, Spark SQL plays an important role. To optimize the performance of Spark SQL query, the existing Spark SQL was improved and the SSO prototype system was developed. By adding the Spark Shuffle intermediate data cache layer, the high disk I/O cost caused by random reading and writing of intermediate data in Shuffle phase was reduced. In order to solve the problem of redundant read-write for intermediate data of Spark SQL, a cost-based correlation merging algorithm was proposed. It was used to determine whether to merge tasks with correlation by weighing the benefits and the extra costs of merging, thus improving the query execution efficiency. e experimental platform was built and the SSO system was developed. e benchmarking tool TPC-H was used to generate test data. e performance comparison with the existing Spark SQL system was carried out to verify the effectiveness of the work in this paper.

Data Availability
All data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.