Optimizing Hadoop Performance for Big Data Analytics in Smart Grid

The rapid deployment of Phasor Measurement Units (PMUs) in power systems globally is leading to Big Data challenges. New high performance computing techniques are now required to process an ever increasing volume of data from PMUs. To that extent the Hadoop framework, an open source implementation of the MapReduce computing model, is gaining momentum for Big Data analytics in smart grid applications. However, Hadoop has over 190 configuration parameters, which can have a significant impact on the performance of the Hadoop framework.This paper presents an Enhanced Parallel Detrended Fluctuation Analysis (EPDFA) algorithm for scalable analytics on massive volumes of PMU data. The novel EPDFA algorithm builds on an enhanced Hadoop platform whose configuration parameters are optimized by Gene Expression Programming. Experimental results show that the EPDFA is 29 times faster than the sequential DFA in processing PMU data and 1.87 times faster than a parallel DFA, which utilizes the default Hadoop configuration settings.


Introduction
Phasor Measurement Units (PMU) are being rapidly deployed throughout global electricity networks, facilitating the development and deployment of Wide Area Monitoring Systems (WAMS).WAMS provide a far more immediate and accurate view of the power grid than traditional Supervisory Control and Data Acquisition (SCADA) monitoring [1,2], collecting real-time synchronized measurements at a typical rate of 1 sample per cycle of the system frequency.This brings in new challenges in terms of data management that need to be addressed to fully realize the benefits of the technology.
The devices transmit 4.32 million measurements per parameter per day for a 50 Hz system.This is orders of magnitude larger than traditional monitoring solutions and requires fastacting, scalable algorithms, combined with novel visualization techniques to turn the growing datasets into actionable information for network operators and planners alike.
The authors' previous research has focused on the detection of transient events in PMU datasets using Detrended Fluctuation Analysis (DFA), for the purpose of triggering steady-state estimators [3] and on determining events suitable for system inertia estimation [4].However, processing an ever increasing volume of PMU data in a timely manner necessitates a high performance and scalable computing infrastructure.For this purpose we have parallelized the works presented in [3] using the MapReduce computing model [5] and implemented a parallel DFA (PDFA) [6] using the Hadoop MapReduce framework [7].
The MapReduce model has become a de facto standard for Big Data analytics by capitalizing on clusters of inexpensive commodity computers.The Hadoop framework is an open source implementation of the MapReduce model and has been widely adopted due to its remarkable features such as high scalability, fault-tolerance, and computational parallelization [8,9].In addition, the Hadoop framework has also been applied in the power system domain for power grid data analysis [6,[10][11][12][13].
Despite its remarkable features, Hadoop is a complex framework, which has a number of components that interacts with each other across a cluster of nodes.The execution times of Hadoop jobs are sensitive to each component of the framework including the underlying hardware, network infrastructure, and configuration parameters.It is worth noting that the Hadoop framework has more than 190 tunable configuration parameters, some of which have a significant impact on the execution of a Hadoop job [14].Manually tuning these parameters is a time consuming and ineffective process and is highly challenging when attempting to ensure that Hadoop operates at an optimal level of performance.In addition, the Hadoop framework has a black-box-like feature, which makes it extremely difficult to find a mathematical model or an objective function that represents a correlation among the parameters.The large parameter space together with the complex correlation among the configuration parameters further increases the complexity of a manual tuning process.Therefore, an effective and automatic approach to tuning Hadoop's parameters has become a necessity.
In this paper, we present an Enhanced Parallel Detrended Fluctuation Analysis (EPDFA) algorithm for scalable analytics on massive PMU datasets.EPDFA is based on an enhanced Hadoop platform whose configuration parameters are optimized by Gene Expression Programming (GEP) [15].The EPDFA employs GEP to construct an objective function based on a historical profile of the execution of jobs.The objective function represents a mathematical correlation among the core Hadoop parameters.It then makes use of the constructed objective function to find a set of optimal values of the core Hadoop parameters for performance enhancement.It should be noted that, in the proposed optimization process, the entire parameter search space is considered in order to maintain the interdependencies among the configuration parameters.The performance of the EPDFA is evaluated on an experimental Hadoop cluster configured with 8 Virtual Machines (VMs) and is compared with both the original sequential DFA and the PDFA that only utilizes the default Hadoop configuration settings.The PMU data used in the evaluation was collected from the WAMS of the Great Britain (GB) transmission system.
The reminder of the paper is organized as follows.Section 2 reviews the related work on Hadoop configuration tuning.Section 3 introduces a set of core Hadoop configuration parameters, which are considered in this work.Section 4 presents in detail the design and implementation of the EPDFA for scalable analysis on PMU data.Section 5 compares the performance of the EPDFA with that of the sequential DFA and the PDFA, respectively, using an experimental Hadoop cluster.Section 6 concludes the paper and proposes some further work.

Hadoop Parameter Tuning
In this section the related work on autotuning Hadoop configuration parameter settings is reviewed.
In the relevant literature, there are several Hadoop performance models that focus on tuning Hadoop configuration parameters in order to enhance the execution of Hadoop jobs [14,[16][17][18][19][20].Wu and Gokhale proposed Profiling and Performance Analysis-Based System (PPABS) [17], which automatically tunes the Hadoop configuration parameter settings based on executed job profiles.The PPABS framework consists of Analyzer and Recognizer components.The Analyzer trains the PPABS to classify the jobs having similar execution times into a set of equivalent classes.The Analyzer uses -means++ to classify the jobs and simulated annealing to find optimal settings.The Recognizer classifies a new job into one of these equivalent classes using a pattern recognition technique.The Recognizer first runs the new job on a small dataset using default configuration settings and then applies the pattern recognition technique to classify it.Each class has the best configuration parameter settings.Once the Recognizer determines the class of a new job, it then automatically uploads the best configuration settings for this job.However, PPABS is unable to determine the finetuned configuration settings for a new job that does not belong to any of these equivalent classes.Herodotou et al.
proposed Starfish [14,16] that employs a mixture of cost model [21] and simulator to optimize a Hadoop job based on previously executed job profile information.However, the Starfish model is based on simplifying assumptions [20], which indicate that the obtained configuration may be suboptimal.Liao et al. [18] proposed a search based model that automatically tunes the configuration parameters using a Genetic Algorithm (GA).One critical limitation is that it does not have a fitness function implemented in the GA.The fitness of a set of parameter values is evaluated by physically executing a Hadoop job using the tuned parameters, which is an exhaustive and time consuming process.Liu et al. [19] proposed two approaches to optimize Hadoop applications.The first approach optimizes the compiler at run time and a new Application Programming Interface (API) was developed on top of a Java Bytecode Optimization Framework [22] to reduce the overhead of iterative Hadoop applications.The second approach optimizes a Hadoop application by tuning Hadoop configuration parameters.This approach divides the parameters search space into subsearch spaces and then searches for optimum values by trying different values for parameters iteratively within the range.However, both approaches are unable to provide a sophisticated search technique and a mathematical function that represents the correlation of the Hadoop configuration parameters.Li et al. [23] proposed a performance evaluation model for the whole system optimization of Hadoop.The model analyzes the hardware and software levels and explores the performance issues in both levels.The model mainly focuses on the impact of different configuration settings on job execution time instead of tuning the configuration parameters.Yu et al. [20] proposed a performance model, which employs a combination of Random-Forest and GA techniques.The Random-Forest approach is used to build performance models for the map phase and the reduce phase and a GA is employed to search optimum configuration parameter settings within the parameter space.It should be noted that a Hadoop job

Hadoop Parameters
The Hadoop framework has more than 190 tunable configuration parameters that allow users to manage the flow of a Hadoop job in different phases during the execution process.Some of them are core parameters and have a significant impact on the performance of a Hadoop job [14,18,25].
Consider that all of the 190 configuration parameters for optimization purposes would be unrealistic and time consuming.
In order to reduce the parameter, search space, and effectively speed up the search process, we consider only core parameters in this research.The selection of the core parameters is based on previous research studies [14,17,18,25,26].The core parameters as listed in Table 1 in brief are as follows.
io.sort.factor.This parameter determines the number of files (streams) to be merged during the sorting process of map tasks.The default value is 10, but increasing its value improves the utilization of the physical memory and reduces the overhead in IO operations.
io.sort.mb.During job execution, the output of a map task is not directly written into the hard disk but is written into an inmemory buffer which is assigned to each map task.The size of the in-memory buffer is specified through the io.sort.mbparameter.The default value of this parameter is 100 MB.The recommended value for this parameter is between 30% and 40% of the Java Opts value and should be larger than the output size of a map task which minimizes the number of spill records [27].
io.sort.spill.percent.The default value of this parameter is 0.8 (80%).When an in-memory buffer is filled up to 80%, the data of the in-memory buffer (io.sort.mb)should be spilled into the hard disk.It is recommended that the value of io.sort.spill.percentshould not be less than 0.50.mapred.reduce.tasks.This parameter can have a significant impact on the performance of a Hadoop job [24].The default value is 1.The optimum value of this parameter is mainly dependent on the size of an input dataset and the number of reduce slots configured in a Hadoop cluster.Setting a small number of reduce tasks for a job decreases the overhead in setting up tasks on a small input dataset while setting a large number of reduce tasks improves the hard disk IO utilization on a large input dataset.The recommended number of reduce tasks is 90% of the total number of reduce slots configured in a cluster [28].When the number of map output files equal to threshold value is accumulated then the system initiates the process of merging the map output files and spill to a disk.A value of zero for this parameter means there is no threshold and the spill process is controlled by the mapred.reduce.shuffle.merge.percentparameter [27].

The Optimization of Hadoop Using GEP
The automated Hadoop performance tuning approach is based on a GEP technique, which automatically searches GEP [15] is a new type of Evolutionary Algorithm (EA) [29].It is developed based on concepts that are similar to Genetic Algorithms (GA) [30] and Genetic Programming (GP) [31].Using a special representational format of the solution structure, GEP overcomes some limitations of both GA and GP.GEP uses a combined chromosome and expression tree structure [15] to represent a targeted solution of the problem being investigated.The factors of the targeted solution are encoded into a linear chromosome format together with some potential functions, which can be used to describe the correlation of the factors.Each chromosome generates an expression tree, and the chromosomes containing these factors are evolved during the evolutionary process.Execution Time =  ( 0 ,  1 , . . .,   ) . ( In this research, we consider 9 core Hadoop parameters and based on the data types of these Hadoop configuration parameters, the functions shown in Table 2 can be applied in the GEP method.A correlation of the Hadoop parameters can be represented by a combination of the functions.Figure 1 shows an example of mining a correlation of 2 parameters ( 0 and  1 ) which is conducted in the following steps in the proposed GEP method: (i) Based on the data types of  0 and  1 , find a function, which has the same input data type as either  0 or  1 and has 2 input parameters.
(ii) Calculate the estimated execution time of the selected function using the parameter setting samples.
(iii) Find the best function between  0 and  1 , which produces the closest estimate to the actual execution time.In this case, the Plus function is selected.
Similarly, a correlation of  0 ,  1 , . . .,   can be mined using the GEP method.The chromosome and expression tree structure of GEP is used to hold the parameters and functions.A combination of functions, which takes  0 ,  1 , . . .,   , as inputs is encoded into a linear chromosome that is maintained and developed during the evolution process.Meanwhile, the expression tree generated from the linear  (10) Translate chromosome () into expression tree (); (11) FOR  = 1 TO the number of training samples DO (12) evaluate the estimated execution time for case () (13) IF ABS (timeDiff) < bias window THEN (14) fitness chromosome produces a form of ( 0 ,  1 , . . .,   ) based on which an estimated execution time is computed and compared with the actual execution time.A final form of ( 0 ,  1 , . . .,   ) will be produced at the end of the evolution process whose estimated execution time is the closest to the actual execution time.
In the GEP method, a chromosome can consist of one or more genes.For computational simplicity, each chromosome has only one gene in the proposed method.A gene is composed of a head and a tail.The elements of the head are selected randomly from the set of Hadoop parameters (listed in Table 1) and the set of functions (listed in Table 2).However, the elements of the tail are selected only from the Hadoop parameter set.The length of a gene head is set to 20, which covers all the possible combinations of the functions.The length of a gene tail can be computed using Length (Gene Tail ) = Length (Gene Head ) × ( − 1) + 1, (2) where  is the largest number of input arguments of a function.In the following section, we present how the GEP method evolves when mining a correlation from the Hadoop configuration parameters in order to construct an objective function.

Mining Hadoop Parameter Correlation with GEP.
Algorithm 1 shows the implementation of the GEP method in order to construct an objective function that represents the correlation between the Hadoop configuration parameters and estimates the execution time of a job.The input of Algorithm 1 is a set of Hadoop job execution samples, which are used as a training dataset.
In Algorithm 1, Lines (1) to (5) initialize the first generation of 500 chromosomes, which represent 500 possible correlations between the Hadoop parameters.Lines (8) to (29) implement an evolution process in which a single loop represents a generation of the evolution process.Each chromosome is translated into an expression tree.Lines (11) to (17) calculate the fitness value of a chromosome.For each training sample, GEP produces an estimated execution time of a Hadoop job and compares with the actual execution time of the job.If the difference is less than a predefined bias window, the fitness value of the current chromosome will be increased by 1.
The size of the bias window is set to 50 seconds, which allows a maximum of 10% of the error space taking into account the actual execution time of a Hadoop job sample.Line (18) states that the evolution process terminates in an ideal case when the fitness value is equal to the number of training samples.Otherwise, the evolution process continues and the chromosome with the best fitness value will be retained as shown in Lines (20) to (23).At the end of each generation as shown in Lines (24) to (25), a genetic modification is applied to the current generation to create variations of the chromosomes for the next generation.
We varied the number of generations from 20000 to 80000 in the GEP evolution process and found that the quality of a chromosome (the ratio of the fitness value to the number of training samples) was finally higher than 90%.As a result, we set 80000 as the number of generations.The genetic modification parameters were set using the classic values [15] as shown in Table 3.
After 80000 generations, GEP generates an objective function as described in (3) representing a correlation of the Hadoop parameters listed in Table 1.))) .

Optimizing Hadoop Settings with GEP.
The correlation mined in the previous section describes each Hadoop parameter's contribution to the execution time.In GEP optimization, each chromosome represents a Hadoop configuration setting.Based on the objective function represented by (3), GEP finds the best chromosome that leads to the shortest execution time of a Hadoop job in each generation.GEP uses a range for each parameter that is involved in the evolution process as shown in Table 4.The range of each involved parameter is selected based on the values used in the training dataset for the corresponding parameters.Initially default values were set for the involved parameters and the values were then updated to obtain optimal solution.Updating the configuration values for the involved parameters is dependent on a number of factors such as input data block size, available physical memory, the number of CPUs, and the type of applications.For example, we set a range of 10∼100 for 1.The value of 1 is based on the data block size and its value must be greater than the input data block size in order to reduce the number of spill records.In this work the size of the data block is 5 MB.In the PDFA, the computation is mainly conducted in the map phase and very little work is performed in the reduce phase.Therefore, we set a large range of values for 4 (i.e., 1∼8) as compared to 5 (i.e., 1∼2).

The Implementation of EPDFA.
The EPDFA algorithm proposed in this paper is optimizing the authors' previous work [3,4,6] where a dataset of PMU frequency measurements is detrended on a sample-by-sample sliding window.The window was configured to be 50 samples long, this is to detect for changes or fluctuations in the power systems state over a 1-second period (at 50 Hz), looking for a specific loss shape in frequency, following an instantaneous loss in generation.A root mean square (RMS) value is then taken of the fluctuation,  for every window, as shown in (4); this value is then compared with a threshold value, predetermined through a number of previous baseline studies,  = 0.2×10 −3 , to detect for the presence of an event.
where  is the size of the window (50 samples),  is the sample number, and () is the detrended signal.
Figure 2 shows the software architecture of EPDFA for an off-line analysis of PMU data in an enhanced Hadoop cluster.OpenPDC [32] was installed which collects measurements from the installed PMUs, which are then stored in the OpenPDC data historian.OpenPDC was configured in such a way that when the data historian size reaches 100 MB, a new data file is created in .dformat with a corresponding timestamp.
A data agent application has been developed in the Java programming language which automatically detects the new data file and moves it to the Hadoop cluster.A portion of the PMU measurements was processed by PDFA with different configuration settings in order to create a historical jobs profile (training datasets).Once the historical job profile was created EPDFA invoked the GEP optimizer.The GEP optimizer has been implemented as a two-stage process.In the first stage, it utilizes the jobs profile and constructs the objective function as presented in Section 4.2.In the second stage, the GEP optimizer searches for an optimal configuration setting within the parameter search space,  which is then configured in a physical Hadoop cluster for performance enhancement.

Experimental Results
The performance of the EPDFA was extensively evaluated from the aspects of both computational speedup and scalability.For this purpose, an experimental Hadoop cluster was set up using an Intel Xeon server machine configured with 8 VMs.In this section, we first give a brief introduction to the experimental environment that was used in the evaluation process and then present the experimental results.

Experimental Setup.
The experiments were performed on a Hadoop cluster using a high performance Intel Xeon server machine comprising 4 Intel Nehalem-EX processors running at 2.27 GHz each with 128 GB of physical memory.Each processor has 10 CPU cores with hyperthread technology enabled in each core.The specification details of the server and the software packages are listed in Table 5. Oracle   that of authors' previous work on the DFA [3] and PDFA [6] algorithms, respectively.A set of PMU data samples provided by National Grid, the National Electricity Transmission System Operator (NETSO) for GB, was used in the evaluation.The data comprised 6000 samples of frequency measurements at 50 Hz from a PMU, equating to 2 minutes' worth of system data.The data contained a known system event, in the loss of a generator exporting approximately 1000 MW.In order to create a massive PMU data scenario, this dataset was replicated a number of times to provide a relatively large number of PMU data samples up to 86.40 million.In order to evaluate the computational performance of the EPDFA, a number of experiments were carried out that varied the number of PMU data samples from 8.6 million to 86.4 million samples.The sequential DFA was run on a single VM whereas both the PDFA and EPDFA were executed on 8 VMs.Furthermore, the PDFA was run using the default Hadoop configuration settings as shown in Table 1.The EPDFA was run on the GEP optimized Hadoop configuration settings.Table 6 lists a portion of the optimized Hadoop settings.Both the PDFA and the EPDFA were run 3 times each and average execution time was obtained.Figure 3 shows the execution times of  the sequential DFA, the PDFA, and the EPDFA on different numbers of PMU data samples.From Figure 3 it can be observed that the EPDFA performs better than both the DFA and PDFA.For example, the sequential DFA took 1690 minutes when processing 86.40 million samples whereas the PDFA and EPDFA took 103 minutes and 58 minutes, respectively, when processing the same number of samples.It is worth noting that, due to long execution times of the sequential DFA, it is hard to differentiate computationally between the PDFA and EPDFA.For this purpose, Figure 4 is plotted over the number of data samples in order to clearly show that the EPDFA is computationally faster than PDFA.
Based on results presented in the Figure 4, the computational speedup of the PDFA and the EPDFA when compared to DFA can be calculated using, respectively, where  represents the number of PMU data samples in millions,  ∈ [8.64, 86.40],   EPDFA is the execution time of the EPDFA on  number of data samples,   DFA is the execution time of the DFA on  number of data samples, and   PDFA is the execution time of the PDFA on  number of data samples.The speedup results are shown in Figure 5.  From Figure 5 it can be observed that, compared with the sequential DFA, the EPDFA has achieved a maximum speedup of 29.03 when processing 86.40 million samples.Alternatively, the PDFA has achieved a maximum speedup 17.33 when processing 69.12 million samples.The average speedup of the EPDFA when compared to the DFA is 26 times faster whereas the PDFA is 16 times faster.Furthermore, the maximum speedup of the EPDFA is 1.87 times faster than the PDFA as shown in Figure 6, whereas the minimum speedup is 1.08 times faster than the PDFA when processing 8.64 million samples.
The computational scalability of the EPDFA was also evaluated from the aspects of both VMs and PMU data samples.Figure 7 shows the execution times of the EPDFA in processing the 5 sets of PMU data samples with a varied number of VMs from 1 to 8. It can be observed that the execution time of the EPDFA is continuously decreased with an increasing number of VMs.Compared with the performance of the EPDFA running on a single VM, the EPDFA achieves the highest speedup on the largest number of data samples which is also clearly indicated by Figure 8.For example, when processing 8.64 million samples, the EPDFA running on 6 VMs is 3.43 times faster than running a single VM whereas it achieves a speedup of 4.83 on 8 VMs.Furthermore, when processing the 86.40 million samples, the EPDFA achieves a speedup of 4.68 on 6 VMs and 5.93 on 8 VMs.

Conclusion
Executing a Hadoop job using default parameter settings has led to performance issues.In this paper we have presented EPDFA to improve Hadoop performance by automatically tuning its configuration parameters.The optimized Hadoop framework can be utilized for scalable analytics on massive PMU data.The EPDFA achieved a maximum computational Mathematical Problems in Engineering speedup of 29.03 times faster than the sequential DFA and 1.87 times faster than a parallel DFA.At present the Hadoop framework is highly applicable to off-line scalable data analytics.However, the high processing overhead associated with input and output files limits the application of Hadoop to on-line analysis of PMU data streams.Further research will apply in-memory processing techniques [34] in order to enable real-time data stream analytics for power system applications.

x 1 Figure 1 :
Figure 1: An example of parameter correlation mining.

Figure 2 :
Figure 2: The software architecture of EPDFA.

Figure 5 :
Figure 5: Speedup of both the PDFA and EPDFA over DFA.

Figure 8 :
Figure 8: Speedup of the EPDFA over a single VM.
[24]xecuted in overlapping and nonoverlapping stages[24], which are ignored in the proposed performance model.As a result, the performance estimation of the proposed model may be inaccurate.Furthermore, the proposed model uses a dynamic instrumentation tool (BTrace) to collect the timing characteristic of tasks.BTrace utilizes extra CPU cycles that generate extra overheads, especially for CPU-intensive applications.As a result, the proposed model overestimates the execution time of a job.

Table 2 :
Mathematic functions used in GEP.
Positive integer or float fmod (, ) returns the floating-point remainder of / (rounded towards zero) Integer or float pow10 () returns base 10 raised to the power exponent  Integer or float inv () = 1/ Integer or float abs () returns absolute value of parameter  integer neg () = −; Integer or float for Hadoop optimum configuration parameter settings by building a mathematical correlation among the configuration parameters.In this section we first describe the GEP technique and then present the implementation of the EPDFA algorithm with the Hadoop performance enhancement.

Table 4 :
Range of parameters.VMs as the Data Nodes.The Name Node was also used as a Data Node.The data block size of the Hadoop Distributed File System (HDFS) was set to 5 MB and the replication level of data blocks was set to 2. We varied different numbers of PMU data samples in the experiments.5.2.Experimental Results.In this section a comparison of the computational efficiency of the EPDFA is presented with

Table 6 :
GEP recommended configuration parameter settings.