As one of the most effective function mining algorithms, Gene Expression Programming (GEP) algorithm has been widely used in classification, pattern recognition, prediction, and other research fields. Based on the self-evolution, GEP is able to mine an optimal function for dealing with further complicated tasks. However, in big data researches, GEP encounters low efficiency issue due to its long time mining processes. To improve the efficiency of GEP in big data researches especially for processing large-scale classification tasks, this paper presents a parallelized GEP algorithm using MapReduce computing model. The experimental results show that the presented algorithm is scalable and efficient for processing large-scale classification tasks.
In recent years, Gene Expression Programming (GEP) [
As an effective data analyzing approach, classification has been researched a lot. The classification algorithms, especially supervised classification algorithms, for example, artificial neural networks (ANNs), show remarkable classification abilities. However, ANNs are also function-fitting algorithms fundamentally, although the algorithms cannot output the functions explicitly as GEP does. This point motivates us that GEP can also be employed to deal with the supervised classification tasks using the following idea: Let the training data be Train the GEP algorithm using Input the to-be-classified data
It also motivates us that, based on the works [
Unfortunately several works [
The rest of the paper is organized as follows: Section
As an effective function mining algorithm, GEP has been widely applied in numbers of researches. Sabar et al. [
However, several works [
Another effective way of solving the efficiency issue of GEP is to use parallel computing or distributed computing. Du et al. [
The improvement of GEP presented in this paper mainly focuses on parallelizing the GEP algorithm for executing large-scale classification. Our algorithm first employs the Hadoop framework [
Based on the selection, crossover, mutation, and fitness, GEP is able to mine a function from the given dataset. Therefore, let
MapReduce computing model contributes two main functions Map and Reduce to facilitate the development of the distributed computing applications. Map function executes main computation and Reduce function collects the intermediate output of Maps and generates the final output. Each Map computes the data instances one by one in the form of key-value pair
Hadoop framework [
The structure of Hadoop framework.
In the training phase, let
In the classification phase, the testing dataset
Bootstrapping is based on the idea of controlling the number of times that the training instances appear in the bootstrapping samples, so that, in the number of Construct a string of the instances A random permutation Repeat step
Based on the bootstrapping, the instance distributions of the original dataset can be simulated in the samples, in which the original data information can be kept more. The majority voting is a commonly used combination technique. The ensemble classifier predicts a class for a test instance that is predicted by the majority of the base classifiers [
By employing the bootstrapping and majority voting, the parallelized GEP classification algorithm works as follows.
In the training phase: The algorithm firstly generates a number of Each mapper initiates a sub-GEP and inputs one data chunk from HDFS. Each mapper trains its sub-GEP according to step As in each mapper its sub-GEP mines the function individually, therefore, finally a number of
Figure
Training phase in the classification.
In the classification phase: Each mapper retrieves one function And then all the mappers input the same one instance In each mapper, when The One reducer collects all the intermediate outputs from all the
Figure
The classification of the presented algorithm.
In order to evaluate the algorithm performance, a physical Hadoop cluster constituted by one NameNode and four DataNodes is established. The details of the cluster are listed in Table
The cluster details.
NameNode | CPU: Core i7@3 GHz |
| |
DataNodes | CPU: Core i7@3.8 GHz |
| |
Network bandwidth | 1 Gbps |
| |
Hadoop version | 2.3.0 |
The datasets employed in the experiments are Iris dataset [
The dataset details.
Type | Iris | Wine |
---|---|---|
Dataset characteristics | Multivariate | Multivariate |
Instance number | 150 | 178 |
Attribute number | 4 | 13 |
Class number | 3 | 3 |
The parameters of GEP used in the experiments are listed in Table
The details of the parameters of GEP.
Number of genes | 2 (iris)/4 (wine) |
Linking function | + |
Head length | 6 |
Function set | |
Fitness | |
Terminal set | |
| |
Population size | 100 |
| |
Mutation rate | 0.044 |
IS transposition rate | 0.1 |
RIS transposition rate | 0.1 |
Gene transposition rate | 0.1 |
One-point recombination rate | 0.4 |
Two-point recombination rate | 0.2 |
Gene recombination rate | 0.1 |
Threshold | 0.5 |
Coded classes | 1, 2, and 3 |
Let
In the following tests, increasing numbers of instances are selected from the two datasets as the training instances, whilst the rest instances are the testing instances. The bootstrapping number is four and the algorithm starts eleven mappers for executing the parallelized GEP classification. The experimental results are shown in Figures
The precision comparison for classifying Iris dataset.
The precision comparison for classifying Wine dataset.
The precision comparison for classifying Iris dataset using parallel GEP and BPNN.
The precision comparison for classifying Wine dataset using parallel GEP and BPNN.
Precision of the classification with increasing bootstrapping numbers.
Algorithm efficiency comparison of increasing training data sizes.
Comparison of the processing time of the parallel BPNNs and the parallel GEP with increasing training data sizes.
The classification result of the standalone GEP using Iris dataset.
The classification results of eleven sub-GEPs.
The classification result of the standalone GEP using Wine dataset.
The classification results of eleven sub-GEPs.
Figure
Wine dataset has also been employed to evaluate the classification accuracy. Comparing to Iris dataset, each instance of Wine dataset has 13 attributes, which may impact the classification accuracy. The experimental result is shown in Figure
Figure
For further evaluating the effectiveness of the proposed algorithm, we also implemented backpropagation neural network (BPNN). The comparisons of the classification accuracy are shown in Figure
Figure
Figure
It should be noticed that the bootstrapping number which represents the number of times the training instances appear in the bootstrapping samples also impacts the algorithm accuracy. Therefore Figure
In Figure
In this section, Wine dataset is selected as the experimental dataset. The algorithm processing time for the increasing training data sizes has been evaluated. In the following tests, firstly the bootstrapping number is 4, which means each training instance appears 4 times. The number of the training instances is 118 whilst the testing instances remain 60. And then the training data size is duplicated from approximately 0.5 MB to 1024 MB. It should be pointed out that, because of the duplication, the bootstrapping number will change from 4 to
Figure
To further compare the classification efficiency to the other classification algorithms, the MapReduce based parallel bac propagation neural network algorithms (MRBPNN 1, 2, and 3) [
Figure
This paper presents a MapReduce and ensemble techniques based parallel Gene Expression Programming algorithm in enabling large-scale classification. The parallelization of GEP mainly focuses on paralleling the training phase (function mining phase) which is the most time consuming and computational intensive process. The experimental results show that the presented algorithm outperforms the standalone GEP and BPNN in terms classification accuracy. In the algorithm executing time evaluations, the presented parallel GEP also shows remarkable performance comparing to the standalone GEP. Although the parallel GEP works slower than MRBPNN 1 and 2, it can supply higher classification accuracy, which enables the presented parallel GEP to be one of the effective tools dealing with large-scale classification.
In the appendix, the details of classifying Iris and Wine dataset have been listed. Figure
Figure
The Iris dataset classification results of the eleven sub-GEPs employed by parallel GEP are shown in Figure
Figure
Figure
In this case, the mined function
The Wine dataset classification results of the eleven sub-GEPs employed by parallel GEP are shown in Figure
The eleven mined functions are listed as follows.
The authors declare that there is no conflict of interests regarding the publication of this article.
The authors would like to appreciate the support from National Natural Science Foundation of China (no. 51437003).