Incremental Gene Expression Programming Classifier with Metagenes and Data Reduction

,


Introduction
Learning from the environment through data mining remains an important research challenge.Numerous approaches, algorithms, and techniques have been proposed during recent years to deal with the data mining tasks.An important part of these efforts focuses on mining big datasets and data streams.Barriers posed by a sheer size of the real-life datasets, on one side, and constraints on the resources available for performing the data mining task, including time and computational resources, on the other, are not easy to overcome.Additional complications, apart from the above-mentioned complexity issues, are often encountered due to the nonstationary environments.
One of the most effective approaches to mining big datasets and data streams is using online or incremental learners.Online learning assumes dealing strictly with data streams.Online learners should have the following properties [1]: (i) Single-pass through the data.(ii) Each example is processed very fast and in a constant period of time.
(iii) Any-time learning: the classifier should provide the best answer at every moment of time.
The incremental learning is understood as a slightly wider concept, as compared with the online learning one.Incremental learners can deal not only with data streams but also with big datasets stored in databases for which using the "oneby-one" or "chunk-by-chunk" approach could be more effective than using the traditional "batch" learners, even if no concept drift has been detected.An important feature of the incremental learners is their ability to update the currently used model using only newly available individual data instances, without having to reprocess all of the past instances.In fact, using incremental learners is, quite often, the only possible way to extract any meaningful knowledge.Usual for the contemporary databases is a constant inflow of new data instances.Hence, the knowledge discovered in databases needs to be constantly updated, which is usually an infeasible task for classic learners.Data streams, and even stored datasets, may be affected by the so-called concept drift.In the above cases, online or incremental learners are needed.

Complexity
In the paper, we propose a new version of the incremental classifier based on Gene Expression Programming (GEP) with data reduction and a metagene as the final, upper-level, classifier.Classifiers using the GEP-induced expression trees are known to produce satisfactory or very good results in terms of the classification accuracy.Our approach uses GEPinduced expression trees to construct learners with the ability to deal with large datasets environment and with a concept drift phenomenon.The rest of the paper is organized as follows.In Section 2 a brief survey of the related results is offered.In Section 3 we describe a new version of the proposed approach.Section 4 contains a detailed description of the validating computational experiment and a discussion of its results including suggestions on how to deal with the real-life datasets through the Orthogonal Experiment Design technique.Section 5 includes conclusions and ideas for future research.

Related Work
To meet the required properties of the online learners several approaches and techniques have been proposed in the literature.The most successful ones include sampling, windowing, and drift detecting.Sampling assumes using only some data instances or some part of instances out of the available dataset.In [14] random sampling strategy with a probabilistic removal of some instances from the training set was proposed.Later on, the idea was extended in [15].Some more advanced sampling strategies were proposed in [16].Effects of sampling strategy on classification accuracy were investigated in [17].
As it has been observed in the review of [18], data sampling methods for machine learning have been investigated for decades.According to the above paper, in recent years progress has been made in methods that can be broadly categorized into random sampling including density-biased and nonuniform sampling methods, active learning methods, which are the type of semisupervised learning, and progressive sampling methods, which can be viewed as a combination of the above two approaches.
Closely related to sampling is the sliding window model.Sliding window can be seen a subset that runs over an underlying collection.Several versions of the approach can be found in [19][20][21].The idea is that analysis of the data stream is based on recent instances only and a limited number of the data instances, usually equal to the window size, are used to induce a classifier.In machine learning, the concept can be used for incremental mining of association rules [22].Another interesting application of the sliding window technique is known as the high utility pattern mining [23].
For noisy environments or environments with a concept drift the key question is when and how the current model should be adopted.Possible solutions include explicit drift detection models (see the survey by Ditzler et al. [24]) or explicit partitioning approaches (see, for example, [25]).
One of the most successful approaches to incremental mining of data streams is using the drift detection techniques.The aim of the drift detection is to identify changes in statistical properties of data distribution over time.Such changes are often referred to as the concept drift.To minimize deterioration of learners accuracy caused by the concept drift, one can apply change detection tests and modify or replace a learner upon discovering the drift (see, for example, [26,27]).The above-described approach is known as an active solution as opposed to a passive one, where the model is constantly retrained based on the most recent sample.More recently several Extreme Learning Machine (ELM) approaches to incremental learning have been discussed.For example, [28] proposed a forgetting parameters concept named FP-ELM.Recent surveys on data stream mining can be found in [24,29].
Among incremental models, there are also those based on exploiting a power of the ensemble classifiers.Ensemble learners involve a combination of several models.Their predictions can be combined in some manner like, for example, averaging or voting to arrive at the final prediction.Ensemble learners for the data stream mining were proposed, among others, in [30][31][32][33][34].
One of techniques used to construct incremental classifiers is Gene Expression Programming (GEP).Gene Expression Programming was introduced in [35].In GEP programs are represented as linear character strings of a fixed length called chromosomes which, in the subsequent fitness evaluation, evolve into expression trees without any user intervention.This feature makes GEP-induced expression trees a convenient model for constructing classifiers [36].
An improvement of the basic GEP classifiers can be achieved by combining GEP-induced weak classifiers into a classifier ensemble.In [37] two well-known ensemble techniques, bagging and boosting, were used to enhance the generalization ability of GEP classifiers.Yet another approach to building GEP-based classifier ensembles was proposed in [38].The idea was to construct weak (base) classifiers from different subsets of attributes controlling the diversity among these subsets through applying a variant of niching technique.Further extensions and variants of GEP-induced ensemble classifiers were discussed in [39] where ideas of incremental learning and cluster-based learning were proposed.Approaches to constructing ensemble classifiers from GEP-induced weak classifiers were also studied in [40].

The Proposed Incremental GEP-Based Classifier
In this paper, we extend and improve the incremental GEPbased classifier proposed in [41].In the above paper, GEP was used to induce base classifiers.Base classifiers serve to construct an ensemble of classifiers.Such an ensemble requires the application of some integration techniques like for instance majority voting, bagging or boosting.Review of the ensemble construction methods for the online learning can be found in [42].Alternatively, a metaclassifier can be constructed following the idea of the stacked generalization [43].In our case, such a metaclassifier is called a metagene.Our approach follows steps proposed in [41] as far as the construction of base classifiers and respective metagenes are concerned.The algorithm for learning the best classifier using GEP works as follows.Suppose that a training dataset is given and each vector in the dataset has a correct label representing the class.In the initial step, the minimal and maximal values of each attribute are calculated and a random population of chromosomes is generated.Each chromosome is composed of a single gene divided into two parts as in the original headtail method [35].The size of the head (ℎ) is determined by the user with the suggested size not less than the number of attributes in the dataset.The size of the tail () is computed as  = ℎ + 1.The size of the chromosome is ℎ +  = 2ℎ + 1.
For each gene, the symbols in the head part are randomly selected from the set of functions AND, OR, NOT, XOR, and NOR and the set of terminals of type (; ; ), where the value of  is in the range of attribute  and  is a relational operator.The symbols in the tail part are all terminals.In Figure 1 an example of a gene is given.The start position (position 0) in the chromosome corresponds to the root of the expression tree (OR, in the example).Then, below each function branches are attached and there are as many of them as the arity of the function, 2 in our case.
The following symbols in the chromosome are attached to the branches on a given level.The process is complete when each branch is completed with a terminal.The number of symbols from the chromosome to form the expression tree is denoted as the termination point.For the discussed example, the termination point is 4; therefore further symbols are not meaningful and are denoted by ⋅ ⋅ ⋅ in Figure 1.The rule corresponding to the chromosome from Figure 1 is IF (1 > 0.57) OR NOT (10 ≤ 0.16) THEN Class 1.
To introduce variation in the population the following genetic operators are used: (i) mutation, (ii) transposition of insertion sequence elements (IS transposition), (iii) root transposition (RIS transposition), (iv) one-point recombination, (v) two-point recombination.
Mutation can occur anywhere in the chromosome.We consider one-point mutation which means that with a probability, called mutation rate, one symbol in a chromosome is changed.In case of a functional symbol it is replaced by another randomly selected function; otherwise for  = (, , ) a random relational operator   , an attribute   , and a constant   in the range of   are selected.Note that mutation can change the respective expression tree since a function of one argument may be mutated into a function of two arguments or vice versa.
Transposition stands for moving part of a chromosome to another location.Here we consider two kinds of transposable elements.In the case of transposition of insertion sequence (IS) three values are randomly chosen: a position in the chromosome (start of IS), the length of the sequence and the target site in the head, a bond between two positions.Then a cut is made in the bond defined by the target site and the insertion sequence is copied into the site of the insertion.The sequence downstream from the copied IS element loses, at the end of the head, as many symbols as the length of the transposon.Observe that since the target site is in the head, the newly created individual is always syntactically correct though it can reshape the tree quite dramatically.In the case of root transposition, a position in the head is randomly selected, the first function following this position is chosen; it is the start of the RIS element.If no function is found, then no change is performed.The length of the insertion sequence is chosen.The insertion sequence is copied at the root position and at the same time the last symbols of the head (as many as RIS length) are deleted.
For both kinds of recombination two parent chromosomes  1 ,  2 are randomly chosen and two new child chromosomes  1 ,  2 are formed.In the case of one-point recombination, one position is randomly generated and both parent chromosomes are split by this position into two parts.Child chromosomes  1 (respectively,  2 ) is formed as containing the first part from  1 (respectively,  2 ) and the second part from  2 (and  1 ).In two-point recombination two positions are randomly chosen and the symbols between recombination positions are exchanged between two parent chromosomes forming two new child chromosomes.Observe that again, in both cases, the newly formed chromosomes are syntactically correct no matter whether the recombination positions were taken from the head or tail.
During GEP learning, the individuals are selected and copied into the next generation based on their fitness and the roulette wheel sampling with elitism which guarantees the survival and cloning of the best chromosome in the next generation.
Further details on GEP operators and GEP learning can be found in [39,40,44].
For a fixed training set  and fixed gene  the fitness function counts the proportion of vectors from  classified correctly: where Having generated a population of genes it is possible to create a population of metagenes which corresponds to creating an ensemble classifier.The idea is as follows.Let  be a population of genes, with each gene identified by its .To create metagenes from  we define the set of functions again as Boolean ones as above and set terminals equal to identifiers For a fixed attribute vector  each terminal (i.e., gene) has a Boolean value and thus the value of metagene can be computed.For the metagene  from Figure 2 and  = (1.2,0.8, 2.5) we have Similarly as in (1), for a fixed training set  and fixed metagene  the fitness function counts the proportion of vectors from the testing set classified correctly: The incremental GEP classifier with metagenes works in rounds.In each round, a chunk of data is used to induce genes and another chunk to induce metagenes.Chunk size is one of the incremental classifier parameters.Its role is to control the frequency with which the model is updated with a view to adapt to a possible concept drift.Main assumptions for such an approach are as follows: (i) Class labels of instances belonging to the first and second chunks are known at the outset (ii) Class labels of instances belonging to the chunk number 3, and to all the following chunks, are immediately revealed after the class of each instance has been predicted (iii) All instances except those belonging to the first two chunks are classified one by one in the "natural" order Based on the above assumptions, in [2], the following procedure was implemented.In each round a chunk of training data  1 is used to create a population of genes, next chunk of data  2 is used to create the population of metagenes and to choose one best-fitted metagene denoted , and the following chunk  3 is tested by metagene .In the next round,  1 fl  2 ,  2 fl  3 and next chunk is used as  3 .For further comparisons, the incremental classifier from [2] is denoted as Inc-GEP1.
Computational experiments confirmed that Inc-GEP1 performs quite well.Comparison with the state-of-the-art incremental classifiers showed that the approach outperforms, in the majority of cases, the existing solutions in terms of the classification accuracy.Unfortunately, Inc-GEP1 suffers from a high demand on computational resources which, in many situations, might prevent it from mining data streams and datasets from the big data environment.One of the reasons behind the above situation is that Inc-GEP1 has not been equipped with any adaptation mechanism providing for updating the model only upon detecting a concept drift.Instead, the model is induced anew each time after classifying a chunk of instances.
To offer more flexibility and to shorten the computation time as compared with Inc-GEP1 we propose two measures.The first is an extensive data reduction option, and the second is providing some adaptation mechanism with a view to decreasing the number of required learner updates during computations.Following the idea of the random sampling proposed for the classic (nonincremental) learners [41], in the proposed incremental learner, the user has an option to set values of the following main parameters: Each of the above options can be used to control and effectively decrease or increase the computation time of the whole process, including learning models and predicting class labels of the incoming instances.Setting value of the chunk size determines how often the learner is updated.Smaller size results in increasing the number of updates.In our case, this number can be decreased through the proposed adaptation mechanism described later in this section.The number of base classifiers used to induce metagenes influences computation time needed to perform the job.A smaller number of the base classifiers may, however, decrease the accuracy of the resulting metagenes.The number of attributes used to induce base genes should be smaller than the number of original attributes in each instance of the considered dataset.Once set, it results in selecting randomly as many attributes as required from the set of all data attributes.The random draw of attributes takes place each time when one of the base classifiers is induced.This means that for inducing each base classifier a combination of attributes is repeatedly randomly drawn.Setting percent of instances used to induce the base genes and metagenes results in randomly sampling chunks used to induce the base genes and metagenes, respectively.Such filtering results in diminishing the number of instances used to induce each of the base classifiers and each of metagenes, by a given percentage.
Apart from the data reduction measures, we also propose to introduce a simple adaptation mechanism reducing unnecessary learner updates.After having used the first two data chunks to induce the initial set of base classifiers and the current metagene (), the following scheme is used.Class labels of instances belonging to the third chunk  3 are predicted using  and the average accuracy of class prediction for that chunk (V 3 ) is recorded.In the next step,  is used to predict class labels of the fourth chunk  4 and the average accuracy of prediction V 4 is calculated and recorded.If V 4 < V 3 , then the learner is updated using  3 and  4 producing new current .Else, the current metagene is used to predict class labels of instances belonging to the next incoming chunk.The procedure is repeated until instances in all chunks have been classified.Wherever the inequality V  < V −1 holds, the current metagene is replaced by a new one induced using chunks   and  −1 .The above adaptation mechanism is denoted as ADAPT1.Alternatively, the second version of the adaptation mechanism, denoted as ADAPT2, can be used.Under ADAPT2 the current metagene is replaced by a newly induced one only after the average classification accuracy for two consecutive chunks is worse than the accuracy produced by the metagene induced for their predecessor chunk.The procedure using ADAPT1 is shown as Algorithm 3 and the case for ADAPT2 is omitted, as being similar.The incremental classifier with data reduction and ADAPT1 mechanism is further on referred to as Inc-GEP2.Such classifier equipped with ADAPT2 mechanism is further on referred to as Inc-GEP3.
Procedures for inducing base classifiers and metagenes are shown as Algorithms 1 and 2, respectively.In both cases, the fitness function is an accuracy of the class label prediction calculated over the respective chunk of data.

Computational Experiment Results
To evaluate the performance of the proposed approach we have carried out the computational experiment over a representative group of the publicly available 2-classes benchmark datasets including large datasets and datasets often used to test incremental learning algorithms.Datasets used in the experiment are shown in Table 1.
In Table 2 experiment settings used in Inc-GEP2 and Inc-GEP3 are shown.There are 4 main parameters affecting the proposed classifiers performance.Chunk size refers to the number of instances classified one by one without interruption using the current metagene.The number of attributes refers to the number of randomly selected attributes used to induce each gene.Reduction rate reflects the percent of both instances used to induce genes and instances used to induce metagenes.Number of classifiers refers to the number of base classifiers (genes).Method of setting values of the above parameters is explained later.Other settings including the number of iterations in GEP (set at 100) and probabilities of applying genetic operators (set as in [2]) have been the same throughout the whole experiment.
In Table 3 mean classification accuracy of Inc-GEP1, Inc-GEP2, and Inc-GEP3 is shown.Accuracy and standard deviation have been calculated as mean values obtained over 20 runs with parameter settings as shown in Table 2.For the Inc-GEP1 chunk size and the number of attributes are identical as in the case of the Inc-GEP2 and Inc-GEP3.In Inc-GEP1, however, there is no reduction with respect to the percentage of genes used to induce base classifiers and metagenes.Additionally, in Inc-GEP1 base classifiers and metagenes are induced using the full set of attributes.
Parameter values shown in Table 2 have been selected through the Orthogonal Experimental Design (OED) method.Since there are four main factors affecting classifier performance, it has been decided to use an L9 orthogonal array to identify the influence of 4 different independent variables on classifier performance.For each variable 3 level values have been set.Selection of the level values was arbitrary, albeit based on common sense.
The decision to use the OED method has been preceded by a comparison of mean classification accuracy values for each dataset and each combination of main factors out of 9 combinations under analysis.Thus, for each dataset, we had 9 groups of samples, each containing 10 classification accuracies obtained by running the considered classifier for 10 times for each combination of factors.The one-way ANOVA with the null hypotheses stating that samples in all groups are drawn from populations with the same mean values has shown that, for all considered datasets with the exception of the Bank Marketing dataset, null hypotheses should be rejected.This finding assures sensibility of searching for the best combination of factor values for each of the considered datasets.
The procedure of the orthogonal experiment and selection of the parameter values is shown below on the example of the Sea dataset.The similar procedure has been applied to all considered datasets.
In Table 4 factor (term) levels for the orthogonal array used in the experiment with the Sea dataset are shown.In Table 5 response values representing classification accuracy using the Inc-GEP2 classifier are displayed.The first column shows factor level numbers.Next ten columns contain response values.The last column contains the average of responses.
Response table for signal-to-noise ratio shown in Table 6 indicates that key role in maximizing the discussed ratio plays the number of attributes while data in Table 7 showing the response table for classification accuracy means indicate that key factor in maximizing accuracy plays the number of classifiers and next the number of attributes.The response table for signal-to-noise ratios contains a row for the average signal-to-noise ratio for each factor level, Delta, and rank.Delta is the difference between the maximum and minimum average response for the factor.The response table for means shows the size of the effect by taking the difference between the highest and lowest characteristic average for a factor.Ranks in a response table allow to quickly identify which factors have the largest effect.All factors, however, have statistically significant effects on response.This is confirmed by the main effect plot for means shown in Figure 3. Main effect plot is constructed by plotting the means for each value of a variable.A line connects the points for each variable.When the line is horizontal (parallel to the x-axis), there is no main effect present.The response mean is the same across all factor levels.On the other hand, when the line is not horizontal, there is a main effect present and the response mean is not the same across all factor levels.The steeper the slope of the line, the greater the magnitude of the main effect.As data in Table 5 indicate, the best combination of factor levels for the Sea dataset is window (chunk) size 2500, 2 attributes for inducing base classifiers, 10% of instances used to induce genes and, respectively, metagenes, and 30 classifiers.Similar analysis has been performed for all considered datasets with a view to find out the best combination of parameter (factor) values.
Orthogonal array analysis can be also carried out with respect to the computation times.For example, in the case of the Sea dataset the respective response table for computation times means indicates that key role in minimizing computation time plays the window size and number of classifiers used to construct the ensemble.The respective main effects plot displaying how the considered factors affect computation times for the Sea dataset is shown in Figure 4.In this Figure "means" refers to times in seconds needed to classify a single instance.In Table 8 comparison between mean computation times for all the considered dataset and for settings of parameters from Table 2 is shown.Respective values refer to times in seconds needed to classify 100 instances by the considered algorithms run on Dell Precision 3520 workstation with Xeon processor and 16 GB RAM.Columns Speed-up1 and Speed-up2 contain speed-up factors comparing Inc-GEP1 with Inc-GEP2 and Inc-GEP1 with Inc-GEP3, respectively.As can be observed from Table 8, there are significant differences in computation times needed to run algorithms under comparison.On average, the proposed Inc-GEP2 classifier is over 2 times quicker as compared with the incremental Gene Expression Programming with metagenes without data reduction (Inc-GEP1).Moreover, the proposed Inc-GEP3 classifier is, on average, over 7 times quicker than the control algorithm Inc-GEP1.To properly evaluate both Inc-GEP2 and Inc-GEP3 one has to evaluate also their performance in terms of the classification accuracy.Assuming equal variances, one-way ANOVA allows observing that null hypothesis stating that all three mean accuracies are equal under the confidence level 0.05 holds.Hence, the alternative hypothesis stating that not all the considered means are equal should be rejected.The above finding is confirmed by Fisher and Tukey tests.
In Table 9     From Table 9 it can be seen that the proposed classifiers perform well and are competitive to several other approaches.In several cases, GEP-based incremental classifiers outperform earlier available solutions.

Conclusions
The main contribution of the paper is to propose the incremental Gene Expression Programming classifier with metagenes and data reduction.The concept of metagenes increases the classification accuracy while data reduction allows controlling computation time.The proposed approach extends earlier incremental GEP-based classifier [2].Additionally, the extended version contains a simple drift detection mechanism allowing dealing more effectively with data streams.
Another important novelty introduced in the paper is using the Orthogonal Experimental Design principles to set up classifier parameters values.The approach allows us to easily evaluate the statistical importance of main parameters (factors) showing through main effects plots and the respective response tables key factors and their influence on classifier performance and signal-to-noise ratios.
An extensive computational experiment confirms that the proposed classifier offers better performance in respect to the required computation times as compared with its earlier version.At the same time, it provides similar results in terms of classification accuracy.The algorithm offers also scalability through the possibility of adjusting computation times to the user needs, which might be a useful feature even at a cost of possibly a bit lower classification accuracy.
Comparison of the proposed GEP-based incremental classifiers with some literature reporting state-of-the-art incremental classifiers in terms of the mean classification accuracy proves that our approach offers quite satisfactory solutions, outperforming in many cases the existing methods.The proposed approach can be useful in data analytics and big data processing where single-pass limited-memory models enabling a treatment of big data within a streaming setting are increasingly needed [45].
Future research would concentrate on incorporating more sophisticated drift detection mechanisms and to further improve efficiency by implementing the algorithm in a parallel environment.
(i) Chunk size (ℎ) (ii) Number of the base classifiers () (iii) Number of attributes used to induce the base genes () (iv) Percent of instances used to induce the base genes () (v) Percent of instances used to induce metagenes ()

Figure 3 :
Figure 3: Main effects plot for classification accuracy means: Sea dataset.

Figure 4 :
Figure 4: Main effects plot for classification time means: Sea dataset.
Input: chunk , number of base classifiers , number of attributes , percent of instances Input: dataset , chunk size ℎ, number of base classifiers  Output: overall prediction accuracy /⋆ induce  base classifiers using the first chunk and best metagene using the second chunk ⋆/ (1)  ← first ℎ rows from  (2)  ← next ℎ rows from  (3) apply Algorithm 1 to  to induce  base classifiers  (4) apply Algorithm 2 to  and  to induce metagene  (5)  ← next ℎ rows from  (6)  ← accuracy of classification performed on  by metagene

Table 1 :
[2]chmark datasets used in the experiment.The table is reproduced from[2](under the Creative Commons Attribution License/public domain).

Table 6 :
Response Table for signal-to-noise ratios: Sea dataset.

Table 7 :
Response table for means: Sea dataset.
comparison of the proposed GEP-based incremental classifiers with some literature reporting state-of-theart incremental classifiers in terms of the mean classification accuracy is shown.The abbreviations used for incremental classifiers are as follows: FTDD, Fisher Test Drift Detection; IncSVM, Incremental SVM; EDDM, Early Drift Detection Method; IncN-B, Incremental Naïve Bayes; KFCM, Online distance based classifier with Kernel Fuzzy C-means; IncEnsemble, Incremental Ensemble; and FISH, Unified Instance Selection Algorithm.

Table 9 :
Comparison of the proposed GEP-based incremental classifiers with some literature reporting incremental classifiers in terms of the mean classification accuracy.