Using Genetic Programming with Prior Formula Knowledge to Solve Symbolic Regression Problem

A researcher can infer mathematical expressions of functions quickly by using his professional knowledge (called Prior Knowledge). But the results he finds may be biased and restricted to his research field due to limitation of his knowledge. In contrast, Genetic Programming method can discover fitted mathematical expressions from the huge search space through running evolutionary algorithms. And its results can be generalized to accommodate different fields of knowledge. However, since GP has to search a huge space, its speed of finding the results is rather slow. Therefore, in this paper, a framework of connection between Prior Formula Knowledge and GP (PFK-GP) is proposed to reduce the space of GP searching. The PFK is built based on the Deep Belief Network (DBN) which can identify candidate formulas that are consistent with the features of experimental data. By using these candidate formulas as the seed of a randomly generated population, PFK-GP finds the right formulas quickly by exploring the search space of data features. We have compared PFK-GP with Pareto GP on regression of eight benchmark problems. The experimental results confirm that the PFK-GP can reduce the search space and obtain the significant improvement in the quality of SR.


Introduction
Symbolic regression (SR) is used to discover mathematical expressions of functions that can fit the given data based on the rules of accuracy, simplicity, and generalization. As distinct from linear or nonlinear regression that efficiently optimizes the parameters in the prespecified model, SR tries to seek appropriate models and their parameters simultaneously for a purpose of getting better insights into the dataset. Without any prior knowledge of physics, kinematics, and geometry, some natural laws described by mathematical expressions, such as Hamiltonians, Lagrangians, and other laws of geometric and momentum conservation, can be distilled from experimental data by the Genetic Programming (GP) method on SR [1].
Since SR is an NP-hard problem, some evolutionary algorithms were proposed to find approximate solutions to the problem, such as Genetic Programming (GP) [2], Gene Expression Programming (GEP) [3], Grammatical Evolution (GE) [4,5], Analytic Programming (AP) [6], and Fast Evolutionary Programming (FEP) [7]. Moreover, recent researches in SR problem have taken into account machine learning (ML) algorithms [8][9][10]. All of the above algorithms randomly generate candidate population. But none of them can use various features of known functions to construct mathematical expressions adapted for describing the features of given data. Therefore, these algorithms may exploit huge search space that consists of all possible combinations of functions and its parameters.
Nevertheless, a researcher always analyzes data, infers mathematical expressions, and obtains results according to his professional knowledge. After getting experimental data, he observes the data distribution and their features and analyzes them with his knowledge. Then, he tries to create some mathematical models based on natural laws. He can obtain the values of coefficients in these models through regression analysis methods or other mathematical methods. And he evaluates the formulas which are mathematical models with the values by using various fitness functions. If the researcher finds some of the formulas that fit the experimental data, he can transform and simplify these formulas and then obtain the final formula that can represent the data. Furthermore, his rich experience and knowledge can help him to reduce the searching space complexity so that he can find the best 2 Computational Intelligence and Neuroscience fit mathematical expression rapidly. As the researchers use their knowledge to discover the best fitted formulas, the methods that inject domain knowledge into the process of SR problem solving have been proposed to improve performance and scalability in complex problem [11][12][13]. The domain knowledge, which is manually created by the researcher's intuition and experience, is of various formulas which are prior solutions to special problems. If the domain knowledge automatically generates some fitted formulas that are used in evolutionary search without the researcher involvement, the speed of solving SR problem will be quickened. A key challenge is how to build and utilize the domain knowledge just like the researcher does.
In this paper, we present a framework of connection between Prior Formula Knowledge and GP (PFK-GP) to address the challenge: (i) We classify a researcher's domain knowledge into PFK Base (PFKB) and inference ability after analyzing the process that a research discovered formulas from experimental data (Section 2). PFKB contains two primary functions: classification and recognition. The aim of two functions is to generate feature functions which can represent the feature of experimental data.
(ii) In order to implement classification, we use the deep learning method DBN [14,15] which, compared with other shallow learning methods (Section 3.1), can classify experimental data into a special mathematical model that is consistent with data features. However, the classification method may lead to overfitting because the method can only categorize experimental data into known formula models which come from the set of training formula models.
(iii) Therefore, recognition is used to overcome the overfitting. It can extract mathematical models of functions that can show full or partial features of experimental data. Three algorithms GenerateFs, CountSamePartF, and CountSpecU (see Algorithms 2, 3, and 4) are designed to implement recognition. For example, from the dataset generated by ( ) = exp(sin( )+ 3 /(8 * 10 5 )), the basic functions sin, exp, and cube can be found by the above three algorithms. In Figure 1, the function sin shows the periodicity of data, and exp or cube shows the growth rate of data. Therefore, these basic functions (called feature functions) can describe some features of the dataset.
(iv) The inference ability is concluded to the searching ability of evolutionary algorithm. As researches infer mathematical models, GP is used to combine, transform, and verify these models. These feature functions that are generated by PFKB are selected to be combined into the candidate population in the light of algorithm randomGenP (see Algorithm 5).
With the candidate population, GP can get convergent result quickly because it searches answers in a limit space which is composed of various feature functions. Through experiment on eight benchmark problems ( Table 5 1 -8 ), the results demonstrate that PFK-GP, compared with Pareto optimization GP [16,17], shows the significant improvement in accuracy and convergence.

Definition and Representation of Mathematical Expression.
In this section, we will define concepts about SR problem and show how to represent these concepts by applying BNF expression. For SR problem, the word "formula" is the general term which describes mathematical expression that fits the given data. We define a formula model is a special mathematical model in which formulas have the same relationships and variables except for different coefficient values. Relationships can be described by operators, such as algebraic operators, functions, and differential operators (http://en.wikipedia.org/wiki/Mathematical model). Therefore, a formula model is a set where each element is a formula. For example, the two formulas 0.1 * sin( ) + 0.7 * log( ) and 0.3 * sin( ) + 0.9 * log( ) belong to the formula model 1 * sin( ) + 2 * log( ). Data that are represented by different formulas in one formula model may have similar features which are data distributions, data relationships between different variables, data change laws, and so on, because these formulas have the same relationships.
is a set of binary functions, while is a set of unary functions. is a set of atomic functions which does not contain any subfunctions.
is a set of complex functions which contains complex functions in and atomic functions in . With the above definitions, any formulas and its corresponding model can be shown by these BNF expressions. For instance, the formula exp(sin( ) + 3 /(8 * 10 5 )) is represented by and , and its subfunction sin( ) is represented by . The constants 8, 3, and 10 5 are shown by elements in . With these expressions, a formula model can be transformed into one tree. And the tree is a candidate individual in population of GP solving SR problem. Every subtree in the tree is a subformula which can show special data features. A subtree that shows features of experimental data is called feature-subtree. If a tree has more feature-subtrees, the tree is more likely to fit the data. How to construct the tree consisting of feature-subtrees is a key step in our method which is implemented by the algorithm randomGenP (see Algorithm 5).

The Process of Researcher Analyzing
Data. The process that a researcher tries to solve SR problems is shown in Figure 2. He depends heavily on his experience which is obtained through a long-term accumulation of study and research. After a researcher collected experimental data, he discovers regular patterns from data by using the methods of data analysis and visualization. He then constructs formula models which were consistent with these regular patterns according to his experiences. After that, he computes the coefficient values in formula models by using appropriate regression methods and obtains some formulas from different formula models. According to results of evaluating these formulas, he chooses the formula that is most fitted to the data. If the formula cannot represent data features, he needs to reselect a new formula model and do the above steps until one fitting formula is found.
We think the researcher's experience and knowledge have two roles in processing SR problem. One role is Prior Formula Knowledge (PFK) which can help a researcher to quickly find fitted formulas that match experimental data features. Through study and work, the researcher accumulates his domain knowledge of various characteristics of formula model. When the researcher observes experimental data, he can apply his domain knowledge to recognize and classify the data. The other is the ability of inference and deduction which 4 Computational Intelligence and Neuroscience y = z + (a * sin(x)) y = z + (a * log(x)) y = z − (a * tan(b * x)) y = a * x + b * x 2 + c * x 3 y = a * x + b * sin(x) y = a * sin(x) + b * log(x) can help the researcher to combine, transform, and verify mathematical expression. We conclude that the PFK contains two primary functions: classification and recognition.
Classification. when experimental data features are in accord with characteristics of one formula model in PFK, the dataset can be categorized into the model. The prerequisite of classification is that different formula models have different characteristics in PFK Base. As shown in Figure 3, six families of curves are generated by six formula models taking different coefficient values. The curves in the same family show similar data features while the curves in different families show different data features. Therefore, we can infer that the curves (including surfaces and hypersurfaces) generated by different formula models can be classified according to their data features. Although many machine learning algorithms such as linear regression [18], SVM [19], Boosting [20], and PCVMs [21] can be used to identify and classify data, it is difficult for these algorithms to classify these curves. That is because these algorithms depend on features that are extracted manually from data, while these features from different complex curves are difficult to be represented by a feature vector which is built based on the researcher's experiences. In contrast to these algorithms, DL can automatically extract features and have a good performance for the recognition of complex curves, such as image [15], speech [22], and natural language [23]. The GenerateFs algorithm (see Algorithm 2) based on DBN is shown to classify the data.
Recognition. Some formulas can represent remark features of curves generated by formula model. For example, after observing the curve in Figure 1, a researcher can easily infer that the formula sin or cos is one of formulas that constitute the curve because data in curve show periodicity. Therefore, these formulas are called feature functions that can be recognized or extracted by PFK. Algorithms CountSamePartF and CountSpecU (see Algorithms 3 and 4) are built to recognize the feature functions.
Recognition can help the researcher overcome overfitting of results that are generated by classification because classification can help researcher to only identify formula models from training set while recognition can help the researcher identify subformula models that are consistent with local data features.
Computational Intelligence and Neuroscience   The ability of inference and deduction is one of main measurements for evaluating performance of artificial intelligence methods. In the SR problem, GP, compared with other methods such as logical reasoning, statistical learning, and genetic algorithm, is a revolutionary method of searching fitting formulas because it can seek the appropriate formula models and their coefficient values simultaneously by evolving the population of formula individuals. Therefore, in the paper, we use GP as the method of inferring and deducing formulas.
To optimize GP, researchers have proposed various approaches, such as optimal parsimony pressure [24]; Pareto front optimization [17] and its age-fitness method [25] are used to control bloat and premature convergence in GP. In order to reduce the space complexity of searching formulas, the methods of arithmetic [26] and machine learning are injected into GP. In the paper, with the algorithm random-GenP (see Algorithm 5) about generating population and the method of Pareto front optimization, PFK-GP can research the formula model in the appropriate space and can find right formulas quickly.

Formula Prior Knowledge Base.
The FPK needs to have the ability of identifying and classifying the formula model based on data features. Although the features between formula models are different, it is difficult to extract features from data which are generated by these models because different formula models represent seemly similar but different features. Based on the above definitions in formula model, the features among functions in set are different. The features between the function ∈ and the function ∈ may be similar if is the parameter of . As shown in Figure 4, these functions sin( ), cube( ), and exp( )  Figure 5.
In this paper, DBN is used to classify data into a special formula model according to features that are automatically extracted from data. Generally, DBN is made up of multiple layers that are represented by Restricted Boltzmann Machine (RBM). As RBM is an universal approximate of discrete models [27], we can speculate that DBN which has many layers of RBM can recognize the features of data generated by complex function just like Convolutional Neural Network (CNN) classifying image [28]. The process of DBN recognizing formulas is illustrated in Figure 4. DBN can extract features from data, layer by layer. So, the lower RBM layer can represent the simple functions and the higher RBM layer can transform the simple functions into more complex functions.
We use the data generated by the different formula models (see Table 7) as training samples. DBN is trained by these samples. The model that is finally gained by DBN training methods is PFKB, which is aimed at identifying the formula model that can represent features of the data. The process of DBN training is outlined in algorithm TrainDBN (Algorithm 1) which uses the same steps as mentioned in [14,15].
PFKB is only changed with formula models. If there are no new trained formula models in an application, the algorithm TrainDBN will not be executed. When the number of trained formula models is large enough, little new formula model will appear, and PFKB will seldom be changed. In the paper, TrainDBN is performed exactly once in order to generate PFKB.

Classification and Recognition with PFKB.
In order to deal with the problem of how to classify and recognize formula model from data, we should consider the problem from two aspects. One situation is that data can be represented by a special formula model from PFKB, while the other one is that data cannot be represented by a formula model from PFKB. In the first case, we exploit PFKB to identify formula models of data by DBN classification. Based on ordered results of DBN classification, we gain a set of formula models ( = 1 , . . . , ) which are most similar to features of the data. The process that deals with the first case is outlined in algorithm GenerateFs. The algorithm is fast because PFKB has been built by TrainDBN, and is small integer value.
In the second case, when a researcher observes laws that are hidden in experimental data, he often tries to find some formulas which are consistent with partial features of the data. Therefore, we propose the two assumptions as follows.
Assumption 1. More formula models s have the same subformula model pf in the set Fs which is the result of GenerateFs running, more strongly that the pf can express features of data.
In order to compute the same pf in Fs, we express the formula model as the string of expression and seek the same part of them by using intersection between the two strings (without considering the elements in sets X and A). Define the intersection between two expressions as follows: For example, 1 = + * cos( ) + tan( )/(exp( ) + log( )), 2 = + * cos( ) + abs( )/(exp( ) + log( )), 1 ∩ 2 = { + * cos( ), exp( ) + log( )}. The method, which obtains pf whose frequency of occurrence in f is larger than threshold t, is described as the algorithm CountSamePartF.

Assumption 2. If function
∈ exists in Fs obtained by GenerateFs and the number of the same is larger than threshold , we can conclude that can show some local data features.
Computational Intelligence and Neuroscience The function ∈ except is common function, which has a high probability of occurrence in mathematical expressions. Therefore, it is difficult to express special data features. Compared with , the function ∈ can show obvious features of data. For instance, sin( ) presents the periodicity of data and log( ) represents data features about extreme increase or decrease. The method, which obtains the special function that can show the local data features, is outlined as the algorithm CountSpecU.
For verifying Assumption 2, we also choose the dataset which are generated from 0 (see Table 5) as the testing data and apply the CountSpecU algorithm to calculate the special among = { 18 , 15 , 34 , 11 , 3 }. The result of the algorithm is shown in Table 1. We find the result = {tan, cos, sqrt, exp, sin} (sin and cos are the operators of the same kind) is part of 0 . Hence, we can discover that the u set, which is gained by the algorithm CountSpecU, can show local features of the dataset.

GP with Prior Formula Knowledge Base.
In order to deal with SR problem, GP is executed to automatically composite and evolve mathematical expression. The process of GP is similar to the process that a researcher transforms formula models and obtains fitting formulas based on his knowledge. Since those algorithms in PFKB, which is created based on analyzing the process of how a research infers fitted formulas, can recognize formula models that are consistent with data features, we combine these formula models of PFKB recognizing into the process of GP in order to reduce the searching space and increase the speed of discovering right solutions. When initializing GP algorithm, we can select candidate formulas from Fs, C, and specU as individuals in population of GP. The sets Fs, C, and specU are gained by the above algorithms in PFKB. Therefore, the PFKB is injected into the process of population generating. And this population can contribute to reserving data features as much as possible and reducing the searching space because these individuals commonly have good fitness value. With the population, GP algorithm can speed up the convergence and improve the accuracy of SR results. However, it may lead to the bias results. To overcome the problem, some random individuals must be imported into the population. The process of population creating is as follows.
Firstly, the elements in sets Fs and are inserted into the population. Then, the set specU and the candidate function sets and U are merged into the new candidate function queue Q. And the number of elements in specU 8 Computational Intelligence and Neuroscience Input: , Output: (ordered local expressions set that are sorted according to frequency of occurrence which is larger than ) (1) = = 0 (2) for each pair ⟨ , ⟩ in change 's element V to V+1 // V indicates the number of times that appears (7) else .
Input: , Output: (ordered spec function set that are sorted according to frequency of occurrence which is larger than ) change 's element V to V+1 // V indicates the number of times that appears (6) else Computational Intelligence and Neuroscience 9 Input: data, PFKB, 1 , 2 , , , , , , interval Output: (candidate formulas set) = ParetoOptimise( ) // prevent the formula model too complex (9) fitnees = EvaluatePopulation( ) (10) bestFitness, = Selectbest( , fitness, ) // choose the best individuals and get the best fitness value from the individuals (11) if mod interval is twice as much as the other elements in Q because ∪ ⊆ . Those elements in specU are more likely to be part of individuals in the population after applying the method traditionalRandomIndividual [16] which is designed to generate randomly individuals from the special function set. At last, the rest of individuals of population are created by traditionalRandomIndividual with sets B and U. The process of population generating is described as the algorithm randomGenP.
Generally, | | + | | + | | < /2, where is the number of individuals in population. Furthermore, in order to enhance the affection of PFKB in the process of GP evolution, the method randomGenP is used to create new individuals in every few generations of evolutionary computation. Meanwhile, the method of Pareto front [17] is introduced into the algorithm PFK-GP to balance the accuracy against the complexity of model. The detail of algorithm PFK-GP is shown in Algorithm 6.

Experiments
In the experiments, we employ DBN in the DeepLearnToolbox [30] to classify formula models and build the algorithm PFK-GP based on GPTIPS [29]. The 39 formula models in Table 7 are composed of formulas from [31,32] and some formulas are created by ourselves. The data generated by these 39 formula models is used as training data of algorithm DBN to create PFKB. The formula models in Table 5 are used to generate the testing data for verifying accuracy of algorithms GenerateFs and PFK-GP. The formula models in Table 6 are devoted to validating the two algorithms CountSamePartF and CountSpecU (see Algorithms 3 and 4). For most formula models from Tables 5, 6, and 7, we sampled them by equal step taking their parameter values from the range [−49, 50]. For some particular formulas, we also sample them with a special equal step from special numerical scope. For example, the value in sqrt( ) is in the range [0, 99], the value in log( ) ranges between 1 and 100. We create 500 groups of different parameters value in each formula model. The coefficients in these formula models are fetched with equal step from the range [−2.996, 3.0]. When all coefficients of a formula model take special values, the formula model generates a formula, namely, a sample of the formula model. We create 7500 groups of different coefficients in each formula model. So, each formula model has 7500 samples where each sample has 500 groups of different parameters value. We take 6000 samples of these samples as training data and the others as test data.
We adopt DBN as the classification model and compare it with SVM that is implemented by the tool libsvm [33]. The training and testing data for the two algorithms are originated from formula models 1 -39 . The parameter values in DBN  and SVM are illustrated as Table 2. We take the first five formulas from Fs generated by GenerateFs as a result set of recognition. If the test formula is included in the set, we think that the recognition result is correct. The accuracy of recognition results of DBN and SVM is showed as Figure 5.

Number Formula
We could find that after processing test data of formula model 13 , PFK-GP found its best model at the first generation and its fitness is higher, while PO-GP found its best model until 718th generation and its fitness is much lower than that in PFK-GP. The PFK-GP can get the right formulas quickly because the model 13 recognized by the algorithm GenerateFs is inserted into the initialized population of evolutionary computation. For the formula models whose characteristics are consistent with data features in PFKB, they can be recognized with high probability and can be combined into population of PFK-GP. The PFK-GP can firstly search the coefficients in these formula models and get the mathematical expression with good fitness value. Therefore, the algorithm GenerateFs can speed up the process of PFK-GP dealing with SR and can improve the accuracy of SR results.
In order to test whether PFK-GP can overcome overfitting or not, a dataset is created by 1 which has not existed in the training models of PFKB. The two algorithms PO-GP and PFK-GP are, respectively, applied to process the dataset. The two algorithms, which run, respectively, 100 and 1000 generations, have similar convergence curves in Figure 8. However, PFK-GP can find better fitness results compared with PO-GP, because PFK-GP searches fitted solution in the space includes more functions whose data features are in accord with 1 . Since the initial population, which is generated by the algorithms (CountSamePartF and CountSpecU) in PFKB, contains subformulas in formula models which are recognized by PFKB and represents data features of these subformulas, PFK-GP can find the right formulas which are more fitted to the raw dataset. In order to observe overall performance of the PFK-GP, we select six datasets as testing set. Three of them generated by formula models ( 9 , 13 , 19 ) from Table 7 are involved in the process of training DBN, while the other three generated by formula models ( 1 , 4 , 6 ) from Table 5 are not involved in that process. The two algorithms PFK-GP and PO-GP are executed, respectively, ten times in order to gain the right formulas from the six different datasets. The six results of mean training error gained by the two algorithms are shown in Figure 9. And the average results from six groups of mean training errors are listed in Figure 7. The PFK-GP(E) and GP(E) are the average results of 1 , 4 , and 6 , while PFK-GP(P) and PO-GP(P) are the average results of 9 , 13 , and 19 . We can conclude that the comprehensive performance of the PFK-GP is better than that of the PO-GP based on the results in Figures 7 and 9, because the algorithm PFK-GP utilizes the method GenerateFs to find the fitted formula model directly and the methods CountSamePartF and CountSpecU to identify subformula models which have data features consistent with test set. The best mathematical expressions PFK-GP and PO-GP found are listed in Table 4.
In order to measure relativity between experimental data and predictive data, the formula Training Variation Explained (TVE) is defined as follows: The higher the TVE value, the more valid the predictive data. PO-GP and PFK-GP are run ten times, respectively, in the dataset generated from eight prediction models (see Table 6 1 -8 ). The eight results of different dataset processed by the above two algorithms are listed in Figure 10. And the maximum, minimum, and average results of TVE are listed in Figure 11. From the results in the two figures, the formulas that PFK-GP finds are more relative to the experimental formula models than those PO-GP finds.

Related Work
The search space of SR is huge even for rather simple basis functions [31]. In order to avoid search space that is too far from the desired output range determined by the training dataset, the interval arithmetic [34] and the affine arithmetic [26], which can compute the bounds of GP tree expression, are imported into SR. Although the method based on affine arithmetic can generate the tighter bounds of the expression in comparison with the interval arithmetic method, its accuracy often leads to high computational complexity [35]. Moreover, the size of search space is still huge because there are plentiful candidate expressions which fit to the data bound computed by the above two arithmetic methods.
In addition to the above arithmetic method, machine learning methods are used to compact or reduce the search space of SR. FFX technology uses pathwise regularized learning algorithm to rapidly prune a huge set of candidate basis functions down to compact model based on the generalization linearly model (GLM); hence the technology outperforms GP-SR in speed and scalability due to its simplicity and deterministic nature [8]. However, it may abandon correct expressions and make them not in the space of GLM. A hybrid deterministic GP-SR algorithm [36] is proposed to overcome the problem of missing correct expression. The hybrid algorithm extracts candidate basis expressions by using FFX and inputs the expressions into the GP-SR. The hybrid algorithm utilizes the candidate expression generated by the linear regression method (pathwise regulation), while our algorithm utilizes the candidate expression by applying the algorithms CountSamePartF, GenerateFs, and CountSpecU.
By applying expectation-maximization (EM) framework to SR, the clustered SR (CSR) can identify and infer symbolic repression of piecewise function from unlabelled, timeseries data [9]. The CSR can reduce the space of searching piecewise function owing to the fact that the EM can search Computational Intelligence and Neuroscience simultaneously the subfunction parameters and latent variables that represent the information of function segment. The abstract expression grammar (AEG) SR is proposed to perform the process of genetic algorithm (GA), allowing user control of the search space and the final output formulas [37]. On understanding the given application, users can specify the goal expression of SR and limit the size of search space by using abstract expression grammars. Compared with manually assigning expression and limiting the search space with AEGSR, in the paper, the methods about PFK can Computational Intelligence and Neuroscience  Figure 11: TVE results of PO-GP compared with PFK-GP in eight formula models.

PO-GP PFK-GP
automatically extract the candidate expression from dataset by using statistical method and dynamically adjust the search space by using GP. The methods that inject prior or expert knowledge in evolutionary search [12,13] are introduced to find effective solutions that can show mathematical expression more compactable and interpretable. In these papers, the prior and expert knowledge are the solutions which are mathematical expressions in some applications. The knowledge is merged into GP by inserting randomized pieces of the approximate solution into population. One of the major differences between these methods and our method is how prior or expert knowledge is created. The knowledge in [12,13] is the existing formula model that comes from the previous solutions and can be called static knowledge. However, the knowledge in our method is the formula model which is consistent with data features that are originated from the algorithms GenerateFs, CountSamePartF, and CountSpecU and can be called dynamical knowledge that is changed with the features of test dataset. Therefore, our methods can insert more suitable knowledge into the GP.

Conclusion
In this paper, a PFK-GP method is proposed to deal with the problem of symbolic regression based on analyzing the process of how a researcher constructs a mathematical model. The method can understand experimental data features and can extract some formulas consistent with experimental data features. In order to implement the function of understanding data features, PFK-GP, through the DBN method, firstly creates PFKB that can extract features from test dataset generated by training formula models. The experiment results confirm, compared with SVM, that DBN can produce better results that extract features from formula models and classify test data into its corresponding formula model. Then, the methods of classification and recognition are implemented to find some formula models that are similar or related to experimental data features as much as possible. For the classification, we exploit the algorithm GenerateFs based on DBN to match the experimental data with formula models in PFKB. With regard to recognition, we propose the algorithms of CountSamePartF and CountSpecU to obtain some subformula models which have local features consistent with experimental data. The classification can help PFKB to find formula models that are consistent with whole data features while the recognition can help PFKB to find subformula models consistent with local data features. At last, the algorithm randomGenP is used to generate individuals of evolutionary population according to the result of the above three algorithms. Through combining and transforming these individuals, GP can automatically obtain approximate formulas that are best fitting to the experimental data.
Compared with Pareto GP, PFK-GP, which is built on the PFKB with the functions of classification and recognition, can explore formulas in the search space of data features. So, it can accelerate the speed of convergence and improve the accuracy of formula obtained.
Obviously, the high efficiency of PFK-GP depends on the powerful methods of classification and recognition based on PFKB. Therefore, it is an important part of the future work to improve the accuracy of the above two methods. The two methods depend on the representation of data features of formula model. In the paper, the two assumptions based on statistics and counts are used to obtain the formulas which can show the data features. The features of formula model are not defined explicitly. And the two assumption are not proved by formal proofs. There are some uncertainties in those assumptions. Therefore, the new representation which can show whole or local features of formula models will be researched to find formulas which can better fit to experiment data. In addition, the rules of formulas transforming and inferring that are similar to researchers' methods will be explored in the evolution of GP.