A New Dual-Mode GEP Prediction Algorithm Based on Irregularity and Similar Period

. Gene expression programming (GEP) uses simple linear coding to solve complex modeling problems. However, the performance is limited by the eﬀectiveness of the selected method of evaluating population individuals, the breadth and depth of the search domain for the solution, and the ability of accuracy of correcting the solution based on historical data. Therefore, a new dual-mode GEP prediction algorithm based on irregularity and similar period is proposed. It takes measures to specialize origin data to reserve the elite individuals, reevaluate the target individuals, and process data and solutions via the similar period mode, which avoids the tendency to get stuck in local optimum and the complexity of the precisions of correcting complex modeling problems due to insuﬃciency scope of the search domain, and subsequently, better convergence results are obtained. If we take the leek price and the sunspot observation data as the sample to compare the new algorithm with the GEP simulation test, the results indicate that the new algorithm possesses more powerful exploration ability and higher precision. Under the same accuracy requirements, the new algorithm can ﬁnd the individual faster. Additionally, the conclusion can be drawn that the performance of new algorithm is better on the condition that we take another set of sunspot observations as samples, combining the ARIMA algorithm and BP neural network prediction algorithm for simulation and comparison with the new algorithm.


Introduction
In the field of predictive modeling, there are many models. Many scholars conduct in-depth research in this area. Wang et al. [1][2][3] have made great efforts in the optimization of prediction algorithms and achieved certain results. is paper is mainly to study and improve the GEP model and use ARIMA and BP-ANN models for experimental comparison.
Time series prediction is a typical method in data mining, which is widely used in the fields of financial economy, meteorology, hydrology, signal processing, and disaster warning. e autoregressive integral moving average model (ARIMA) is the most common model used for time series forecasting. Atanu et al. [4] used the ARIMA model to predict a country's GDP, and iruchelvam et al. [5] used the ARIMA model to determine the spatial effect of dengue fever cases on neighboring areas. e error backpropagation (BP) algorithm is also a current method used for time series forecasting. Li et al. [6] predicted the passive torque of the human shoulder joint based on BP-ANN (Artificial Neural Network), and Kianpour et al. [7] used BP-ANN to predict the acute oral toxicity of organophosphates. ARIMA is essentially a linear prediction model and requires data to be stable. And, there exist many anthropogenic factors in the training of the BP-ANN prediction model. GEP inherits the advantages of the simple linear coding of genetic algorithms (GAs) and the expression tree of genetic programming (GP) and expresses the population from both the genotype and phenotype [8]. Because of the simplicity, comprehensibility, and high efficiency of GEP, it has been widely utilized in data prediction. Oulapour et al. [9] used GEP to find the best equation for the relationship between the width and depth of the possible crack area and the geometric parameters of the valley cross section. Khan et al. [10] used GEP to predict the compressive strength of geopolymer concrete. Yang et al. [11] proposed a new spectral model for leaf area index estimation based on GEP. Mallick et al. [12] used GEP to evaluate the surface average pressure coefficient of the building surface. Ali et al. [13] used GEP to predict the air ratio parameters of mineral tailings. Deng [14] et al. used hybrid GEP to recognize numerical sensitive data in active distribution networks. Majidifard et al. [15] used GEP to develop a prediction model of asphalt mixture derusting depth. Murad [16] et al. used GEP to predict the shear strength of internal reinforced concrete beam-tocolumn joints subjected to cyclic loading.
GEP is based on the principle of survival of the fittest in biological evolution and the unique mechanism which can decode the results into functions making it extraordinary in the family of prediction algorithms. But there exist many shortcomings, such as tendencies to fall into the local optimum [17], the slowness of later convergence, and the complicacy of accuracy correction of nonlinear complex problem solutions [18,19]. In this regard, Zhang et al. [20] used regularization methods to enhance the generalization ability of GEP, increase gene diversity, and jump out of local optimality. Jiang et al. [21] accelerated GEP through measures such as adaptive parameters, population age stratification, and transplantation of the Spark framework. Wang et al. [22] introduced a multipreference-driven coevolutionary algorithm in GEP to improve the quality of the target solution, while reducing the complexity of the algorithm. However, because the above improvements require the additional constraint information or the integration of other algorithms, they do not have high generalization. is paper proposes a new dual-mode GEP (DM_GEP) prediction algorithm based on irregularity and similar period. e data processing objects of the first mode are those that are irregular. e data processing objects of the second mode are those with similar periodic fluctuations. In general, if the data object has similar periodic fluctuation law, we use the second mode. Otherwise, use another mode. We compared the results with the basic GEP algorithm. e experimental results prove that DM_GEP has a wider and deeper search area and convergence efficiency, and thus, it can achieve higher prediction accuracy. In the experiment, DM_GEP is compared with the ARIMA model and BP neural network prediction model. e experimental results further verify that DM_GEP has better prediction performance.

Gene Expression
Programming. GEP is a new type of adaptive evolution algorithm proposed by Portuguese scientist Candida Ferreira in 2001. GEP inherits the rapidity and usability of GAs and the variability and versatility of GP. It can utilize simple coding to solve complex problems [23]. Meanwhile, the separation of genotype and phenotype makes the evolutionary efficiency of GEP to solve practical problems' 2-4 orders of magnitude higher than GA and GP [24].

Chromosome.
In the process of gene expression programming modeling, a random initial population is first generated, and the population is composed of chromosomes. e processing object of GEP is a chromosome composed of a single gene or multiple genes. Genes consist of linear and fixed-length strings of symbols, which can be divided into heads and tails. e chromosomes generated according to a certain rule can be decoded according to the rule to generate an expression tree. e expression tree can be further transformed into mathematical expressions. So, the essence of chromosomes is a series of mathematical expressions. If F is the set of function symbols and T is the set of terminal symbols, the heads of the genes can be randomly composed of any symbols in F and T, and the tails of genes can only be composed of any symbols in T. If we let the length of a gene, the length of its head and the length of its tail be L, H, and T, respectively, and the maximum number of operations of the function in the function symbols contained in the gene is N.
e following formulas are established: (2) Figure 1 shows a double-gene chromosome with head length of 4 and tail length of 5. e chromosome in Figure 1 has two open reading frames (ORF), which correspond to the subtree (sub-ET) of Figure 2. In the multilevel structure tree, each subexpression tree is not only an independent evolutionary individual but also a part of the hierarchical evolution system. Figure 2 is the expression tree generated by decoding corresponding to Figure 1, which is connected by "+."

Fitness.
Fitness is an index to evaluate the ability of an individual to adapt to the environment. e solution step of fitness needs to decode the chromosome to get the corresponding expression tree and then generate the corresponding mathematical expression. Finally, the value of the objective function is obtained by substituting the value of the variable into the mathematical expression. e smaller the gap between the objective function value and the actual value, the higher fitness that the individual has. According to the individual's fitness value, the quality of the individual in evolution can be evaluated. ere are two classic evaluation models in GEP, absolute error (equation (3)), and relative error (equation (4)): where Y is the training dataset which contains n data needed for modeling, Y i represents the input of the ith group of training data, Y i represents the predicted value of the corresponding ith group of data, and M is a constant representing the selection range.

A Dual-Mode GEP Prediction Algorithm
If the observation data used for modeling has obvious similar waveform trends that do not strictly limit the height of peaks and troughs in intervals with similar spans, we define that it has a similar period (SP), and this type of data is nonlinear and often contains high ambient white noise. e basic GEP adopts different individual evaluation standards for different types of input data, which does not have generalization. However, if a unified evaluation system is used to evaluate individuals from different data types, the situation that the individual fitness value is high but the regression fitness keeps low is easy to occur. In case of the search area of the algorithm is small, it would make the algorithm fall into the local optimum. When the individual fitness value is approaching the theoretical value, the time consuming is not proportional to the improved accuracy, which leads to the slow convergence rate of the algorithm in the later stage. When processing data with SP characteristics, the accuracy correction is difficult and the process is complicated. e DM_GEP proposed in this paper is based on the basic GEP algorithm and consists of the irregular prediction mode (IPM) and similar period mode (SPM). Among them, IPM can be used for regression prediction problems of various types of data. SPM only processes SP data without complicated and difficult processes, so it has an improved accuracy compared with IPM.

General Mode-IPM.
On the basis of GEP evolution, a unified evaluation system is applied to deal with different types of modeling data, while expanding the search domain and accelerating the evolution efficiency. IPM is described as follows, and the pseudocode of the algorithm is shown in Algorithm 1.
① Appropriately lower the preset solution fitness value to reduce the huge amount of time it takes to approach the value in the later stage of model evolution. Reduce the number of individuals who obtain high fitness but the regression fitting effect does not meet the needs, and accelerate the acquisition of the target solution. ② When evaluating the chromosome, the strict parameter values are not used as the denominator for a few parameter values approaching the zero point, and these parameter values are specially treated to reduce the impact on the individual in calculating the error value of prediction effect. ③ If the individual has reached the preset fitness value, the target individual obtained by the model is reevaluated. Use modeling data to calculate the average error of the individual for regression. If the average error value is less than the preset limit value, it will be given as the model result. Otherwise, restart the entire model to re-evolve and combine with ① and ② to accelerate the evolution efficiency under the premise of ensuring the accuracy of the target individual's prediction.
e calculation of BEST-FIT complies with the special calculation principle in ② above. K is the optimal number of individuals retained to the next generation, and the value depends on the population size requirements. MAX_FIT is the ideal fitness. Discrete Dynamics in Nature and Society

Dedicated
Mode-SPM. IPM can process SP data, but it is more difficult. e SPM proposed in this paper is based on the larger search domain of IPM and effective convergence in the later stage. Because it is aimed at the SP data processing model and using of compound individuals as solutions, there is no complicated and difficult accuracy correction process, and the convergence efficiency and accuracy are further improved on the basis of IPM.

Original Data Processing Model
① e number of SP data in a group is P, and the average value L of its "period" is obtained, and G is N times L (P is rounded down), that is, there are N "periods" in this group of data. Among them, G, L, and N are a positive integer. ② e sliding size of the SPM window is W, and set W � L, that is, the coding parameter of a chromosome is W dimension. ③ A set of modeling data consists of two parts: continuous L points form W-dimensional calculation parameters, and the (L + 1)th data is used as the correction value. For example, the first set of modeling data . e first (G − L) pieces of data constitute the modeling data, and the remaining L pieces of data are reserved for observing the effect of simulation prediction. ④ From the modeling data group, a group of (N − 2) continuous modeling data with each data interval of L is selected as the target child chromosome SP modeling data group. For example, the first target child chromosome modeling dataset Datas ] }, and so on, and the modeling data set in ③ is assigned to each target child chromosome for evolutionary modeling.  In the same way, predict future data, and calculate the remainder of L through the historical data coordinate value of the point to obtain the remainder i. Select Com_Chromosme [i], and then, predict the point to get theoretical data.

Model Schematic.
e ideal periodic function trend chart is convenient to describe the basic principle of SPM, as shown in Figure 3, and the pseudocode of the algorithm is shown in Algorithm 2 below.
where L is the length of the SP period, W is the sliding size of the SPM window, and P is the number of SP data in a group.

Evaluation Standard.
In this paper, the following criteria are used to evaluate the algorithm model.

MSE and MAPE.
e MSE is the average value of the sum of squares of errors during the fitting process of the linear regression model, and the value range is [0, +∞). e closer the value approaches to 0, the better the data obtained from the model fits the original data: BEST ⟵ find the best one from CHROMOSME [ ]; (5) BEST_FIT ⟵ calculate fitness of BEST; (6) If (BEST_FIT ≥ MAX_FIT) then (7) AVE_ERROR ⟵ calculate error of BEST; (8) If (AVE_ERROR ≤ LIMIT) then (9) Return BSET; (10) End if; (11) Else Break; //restart evolutionary model (12) End if; (13) SONS [ ] ⟵ produce empty population same as CHROMOSME [ ]; (14) SONS where m refers to m samples, i refers to the I dimension of quantity Y, Y i refers to the original value, and Y i refers to the predictive value. e MAPE is the mean absolute percentage error, and the value range of MAPE is [0, +∞). e closer the value approaches to 0, the better the data obtained from the model fits the original data:  Step2:genetic evolution process (8) where Y is the sample average.

Residual Diagram.
e residual diagram is to visually evaluate the performance of the model and obtain outliers by drawing the difference or vertical distance between the true value and the predicted value. For a good regression model, the expected error is randomly distributed, and the residuals are also randomly distributed near the center line.

Experimental Data.
Data group 1 is the daily average price of leeks in the Jiangnan agricultural and sideline product market in Guangzhou City, Guangdong Province from January 1, 2020 to April 20, 2020, obtained from the national agricultural product price database, with a total of 108 price data. is piece of data can represent a normal time series and is used to test the performance of the model. Data group 2 is taken from NCEI Sun-Geophysics in Space Weather [25]. e sunspot detection data values from 1770-1869, a total of 100. And, its trend is shown in Figure 4. e observational value sequence of sunspots has the characteristics of nonlinearity and multiple time scales [26]. At the same time, due to the large environmental interference and large noise when observing and collecting values, it has become a classic use case for testing the effectiveness and prediction accuracy of predictive model analysis to solve complex real-world problems.
Data group 3 selects the sunspot data from 1919 to 2018 [25]. Because the sunspot time series observation is a typical example of detection and prediction model, it is used to test the performance of several typical algorithms.
For details of the data, please refer to the two documents in the supplementary materials (available here).

Experimental Setup.
In order to highlight the advantages of DM_GEP over GEP, the simulation and comparison experiment parameter settings are simplified to the greatest extent: the function set only selects the most basic four arithmetic operations, and the chromosome structure and length are simplified. At the same time, the average value of multiple experiments is obtained as the conclusion. e same parameter settings of GEP and DM_GEP are shown in Table 1.
BP neural network is a multilayer feedforward neural network trained according to the error backpropagation algorithm [27]. e gradient descent method is used to minimize the mean square deviation between the actual output value and the expected output value of the target network. e trained target network can intelligently process the input information of similar input samples and then output the information obtained from the linear transformation with the current minimum error. e ARIMA model has three heaviest parameters. e parameters p, q, and d represent the autoregressive parameter, moving average, and order of transforming the original sequence into a stationary sequence [28]. is paper compares BP, ARIMA, and DM_GEP through a set of time series simulation experiments to verify the advantages of DM_GEP algorithm performance. Table 2 shows the parameter settings of ARIMA and BP-ANN as the comparison algorithm. Among them, the selection of the number of hidden nodes in the BP neural network refers to the empirical formula that scholars have obtained for a long time: where h refers to the number of hidden layer nodes, m refers to the number of input layer nodes, n refers to the number of output layer nodes, a is a constant and its range is [1,10]. In Table 2, logis and purelin refer activation functions.

GEP and IPM.
For the comparison between GEP and IPM, the data selected in this paper is data group 1, in which the first 103 data constituted 98 sets of modeling data, and the last 5 data will be retained for simulation prediction. e two target chromosomal individuals representing the average performance of GEP and IPM are as follows:  Discrete Dynamics in Nature and Society with the sample data, and there is no significant mutation point that deviates from the sample data curve. Although the GEP target individual can better reflect the fluctuation of the sample data in the curve trend, it obviously has a certain delay, and many data points deviate greatly from the sample data curve. e residual mean value of the sample data of the GEP target and IPM target is plotted as a residual diagram, as shown in Figure 6. And, the conclusion mentioned above   e data points between 0 0.4 accounted for 96.2%, and they have accounted for most of the data points. And, the proportion between 0 and 0.6 is 99.1%. e prediction data of the GEP target is scattered and unevenly distributed on the upper and lower sides of Y � 0. e mutation points far away from the center line account for a relatively large number.
ere are only 5 perfect fitting points, accounting for 3.9%, which is between 0 and 0.2. e proportion of data points between 0 and 0.4 is 38.9%, 56.4% is between 0 and 0.4, which is barely more than half, and the proportion between 0 and 0.6 is 65.1%. Table 3 shows the average conclusion of the experimental data. It can be seen that, under the same or even better evolution time, the MSE and calculation error of the IPM target is still better than the GEP target. After the IPM model has increased the calculation accuracy by 10%, the timeconsuming is still close to that of GEP, achieving an MSE close to 0 under a large R 2 . It is proved that IPM can break out of the defects of local optimality and precocity better than GEP and find better target individuals. Setting aside time constraints, the IPM target prediction error is only 2.71%, and the MSE and R 2 are 0.2 and 0.96, respectively. e prediction model has basically fitted the actual data. is indicates that IPM has a better ability to explore a wider and deeper search field and to approach the limits with high accuracy than GEP.

GEP, IPM, and SPM.
In this paper, data group 2 is selected as the dataset to compare GEP, IPM, and SPM. e data trend chart is shown in Figure 3. It can be seen that it is composed of several waveforms with large peak fluctuations, similar waveforms, and an average time span of 10, which has obvious SP characteristics and is suitable for detecting SPM performance. Among the 100 years of data, the first 90 years of data are used to train the model, and the remaining 10 years of data are used to simulate predictions. e target chromosomal individuals of GEP, IPM, and SPM are as follows:   Figure 7 shows the forecast data of the set of sunspot data with the above targets, and it can be seen that all three can closely follow the trend of the sample data. Although the GEP target reflects the fluctuation trend of the sample data, the troughs are not well-fitted and the prediction curves are abrupt and jagged in many places. On the contrary, the SPM target has the best effect. e entire prediction curve only appears once with a large abrupt change. Figure 8 shows the residual mean value of these three models, which shows that the distribution of GEP residual points is scattered. e mean error values of each data point of the prediction results of the three models are made into Figure 9 to support the above discussion. e average value of the experimental data conclusions is made in Table 4, and the performance gap between GEP and DM_GEP can be analyzed. Under the same experimental conditions, the performance of SPM and IPM is close. With only 31% and 25% of the time consumption of GEP, an accuracy of more than 30% higher than GEP was achieved. At the same time, MSE improved by more than 700 and R 2 improved by more than 15% of the target. With the knowledge that the DM_GEP dual mode is better than GEP, analyze the advantages of SPM compared to IPM in SP data; when the expected accuracy is 0.5, SPM is 56.28 higher than IPM on MSE, and the rest is close. When the expected accuracy is 0.35, the experiment shows that this is the performance bottleneck of the IPM experiment. At the same time, SPM can achieve a target of 5.68% increase in accuracy of IPM, 80.41 increase in MSE, and slightly better R 2 than IPM with only 17% of the time consumed by IPM. It shows that SPM has higher evolution efficiency and better target exploration ability than IPM when processing SP data to obtain higher precision targets and has a higher performance threshold. Exploring the SPM performance threshold, the experiment shows that the threshold is 0.2, and the average achieved accuracy is 17.73%.
In summary, SPM is more efficient than IPM and GEP in processing SP data to obtain individuals with uniform residual distribution, larger R 2 values, lower MSE values, and higher accuracy thresholds. 4.6. ARIMA, BP-ANN, and DM_GEP. In this paper, data group 3 is selected as the dataset to compare ARIMA, BP-ANN, and DM_GEP. Among them, the DM_GEP experiment parameter settings are the same as Table 1, the budget accuracy of the change is set to 0.2, and the expected fitness is set to 7850. e sliding window of BP neural network and DM_GEP is 10, and the last 10 data of three models are reserved as simulation prediction data. Figure 10 shows the experimental fitting effects of the three algorithms. Table 5 shows the statistical analysis of the experimental results.
It can be seen from Table 5 that the mean absolute percentage error (MAPE), MSE, and R of the DM_GEP model are better than the BP neural network and the ARIMA model in the prediction. It can be seen from Figure 10 that the regression prediction curve of the DM_GEP model is basically fitted to the modeling sample data curve, and the MAPE is 16.27. e prediction trend of the last ten sets of prediction data conforms to the future development trend, and the effect is the best.
Some scholars have also made up the shortcomings of ARIMA and BP neural networks: Min et al. [29] used SVM to map data to a high-dimensional space to try to weaken the interference caused by nonlinearity. He et al. [30] used BP neural networks to pass the PSO algorithm that optimizes the weights of each connection layer accordingly. But in contrast, DM_GEP has excellent linear and nonlinear analysis and modeling capabilities, which not only requires no special requirements for the amount of historical data but also enables excellent accuracy of regression prediction for   modeling data that causes large environmental noise without noise reduction. ese characteristics make DM_GEP have more prospects and better forecast applicability in the field of forecasting.

Conclusion
In the experiment of leek price prediction, the experimental results show that the IPM mode can find better individuals in a shorter time than ordinary GEP. In the experiment of predicting the observed value of sunspots, the SPM mode has higher accuracy and shorter time than ordinary GEP and IPM mode. In addition, the results of experiments with ARIMA and BP-ANN in the prediction of sunspot observations also show that the accuracy of SPM is higher. GEP has unique advantages in the family of prediction algorithms. However, there are shortcomings such as tendencies to fall into local optimum and difficulties to regress complex nonlinear data. In this regard, a new DM_GEP prediction algorithm was proposed in this paper, which is compatible with the high efficiency of GEP's regression of simple linear problems and excellent nonlinear data analysis and construction capabilities. For the sake of avoiding overly premature models, the algorithm expands the algorithm search space by reducing the rigor of error judgments for those true values close to 0 and the precomputation of individuals. By changing the single mode of modeling data and using specific methods, the complicated and difficult correction process for SP data was avoided. At the same time, it improved the deficiencies of GEP, simplified the threshold of GEP prediction application, and enhanced practicality and generalization. In the experiment of leek price prediction, the experimental results show that IPM can find better individuals in a shorter time than ordinary GEP.
In this thesis, DM_GEP only used simple four arithmetic operations as a function set and simple structure chromosome to solve the problem in order to compare the experimental effect. e next research direction is to add a rich set of functions and diverse connection functions and to study the more general data preprocessing method to DM_GEP so that the new GEP algorithm will be more convenient and general, and the accuracy will be better.

Data Availability
e data used to support the findings of this study are included within the Supplementary Information file.