Application of Gene Expression Programming to Evaluate Strength Characteristics of Hydrated-Lime-Activated Rice Husk Ash-Treated Expansive Soil

Department of Civil and Mechanical Engineering, Kampala International University, Kampala, Uganda Department of Civil Engineering, Shanghai Jiao Tong University, Shanghai, China Department of Civil Engineering, Faculty of Engineering, University of Nigeria, Nsukka, Nigeria Department of Environmental Technology, Federal University of Technology, Owerri, Nigeria Department of Civil Engineering, Michael Okpara University of Agriculture, Umudike, Nigeria


Introduction
e design, construction, and monitoring of earthwork infrastructure have been of utmost importance due to the everyday failure civil engineering facilities experience [1][2][3][4]. For this reason, composite materials with special properties have been evolved to replace ordinary cement [5][6][7][8]. One such technique in the utilization of special binders is the introduction of activators to ash materials to form activated ash with the ability to resist unfavorable conditions and factors that have proven to be averse to constructed infrastructure [9][10][11][12][13][14]. However, the evolution of soft computing in engineering has added to the efficiency of designing, constructing, and monitoring of the performance of earthworks [15][16][17][18][19]. One such soft computing or machine learning method is gene expression programming (GEP). Invented by Cramer [20], genetic programming (GP) and gene expression programming (GEP) are the branches of genetic algorithm (GA) that is regarded as an evolutionary computing algorithm technique [20][21][22]. It is based on Darwin's theory of "survival of the fittest" that does not require making prior assumptions about the solution structure [23]. e working procedure of GP comprises various steps [24]: (1) create an initial population in accordance with the function and terminal settings; (2) use two key criteria, fitness function and maximum number of generations, to assess the performance of the generated population; if the performance of this population is according to the requirement or approaches the maximum number of the generation, terminate the program, otherwise, continuously generate a new population using three genetic operations of reproduction, crossover, and mutation for an amount of duration until the threshold criteria are not met. e experimental database was separated into training, validation, and testing set for the GEP analysis. In order to confirm consistent data division, many combinations of the training and testing sets were taken [25].
In Figure 1, it can be seen that input data is fed to either GP or a mathematical model that incorporates GP that yields predicted and observed values. e difference between these is residual errors which are reduced by continuing formulating in the GEP tool until an optimum model is obtained.

Preparation of Materials.
Expansive clay soil was prepared and tests were conducted on both the untreated and the treated soils to determine the datasets presented in Table 1, needed for the evolutionary predictive modeling. e hydrated-lime activated rice husk ash (HARHA) is a hybrid geomaterial binder developed by blending rice husk ash with 5% by weight activator agent, which in this case is hydrated lime (Ca(OH) 2 ) and allowed for 48 hours. At the same time, the rice husk is an agroindustrial waste derived from the processing of rice in rice mills and homes. rough controlled direct combustion proposed by Onyelowe et al. [4], the rice husk mass is turned into ash from rice husk ash (RHA). e HARHA was used in incremental proportions to treat the clayey soil and the response behavior on different properties tested, observed, and recorded (see Table 1). Figure 2, the flowchart of the gene expression programming method and execution is presented. e 121 input and output datasets were deployed to the GeneXpro software computing platform to generate the predicted outputs and the models from that operation. Several trials or iterations were carried out to achieve the best fit.

Pearson Correlation.
Pearson's correlation matrix [26] was generated from the given data comprising seven input and three output parameters using the data analysis capabilities of Microsoft Excel. e correlation matrix is defined as a square, symmetrical P × P matrix with the (ij) th element equal to the correlation coefficient R_ij among the (i) th and the (j) th variable. e diagonal members (correlations of variables with each other) are always equal to one [27]. us, the left-hand nine columns of this correlation matrix represent qualitatively the correlations between the input soil hydraulic-prone properties (HARHA, w L , w P , I P , w OMC , A C , δ max ) and output soil strength properties, i.e., CBR, UCS 28, and R Value (Table 2). e range of correlation factors varies from −1 and 1 (0 represents no correlation, whereas ±1 shows greater correlation). A positive value suggests that the respective increase or decrease is linear among the two variables simultaneously. It is indicated in Table 2 that the CBR, UCS 28 , and R Value have a correlation coefficient above 0.90 for all input parameters with the exception of w OMC for the last two outputs (0.134 and 0.363), respectively. us, a high correlation exists in this correlation matrix for the considered input and out parameters. In Figure 3 was presented the frequency histograms of the input variables: (a) HARHA; (b) w L ; (c) w P ; (d) I P ; (e) w OMC ; (f ) A C ; (g) δ max ; and output variables (h) CBR; (i) UCS 28 ; (j) R Value .

Gene Expression Programming.
e performance of a developed GEP model using a database is affected by the sample size and its variable distributions, which agrees with the findings of Gandomi and Roke [25]. us, the frequency histograms for all the input parameters (HARHA, w L , w P , I P , w OMC , A C , and δ max ) and output values (CBR, UCS 28, and R Value ) are visualized in Figure 3. It can be seen that the bellshaped curve indicates even distribution of the data. is diagram is often used for the initial assessment of geochronological data, which involves relatively large sets of data, according to Sircombe [28]. All the data is seen to exhibit even sample distributions and follow a symmetrical pattern such that the display of the histograms straightforward.
e descriptive statistics of the input and out parameters are tabulated in Table 3. is statistical summary shows the minimum and maximum ranges for all input and output parameters. e standard deviation (SD), Kurtosis, and skewness are also given for each parameter, which agrees with Edjabou et al. [29]. A low SD means that most of the values are close to the average (w P , w OMC , A C , δ max , and R Value ), whereas a larger SD means that the numbers are more spread out (w L , I P , CBR, and UCS 28 ). Skewness quantifies the asymmetry of the probability distribution of a real-valued random variable with respect to its mean. It can be positive, zero, negative, or undefined [30]. e negative values generally suggest that the tail is extended on the left side of the distribution curve (w L , w P , I P , w OMC , A C , δ max ,  and R Value ), while positively skewed shows that the tail is on the right side (CBR and UCS 28 ), which is reflected from the frequency histograms given in Figure 3 and the variable importance presented in Figures 4-6. Like skewness, kurtosis explains the shape of a probability distribution [31]. e Pearson measure of kurtosis of a given univariate normal distribution is generally taken as 3. Kurtosis values below 3 are called platykurtic, meaning that the distribution produces fewer and less extreme outliers than does the normal distribution, for instance, a uniform distribution, that is reflected in Figure 3.
To select the most appropriate GEP estimation model for HARHA treated expansive soils, several models with a varying number of genes were generated by employing a set of genetic operators (mutation, transposition, and crossover). Originally, a model composed of two genes with additional linking functions and head sizes of four (head size, H � 4) was selected and run a number of times. After that, the parameters were altered, in a stepwise order, by increasing the number of genes to three, head size to eight (head size, H � 8), number of chromosomes to 50, and weights of function sets. e program was run various times for different models, and the predicted final models were checked and compared with regard to their performance. Furthermore, the parameters such as mutation rate, inversion, and points of recombination were chosen on the basis of past studies [32][33][34] and then assessed to obtain their optimum impact. After running several trials, the final mathematical model was obtained, for which the selected parameters including detailed information of the general, numerical constants, and the genetic operators, are listed in Table 4. e final prediction model was chosen on the basis of criteria of the best fitness and lesser complexity of the mathematical formulation, while the expression trees (ETs) are illustrated in Figures 7-9 for the model outcomes CBR, UCS 28, and R Value , respectively.
In order to formulate the three models for the respective output parameters, initially, the input parameters were selected from the extensive experimental study, which is given below: CBR, UCS 28 , R Value � f HARHA, w L , w P , I P , A C , w OMC , δ max ,  Applied Computational Intelligence and Soft Computing  where CBR is California bearing ratio, UCS 28 is unconfined compression strength after 28 days, R Value is resistance value, HARHA is hydrated lime activated rice husk ash, w L is the liquid limit, w P is the plastic limit, I P is the plasticity index, w OMC is the optimum moisture content, A C is the activity value, and δ max is the maximum dry density.          It has been reported earlier that multilinear regression (MLR) was conducted to evaluate quantitatively the relationships between the input soil hydraulic-prone properties and output soil strength properties, i.e., CBR, UCS 28, and R Value . Each output value was defined as a combination of the six soil parameters (HARHA, w L , w P , I P , w OMC , A C , and δ max , respectively), and the following equations were derived: ese are useful tools to estimate the soil strength properties based on easily determinable geotechnical indices for HARHA treated expansive soils. However, these MLR equations can only be employed in the case when the points show linearly changing behavior [27]. ese equations were derived from making a comparison with the developed GEP models for CBR, UCS 28, and R Value . Using the expression trees given from Figures 6-9 for evaluating the CBR, UCS 28, and R Value of soils, respectively, decoding was done to derive the three simple mathematical expressions (equations (5)-(7)) as follows: e comparisons between the predicted and the observed expansive soil parameters are shown in Figure 10. e indicators indicate high accuracy can be observed for CBR, UCS 28, and R Value , with higher R 2 values for GEP formulated models.
is suggests that the prediction of the output parameters using the proposed model is in good agreement with the testing data.
It can be seen in Figure 11 that the range of error distribution for CBR and R Value is significantly lower in contrast to that of UCS 28 . It could be attributed to the larger SD value and range of data for the UCS 28 , as reflected in Table 1. In addition, the GEP proposed models exhibit superior performance for CBR and R Value cases in comparison with the respective MLR plots. However, the results of GEP are not better than that of the MLR model in terms of error distribution which is shown in Figures 7(c) and 7(d), respectively.
Finally, the summary of statistical performance is listed in Table 5. Variety of performance indices have been determined, including root mean square error (RMSE), mean absolute error (MAE), root square error (RSE), Nash-Sutcliffe efficiency (NSE), relative root mean square error (RRMSE), coefficient of correlation (R), performance index (ρ), and objective function (OBF) to evaluate the performance of developed CBR, UCS, and R value GEP models. e following equations were used to calculate the performance indices. e RMSE errors are squared, implying that relatively a much larger weight is assigned to the larger errors. High R values and low RRMSE values achieve a high degree of accuracy, which agrees with the results of Gandomi and Roke [25]. e proposed models indicate that the MAE, RMSE, RSE, and RRMSE values are significantly lower while the NSE and R values are larger for the CBR and R value, which shows superior model performance. However, these values are vice versa in the case of UCS 28 that leads to lower performance. Similarly, the performance indices and OBF values are well within allowable limits in the literature [32,35,36]. ese results further show that the proposed models of CBR and R Value using GEP were much better than for the case of UCS 28 , thereby achieving reliable and accurate results. e range of data for the input parameters of UCS 28 is several times greater than those of CBR and R Value , which is also reflected in Table 2. So, GEP models were used to formulate simple mathematical equations which can be readily employed to predict CBR, UCS 28 Figure 11: Error distribution diagram for CBR, UCS, and R value generated models using GEP and MLR, respectively.