Prediction of California Bearing Ratio of Granular Soil by Multivariate Regression and Gene Expression Programming

is research demonstrates the results of an investigation into the California bearing ratio (CBR) of granular soils from Qassim region, Saudi Arabia, using multilinear regression (MLR), pure quadratic (PQ) models, and gene expression programming (GEP) methods utilized to develop mathematical models for estimating the CBR based on basic soil index properties. In this study, samples were collected from dierent borrowing pits in the Qassim area. Forty-three samples of soil were taken and transferred to a laboratory for examination. Seven multilinear regressions and seven PQ models were investigated, while four GEP models were made. e selection of each model variable depends on soil indices, grouping into grain size distribution, Atterberg limits, and compaction parameters. e results of this analysis showed that the PQmodel had a higher accuracy [coecient of determination (R) = 0.89, root mean square error (RMSE) = 16.006, uncertainty (U95) = 16.17, and reliability = 57%] than the multilinear regression model, which has a lower accuracy model [R = 0.811, RMSE= 20.791, U95 = 15.569, and reliability = 51%]. e best GEP model yields [R = 0.776, RMSE= 22.552, U95 = 15.787, and reliability = 53%]. Furthermore, sensitivity analysis was conducted to distinguish the inuences of dierent input variables on CBR; it was found that ¡nes percentage (F200), maximum dry density (MDD), and optimum moisture content (OMC) are the most inuential variables.


Introduction
Saudi Arabian road network consists of 73,171 km. Saudi Arabia has a vast and well-developed transportation network. Roads carry approximately 65 percent of freight and 80 percent of passenger transportation [11]. e total number of automobiles in Saudi Arabia exceeds 12 million, and the growth rate of the number of vehicles is approximately 5% per annum over the last ve years [20]. Geotechnical engineering is one of the greatest crucial subjects in the initial stages of infrastructure planning and design, owing to the concept that poor geotechnics might consequence in unnecessarily high costs if not addressed properly. e CBR value is a critical parameter in the structural design of pavements. e CBR value may be determined directly in the laboratory using AASHTO T 193 and ASTM D 1883.
Two broad correlations for predicting CBR were also presented in the Guide for Mechanistic and Empirical design (2001). e models are constructed using empirical parameter data for the following soils; D60 (passing 60% diameter), P200 (percent passing sieve no. 200; US sieve), and PI (plastic index). PI and % passing no. 200 US sieve (or 0.075 mm size sieve) are two parameters that are included in the suggested model for plastic ne-grained soils [15]. e recommended equation is where F 200 is passing no. 200 US sieve (%). PI is the plasticity index. For the nonplastic coarsegrained soil, the proposed equation is CBR 28D 0.358 60 .
Determine the CBR value of each soil sample that demands at least four days in the laboratory. Historically, civil engineers have struggled to determine a representative CBR value for pavement design [36]. e CBR value varies according to the kind of soil and its various qualities. Due to the variability in soil engineering features, conducting this experiment with soil samples gathered from a few places cannot be used for the entire road path. To overcome this, a substantial number of specimens must be collected for testing, which causes the method to be costly and time consuming. Black made the first attempt to predict CBR in 1962 [13]; his work depends on remolded soil and its friction angle. It is also shown that the suction of such a remolded soil and its true angle of friction can be inferred from data on the liquid limit, plastic limits, and water content of the soil. Agarwal and Ghanekar predicted the CBR deepening on the grain size distribution as well as the consistency parameters, specifically the PI [2]. ese attempts were followed by many subsequent attempts using more advanced methods of prediction, each of these attempts expresses the classification, considerable types of index properties, and unique study area samples [3,5,7]. CBR was determined by Yildirim and Gunaydin [35] utilizing soft computing; they used soil index properties and compaction properties of subgrade soil in Turkey as input parameters. eir work revealed a high correlation (R 2 � 0.80-0.95) employing soil indices. ey recommend using the provided formulas in the early stages of the design process [35]. Kumar (2014) proposed a linear regression model for ML and MI soils on the basis of the variables PI, liquid limit (LL), plastic limit (PL), MDD, and OMC, for ML and CBR and PI have an inverse linear regression relation [32]. Several linear regression models for estimating soaked CBR in clay soils are compared by Ramasubbarao and Siva [26,26]. Tenpe et al. constructed several mathematical models to predict CBR utilizing a variety of soil parameters as independent variables, with the assistance of the GEP and ANN models. ey also conducted comparative research to demonstrate that there is little difference between the findings acquired from GEP and those from ANN [6]. Numerous scholars have presented such models [6,29,34], but there is still a knowledge gap concerning the assessment and comparison of their predictive capability. Aleksandra et al. investigated the effects of compaction parameters, namely MDD and OMC on CBR [10].
Interestingly, no correlation formula has been created utilizing the soil of the Qassim area. e region's geological formations are divided into two large landmasses: the Arabian Shield in the west and the Arabian Shelf in the east. e Arabian Shield is made up of igneous and metamorphic rocks, while the Arabian Shelf is made up of sedimentary rocks. e weathering products of these rocks have generated the soils. Predicting the value of the CBR test would shorten the time from 96 hours to minutes, achieving such a goal would reduce the overall cost of projects in various ways. Al-Qassim region is one of the most important logistical areas. Since at present, there is no published work correlating soil properties and CBR in the Qassim area and its environs. So that the purpose of the study is to determine if GEP and multivariate regression approaches are applicable to estimate CBR of flexible pavement subbases and subbase layers in the Qassim area.

Samples Collection.
Qassim is a region of predominantly desert nature with a multigeological nature, and the socalled Arab Shield is located in the southwestern part of it. It is located in the middle of the Kingdom of Saudi Arabia [4]. Qassim is considered a vital region linking the north with the south and the west with the east, thus heavy carriers pass through its roads. In Saudi Arabia, roads are built in accordance with the Ministry of Transportation [8] (MOT, 1998) regulations. Tables 1 and 2 present the gradation of the subbase layer and the limit of indices. During the building of new roads in Qassim area, samples were taken from various places and borrowed pits. ese samples were sieved to find the percentage of aggregate (gravel) and the percentage of aggregate passing through sieve no. 4 (4.75 mm) and were retained on sieve no. 200 (0.075 mm) (sand) and the percentage passing through sieve no. 200 (0.075 mm) (fines). e consistency tests were conducted on fine soil to determine the liquid and plastic limits. e relationship between moisture content and dry density was then determined by the proctor modified test. en, each sample was made at OMC and at different densities, with the one with the highest dry unit weight being tested to see what its CBR was for each unit weight.

Experimental Work.
In this research, to study the factors which explain CBR characteristics and its prediction, soil samples were gathered from 43 borrowed pits in the Qassim area as shown in Figure 1, with a different number of specimens taken from each location. e test method used to determine the CBR is given by MOT (1998) specifications, in which it complies with MRDTM: 213 ASTM D 1883-AASHTO 193 [31]. e compaction parameters are found in accordance to ASTM D 698-1557, AASHTO T99-T180. e particle size distribution shall follow MRDTM: 204. MRDTM: 208/209 was used to conduct the test of LL, PL, and the PI. ese limits were determined on the soil samples having a fine percentage greater than 5%. Table 3shows that in general the average numbers of fines (silt and clay), sand, and gravel were 15.32%, 28.3%, and 56.421%, respectively. e averages of LL, PL, and PI were 7.8%, 6.18%, and 1.62%, respectively, while the average water content OMC was 7.73%, and the average MDD was 2.15 g/cm 3 .
AASHTO classification system [(ASTM designation D-3282; AASHTO method M145)] was used to classify the soil samples, and it was found that 34 samples are classified as A-1a and A-1-b, and ten samples were classified as A-2.
Grain size analyses were performed firstly to classify the soil in the studied area and to quantify the number of fines [F 200 ], sand percent [S], and gravel percentage [G]; grain size distribution curve is shown in Figure 2. en after analysis of all samples, the tests of LLs and plastic limits were conducted; the plasticity chart is shown in Figure 3. irty percent of soil samples had a liquid limit larger than zero, which was classed as inorganic clays of low plasticity to medium plasticity and inorganic silt of low compressibility. e majority of the samples depicted fall above line A and below line U. e existence of soil samples parallel to the line A on the plasticity chart shows that they have the same geological origin [19,30]. Table 3 provides a full statistical summary of the soil. To see the distribution of the parameters more clearly, frequency histograms were created for each parameter in Figure 4.

Multivariate Regression.
Single and multivariate regression analysis are well-known predictive techniques that are frequently used to construct models for predicting desired output parameters based on specified input factors such as the physical and mechanical characteristics of rocks [17]. Single regression studies examine the connections between input and output variables using a linear, logarithmic power or exponential function. Multivariate regression studies are advantageous when several input parameters are present in complicated connections, and the common form of a multivariate regression equation is where CBR is California bearing ratio (dependent variable). G pi is soil index or mechanical (independent variable), n is the number of variables, bi is the coefficient of regression, a is the constant, and ε is the error estimate. CBR will be used as the dependent variable, while the soil parameters in Table 4 will be used as independent variables individually, or they will be used as groups of parameters depending on the relationship between these data and the correlation matrix as shown in Table 5. In addition, the selection of input variables of multilinear equations depends on grouping the soil indices into three categories; the first one is the grain size distribution, in which the parameters are gravel percent, sand percent, and fines percent. e second category is Atterberg limits, which are LL, PL, and PI. e third category is "compaction variables," which include "MDD" and optimum water content. is will help to find the highest correlation parameters with CBR.

Multinonlinear Regression.
Nonlinear regression estimates the dynamic connection between the dependent variable and the function. Nonlinear multivariable regression estimates the interaction between independent and dependent variables. ere are various modeling techniques and regression approaches for specific aspects, especially when data are limited [17].
It is possible that the greatest association between the answer and independent variables may be explained by stepby-step regression.
is algorithmic technique selects the proper model subsets to filter forward or backward. First, choose a permanent model and use the model words before optimizing fitness.
at is, step-by-step regression is a forward selection procedure that examines the significance of all prior variables. For variables whose partial square sums do not meet the minimal requirement, the retroactive exclusion procedure is improved and variables are excluded one at a time until the minimum condition is met.
Step-bystep regression requires more computations than forward or backward detailed calculations, but it yields better results [33]. Scheffe proposed a second-order polynomial for MNR analysis of mechanical characteristics [21]. MNLR in PQ forms has been studied.

Gene Expression Programming (GEP).
GEP, which is an algorithm based on genetic algorithm (GA) and genetic programming (GP), was first proposed by Ferreira [27]. e GEP technique is an integration of two evolutionary strategies that are broadly acknowledged in the scientific community. e first approach is called the GA methodology, and it simplifies complex relationships by depicting them with linear structures called chromosomes that are of a fixed length. In addition, the second method is known as GP, and it makes use of the expression tree (ET) configurations that may take on a wide range of forms and dimensions [22]. Developing a program code recorded in linear chromosomes of fixed length, it exhibits some characteristics akin to biological evolution. GEP's primary goal is to create a mathematical function that fits a collection of data supplied to the GEP model. e GEP method performs symbolic regression on the mathematical equation using the majority of GA's genetic operators. GEP and GA have some distinctions. Individuals are shown in GA as symbolic strings of fixed length (chromosomes), but they are depicted in GP as nonlinear entities of varying sizes and forms (parse trees), whereas in GEP, each individual is represented by a fixedlength string that is used to make ETs of different sizes and shapes [9]. e primary GEP process is presented in Figure 5. Five variables are employed in GEP analysis: function set, terminal set, fitness function, control parameters, and stop condition. After encoding the issue into the candidate solution and specifying the fitness function, a randomly generated population of viable people (chromosomes) is  Advances in Civil Engineering constructed and then translated into an ET matching to a mathematical equation. Following that, the projected values are compared to the actual ones, and the fitness score for each chromosome is calculated. e algorithm is ended when a suitable fitness score is reached. If this is not the case, some chromosomes are randomly selected and then mutated to generate new generations. is method is repeated until a suitable fitness score is produced, at which point noncoding chromosomal analysis is undertaken to determine the optimal solution to the problem [14].     Advances in Civil Engineering

GEP Elements.
e ET and chromosomes are the two most important components of the GEP. e chromosome is made up of one or more genes and it is represented by a mathematical equation on the genome. To code any mathematical equation, two bilingual and conclusive languages named Karva languages (the language of genes and the language of ET) are utilized, both of which are bilingual and conclusive. is property of the GEP is important for inferring the genotype with more precision than it would otherwise be possible. e GEP genes are divided into two sections, which are referred to as the head and tail. It is necessary to use the head of a gene, which contains certain mathematical functions, variables, and constants, for the encoding of a particular function. However, the tail of a gene that contains just variables and constants is used as supplemental terminal symbols in addition to the main terminal symbols. ese symbols are used in the event that the terminal symbols in the head are insufficient to express a particular function. Many other functions, such as fundamental arithmetic operators, trigonometric functions, and any other mathematical or user-defined functions, may be utilized in the head of a gene.

GEP Operators.
e major operators in GEP are selection, mutation, transposition, and crossover (recombination), which are all based on natural selection. By using these operators, the chromosomes may be adjusted to improve the fitness score of the following generation for survival. e operator rates, which are specified at the outset of the model construction, indicate a particular chance of a chromosome appearing in the model. In general, a mutation rate of 0.001 to 0.1 is suggested, with the lower limit being 0.001. e transposition operator and the crossover operator, on the other hand, are recommended to be 0.1 and 0.4, respectively [24].    Advances in Civil Engineering

Model Performance.
To examine the performance of the prediction models, the RMSE, the coefficient of determination (R 2 ), as well as mean absolute percentage error (MAPE), the scatter index (SI) [23], Bias, and index of agreement (IOA) are being used as shown in (4)- (9).
n, CBR m , and CBR p indicate the total number of data measured experimentally, and the developed predicted value, respectively.
In addition, the Durbin-Watson statistic was used to determine the existence of multicollinearity [1]. e Durbin-Watson statistic has a range of values between 0 and 4, and the midpoint (that is, 2) indicates that there is no association between the input variables. erefore, a value between 1.5 and 2.5 is sufficient for obtaining models that are unaffected by multicollinearity. is section covers mathematical expressions and models. Each model's R 2 and P-value (p) are described in the Tables 6 and 7.
whereas DW is the Durbin-Watson test and e t and e t−1 leastsquares regression residuals.

Uncertainty and Reliability.
e main purpose of uncertainty analysis is to narrow the predicted range in which the real value of an experiment's result resides. is estimated range is expressed as an interval and is referred to as the uncertainty interval. It may be approximated using the calculated errors for the measuring method of the experiment in question [28]. U95 is one of the uncertainty analysis algorithms used to calculate the uncertainty interval. e value of U95 relating to a specific experiment result may be read as follows: if the provided experiment is performed again, the real value of the experiment's outcome will fall within the specified uncertainty range around 95 times out of every 100 trials. Furthermore, the value of U95 is provided by 1.10. Relative Absolute Error and Reliability. e relative absolute error is a relative statistic that compares a predictive model's performance to that of a simple model. e predictive model's performance is defined as the total absolute difference between the actual and anticipated values (i.e., the error). e basic model's performance is defined as the total absolute difference between the realized value and the average of all realized values [25]. In other words, the relative absolute error determines if a model outperforms just forecasting the average (i.e., the simple model). e relative absolute error formula is e meaning of the relative absolute error is straightforward: if the RAE is less than one, the model outperforms the basic model. e relative absolute error for a perfect model is 0. RAE should ideally be as near to zero as feasible.
A statistical technique known as reliability analysis may be used to evaluate a model's level of general consistency. To be more specific, it determines whether or not a proposed 8 Advances in Civil Engineering  en after we check to see if RAEi is greater than δ, if it is, we set k i equal to 1; otherwise, we set it equal to 0. We define k i as the number of times the value of RAE is either less than or equal to that of δ. According to Chinese standards, the ideal value of δ is 0.2 [28].

Multivariate Models Development and Prediction of CBR.
e data were fully analyzed for the relationship between CBR and the parameters, each one individually, followed by a multiple linear regression relationship between CBR and the parameters: grain size, the relationship between CBR and Atterberg limit, and the relationship between the compaction parameter and CBR. With the help of the correlation matrix, a certain parameter will be selected to get the best model representing this data. Each model will be evaluated based on the correlation coefficient of determination (R 2 ) and its level of significance (P ≤ 0.05). e correlation matrix in Table 5 clearly illustrates the nature of the relationship between the CBR and the soil parameters individually. As was expected, the studied area is sandy in nature. erefore, all Atterberg coefficients appear to have a poor correlation with CBR. On the other hand, the relationship between the granular size distribution parameters, especially F 200 , appears from the compaction parameters (MDD) such that the highest correlation is related to CBR.

Multilinear Regression.
Several attempts were conducted to find the best combination of parameters that represent the CBR correlation with the soil index and compaction parameters based on the findings from the correlation matrix ( Table 5). Out of several models, a possible correlation was made among those who exceeded R 2 of 0.5 and, at the same time, was statistically significant. Multilinear regression models (MLR) are presented herein. Table 6 presents the outcomes of the multilinear regressions. MLR 1 and 3 in Table 6 are controlled by the grain size distribution; those models revealed R 2 of 0.753 and 0.74. Consequently, MLR 4 presents the compaction parameters MDD and OMC as input parameters to predict the corresponding CBR; it revealed R 2 of 0.61. MLR 2, 5, and 7 incorporate the effect of all indices as predictors for CBR; a better performance is revealed from these models where their R 2 is 0.811, 0.79, and 0.792 consequently. e combination of grain size parameters and compaction parameters in MLR.6 gives R 2 of 0.775. Consistency parameters could be considered with no contribution in MLR.6; this may be attributed to the fact that the soil in this study is nonplastic due to low plasticity soil. e best model of multivariate linear regression model is shown in (8), Table 7.

PQ Models.
e PQ which is none linear generally contains linear variables and constants; it is effective when such a pattern does not appear to be linear, and the relationships of the variables tend to be curvilinear [16]. Table 8 illustrates the PQ models for the prediction of CBR of granular soil. e model's significance (P) value is zero, signifying robust models. e best model determination coefficient (R 2 ) is 0.888, which is better than MLR CBR model in (8). e best PQ regression model is shown in (9).

GEP Model Development.
e primary objective of developing the GEP model was to provide mathematical functions that might be used to predict the CBR value. Numerous GEP models with differing quantities of input variables were built in this study in order to estimate the CBR. Each model's number of input variables was distinct. e GEP models use the same input parameters as the multivariate models, which are also shown in Table 3. Similar to multivariate models, all GEP data were arranged into a format with seven input parameters, including percentages of gravel (Gr), sand (S), and fines (F 200 ), as well as LL, PI, MDD, and optimal water content (OMC). To create  the best GEP model, the number of chromosomes was altered between 20 and 35, the number of genes between 3 and 7, and the head size between 8 and 13. During the building of GEP models, it was observed that the same GEP parameters were gathered for the "four best models"; Table 8 lists these parameters.  Results of the best models which have better statistical performance that are presented in (5) e IOA, the SI, and Bias are the three indices that are produced in order to carry out a full statistical evaluation of the models that have been offered for the GEP. Table 9 displays the findings of the statistical analysis that was carried out. According to Table 10, GEP 1 and GEP 4 have the highest coefficient of determination (R 2 of 0.776), although GEP4 has the least RMSE with a value of 22.552. According to Table 9, GEP 1 has the lowest SI of 0.252, which indicates that it has better GEP model performance than the other GEP models. Good indexes of agreement can be seen across all GEP models, with values ranging from 0.928 for GEP 4 to 0.984 for GEP 3. In addition, the predictions of CBR that were made available by GEP 1 had a higher reliability (53%) and uncertainty (15.78) in comparison to the predictions that were made available by GEP 4 (45% reliability and 16.766 uncertainty).

Models Comparison.
In this section, an examination, discussion, and comparison of the results were acquired from multilinear, PQ, and GEP models. e coefficient of determination (R 2 ), RMSE, MAPE, and Durbin-Watson (DW) were employed as statistical verification tools to assess the accuracy of the created models' outputs. Table 10 summarizes the statistical performances of the multivariate and GEP models. According to Table 10, the majority of models exhibit acceptable agreement between the developed model and experimental values. Table 10 further shows that the performance of the majority of models derived by multilinear regression, PQ, and GEP techniques is acceptable in terms of R 2 , RMSE, MAPE, uncertainty, and reliability.
is event can be regarded as the models' inputs being appropriately chosen. Table 10 presents the statistical performances of multivariate and GEP models.
However, in statistics, the total error performance of a link between two groups may be assessed using coefficient of determination (R) values. According to Benjamin [12], if a given model provides an R value greater than 0.80, there is a strong correlation between the measured and predicted values for the whole available database of data. In addition, the statistical performance of any model is assessed using error criteria such as RMSE; MAPE, an important criterion; and R value, because a model with a high R 2 value may have a high RMSE or MAPE value. Taking into consideration these criteria, the best multivariate models are MLR2, PQ2, and PQ5. However, the four GEP models have a higher R 2 and a lower RMSE of 22.56. e error magnitudes of these models are sufficient for forecasting the CBR of soils. In addition, Model PQ5 is clearly the best model for both GEP and multivariate models based on all statistical measures. In Figures 6 and 7, the estimated CBR values generated by PQ5, of multivariate and GEP1 are visually contrasted with the observed CBR values. Figures 6 and 7 show that both the multivariate model and the GEP model make good predictions based on the experimental data.
An importance analysis of GEP is also performed to realize to what extent the input variables can affect the output CBR (see Figures 8 and 9). Gravel% and Fines% are found to be the most contributing variables to CBR, followed by Sand%, as shown in model GEP 3 figure. While in model GEP 4, where the contribution of the consistency limit is incorporated into the model, it is found that the consistency parameters are affecting the CBR significantly, which is in agreement with the study performed by El-Ashwah et al. [18].

Sensitivity Analysis.
To determine the quantitatively relative implication of each input variable on the CBR, the MLR, PQ, and GEP models were employed to carry out a sensitivity analysis. e analysis is executed such that one variable of 15 (6) and (8) Table 11.

Conclusion
is study's primary objective is to investigate the applicability of multivariate regression analysis and gene expression programming (GEP) for predicting CBR. To do this, CBR test data for coarse-grained soils belonging to the A-1 and A-2 soil groups were supplied from various locations in the Qassim area. Seven MLR and PQ nonlinear regressions and four GEP models with different input variables were examined to determine the optimal connection between fundamental soil indices and the parameter CBR. e models' performance was assessed using statistical verification criteria. e models MLR, PQ, and GEP with eight input parameters yielded the greatest results. It can be concluded that the MLR, PQ, and GEP are capable of acquiring the relationship between CBR and basic soil parameters and may be used to predict CBR values of soils. e results of this analysis showed that the PQ model had a higher accuracy [coefficient of determination (R 2 ) � 0.89, RMSE � 16.006, uncertainty (U95) � 16.17, and reliability � 57%] than the multilinear regression model, which has a lower accuracy [R 2 � 0.811, RMSE � 20.791, U95 � 15.569, and reliability � 51%]. e best GEP model yields [R 2 � 0.776, RMSE � 22.552, U95 � 15.787, and reliability � 53%]. Furthermore, an important analysis revealed that gravel percent and fine percent have the greatest influence on CBR, followed by consistency parameters (LL, PL, PI) and OMC. Given how hard, timeconsuming, and expensive CBR soil tests are, it can be emphasized that using MLR, PQ, and GEP models to estimate CBR for granular soils in terms of soil parameters could be a useful method for early stages of material development or as a way to judge the validity of CBR values. Compaction parameters (OMC and MDD) give a weaker correlation with the CBR of granular soils, but when used with gravel and fines percentage, they play a significant role in the determination of the CBR of granular soil in the Qassim area. A sensitivity analysis was conducted to yield that F 200 is the most variable factor affecting the multilinear regression model, while MDD is the most variable factor affecting the PQ model, and the optimum moisture content (OMC) significant variable affects the best GEP model. e models used in this work to estimate CBR were based on granular-grained soils with low to nonplastic properties. As a result, it cannot accurately figure out CBR in fine soils or soils of high plasticity.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.