Extreme Gradient Boosting Algorithm for Predicting Shear Strengths of Rockfill Materials

Department of Civil Engineering, Faculty of Engineering, International Islamic University Malaysia, Jalan Gombak, Selangor 50728, Malaysia Department of Civil Engineering, University of Engineering and Technology Peshawar (Bannu Campus), Bannu 28100, Pakistan Department of Transport, Academy of Engineering, Peoples’ Friendship University of Russia (RUDN University), 6 Miklukho-Maklaya Street, Moscow 117198, Russia Department of Civil Engineering, 6ammasat School of Engineering, 6ammasat University, Pathumthani 12120, 6ailand Peter the Great St. Petersburg Polytechnic University, Saint Petersburg 195251, Russia Department of Civil Engineering, University of Engineering and Technology Peshawar, Peshawar 25000, Pakistan Department of Physics, Mindanao State University-Iligan Institute of Technology, Iligan City 9200, Philippines


Introduction
Rockfill materials (RFM) are commonly used in the construction of high embankment dams in order to harness natural water resources. RFM is comprised of gravels, cobbles, and boulders obtained by blasting rock quarries or natural riverbeds. Material from riverbeds is rounded to subrounded, and material from quarries is angular to subangular. Mineral composition, particle size, shape, gradation, individual particle strength, void content, relative density (RD), and particle surface roughness all influence the behaviour of these RFMs used in the construction of rockfill dams. erefore, it is essential to comprehend and characterise the behaviour of these materials for the study and safe construction of rockfill dams.
In engineering practice, the particle size of rockfill materials ranges from 400 to 600 millimetres and can exceed 1000 millimetres. Due to the constraints of laboratory testing equipment, rockfill materials that exceed the maximum permissible particle size must be scaled. To determine the mechanical properties of rockfill materials on-site, analog simulation is used in laboratory testing to build test specimens with the same internal structure as the prototype rockfill materials, thus determining the engineering characteristics of the prototype rockfill materials. Several research studies have investigated the behaviour of the RFM such as Abbas et al. [1], Gupta [2], Venkatachalam [3], Marsal [4], Mirachi [5], and Honkanadavar and Sharma [6] and carried out laboratory experiments on different RFMs, and it was revealed that their stress-strain behaviour is dependent on the stress level, but nonlinear and inelastic. ey also reported that the angle of internal friction increases as the maximum particle size of riverbed RFM increases, while the opposite trend is true for quarry RFM. Frossard et al. [7] proposed a rational approach for estimating RFM shear strength based on size effects; Honkanadavar and Gupta [8] developed a power law for the relationship between the shear strength parameter and various riverbed RFM index features due to the difficulty of conducting large-scale strength testing and defining the mechanical behaviour of RFMs. Numerous methodologies have been developed to anticipate the behaviour of such soils. Large particle size RFM cannot be tested under laboratory circumstances as maximum large-scale shear tests are time-consuming and complicated, and it is hard to predict the nonlinear shear strength function without an analytical method (particle size 1200 mm) [8].
Over the last ten years, a newly developed approach based on machine learning (ML) algorithms has been widely applied to solve real-world problems, particularly civil engineering. Numerous practical problems have been effectively addressed using ML techniques, paving the way for many promising opportunities in civil engineering and other fields such as environmental [9] and geotechnical [10][11][12][13][14][15] including prediction of RFM shear strength [16][17][18]. In this context, the artificial neural network (ANN) approach is utilized by Kaunda [16] for estimating RFM shear strength. Cubist and random forest regression techniques are used by Zhou et al. [17], and they found that both models are accurate for RFM shear strength estimations than ANN and traditional regression models. Ahmad et al. [18] used support vector machine (SVM), random forest (RF), AdaBoost, and K-nearest neighbor (KNN) algorithms to estimate the shear strength of RFM and concluded that the SVM model achieved a better prediction performance compared to the RF, AdaBoost, and KNN models.
is field, however, is currently being investigated. e article aims to provide the following contributions in the research field: (i) To evaluate the predictive capacity of the XGBoost algorithm for the shear strength of RFM (ii) To compare the proposed model to the reference models used in the published literature (iii) Conduct sensitivity analysis to assess the influence of each input parameter on the RFM's shear strength e structure of the paper is as follows: e theory of extreme gradient boosting is explained in Section 2. Data collection and correlation analysis are presented in Section 3. Section 4 explains the performance measurement employed. Section 5 presents the obtained results and a discussion of them. Finally, conclusions based on the achieved results are provided.

Extreme Gradient Boosting (XGBoost)
Chen and Guestrin [19] proposed the sophisticated supervised technique extreme gradient boosting (XGBoost) under the gradient boosting framework which has received widespread recognition in Kaggle machine learning contests due to its advantages of high efficiency and considerable flexibility. XGBoost's loss function adds a regularization term to the objective function, which helps to smoothen the final learning weights and avoid over-fitting [19]. It also optimizes the loss function using first and second-order gradient statistics. XGBoost also supports row and column sampling to address this issue in addition to providing regular terms to prevent over-fitting. As a result of the parallel and distributed computation, faster model exploration is possible. e following is a description of the XGBoost algorithm [20]: given a dataset with n examples and m features D � (x i , y i ) (|D| � n, x i ∈ R m , y i ∈ R),K additive functions will be used to predict the output values of a tree ensemble model as follows: where F is the regression trees space. It is calculated as where q represents for the structure of each tree, T represents for the number of leaves in the tree, and f k is a function that corresponds to an independent tree structure q and leaf weights ω. To reduce errors of ensemble trees, the objective function is found in the XGBoost model: where l is a differentiable convex objective function to calculate the error between predicted and measured values; y i and y i are regulated and predicted values, respectively; t shows the repetitions in order to minimize the errors; and Ω is the complexity penalized with the regression tree functions: ω is the vector of the score for the blades, and c the minimal loss required for the further isolation of a blade node. λ is the regularization function. In addition, c and λ are parameters which are able to control the complexity of the tree, and the regularization term helps to avoid overfitting by smoothening the final learnt weights. Taylor expansion is applied to the objective function in order to further simplify it as where g i and h i are the first and second derivatives obtained on the loss function, respectively. More detailed explanations of the XGBoost algorithm can be found in Chen and Guestrin's [19] research paper.

Dataset Collection and Correlation Analysis
In this study, a database of 165 samples of RFM shear strength reports was collected from Kaunda [16] and is presented in Appendix A and Table A1 in supplementary file. All input parameters that might influence the shear strength results of RFM were considered. e included parameters are D 10 , D 30 , D 60 , and D 90, corresponding to the 10%, 30%, 60%, and 90% sieve sizes passing, respectively. C c and C u refer to the curvature uniformity coefficients (C c ), respectively; FM and GM describe fineness modulus and gradation modulus, respectively; R represents International Society of Rock Mechanics (ISRM) hardness rating; UCS min, and UCS max (MPa) signify the uniaxial compression strengths boundaries (MPa); and c represents the dry unit weight (kN/m 3 ), while σ n is the normal stress (MPa). e considered output is the shear strength of RFM (MPa) (denoted as τ (MPa)). e summary of the database statistics is presented in Table 1, which includes the boundary and standard deviation values of all parameters used in this study.
Correlation (ρ) was used to verify the intensity of correlation between different parameters (see Figure 1). For a given pair of random variables (m, n), the following equation for ρ is used: where cov denotes covariance, σ m denotes the standard deviation of m, and σ n denotes the standard deviation of n. |ρ| > 0.8 represents a strong correlation between m and n, values between 0.3 and 0.8 represents a moderate relationship, and |ρ| > 0.30 represents a weak relationship [21]. As per Song et al. [22], correlation is considered as "strong" if |ρ| > 0.8. In the order of strong to weak, the relationships between input and output parameters are represented in Figure 1. Consequently, no factors from the estimation model's τ were deleted. e correlation coefficient has a maximum absolute value of 0.97, as shown in Figure 1.

Evaluation and Prediction
To evaluate the predictive capacity of the XGBoost algorithm, we compared it with some other machine learning methods developed in literature using performance measures.

Compared Machine Learning (ML) Methods.
e XGBoost model was compared with other prediction methods such as support vector machine, adaptive boosting, random forest, and K-nearest neighbor proposed in literature. A brief description of each technique is presented. For a more in-depth discussion, the reader is referred to the relevant references.

Support Vector Machine (SVM).
e Support Vector Machine (SVM) regression technique relies on feature classification and generates an interclass hyperplane and minimizes the vector lengths and variance between the features and the plane.
e SVM is compatible with the majority of kernel types, including Euclidean, Gaussian, Exponential, and Dirichlet kernels [23]. e objective function for SVM regression contains a coefficient generated from the cost analysis that aids in determining the flatness of the created hyperplane [24]. is allows the user to change the SVM technique to fit unique datasets.

Adaptive Boosting (AdaBoost).
Adaptive Boosting is a boosting machine learning technique in which strong learning algorithms augment weak learning algorithms. AdaBoost must define the number of beginning students (n) as a parameter [25]. During the training phase, AdaBoost develops learners with low accuracy who improve based on their predecessors [26]. Using this method, the AdaBoost dynamically modifies the training weight based on the performance of the fundamental learning algorithms [27].

Random Forest (RF).
Random Forests are ensemble models that use many decision trees as base-learners to obtain more precise outcomes. Individual trees are generated from training data using random parameters as their roots and nodes using the bootstrap sampling method [28]. Multiple decision trees are more stable than a single tree because they reduce overfitting and average the outcomes [26]. e number of trees in the forest at each binary node, the number of randomly selected predictors, and the lowest number of observations at the nodes of the trees are the three primary parameters for random forests [29].

K-nearest Neighbor (KNN).
e supervised KNN is a machine learning algorithm that can be used to tackle both classification and regression problems. In regression problems, the input data set is comprised of k that is most similar to the training data sets utilized in the highlighted set. e outcome of KNN regression is the object's characteristic value, which is the mean value of k's nearest neighbors. As Complexity

Evaluation Measures.
ree quantitative statistical indices, i.e., coefficient of determination (R 2 ), error in the root mean square ratio to the measured data standard deviation (RSR), and Nash-Sutcliffe coefficient (NSE) were employed to validate and compare the XGBoost model. e following equations characterise the supplied indices: where n is the total number of data; y i and y i are the actual shear strength and the predicted shear strength, respectively; and y is the mean of the actual shear strength.
Values of the coefficient of determination (R 2 ) that are closer to 1 imply that this model better fits the data. When R 2 is greater than 0.8 and close to 1, the model is deemed robust [31]. e NSE is a normalized statistic that regulates the level of residual variance compared to the variance of the data being measured [32]. e NSE scale ranges from − ∞ to 1, with 1 denoting an ideal match. If the NSE value is greater than 0.65, a strong correlation exists [32,33]. e root mean square error (RMSE)-standard deviation ratio (RSR) is computed by dividing the RMSE by the standard deviation of the observed data. e RSR varies from 0, representing the optimal value, to a significant positive value. e RSR ranges from the optimal value of 0 to a substantial positive number. Classification ranges are expressed as very good, good, acceptable, and unacceptable. e RSR ranges are 0.000 ≤ RSR ≤ 0.500, 0.500 ≤ RSR ≤ 0.600, 0.600 ≤ RSR ≤ 0.700, and RSR > 0.700, respectively [34]. (1) Data preparation and correlation analysis: In this first step, the data of samples from the laboratory were utilized to build the training and testing datasets. e training dataset was constructed using 80% of the total data, while the testing dataset was built from the remaining 20%. Complexity 5 resulting models on testing data. All training and testing operations were conducted out in Orange software. (3) Validation of the proposed models: In this third step, the testing dataset was adopted for validating the proposed models. Statistical indices including R 2 , NSE, and RSR were applied to validate the models. e proposed model is compared to the reference models used in the published literature. Furthermore, Taylor diagram is utilized to illustrate how similar the models (including the proposed XGBoost) are to the reference/observed point position. (4) Sensitivity analysis: In the last step, sensitivity analysis is used for evaluating the influence of input factors on the shear strength of rockfill material.

Results and Discussion
e proposed model that estimates the RFM shear strength is developed using orange software. e predictor variables were provided via an input set (x) defined by x � [D 10 , D 30 , D 60 , D 90 , C c , C u , GM, FM, R, UCS min , UCS max , c, and σ n ], while the target variable (y) is shear strength (τ) of the rockfill material. Every modelling stage requires the selection of the suitable size of training and testing datasets. Consequently, 80% (132 cases) of the total data were employed to generate models while the remaining 20% (33 cases) of the data were used to test the developed models in this study.
e XGBoost model was tuned through trial and error to get an optimal hyperparameters values owing to accurate estimate of the shear strength of rockfill materials. is study optimizes some essential XGBoost parameters and clarifies the definitions of these hyperparameters. e tuning parameters for the model were selected and then changed during the trials until the best metrics from Table 2 were obtained. e predictive performance of the training and testing datasets is shown in regression form in Figure 3. In terms of training, the XGBoost model produced the best prediction results (i.e., R 2 = 0.9707, NSE = 0.9701 and RSR = 0.1729) compared to SVM (i.e., R 2 = 0.9655, NSE = 0.9639 and RSR = 0.1899), RF (i.e., R 2 = 0.9545, NSE = 0.9542, and RSR = 0.2140), AdaBoost (i.e., R 2 = 0.9390, NSE = 0.9388, and RSR = 0.2474), and KNN (i.e., R 2 = 0.6233, NSE = 0.6180, and RSR = 0.6181). It is also verified by the findings of R 2 , NSE, and RSR in Figure 4 as XGBoost produced lesser RSR, higher R 2 , and NSE values compared to SVM, RF, AdaBoost, and KNN models developed in the literature by Ahmad et al. [18] and the parameter optimization is presented in Table 2.
6 Complexity comparison of study outcomes makes sense because the data sets and inputs are the same. In contrast, the XGBoost model beats the other models in terms of predictive performance and offered a balanced prediction throughout the training and testing data sets. In addition, due to the study's small data set, additional research on other data sets is necessary to establish the most generic model for predicting the shear strength of RFM. e difference between the actual and predicted shear strength of RFM is represented in Figure 5 by comparing the results of the training and testing sets. e proposed XGBoost model is satisfactory for predicting the RFM shear strength, barring a few noise points. Taylor diagram (see Figure 6) is utilized to illustrate how similar the models (including the proposed XGBoost) are to the reference/observed point position based on their correlation, root-mean-square error difference, and amplitude of their variations (represented by their standard deviations). e better the performance, the closer each model point is to the position of the reference/observed point. In terms of predictive ability, the proposed XGBoost model beats the SVM, RF, AdaBoost, and KNN models developed in the literature by Ahmad et al. [18].  e sensitivity results of the XGBoost model were evaluated utilising Yang and Zang's [35] approach for evaluating the influence of input factors on the shear strength of rockfill material.
is approach, which has been the topic of numerous studies [36][37][38][39][40][41], is as follows: where n represents the number of values (i.e., 132); y im and y om denotes input and output variables, respectively. For each input parameter, the r ij value ranges from zero to one, with the greatest r ij values indicating the efficient output variable (i.e., τ). Figure 7 shows the r ij scores for all input variables and demonstrates that σ n (r ij � 0.99) has the greatest effect on the shear strength of rockfill material. Furthermore, Figure 1 shows that the normal stress σ n has the highest ρ of 0.97 in all other parameters validating the sensitivity analysis results.

Conclusions
Using an XGBoost algorithm, a new prediction model for RFM shear strength is proposed in the current study.
Comparisons reveal that the proposed XGBoost model provides the most accurate prediction of the RFM's shear strength when compared to the algorithms developed using the SVM, RF, AdBoost, and KNN model. Important findings found from this study include as follows: (1) In the test phase, results showed that the XGBoost had the highest power performance (R 2 � 0.9676, NSE � 0.9672, and RSR � 0.1812) compared to other machine learning models. Furthermore, based on the scatter plots of actual and predicted values, the XGBoost model exhibited a better fit to the observed data, indicating that it has potential for broader applications in RFM material properties prediction. (2) Compared to SVM, RF, AdaBoost, and KNN models in the literature, the proposed XGBoost model has a superior predictive capability. In addition, the proposed model is amenable to further modification so that the accumulation of further data will considerably enhance its predictive potential. (3) e findings of the sensitivity analysis indicate that five parameters, namely, the normal stress, the 90% passing sieve diameters (D 90 ), the dry unit weight, and the ISRM hardness rating, are the most sensitive and important factors for estimating the shear strength of rockfill materials. (4) e developed XGBoost model gives predictions with the same level of accuracy as existing soft computing methods.
Since the proposed XGBoost model produces predictions based on the input values, interpolation between the input variables is more accurate and reliable than extrapolation. erefore, the model should not be used for input parameter values beyond the defined range of the study.

Data Availability
e data presented in this study are available in Appendix A,