Machine Learning-Based Prediction of Unconfined Compressive Strength of Sands Treated by Microbially-Induced Calcite Precipitation (MICP): A Gradient Boosting Approach and Correlation Analysis

,


Introduction
Te inexorable growth of the global population has led to utilizing every available piece of land for construction. Since not all soils have the compressive strength to support structural loads, loose sites are frequently stabilized to facilitate construction. Several approaches are being used to enhance the compressive behavior of sand, each of which has advantages and disadvantages in terms of economics, environment, and practicality. In traditional soil improvement techniques, adhesive materials such as cement, lime, or other chemicals are usually added to soil to improve its strength. Although these approaches are efective in enhancing the strength of soils, their negative environmental impacts far outweigh their mechanical benefts [1][2][3]. In order to minimize the environmental impact associated with traditional soil improvement methods, some ecofriendly approaches have been developed.
Microbially-induced calcite precipitation (MICP) is a sustainable, cost-efective, and novel approach for enhancing the compressive strength of sand [4,5]. Te biocementation process enhances the compressive strength of sand with a biological activity that produces calcite minerals within soil structure; thus, no cement or other chemical binders are included in the stabilization process, leading to an ecofriendly ground improvement approach. In most laboratory and in situ explorations, it is frequent to perform the unconfned compression test to acquire the compressive strength of cemented sands [6,7]. Te unconfned compressive strength (UCS) of biocemented sand depends on several factors, such as soil properties, details of the MICP process, and environmental conditions, so the UCS of treated sand covers a wide range from 0.15 to 34 MPa [4]. Tis wide range of compressive strength for soil can signifcantly infuence the design and function of the overlaying structures. A reasonable prediction of the compressive strength of biocemented sands can enhance the reliability of the predesign of overlaying structures and clarify the suitability of the MICP method for a problematic site.
For predicting the unconfned compressive strength of biocemented sand, Wang and Yin [8] developed a multiexpression programming algorithm combined with the Monte-Carlo method (MEP-MC) that relies on an evolutionary algorithm for developing mathematical expressions. A database consisting of 351 data driven from previous studies was employed for developing MEP-MC algorithms. Several MEP-MC models were developed for predicting the UCS of biocemented sand which was found to be reliable and accurate. However, considering the superiority of advanced soft-computing techniques, such as machine learning techniques, the authors proposed that implementing these novel techniques can produce more reliable and straightforward models and algorithms for predicting the UCS of biocemented sand.
In the past few years, advances in analytical and computational studies have led to the development of softcomputing techniques, which are derived from mathematical and statistical algorithms. Machine learning is one of the novel techniques which can be used to identify linear and nonlinear relationships between variables. Machine learning is being utilized in geotechnical investigations as a means of solving problems, predicting disasters, or estimating soil characteristics [9][10][11][12][13]. Currently, several efective algorithms are commonly used for geotechnical issues, including artifcial neural networks (ANNs), support vector machines (SVMs), k-nearest neighbors (KNNs), decision trees (DTs), and many others. In data-driven modeling, it is most common to construct only one strong prediction model. In an alternative approach, a group of models could be developed to address a particular learning objective. Te ensemble learning method is a general application of several weak learners in which predictions from several models are combined to improve predictive performance. Consequently, combining more simple learners will result in a higher level of predictive accuracy than only an individual model. Furthermore, since ensemble machines contain several learners, the implementation of ensemble methods is highly efective for both linear and nonlinear data [14,15].
Generally, ensemble methods can be classifed into two groups based on their structure: parallel and sequential. Parallel algorithms run several learners simultaneously and then calculate the fnal prediction from all independent learners. Among the parallel ensemble methods, random forest is being used more frequently in engineering and geotechnical problems [16][17][18]. On the other hand, a sequential process (also known as boosting) builds base estimators sequentially and attempts to reduce the bias of the combined estimator at each iteration. Gradient boosting (GB) is an ensemble algorithm constructing additive regression models by sequentially ftting a weak learner at each iteration to current pseudo-residuals [19]. Since soil problems struggle with nonlinear behavior, the gradient boosting approach would be well-suited for solving geotechnical issues. Numerous studies have found gradient boosting to be a robust approach for predicting geotechnical problems, such as shear strength [20][21][22], slope stability [23][24][25], settlement [26][27][28], liquefaction [29][30][31], and other geotechnical concerns.
To develop functional and reliable models for predicting the UCS of sand treated with MICP, this study was carried out to predict the unconfned compressive strength of biocemented sands using machine learning techniques. Given that the gradient boosting is capable of analyzing datasets with nonlinear behavior, this method was employed as the main algorithm. Also, fve frequently used and straight forward algorithms were utilized to compare the performance of the gradient boosting method. In this paper, the MICP method mechanism is delineated in Section 2. Afterward, gradient boosting fundamentals are discussed in Section 3. Section 4 describes the dataset and provides a correlation analysis of variables. Tis section also presents procedures of the k-fold cross-validation and hyperparameter tuning. In Section 5, the result and discussion are outlined. To modify the predicted UCS with environmental factors, section 6 provides guidelines for applying the efect of temperature and pH on the fnal UCS. Lastly, conclusions derived from this study and potential future works are presented.

MICP
Microbially-induced calcite precipitation (MICP) is an interdisciplinary approach for enhancing mechanical behavior of soil by microbial activity. Te microorganisms produce calcite crystals (CaCO 3 ) within the soil pores that bind soil grains together and improve stifness and strength. Tere are three steps involved in the MICP process: (1) bacteria cultivation; (2) treatment; (3) curing. Following the article, details regarding each step are presented separately. A schematic diagram of the MICP process is shown in Figure 1, depicting each step along with factors that contribute to the fnal strength of the treated sand.

Bacteria Cultivation.
Te primary function of the microorganisms in MICP is to break down urea and act as a catalyst for the formation of carbonate crystals between the sand grains. In MICP, Sporosarcina pasteurii (also known as Bacillus pasteurii) is most commonly used for ureolysis due to its high urease activity [33,34]. S. pasteurii, which grows in alkaline media, requires urea and ammonium to grow: urea provides nitrogen and carbon to the bacteria, and ammonium regulates the pH and allows substrates to pass through the cell membrane [35][36][37]. Terefore, S. pasteurii is cultivated under an aerobic batch containing nutrients that help to grow the bacteria and increase urease activity. Most previous studies used ammonium-yeast extract [34][35][36][37] or trypticase soy agar [38,39] media for S. pasteurii cultivation.
Te main characteristics of harvested bacteria measured before introducing bacteria to the soil are the optical density of biomass at 600 nm, OD 600 , and the urease activity, UA. Te optical density of biomass is correlated with bacterial concentration as well as bacterial size in a sample [38].
Urease activity is an indicator of bacteria's capability to hydrolyze urea, which is greatly infuenced by environmental factors such as cultivation and storage conditions [34]. Urease activity is measured in units of mM·h − 1 or U·mL − 1 , which can be converted as follows: 1 U � 1 μmol of urea hydrolyzed per minute, Previous studies have shown that the performance of MICP and the fnal strength of treated soil are infuenced by bacterial density and total activity [39][40][41][42][43]. A study by Hammad et al. [43] assessed the activity of S. pasteurii in an agar-urea medium and found that higher urease activity leads to faster crystallization of CaCO 3 . Cheng et al. [41] evaluated the performance of biocementation with three diferent urease activities (5, 10, and 50 U·mL − 1 ) and found that specimens with a low level of urease activity exhibited a greater UCS for the same amount of CaCO 3 content, which was due to diferences in nucleation sites afecting precipitation patterns. Zhao et al. [40] revealed that sands treated with high values of OD 600 and urease activity could maintain more signifcant unconfned pressure. As bacteria concentrations increase, more CaCO 3 is precipitated. Based on the comparison of precipitation patterns of three bacteria with diferent OD 600 (0.2, 1, and 3), Wang et al. [42] concluded that the density of bacteria greatly infuences the stability of CaCO 3 crystals.

Treatment.
Following the cultivation of bacteria with the desired density and activity, the bacteria and the cementation solutions are introduced to the soil. Te addition of bacteria and cementation solutions to sandy soils can be carried out using three methods: injection, surface percolation, and premixing. Injections are more commonly used than either of the two other techniques, which are less effcient and practical due to certain limitations. In surface percolating, the main issue is related to restricted penetration depth. Te treatment depth of surface percolation is limited to 2 m for coarse granular material and 1 m for fne sand [44]. Furthermore, the premixing method involves disturbing the soil mass in order to mix it with the solution.

Advances in Civil Engineering
Due to the intense mixing of cementing ingredients with soil, pseudo-stress emerges in the soil sample during this process [4]. On the other hand, injection is the most common method for MICP, which improves soils without disrupting the soil structure.
Te injection of bacteria and cementation solutions takes place sequentially in several batches: frst, bacteria suspensions are commonly injected into soil masses, followed by the injection of cementation solution. However, few drawbacks relating to the homogeneity of CaCO 3 within the soil were seen when solutions were injected into the soil [34,45]. Te uneven distribution of CaCO 3 is ascribed to the linear reduction of microbial concentration along the injection path [46]. It is possible to resolve the unequal distribution of CaCO 3 content by slowing down the injection rate of the bacterial suspension or considering a break between the injection of bacteria and cementation solutions [47,48].
Cementation solutions are composed of urea and calcium salt, accompanied by injecting a small number of nutrients or ammonium chloride to maintain microorganism activity [49][50][51]. Calcium salt solution supplies calcium molecules for CaCO 3 crystallization, and its composition can afect the cemented sand formation and calcium content [52][53][54]. Among the calcium compositions, calcium chloride (CaCl 2 ) has been used most commonly for MICP [34,39,45,49,51,55], which is due to the ability to produce a greater amount of CaCO 3 [54].
Moreover, the concentration of cementation solution infuences the performance of MICP and the fnal strength of cemented soil. Al-Qabany and Soga [49] observed uniform distribution of CaCO 3 in sands treated with lowconcentration solutions (0.25 and 0.5 M). A number of studies have also indicated that sandy soils treated with 1 M of cementation solution tolerate a lower UCS than soils treated with lower concentrations [39,48,56]. Aside from the concentration of each solution, the ratio between the concentration of urea and calcium salts can also infuence the performance of the MICP. As the urea content exceeds the calcium salt content, the bacteria consume more urea and become more active; as a result, the calcium content and the shear strength of the biocemented sands will increase [57,58]. However, Mahawish et al. [58] evaluated the behavior of soil treated with equimolar (similar molarity of urea and calcium chloride) and nonequimolar (the urea content was two times the calcium chloride content) cementation solutions and found that nonequimolar solutions produce more uniform distributions of CaCO 3 than that of equimolar solutions.

Curing.
Te involvement of bacteria and cementation solutions within the soil matrix triggered reactions that resulted in the formation of calcite crystals among sand pore space. Te microscale simulation of chemical reactions is exhibited in Figure 1. Te chemical reactions occur in the following order: As can be seen, the chemical reaction starts with the decomposition of urea (CO(NH 2 ) 2 + H 2 O) by bacterial microorganisms, followed by producing calcite (CaCO 3 ) crystals and ammonium (NH − 4 ) ions. Chemical reactions in MICP are infuenced by environmental factors, such as the temperature and pH of the sand, which infuence the CaCO 3 content and mechanical characteristics of the treated sands.
Te temperature of the curing media can signifcantly afect the MICP performance. Increasing the setting temperature up to 50°C raises urease activity, leading to precipitation of a more considerable amount of CaCO 3 in the MICP process. However, sands treated at room temperature (20-25°C) show greater strength than those cured at 50°C, which indicates that CaCO 3 depositions produced at 50°C are less efective at strengthening biocemented sands than those produced at room temperature [4,41,58]. At a similar CaCO 3 content, cemented sands treated at room temperature show higher UCS than those treated at a colder or warmer temperature. Cheng et al. [41] ascribed this discrimination to the incompetency of CaCO 3 crystals to fll the gap between the sand grains, which stem from the faster nucleation rate of CaCO 3 precipitation at 50°C and the lower nucleation rate at 4°C [59,60]. Mahawish et al. [58] attributed the inefective precipitation to the formation of loose CaCO 3 crystals at elevated temperatures.
Te initial pH level of the MICP environment has an impact on the activity of the microorganisms that afect the precipitation of CaCO 3 and the strength of treated sand [47,61]. Soil media with high acidity and alkalinity are found to be in inhospitable environments for microorganisms to form CaCO 3 crystals [62,63]. Liu et al. [62] observed no efcient CaCO 3 crystal among sand grains contact when the treated sand was immersed in an acidic medium with a pH value of 3.5. It was also reported that the CaCO 3 depositions were consumed through reacting proton ions (H + ) in the acidic solution. Overall, the optimum pH level for the MICP process was found to be around 7 or a neutral environment [63][64][65].

Gradient Boosting
Gradient boosting (GB) is a supervised machine learning algorithm that combines outputs of several weak learners sequentially to yield a robust model. A schematic representation of the GB mechanism is shown in Figure 2.
Boosting involves sequentially applying a weak learner, f(x), to repeatedly modifed versions of the data, resulting in a sequence of weak learners, f m (x), m � 1, 2, . . ., M. Te fnal prediction is obtained by multiplying the predictions of all learners by a weight (α m ) [66]: In order to ft these models, f(x), the loss function, L (y, f(x)), is minimized over the training data: where x denotes input variables, and y is the target variable. Te accuracy of the fnal prediction depends on the values of the weight factors obtained through the boosting algorithms. Te weight of each learner is determined based on its accuracy, which is calculated by a loss function: the more precision is attained by a learner, the lower the weight factor is assigned. Terefore, by assigning unequal weight to the training set at each iteration, the learner knows how to focus on erroneous data at the next iteration.
In gradient boosting models, decision trees are harnessed as weak learners, which are relatively fast to construct and capable of performing robust predictions [67]. Decision trees split the training set into disjoint regions R j , j � 1, 2, . . ., J, according to the terminal nodes then assign a constant c j to each region, so the predictive rule, based on the inputs x, can be defned as follows: Within the gradient boosting procedure, additive decision trees sequentially are constructed based; then, at each iteration with regard to each training data, the pseudo-residuals (gradient of the loss function) are minimized. Te gradient boosting algorithm is written in Algorithm 1.
In the frst step, the model initializes with a single terminal node tree. Ten, with boosting approach with m = 1, 2, . . ., M, the best regression tree is ftted in 4 steps. First, the component of the negative gradient (pseudoresidual), r im , for i = 1, 2, . . ., N, is computed. Ten, a regression tree partitions the training data into L-disjoint regions, R jm L 1 and assigns distinct constant values at each node. After that, the minimum value of the loss function within diferent regions is located. Consequently, the current approximation at each region is separately updated based on the previous iteration. Te fnal GB model is obtained from the sum of all trees ftted at each iteration multiplied by its coefcient. In other words, the model constructed at the last iteration is equivalent to the fnal model, which involves all trees ftted at the previous iteration multiplied into the corresponding coefcient. Te shrinkage parameter, ], at Algorithm1 represents the learning rate of the additive procedure. An operation with a low shrinkage will have a higher degree of precision; Te squared error is more convenient than the other loss functions because its derivative is equal to the residual of the current model at each iteration (r im � y i − f(x i )). Tus, for the squared-error loss function, the current residual is added to the expansion at each iteration, which facilitates the computation of the gradient boosting algorithm [19,67,69].
An alternative for the squared-error loss function is the Huber loss function [70], which is a combination of squared error and absolute error loss functions: where δ denotes the threshold at which the loss function transitions from square error to absolute error. Te optimum value of δ depends on the distribution of (y − f ). It is suggested to choose the α-quantile of the distribution of | y − f| equivalent to the value of δ. In this case, (1 − α) corresponds to the breakdown point in the procedure. Te breakdown point refers to the fraction of observations capable of being arbitrarily modifed without degrading the quality of the results [19].

Additive Model.
As mentioned before, the GB algorithm fts decision trees sequentially, and the accuracy of the model increases after each iteration. Te number of iterations (n_estimators) can infuence the fnal model and its accuracy. Tere are three common methods for determining the optimal number of iterations in the gradient boosting method: an independent test set, out-of-bag estimation, and k-fold cross-validation. As Ridgeway [71] demonstrated 5 or 10-fold cross-validation is more efective than the other approaches, although it may require more computing time. Moreover, the shrinkage parameter (Learning_rate) signifcantly has an impact on the performance of the GB algorithm. Since shrinkage represents the learning rate of boosting procedure, its lower values result in models with better predictive performance. However, models with lower shrinkage demand far more storage and CPU time. A lower amount of learning rate requires more signifcant iterations to achieve the optimal algorithm [71].

Decision Tree.
Te confguration of decision trees ftted within the gradient boosting procedure can afect the fnal accuracy. For a decision, properties such as the maximum depth that limits the growth of trees (max_depth), the minimum number of samples required to split an internal node (min_samples_split), the minimum number of samples needed to be at an internal or external node (min_sam-ples_leaf ), and the number of features to consider when looking for the best split (max_features) determine the structure of the fnal decision tree [72].

Dataset.
Te dataset consists of 402 unconfned compression test results conducted on sands treated with MICP, which were reported in previous studies [39,41,45,49,55,56,58,73]. Tis literature-based database includes all research were conducted on biocemented sands that properly reported test procedures and results that could be relied upon. Figure 3 shows the contribution of references along with their UCS distribution. Te barplot in Figure 3(a) demonstrates the frequency of each reference, and its portion is plotted above the column. As can be seen, the distribution of data is not equal among references; for Te overview of the GB algorithm for regression is summarized in the following order [67]: ALGORITHM 1: Gradient boosting. 6 Advances in Civil Engineering instance, Cheng et al. [41], the most populated reference, constitutes 31.6% of the entire dataset, while another research by Cheng et al. [55] has only 1.7% contribution in the dataset. Furthermore, the box plot in Figure 3(b) illustrates UCS distribution for each individual referenced study. Te bottom, middle, and top of each box are the frst quartile, median, and third quartile of the UCS population, respectively. Te lines extending from the top and bottom of each box indicate the minimum and maximum UCS. Te outlier points for each reference are also shown in Figure 3(b). Similarly, the UCS distributions are not identical across all studies; however, it can be seen that most of the studies concentrate on UCS below 4000 kPa. Eight parameters are considered as inputs in the dataset: median sand particle size, (D) 50 , uniformity coefcient of sand, C u , initial void ratio of sand, e 0 , calcium chloride concentration, M ca , urea concentration, M u , optical density of bacteria, OD 600 , urease activity of bacteria, UA, and calcite content, F CaCO3 . Apart from input parameters, some other variables are almost analogous throughout the dataset, so they are not included in the dataset. Te source of calcium was calcium chloride in all studies. Te treated sands were initially neutral (pH � 7) and were cured at room temperature (20-30°C). Table 1 presents statistical information of the dataset, including mean, standard deviation (std), minimum, maximum, and quartiles for each variable. In addition, the distribution of unconfned compressive strength and each input parameter is exhibited in Figure 4. Te description of variables can be summarized as follows: (i) Te sands are classifed as fne to medium sands with median grain sizes ranging from 0.14 to 1.60 mm (Figure 4(a)); however, most of them can be categorized as fne-grained sands (D 50 < 0.425 mm) [74]. Also, the majority of the sands have uniform particle size distribution with a coefcient of uniformity ranging from 1 to 2 (Figure 4(b)).

Correlation of Variables.
Correlation analysis can efciently reveal the relationship between variables in a dataset. In this study, the Pearson correlation coefcient approach is used to analyze the relationship between variables [75]. Te Pearson correlation method determines the degree of the linear relationship between two variables. Te Pearson correlation coefcient, r p, ranges from − 1 to 1. Te higher value of r p represents the strong correlation between the two variables. In Figure 5, the heatmap of the Pearson correlation coefcients matrix of all features is depicted. It is evident that the UCS of biocemented sands strongly correlates with calcite content. By contrast, UCS is almost independent of the uniformity coefcient of sand.
To explore the relationship between UCS on F CaCO3 , Figure 6 displays the distribution of UCS with various F CaCO3 . A linear regression line with a positive slope is also      plotted in Figure 6, which establishes the direct relationship between UCS and F CaCO3 . In other words, for sands cemented with similar test properties, those samples with large amounts of CaCO 3 content would sustain higher compression. Tis strength enhancement mainly stems from the role of CaCo 3 crystals in the sand pores that binds sand grains together. Te initial void ratio of soil is a fundamental parameter for defning the density of soil. According to Figure 6, the void ratio shows the strongest correlation with CaCO 3 content within the dataset. In order to explore the relationship between void ratio, CaCO 3 content, and UCS, the gradient-colored scatterplot of e 0 and F CaCO3 is shown in Figure 7. Te color bar displayed on the right side of Figure 7 gives the values of UCS for each data. Te color bar represents the UCS ranges between 0 and 10 MPa in Figure 7, so data with UCS higher than 10 MPa are colored the data with a UCS of 10 MPa. In spite of the non-normal distribution of e 0 , a correlation between e 0 and F CaCO3 can be derived from Figure 7: the F CaCO3 reaches to higher value for sand with greater e 0 . In other words, sands with more void space have the potential to produce more amounts of calcite crystals among the sand particles. Furthermore, the color-mapped data with UCS demonstrate that sands with F CaCO3 higher than 10% mostly have e 0 between 0.6 and 0.9, and these treated samples have UCS higher than 2 MPa. Terefore, enhancing the compressive strength through MICP is more efcient in sands with more pore space (0.6 < e 0 < 0.9) than in dense ones. High-strength treated sands (UCS >10 MPa) are mainly found in sands with void ratios ranging from 0.6 to 0.8 in Figure 7.

K-fold Cross-Validation.
Validation of models was carried out through a k-fold cross-validation approach, which produces reliable models obtained from k times validation. In k-fold cross-validation, the dataset is divided into two sets: a training set and a test set. Te test set is held out for the fnal evaluation of the model. Te training set is divided into k subsamples with similar sizes. Ten, a model is ftted based on the (k − 1) folds of the training data, and the remaining fold validates the constructed model. Tis procedure is repeated for k time, and each fold is harnessed as a cross-validation set for one time. In k-fold cross-validation, the evaluation of the model is obtained from the average of all models. Tis study uses 10-fold cross-validation by holding out 20% of the dataset as the test set for model development. Te test set is selected randomly over the whole dataset. Figure 8 illustrates the schematic procedure of the 10-fold cross-validation used in this study.

Hyperparameter Tuning.
As stated previously, the gradient boosting algorithm incorporates three parts, including loss function, additive boosting, and decision tree, each of which has its own confguration. Te performance of a gradient boosting model for a dataset can signifcantly fuctuate by changing the model architecture. Terefore, fnding the optimal model is a key step for precise prediction.
Calibrating models with diferent confgurations to fnd the optimal model is commonly known as hyperparameter tuning, and the parameters are called hyperparameters. In this study, hyperparameter tuning is carried out using the RandomizedSearchCV module in the Scikit-learn package [76]. Te RandomizedSearchCV randomly runs a set of hyperparameters and computes the scores and then returns the best set of parameters which yields the highest score as an output. Te optimized model determined by this module is relied on the root mean squared error for the crossvalidation score; therefore, the optimized model is not overftted at all. Te hyperparameters and the optimal model of the GB model are described in Table 2.

Accuracy
(ii) Te RMSE stands for the root mean squared error, a measurement of error produced in the model prediction. Terefore, the lower RMSE, the higher accuracy is attained. Te RMSE parameter can be calculated as follows: (iii) MAPE introduces the mean absolute percentage error, which is a relatively intuitive measure. Model performance improves as MAPE approaches 0. MAPE can be computed as follows: (iv) Te R 2 is the coefcient of determination in regression problems that measures how well a model predicts targets. Te R 2 ranges from 0 to 1, and the higher value represents the better performance of a model. Te R 2 relates to the ratio of the residual sum of squares, SS res , to the total sum of squares, SS tot , and can be computed as follows: where y is the average of targets.

12
Advances in Civil Engineering [78], k-nearest neighbor (KNN) [79], support vector regression (SVR) [80], and decision tree (DT) [81]. Moreover, the results of this study are compared with those of Wang and Yin [8], who predicted the UCS of biocemented sands. Tey employed a multiexpression programming method combined with the Monte-Carlo method (MEP-MC) that relies on an evolutionary algorithm for developing mathematical expressions [82]. In the MEP-MC, fve groups were constructed based on a database, and then a model was ftted for each group. Te database used in their study was smaller than this study, contained 351 UCS test results. Wang and Yin [8], in contrast with this study, did not consider the urease activity of bacteria as an input variable. Table 3 summarizes the error metrics of gradient boosting methods and other models for training and testing sets. It is evident that gradient boosting (GB) outperforms other algorithms in predicting the unconfned compressive strength of biocemented sands. Predictions made by GB produced MAE equal to 34 kPa for the training set and 229 kPa for the testing set. In other words, when a test datum is introduced to the optimal GB model with parameters presented in Table 2, its UCS can be predicted with an average error of 229 kPa. In the dataset, the mean value of UCS is 1328 kPa (Table 1); thus, it can be stated that the mean absolute error produced by GB is 17 percent of the mean value of UCS over the entire dataset. Furthermore, the RMSE of the GB shows a similar trend which is equal to 404 kPa for the test set. Te parameter of MAPE can better explore the superiority of GB to other algorithms, which is a scale-independent and interpretable error parameter. Te UCS values estimated through GB show an MAPE equal to 25% for the test set, while other algorithms have MAPE in a range of 36 to 54%. Terefore, it can be stated that the GB algorithm is capable of predicting the UCS of biocemented sand with an average error of 25%.

Models Performance.
As stated in the literature review, random forest (RF) is an ensemble algorithm consisting of several parallel learners; in contrast, gradient boosting consists of several sequential learners. It can be seen from Table 3 that the GB technique is far more robust than RF in predicting the UCS of sands treated with MICP. Te RF algorithm makes predictions with MAE and RMSE that are 62 and 44% higher than GB, respectively. Moreover, MAPE obtained with RF for the test set is equal to 44.8%, which is almost 20% greater than GB. According to these observations, the sequential harnessing of weak learners is far more efcient than parallel ones for predicting the unconfned compressive strength of sands treated with MICP.
Moreover, the performance of the multiexpression programming method (MEP-MC) performed by Wang and Yin [8] is presented in Table 3. Gradient boosting is clearly superior to MEP-MC in all aspects of error metrics. Te MAE and RMSE of predictions obtained from MEP-MC were 409 and 652 kPa, respectively, which are 78 and 61 percent greater than those obtained from the GB model.
Te distribution of predicted UCS versus actual UCS for the training and testing sets are exhibited in Figures 9 and 10, respectively. It can be seen that the predictions made for the training set are mostly close to or equal to the targets, and the majority of points in Figure 9(a) lie along the line of equality. Te error distribution in Figure 9(b) shows that more than 200 of the training data have no error in their estimation. Te distribution of the test set, shown in Figure 10(a), corroborates the reliability of the GB model. Te test set predictions are well concentrated around the line of equality, demonstrating the strong correlation between predicted and actual UCS. According to Table 3, the coefcient of determination (R 2 ) for the test set of the GB model is equal to 0.95. Additionally, the produced errors for the test set are distributed normally in Figure 10, with the majority being lower than 500 kPa.

Reliability Analysis.
In order to establish the efectiveness and dependability of the algorithms, a reliability analysis based on the Friedman analysis is performed [17]. According to this method, the models are ranked according to their errors in their predictions, from 1 indicating the least error to z indicating the highest error, for z models. For a database containing N data, the average ranking for model j (r j ) can be calculated using the following formula: where r i j denotes the ranking of the i th data for model j. Using equation (15), the average ranking (r j ) of all utilized models are computed and plotted on Figure 11. Tis plot illustrates the superiority of the gradient boosting method, which has the lowest average ranking in comparison to the other models. Tis point endorses the outperformance of GB over fve other frequently used machine learning techniques in predicting the UCS of biocemented sands. To fnd out whether this variation in models' performances is signifcant or not, the chi-Square (χ 2 r ) of the average ranking throughout the test set is computed as follows: where N is the number of test data, and z is the number of algorithms which is equal to 6 in this study. Te chi-square test relies on null hypothesis with (z − 1) degrees of freedom, which would be rejected if the computed chi-square value is equal to or greater than the critical one at a prespecifed level of signifcance [83]. Te critical chi-square for a distribution similar to this study, with 5 degrees of freedom and considering 0.95 signifcance, is equal to 11.07. Using equation (16), chi-square is equal to 38.65 for this study; thus, it can be Advances in Civil Engineering 13  concluded that the null hypothesis is rejected, so the distribution of models' performances is found to be signifcant.

Feature Importance.
Generally, the gradient boosting technique is also capable of providing an importance score for each variable to recognize how valuable each feature is in the construction of the boosted decision trees. Te feature importance score fuctuates within a range of 0 to 100, and the higher values for a variable demonstrate the greater importance. Te results of the feature importance analysis of this study are presented in Figure 12. Similar to the heatmap outlined in Figure 5, calcite content (F CaCO3 ) is found to be the most infuential feature for the gradient boosting algorithm. Te second most important feature is the initial void ratio (e 0 ), which has a 10% feature importance. Te other features of the sands, bacteria, and cementation solutions have the lowest infuence on the fnal UCS in the gradient boosting algorithm.

Environmental Modification
According to the literature review, the UCS of biocemented sand is infuenced by the surrounding temperature and initial pH of the soil. However, given that all the test results included in the dataset were obtained from unconfned compression tests performed on neutral sand (pH � 7) at room temperature (20-30°C), these two variables are not included in training the models. It should be noted that the available data that focused on the efect of temperature and pH are too small that extracting a model based on these variables is almost impossible. As a solution for this limitation, this section provides guidelines for applying the efect of temperature and pH on the UCS of biocemented sands based on those small set of data. Given that these fndings are based on a limited number of tests, the results from such analyses should be treated with considerable caution.
6.1. Temperature. Troughout the dataset used in this study, the temperature of the curing environment is close to room temperature (20-30°C). Research conducted by Cheng et al. [41] can present guidelines for modifying the predicted UCS to other temperatures. Cheng et al. [41] conducted a series of unconfned compression tests on sands treated with an identical treatment program but cured at three diferent curing temperatures (4, 25, and 50°C). It was reported that the strongest biocemented sands were cured under the temperature of 25°C. Since all samples were treated with similar properties, the UCS values of the samples corresponding to the temperatures of 50 and 4°C can be normalized with the temperature of 25°C. Terefore, the parameter of temperature coefcient, r t,25°C , is defned as follows: where UCS T and UCS 25°C are the value of unconfned compressive strength for specimens at a temperature of T and 25°C. It should be mentioned that the calcite content is equal for both samples. Te parameter of r t,25°C introduces the ratio of UCS of sands treated at a temperature of T to 25°C. Te distribution of r t,25°C in the study of Cheng et al. [41] is illustrated in Figure 13.   Feature Importance (%) Figure 12: Results of feature importance analysis.

Advances in Civil Engineering
When it is aimed to estimate the UCS of bio-cemented sands cured in a hotter or colder environment, the trend lines in Figure 13 can be used to adopt the UCS values estimated using GB models. Te predicted UCS should be multiplied with the corresponding r t,25°C with regards to the temperature and CaCO 3 content. Although the lack of experimental data related to diferent temperatures restricts temperature modifcation, these results can be conducive to providing insight into other temperatures.
6.2. pH. As stated previously, both acidity (pH < 7) and alkalinity (pH > 7) negatively impact the UCS of sands treated by MICP. Te degree of UCS reduction cannot be accurately estimated due to the lack of high-quality literature with extensive datasets; however, the results of Cheng et al. [63] could provide an initial guideline. Cheng et al. [63] demonstrated that sands with pH levels equal to 9.5 and 3.5 endure lower UCS than neutral sand, even with high levels of CaCO 3 crystals. Acidic sand showed higher drop rates than alkaline ones: the UCS of acidic sand was approximately 25% of neutral sand, whereas the UCS of alkaline sands was 50% of neutral sand. Terefore, when estimating the UCS of acidic or alkaline sands treated by MICP, the fnal UCS of acidic and alkaline sands can be considered to be 25 and 50% of neutral sand, respectively.

Conclusions and Future Works
Given the environmental benefts and wide application of microbially-induced calcite precipitation of sandy soils, the unconfned compressive strength of sands treated with MICP was predicted using a gradient boosting technique in this study. Based on a dataset consisting of 402 data extracted from previous studies, the fndings can be summarized as follows: (i) An acceptable performance of the gradient boosting algorithm was achieved in predicting the UCS of biocemented sands in neutral condition (pH � 7) and room temperature (20-30°C). For the test set, predictions made by the gradient boosting had MAE and RMSE equal to 229 and 404 kPa, respectively. Also, MAPE and R 2 were 25% and 0.95, respectively. Te comparison of error metrics with fve other frequently used machine learning techniques (ANN, SVR, KNN, RF, and DT) demonstrated the outperformance of the gradient boosting in all aspect of error metrics. (ii) Te correlation analysis revealed that the UCS of biocemented sands mostly depends on the calcite content. Furthermore, a correlation was found between the void ratio and calcite content suggesting that high levels of CaCO3 precipitation could occur in soils with a void ratio between 0.6 and 0.9.
(iii) Using existing literature on the UCS of biocemented sands in harsh environments, the guidelines were developed for modifying predicted values. Tese analyses revealed a trend for low calcite samples in cold (4°C) and hot (50°C) weather. Furthermore, biocemented sands treated in alkaline and acidic environments showed lower UCS than neutral ones. Tese modifcations were limited to a specifc range of temperature and pH level because few data are available for performing analysis.
Overall, this study provides valuable insights into the application of machine learning algorithms in predicting the UCS of biocemented sands treated with MICP, which can be useful for civil engineering applications. However, further experimental studies with clear and detailed treatment procedures (particularly injection details) can be reinforce the database for developing our models and study. MICP treatment of sands with varying void ratios can provide valuable insight into determining the optimal initial condition for the MICP treatment. Also, further research at a variety of temperatures and pH levels is needed to enhance the accuracy and feasibility of the environmental modifcations.
Notation r t,25°C : Temperature coefcient for 25°C C u : Uniformity coefcient D 50 : Median sand particle size e 0 : Initial void ratio f: Function estimate f ( ): Final boosted function at iteration M f m ( ): Function for the m th iteration F CaCO3 : Calcite content k: Number of folds in cross-validation Temparature 50°C 4°C Figure 13: Variation of temperature coefcient (r t,25°C ) for various CaCO 3 content (F CaCO3 ) at diferent temperatures [41]. 16 Advances in Civil Engineering

L( ):
Loss function L Huber : Huber function L lad : Absolute error loss function L ls : Squared error loss function Learning_rate: Shrinkage parameter MAE: Mean absolute error MAPE: Mean absolute percentage error max_depth: Te maximum depth that limits the growth of trees max_features: Number of features to consider when looking for the best split M ca : Calcium chloride concentration min_samples_leaf: Minimum number of samples required to be at an internal or external node min_samples_split: Minimum number of samples required to split an internal node M u : Urea concentration n_estimators: Number of iterations OD 600 : Optical density of biomass at 600 nm R 2 : Coefcient of determination R j : Region j in a tree R jm : Region j in a tree for m th iteration r im : Negative gradient at m th iteration r i j : Ranking of the i th data for model j r j : Average ranking for model j RMSE: Root mean squared error UA: Urease activity UCS: Unconfned compressive strength UCS 25°C : Unconfned compressive strength at a temperature of 25°C UCS T : Unconfned compressive strength at a temperature of T v: Learning rate (shrinkage) x i : Input variables of the i th sample y: Average of targets y i : Target variable of the i th sample z: Number of algorithms α: Breakdown point parameter in Huber loss function α m : Weight factor for the m th sample c j : Constant for terminal j c jm : Optimal constants in each region at m th iteration δ: Treshold of Huber loss function χ 2 r : Chi-square in Friedman analysis.

Data Availability
Te dataset used in this study is included within the supplementary materials.

Conflicts of Interest
Te author declares that he has no conficts of interest that could have appeared to infuence the work reported in this paper.

Supplementary Materials
A supplementary material fle is provided that contains the dataset used in this study. Te dataset includes the values for input and target parameters sorted by referenced sources and are provided in a text fle. Te notations and references numbers of the dataset are according to the main article. (Supplementary Materials)