Estimation of Soil Cohesion Using Machine Learning Method: A Random Forest Approach

. Soil cohesion (C) is one of the critical soil properties and is closely related to basic soil properties such as particle size distribution, pore size, and shear strength. Hence, it is mainly determined by experimental methods. However, the experimental methods are often time-consuming and costly. Therefore, developing an alternative approach based on machine learning (ML) techniques to solve this problem is highly recommended. In this study, machine learning models, namely, support vector machine (SVM), Gaussian regression process (GPR), and random forest (RF), were built based on a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project, Vietnam. The database also includes six input parameters, that is, clay content, moisture content, liquid limit, plastic limit, speciﬁc gravity, and void ratio. The performance of the model was assessed by three statistical criteria, namely, the correlation coeﬃcient (R), mean absolute error (MAE), and root mean square error (RMSE). The results demonstrated that the proposed RF model could accurately predict soil cohesion with high accuracy ( R � 0.891) and low error (RMSE � 3.323 and MAE � 2.511), and its predictive capability is better than SVM and GPR. Therefore, the RF model can be used as a cost-eﬀective approach in predicting soil cohesion forces used in the design and inspection of constructions.


Introduction
e cohesion (C) of the soil is created by the bonds between the compounds, the particles, and the viscosity of the water-glue film that surrounds them. Along with the internal friction angle, the cohesion force is part of the shear resistance (slip resistance) of the cohesive soil, used to calculate the load capacity of the ground soil. Cohesion force is usually measured based on the Mohr-Coulomb theory. In the stress plane of the shear effect normal stress, the soil cohesion is the intercept on the shear axis of the Mohr-Coulomb shear resistance line [1][2][3]. e soil cohesion of the soil greatly depends on the composition of particles in the soil, soil texture, and moisture [4]. In the design of geotechnical constructions such as foundations, slopes, or open-pit pits, the precise determination of the soil cohesion is of great concern [5]. is important parameter can be determined in the field or laboratories [3]. Tests for soil cohesion determination are usually carried out as a direct shear test (slow cut, quick cut, and fast consolidation) or indirect soil shear test with a triaxial compressor [6].
However, the experiments to determine this parameter are often cumbersome, expensive, and time-consuming [7]. With field estimation, a team of skilled and experienced engineers is required [8][9][10]. To overcome the above difficulties, technical design models have been proposed based on useful correlations that exist between indicator properties obtained from field tests. Several studies have employed models to predict different soil properties and characteristics, for example, Masada's [11] study for clay and silt embankments, Mofiz and Rahman [12] for Barind soils, Cola and Cortellazo [13] for peaty soils, and Hajarwish and Shakor [14] for mudrock. However, soil is an extremely complex material, and the geological conditions in each region are different, so it is not possible to apply these models thoroughly to different regions [15]. is confirmed the need to propose a general method to be able to predict soil cohesion under different conditions.
More recently, machine learning (ML) or artificial intelligence (AI) based on computer science has gradually become popular and applied in many different fields [16][17][18]. e wide applications of ML have been applied in areas of the construction industry, such as determining the critical force of steel [19]. Many dependent variables are affecting the critical force of steel [20] and the mechanical properties of the soil [21]. erefore, the application of artificial intelligence to determine soil cohesion is completely feasible. Kovačević et al. [22] used a support vector machine (SVM) to estimate the chemical and physical properties of soil and classify soil types. Guo et al. [19] used Artificial Neural Network (ANN) and Generalized Linear Model (GLM) to predict soil aggregate stability. Moufiz and Rahman [12] used and compared different ML models, including Linear Regression (LR), ANN, SVM, random forest (RF), and M5 Tree (M5P) for prediction of Standard Penetration Test (SPT) based N-value of soil in the state of Haryana, India. In general, the ML models are proved as potential and highly accurate tools for the prediction of soil properties [23,24]. In this study, the main aim of this study is to apply one of the most popular ML models, namely, random forest (RF) [25][26][27], for predicting the cohesion force of the soil quickly, avoiding costly and time-consuming experiments. Database of soil properties was constructed from the experimental results of the Da Nang-Quang Ngai expressway project, Vietnam. Two other ML models, namely, support vector machine (SVM) and Gaussian process regression (GPR), have been used for comparison.

Database Collection and Preparation
In this study, the testing results of 145 data of soil samples collected from Da Nang-Quang Ngai expressway project, located in the Central South part of Vietnam (Figure 1), were used to construct the database for modeling soil cohesion force prediction. In the modeling, we considered six input parameters, namely, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio, and one output parameter of soil cohesion force. e detailed determination of input and output parameters is calculated according to the formulas in the published works [28,29]. e data in this study are randomly divided into two subsets using a uniform distribution, in which 70% of the data is used as a model training set, and 30% is used to test the performance of the model. All data are scaled to the range [0; 1] to reduce numeric error while processing with ML algorithms, as Witten et al. [30] recommended. is process ensures that the training phase of the AI models can be performed with functional generalization capabilities. Such proportions are represented by where x max and x min are the maximum and minimum values of the considered variable and x n is the normalized value of the variable x.

Modeling Approaches
3.1. Random Forest. Random forest (RF) is one of the most commonly used ML algorithms for its simplicity and variety. is is a supervised learning model used for classification and regression problems proposed by Breiman in 2001 [30]. RF is an integrated learning method that gathers results from single decision trees, thereby improving predictive efficiency through the form of majority voting or averaging results depending on each specific problem. Suppose that there is an input data set X � x 1 , x 2 , x 3 , ..., x n where n is the number of data dimensions or the number of predictive variables. An RF model would be a set of T trees T 1 (X), T 2 (X), T 3 (X),. . ., T n (X). e prediction result of these decisionmaking trees is For the regression problem, the final result of the RF model will be the average of all the prediction results of the above trees. e development of tree growing is done with the principle of dividing the initial training sets into smaller training sets, and in each split, only a few predictive variables are selected randomly. Decision trees are continuously developed without pruning to predetermined stopping criteria by the programmer. Commonly used tree growth stops are RMSE, Gini Diversity Index, or Mean Square Error. Trees with low predictive results are then discarded, and only plants with sufficient predictive value are selected in the final RF model. e random selection of predictor variables and the result set of decision trees eliminate the overfitting problem of the single decision tree model [30,31]. e structure of the random forest is depicted in Figure 2. In this study, the RF model was trained and validated using the tools in MatLab application.

Support Vector
Machine. Support vector machine (SVM), proposed by Vapnik since 1995 [32], is an effective and popular learning model for classification of linear and nonlinear regression problems. SVM machine learning model gives accurate prediction results and stable, good noise tolerance and is practical for high-dimensional feature spaces [33,34]. Many successful SVM applications with classification and regression problems have been published in different fields [35][36][37]. e basic theory of SVM is summarized as follows.
A training dataset (x i , y i ), i � 1, 2, ..., N is selected for an SVM model as shown in Figure 3, where x i � [x 1i , x 2i , ..., x ni ] ∈ R n h is the input data, y i ∈ R n m is the output data corresponding to x i , and N is the number of training samples. e SVM aims to find an optimal hyperplane function f (x) (determined by the weight vector w and the offset b), passing through all the data elements with the insensitive loss coefficient ε (based on two supporting hyperplanes, w. In the case of nonlinear regression, the function f (x) is determined as follows: where C is the penalty constant used to control the penalty error, α i , α * i are the Lagrange multipliers, and K (x i , x j ) is the kernel function defined as follows: 2 Advances in Civil Engineering With F being a nonlinear mapping function. Linear, polynomial, sigmoid, and Gaussian functions are the most commonly used kernel functions:

Gaussian Process Regression.
Gaussian process regression (GPR) is a nonparametric, Bayesian approach applied to regression problems. GPR has several advantages, working well on small datasets and having the ability to provide uncertainty measurements on the prediction values. Given the training data set D � (x i , y i ) N i�1 , where N is the training set's dimension, x i ∈ R D represent input data, and y i ∈ R is the corresponding output value. In data set D, random variables corresponding to input data set .., f(x N ) and are subjected to the joint Gaussian distribution. For the simplest case, the relation between the latent function f (x) and the observed target y is where w denotes the weight, ε is the independent noise, σ 2 n is the variance of the noise, and Σ P is covariance. e distribution in the Gaussian process is represented by a mean function, denoted as m (x), and a covariance kernel function, denoted as K (x, x') [38]: X Averaging Random forest prediction where x and x ′ ∈ R D are random numbers of random variables. For the basic GPR, m (x) is set to be zero, and formula (1) can be rewritten as where x is the learning sample whose measure in the GP is the finite-dimensional distribution of the GP. As defined by the GP, the finite-dimensional distribution is a normal joint distribution as e noise e is free from f (x), and it is subject to the Gaussian distribution. When f (x) is an object of the Gaussian distribution, and y is also subjected to the Gaussian distribution.
en, the prior distribution of the observed target value y is inferred as: With given test sample points (x * , y * ), the joint probability distribution of the observed target value y and prediction value y * at test points is expressed as where are the elements in the matrix, respectively, to measure the correlation of x i and x j ; K (x, x * ) is the matrix of covariance of the training set and the testing set.
Applying the conditional distribution properties of the Gaussian distribution, an equation is proposed: where e mean value y * is the estimation value of y * ; cov(y * ) is the variance matrix of test samples, which reflects the estimation value's reliability.

Model Evaluation.
e application of modeling tools in the field of geotechnical engineering is increasingly popular and effective. However, to assess the ability of these models to make an accurate prediction still needs to be tested by appropriate model evaluation indicators. In this study, 3 indicators are used to evaluate the quality of the model compared to data collected from the experimental results, including mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (R) [39,40].
MAE is calculated by Equation (2), which evaluates the difference between actual data and is calculated from the model [28]. However, it does not tell the bias trend of the predicted and experimental values. When MAE � 0, the value of the model completely coincides with the actual value, and the model is considered "ideal." MAE value is in the range (0, +∞).
RMSE is one of the basic quantities and is commonly used for evaluating the results of predictive models [41]. RMSE is often used to denote the mean magnitude of the error. In particular, the RMSE is extremely sensitive to large error values. erefore, the closer the RMSE is to the MAE, the more stable the model error is. Just like MAE, RMSE also does not indicate the deviation between forecast value and actual value. RMSE is determined by formula (3), and the value of RMSE is in the range (0, +∞).
R is the correlation coefficient representing the data's suitability with the algorithm, a measure commonly used in ML algorithms [42]. e equation for calculating the value of R is presented in equation (4). e R values range from -1 to 1. e absolute value of R equal to 1 represents a perfect distribution between the simulated and real values, while a value of 0 indicates no correlation.
where n is the number of database, y 0 and y 0 are the actual experimental value and the average real experimental value, and y t and y t are the predicted value and the average predicted value, calculated according to the model forecast.

Methodological
Flowchart. e process of implementing the methodology is depicted in Figure 4, including the following basic steps: (i) Data acquisition: in this step, soil sample data collected from the Da Nang-Quang Ngai expressway project is used to build the model. On the basis of the data set collected, determine the input and output parameters to be defined. (ii) Database preprocessing: this is one of the most critical steps in ML to help build a more accurate ML model. Some techniques are used to process data, such as transforming data, ignoring missing values, and filling in missing values. After that, the data set is randomly divided into two parts: the training part and the testing part. (iii) Select the model best suited to the data type: in this study, a random forest (RF) algorithm is used to estimate soil cohesion. e results of RF model are also compared with the support vector machine (SVM) [32] and Gaussian regression process (GPR) [43]. (iv) Train and test the model on data: in this step, train the tuple and tune the parameters using the "training database," and then test the performance on the unseen "testing database." An important point to note is that the test dataset is not used in the training process. (v) Model evaluation: model evaluation is an indispensable part of the model development process, helping find the model to predict the best results.

Descriptive Statistics Analysis.
e statistical analysis of the data was performed (Table 1 and Figure 5). In the database, the value of the clay content varies in the range of 4.09-47.96%, the natural moisture content is in the range of 15.53-115.41%, the liquid limit varies from 20.8 to 154.12%, the plastic limit ranges between 13.42 and 63.96%, the specific density value varies from 2.59 to 2.75 g/cm, and the void ratio ranges from 0.58-3.25. Besides, the soil cohesion values are in the range of 0.29 to 30.39 kPa. e histograms of the corresponding variables are presented in Figure 5. Besides, the quantitative analysis of input and output parameters is detailed in Table 1.

Prediction Performance of RF.
In this section, the effectiveness of the RF model is evaluated. e hyperparameters of RF model are selected using trial and error tests, presented in Table 2. e comparison results between the experimental values of soil cohesion with those obtained from the RF model for the training and testing dataset are shown in Figure 6. Observe that the line representing the cohesion value of the soil is predicted to be quite close to the line representing this value experimentally. is good correlation was confirmed by the error diagram between the predicted and experimental soil cohesion for the training set (Figure 7(a)) and the testing dataset (Figure 7(b)). Of the 102 data samples of the training dataset and 43 data samples of the testing dataset, only a very few samples have an error in the range of [-7; 11] kPa. ese errors show that the predictability of the RF algorithm is feasible with small errors.
Finally, the relationship between the actual data value and the predicted value is given as a regression graph in Figure 8. e quantitative values of the three criteria evaluating model performance are shown in Table 3. As shown in Table 3 erefore, the RF model application to predict soil cohesion is feasible with high accuracy and low error.

Analysis of Simulation Convergence of RF and Other ML
Models. In this work, the performance of the proposed model is assessed by the number of simulation runs. Several studies [44,45] have shown that the predictive performance of the algorithm depends on randomly dividing the data set into training and test sets. erefore, analysis of the model's performance should be performed with a sufficient number of simulations to demonstrate the generality of the obtained results. In this study, a total of 200 simulations were conducted to study the performance of the proposed RF model. e hyperparameters of other models are selected using trial and error tests, presented in Table 2. Figures 9(a), 9(c), and 9(e) represent the normalized convergence values of RMSE, MAE, and R, respectively. In contrast, Figures 9(b), 9(d), and 9(f ) represent the convergence values of the three respective criteria. As observed, after about 50 simulations, the oscillation of RMSE and MAE was in the range of less than 1% with the training set (Solid Green Line       Table 4. In addition, 200 simulations with SVM and GPR algorithms are performed and presented in Figure 10. It

SVM
Using fitrsvm MatLab function Using hyperparameter optimization that minimize 10-fold cross-validation e 6 hyperparameters are box constraint, kernel function, kernel scale parameter, polynomial kernel function order, half the width of the epsilon-insensitive band, standardize method for data ANN Using fitrrgp MatLab function Using hyperparameter optimization that minimizes 10-fold cross-validation e 5 hyperparameters are basis function, kernel function, kernel scale, sigma value, standardize method for data      Overall, the proposed RF algorithm is a better ML model compared with other ML models (SVM, GPR) in predicting soil cohesion. It is reasonable because RF has many advantages such as the following: (i) it can be effectively applied to largescale datasets as it provides the facility for size reduction without deleting unwanted variables from the training dataset; (ii) it can handle thousands of input features and variables at a time; (ii) it has an embedded efficient technique for estimating missing or null values. Hence, it is possible to maintain a level of accuracy (i.e., consistent performance) even when a large portion of the data is missing; (iv) it is able to perform a good parallel simulation because the number of trees generated and computed is completely independent of each other; and (v) this model can minimize errors as the results are synthesized from different "learners" (random forest trees) [46]. e results of this study are also comparable with other previous published works [46][47][48].

Sensitivity Analysis.
In this section, the estimation of the feature importance of input variables is performed. For each simulation, the importance value is calculated by the sum of the difference taken by the splits of the given predictor and divided by the sum of the branch in RF. Figure 11 shows the out-of-bag feature importance over 200 simulations (by mean values) along with the standard deviation values. It can be seen that the void ratio is the most important variable in predicting soil cohesion. Besides, the moisture content is the second important input for the problem, followed by the plastic limit, liquid limit, specific gravity, and the clay   content. ese sensitivity results are reasonable and comparable with other published works [28,49,50].

Conclusion
In this study, a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project was used to construct an RF model for the purpose of soil cohesion prediction. Input data for network training includes clay, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. ree statistical criteria, namely, correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE), are used to evaluate the correlation between the values predicted by the RF model and actual experimental values. e analysis results show that the built model can predict soil cohesion accurately and quickly, avoiding costly and difficult experiments that require complicated equipment. However, in ML problems, data is the key factor in creating a reliable predictive tool. erefore, the next research direction is to collect additional data to further improve the algorithm, making the prediction more accurate, avoiding costly on-field experiments.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.