Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil

1 uyloi University, Hanoi 100000, Vietnam University of Transport Technology, Hanoi 100000, Vietnam Civil and Environmental Engineering Program, Graduate School of Advanced Science and Engineering, Hiroshima University, 1-4-1, Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8527, Japan Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, 971 87 Lulea, Sweden Institute of Research and Development, Duy Tan University, Da Nang 550000, Vietnam Bhaskaracharya Institute for Space Applications and Geo-Informatics (BISAG), Gandhinagar 382002, India


Introduction
Soil is a crucial material in civil engineering, as most of the structures are built on soil ground [1]. e failure of the ground and collapse of the buildings are often associated with soil shear strength. Under different loading conditions, the soil shear strength, or the shear resistance, is dependent on the cohesion, friction, and interlocking between particles [1]. e mechanical property of soil is complex due to the fact that soil often contains different particle sizes, high water content, and large voids [1]. Soil shear strength is dominated by basic parameters such as soil mineralogy, overburden pressure, water content, density, and void. Commonly, the soil shear strength is calculated by determining the effective stress and soil parameters, such as internal friction angle and cohesion [1,2]. ese soil parameters can be determined in the field by Standard Penetration Test (SPT) or shear vane test and in the laboratory by conducting direct shear test, ring shear test, triaxial test, and unconfined compression [3,4]. ese tests are timeconsuming and involve a lot of cost on conducting tests on an important number of samples.
Over the last decades, many researchers have tried to improve and find alternative methods to determine the shear strength of soil [3,[5][6][7][8][9][10]. Nam et al. [11] used a multistage direct shear test for determining the shear strength of unsaturated and saturated soils. Such a method could reduce some disadvantages of conventional direct shear tests and produced high accuracy results. Besides, many researchers have attempted to establish a relationship between soil indexes, such as clay fraction, liquid limit, plastic limit, and clay mineralogy [9,12]. Also, many efforts have been made to evaluate the shear strength of soil through other soil parameters, such as establishing a correlation between suction and shear strength [10,13]. In addition, several conventional procedures were introduced to estimate the shear strength of soil, where the relationship between the water content and suction is employed as a tool in the prediction process of unsaturated soil shear strength [6,[14][15][16]. Another effort has been carried out to estimate the soil shear strength in situ through shear wave velocity [16][17][18]. Overall, the conventional and traditional techniques possess some disadvantages and limitations, such as limitations in using basic soil parameters or considering a small range of soils. As an example, Kaya [2] indicated that the empirical formula, as suggested by Wright [19], is only limited to the soil containing a clay fraction superior to 50%.
In the recent time, Machine Learning (ML) techniques have been developed expeditiously and successfully applied in many fields of civil engineering [20][21][22][23][24][25][26][27] and Earth sciences [28][29][30][31], including geotechnical engineering such as landslide susceptibility [32][33][34][35][36][37][38][39][40][41] and estimation of soil parameters [42][43][44][45][46][47] including shear strength of soil [47][48][49][50][51][52]. In the work of Das et al. [53], the authors successfully applied an Artificial Neural Network (ANN) for estimating the residual friction angle of tropical soil in a specified area. Besides, it is found that the Support Vector Machine (SVM) showed a better performance than ANN for estimating the shear strength of soil using basic soil parameters, such as liquid limit, plastic limit, and clay fraction. In another work, Besalatpour et al. [54] showed that Adaptive-Network-based Fuzzy Inference System (ANFIS) and ANN models had higher ability than conventional regression methods. In another study, three new optimization techniques, namely, the Dragonfly Algorithm (DA), Invasive Weed Optimization (IWO), and Whale Optimization Algorithm (WOA), were employed to optimize the weights and biases of an ANN structure in estimating the shear strength of soil [50], where it was noticed that the learning error was significantly decreased. us, the IWO-ANN hybrid algorithm was found to be promising model instead of conventional methods in solving soil shear strength problems. Further, Moayedi et al. [49] used four neural-metaheuristic models for estimating the shear strength of soil and stated that the Salp Swarm Algorithm-Multilayer Perceptron (SSA-MLP) model is a potential alternative method for estimating the soil shear strength. In general, ML techniques have significantly improved the prediction ability compared to conventional methods.
Despite significant growing of researches in applying ML algorithms in soil science, it is surprising how few of these suggestions are dedicated to the investigation of the performance assessment under a combination of factors during the model development phase. ese factors could be the choice of data splitting, the selection of sampling technique, or the ML algorithm. For instance, a study on the comparison of ML techniques in digital soil mapping found that sample design and model choice significantly affected the outputs [55]. With regard to the data splitting, the data sample is often divided into two datasets, including a training set for model training and a testing set for model validation. Many researchers proposed a ratio of 70/30 or 80/ 20 (training/testing set) for producing datasets in landslide susceptibility problems [56][57][58][59][60][61]. Regarding studies on estimating the residual strength of soil using ML algorithms, previous works mainly used ratios of 70/30, 80/20, and 90/10 (training/testing) for generating datasets [22,43,[47][48][49][51][52][53]. Recently, Pham et al. [47] conducted a study on estimating the shear strength of soil in varying the training dataset size from 30% to 90% using the Random Forest (RF) algorithm. e study revealed that the increase in the size of the training dataset improved the training performance and made the model more stable. For the testing performance, the increase in the training set's size from 30% to 80% could also enhance the testing performance. However, when training size increased from 80% to 90%, the opposite trend was found in testing performance. In general, the training set size had an important effect on the prediction ability of the ML models [62]. e main objective of the present study is to evaluate the performance of ML models considering different ratios of soil data splitting for the prediction of soil shear strength. In this research, three ML techniques, namely, ANN, Extreme Learning Machine (ELM), and Boosting algorithm, were adopted to estimate the soil shear strength based on different splitting ratios of input data for the training and testing phases. e main difference of this study compared with the previously published works is that it is the first time the influence of splitting strategy of training and testing datasets used in ML models was investigated to predict the soil shear strength. Results were evaluated using standard statistical measures, namely Mean Absolute Error (MAE), Correlation Coefficient (R), and Root Mean Squared Error (RMSE), for the selection of the best model in predicting the soil shear strength and study the influence of different ratios of training and testing data on the performance of models.

Research Significance
ML, which includes advanced soft computing based techniques, has been developed and applied successfully and efficiently to solve a lot of real-world problems [63][64][65][66][67][68]. e main advantage of ML is that it can subjectively analyze unlimited amounts of data and give reliable outcomes and assessment [69]. However, its performance depends significantly on the quality of data and the strategy of using the data [70][71][72]. erefore, assessment of the influence of data splitting on ML models' performance has a high significance, which will pave the way on how to select a suitable data splitting for better ML-based modeling. In this study, we have selected three popular ML models, namely, ANN, ELM, and Boosted, for modeling. In addition, we have selected a research problem, "the prediction of soil shear strength," which is an important geotechnical engineering task [43,46,47,73]. is will help the construction engineers and managers to quickly and accurately predict the soil shear strength, which can be used for the design and verification of construction projects.

Data Used
Soil investigation data of the Long Phu 1 power plant project, located in Soc Trang province, Vietnam (longitude of 9°59′07.3″N and latitude of 106°04′48.0″E), was used in this study for the development of the ML models.
e construction of this power plant was started in June 2015, reflecting a key project under the Vietnamese Government's 2011-2020 National Power Development Plan [73]. A database of 538 soil samples was used to build the training and testing data sets. Soil parameters such as clay content (%), void ratio, moisture content (%), liquid limit (%), plastic limit (%), and specific gravity were used as input variables, whereas the soil shear strength (kg/cm 2 ) determined by direct shear test under the Undrain and Unconsolidated (UU) scheme was used as the output variable.
Considering different ranges of variables ( Figure 1), these values were scaled in the range of [0, 1] to avoid unexpected jumps and reduce fluctuations within the datasets used for modeling.

Artificial Neural Network (ANN)
. ANN has been known as a popular and powerful machine learning technique (computational model) [74,75], based on structures and functions of biological neural networks: the nervous system of the human brain [20,[76][77][78]. is method has been used successfully in solving a wide range of civil engineering problems, including geotechnical engineering problems. ANN method is used to identify the relationship between input and output neurons in both linear and nonlinear patterns [21,22,79]. us, ANN could make a decision by analyzing patterns and relationships in data by itself [2,43,80]. In this study, a multilayered perceptron neural network, a popular ANN [81], was employed as a regression technique to estimate the soil shear strength. (Boosted). Boosted (Trees) is a hybrid method that combines the decision trees and boosting method. In this ensemble-type method, decision trees are employed to link input and output variables through recursive dual separations, while the boosting method is adopted to associate many individual models for improving the performance of the hybrid model [82].

Boosting Trees
e Boosted method, having the merits of tree-based techniques, can overcome the disadvantages of a sole tree model because of the following reasons. Firstly, this ensemble can choose a proper variable to match the appropriate functions. Secondly, it is suitable for various types of data using random boosting, and finally, this method can mitigate both bias and variance via model averaging [83].

Extreme Learning Machine (ELM)
. ELM was firstly suggested by Huang et al. [84,85], which is a modern algorithm and employed as a Single hidden Layer Feedforward Neuron Network (SLFN) [86]. ELM algorithm produces better performance in terms of learning speed compared to a conventional algorithm, for instance, backpropagation and least-square support vector machine [61,84,87]. e main aim of ELM is to get the smallest norm of weights on which the smallest training error can be reached for optimization of the model performance. A detailed description of ELM algorithm is available in published papers [84,[88][89][90].

Monte Carlo Approach.
Monte Carlo method has been widely introduced to solve problems relating to the variability of input parameters in various fields, including geotechnical engineering [45,91,92]. Monte Carlo methods are a broad class of computational algorithms that rely on the repeated random sampling process to obtain numerical results. Basically, this technique could produce a high ability to compute, statistically, the relationship in data for both linear and nonlinear problems [45,91]. Monte Carlo technique is implemented by repeating randomly input variables based on the distribution of probability density, and the outputs are computed correspondingly via a simulated model [93,94]. A concept of the Monte Carlo method includes the following: (i) variability of input parameter could be completely spread by predetermined models and (ii) sensitivity analysis of inputs can be evaluated using statistical analysis of the output results.

Performance Evaluation Criteria.
In this paper, standard statistical measures, namely, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Correlation Coefficient (R), were used to compare and validate the performance of ML models [47,95]. In general, RMSE is the mean squared difference between the estimated and actual values, while MAE is the mean amplitude of errors. Lower values of RMSE and MAE mean higher prediction ability of the models. Besides, R is employed to evaluate the where y coi and y co represent the output value of the ith sample and the corresponding output mean value computed by the ML model, respectively; y aci and y ac denote the measured value of the ith sample and the measured mean value, respectively; and n indicates the total number of samples.

Results and Analysis
In this section, the prediction results of the soil shear strength are presented using various ML models (ANN, ELM, and Boosted). In the modeling, clay content, void ratio, moisture content, liquid limit, plastic limit, and specific gravity were considered as input variables, whereas soil shear strength was considered as the output variable. As a first step, the influence of training and testing ratio on the performance of the ML models is presented, followed by the study of the random sampling effects on the performance of ML models, and finally, comparisons of different ML models are performed.  Figure 2.

Influence of Different Training and Testing Ratios on the
It can be seen that as the number of data in the training datasets increased, the errors (RMSE and MAE) of the ANN model increased, and R values of the ANN model decreased, showing the accuracy of ANN decreased (Figures 2(a), 2(c), and 2(e)). In contrast, as the number of data in the testing datasets increased, the errors (RMSE and MAE) of ANN decreased, and R values increased, reflecting an increase of the ANN accuracy (Figures 2(b), 2(d), and 2(f )). It can be observed that the performance of the ANN model on both training and testing datasets was the best on the training/ testing ratio of 70/30, based on the values of mean, standard deviation, and quantile levels of the three criteria.

Random Sampling Effects on the Performance of ANN.
To validate the random sampling effects on the performance of the ML models, the ANN model was used and trained on different training/testing ratios using Monte Carlo simulation. In this process, the 1000 simulation was carried out to validate the statistical convergence of the model, as shown in Figure 3. It can be seen that RMSE and MAE values were stable at 10% of the average values with only 10 iterations, whereas these values were stable at 5% average from 20 Monte Carlo iterations. Besides, the values of R were statistically stable at 2% average with 8 iterations and at 1% average from 50 iterations.
In addition, the analysis of the probability density of R, RMSE, and MAE values was also carried out to study the random sampling effects on the performance of ANN model (Figure 4). It can be observed that the distribution of the probability density of R, RMSE, and MAE values was different on various training/testing ratios.
In general, it can be stated that the performance of the ANN model is sensitive to the random selection of data in the datasets used for training and validating the model. In this study, the ANN model was converged with above 700 Monte Carlo simulations, and the train-to-test ratio of 70 : 30 was found as the best option for ML modeling.

Validation and Comparison of Different ML Models.
Validation and comparison of three ML models (i.e., ANN, ELM, and Boosted) were conducted using the best ratio of 70/30 of training and testing datasets. e ANN was trained with the parameters provided in Table 1, whereas ELM was trained with the network constructed by one input layer (6 neurons), one hidden layer (8 neurons), and one output (1 neuron). Regarding Boosted algorithm, the minimum leaf size was taken as 8, the number of learning cycles was 20, and the learning rate was set at 0.1. Values of R, RMSE, and MAE of the models using the testing dataset are shown in  Table 2. Overall, it can be stated that the ANN model is the best and most stable model compared with other models (Boosted and ELM) for the prediction of soil shear strength.

Discussion
ML models are known as advanced techniques and approaches for quick and accurate prediction of real-world problems. ese models, based on the objective computational algorithms, can handle complex relationships between input and output variables [97]. However, it is observed that ML models are quite sensitive to the quality of data and the way they are used in the modeling process, especially the ratio used to divide the datasets for training and validating the ML models [98]. In this study, this problem is analyzed by investigating the influence of training/testing ratio on the performance of three different popular ML models, namely, ANN, EML, and Boosted, to predict the soil shear strength.
Overall, the results showed that the ML models' performance was significantly changed under different training/ testing ratios. e results showed that the training/testing ratio of 70/30 was the most suitable one for training and validating the ML models. is finding is in line with other published works, such as Pham et al. [99], who investigated different training/testing ratios for training and validating various ML models (SVM, Logistic Regression, ANN, and Naıve Bayes) for spatial prediction of landslides and proved that 70/30 was the best training/testing ratio for getting the best performance of the ML models. Other studies and researches also confirmed the finding of this study [100][101][102][103][104][105]. In addition, it is noticed that when the percentage of data in the training dataset increased, the errors (RMSE and MAE) of the models increased, and R values decreased.
us, an increase of data (or samples) in the training dataset might have a negative influence on the prediction accuracy and difficulty in applying the models.
Besides, the validation and comparison results showed that all the ML models performed well, but ANN was the best model for the prediction of soil shear strength. It can be stated that ANN model has been reaffirmed as the best single ML model for solving most of the real-word problems [106,107]. ANN has several advantages compared with other ML models, such as (i) capable of extracting the essential process information from data for analyzing and prediction, (ii) an ability of generalization of data, (iii) able to correctly process information that only broadly resembles the original training data, and (iv) its essential features being related to nonlinearity, fault tolerance, independent assumptions, and universality. us, ANN algorithm is particularly reasonable for extremely complex data. Last but not least, ANN is an            adaptive algorithm, so that the learning process can be more effective [108,109]. erefore, it can be stated that the ANN was the best predictor for the prediction of soil shear strength.

Conclusions
Soil shear strength is one of the most critical geotechnical engineering properties used for designing and constructing civil engineering structures and constructions. Prediction of this parameter using advanced ML models might help in saving time and reducing cost for construction projects. In this study, three popular ML models, including ANN, ELM, and Boosted, were applied and compared to predict the soil shear strength using a database collected from Long Phu 1 power plant project, Vietnam. In addition, the performance of these models was also investigated under the influence of different training and testing ratios over 1000 Monte Carlo simulations.
Validation and comparison results showed that even the performance of all models was good and the performance of ANN was the best compared with other models. It can also be observed that the performance of the models was significantly changed under the different training and testing ratios used for training and validating the models. Based on the statistical analysis, a ratio of 70/30 for training and testing datasets was considered as the best ratio for training and validating the models. In addition, Monte Carlo simulations showed that the performance of the models is different under the random sampling effect over 1000 simulations. ANN was found as the best and most stable method under the variability of the input space.
In short, civil engineers can use the results of this study for quick and accurate prediction of soil shear strength for designing purposes, for instance, road, bridges, retaining walls, and other geotechnical and civil structures. Although the one group of data used in this study is sufficient for the development of the ML models, it is recommended that these ML models should be applied and validated with various data in different regions for better justification and verification. However, it is noticed that these applied models are considered as black-box models and do not provide the equations for engineer's calculation; therefore, other ML models like GEP, GMDH, and EPR, which can provide the equations, can be considered for future application and comparison.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.