Liquefaction Evaluation Based on Shear Wave Velocity Using Random Forest

Liquefaction evaluation on the sands induced by earthquake is of significance for engineers in seismic design. In this study, the random forest (RF) method is introduced and adopted to evaluate the seismic liquefaction potential of soils based on the shear wave velocity. -e RF model was developed using the Andrus database as a training dataset comprising 225 sets of liquefaction performance and shear wave velocity measurements. Five training parameters are selected for RF model including seismic magnitude (Mw), peak horizontal ground surface acceleration (amax), stress-corrected shear wave velocity of soil (Vs1), sandylayer buried depth (ds), and a new introduced parameter, stress ratio (k). In addition, the optimal hyperparameters for the random forest model are determined based on the minimum error rate for the out-of-bag dataset (ERROOB) such as the number of classification trees, maximum depth of trees, and maximum number of features. -e established random forest model was validated using the Kayen database as testing dataset and compared with the Chinese code and the Andrus methods. -e results indicated that the random forest method established based on the training dataset was credible. -e random forest method gave a success rate for liquefied sites and even a total success rate for all cases higher than 80%, which is completely acceptable. By contrast, the Chinese code method and the Andrus methods gave a high success rate for liquefaction but very low for nonliquefaction which led to the increase of engineering cost. -e developed RF model can provide references for engineers to evaluate liquefaction potential.


Introduction
Soil liquefaction occurs if saturated sand suffers loss of strength and modulus due to the increase of excess pore pressure when subjected to strong earthquake loading. It was not until the 1964 Niigata earthquake in Japan and the Alaska earthquake in the United States that people fully realized the harm of sand liquefaction. ese two earthquakes caused the damage of many buildings and the loss of people's lives and property, which highlighted the need to study the liquefaction potential of saturated sand.
Many scholars have presented different methods for liquefaction evaluation based on in-situ tests such as shear wave velocity (V s ) [1][2][3][4][5], standard penetration test (SPT) [6][7][8][9], and cone penetration test (CPT) [10][11][12][13]. Compared with SPT-based and CPT-based liquefaction evaluation methods, the advantage of V s -based methods is that they are less sensitive to soil characteristics [4]. In addition, shear wave velocity tests can be easily performed in soils while SPT and CPT tests are limited by the soil type such as gravelly sands.
ese advantages make the V s -based method irreplaceable and have a bright development prospect for liquefaction evaluation. At present, the most widely accepted liquefaction evaluation method based on shear wave velocity is proposed by Andrus and Stokoe [1] by establishing the liquefaction assessment curves based on the correlation between CRR (liquefaction resistance) and V s . Kayen et al. [2] further presented updated V s -based liquefaction assessment curves based on 422 case histories and concluded that the correlation developed between CRR and V s is insensitive to fines content (FC). Shen et al. [14] established an updated liquefaction assessment curves based on 261 cases histories combining the Andrus database with new collected cases from Canterbury earthquake. However, these V s -based methods are commonly empirical or semiempirical and are easily limited by local data. In addition, there are several factors influencing the liquefaction evaluation such as sandy-layer buried depth (d s ), ground water table (d w ), and peak horizontal ground surface acceleration (a max ). Unfortunately, the existing V s -based methods cannot clearly explain the relationship between influencing factors and liquefaction potential.
With the continuous development of computer technology, machine learning methods such as adaptive neuro fuzzy inference system (ANFIS), artificial neural network (ANN), and support vector machine (SVM) make it possible to solve the above problems. e advantage of machine learning method is that it is not necessary to consider the relationship between input and output variables and can obtain the accurate prediction by relying on the collected data itself. Xue and Yang [15] adopted the ANFIS model for the assessment of liquefaction potential, which provided more accurate results than traditional empirical methods including seed simplified methods [16] and CPT-based method proposed by Stark and Olson [17]. Hanna et al. [18] established an ANN model for liquefaction evaluation utilizing 12 parameters related to soil and seismic characteristics. Zhao et al. [19] employed the SVM method to assess the soil liquefaction based on SPT and CPT data through the particle swarm optimization (PSO) [20], for searching the kernel functions and training parameters. ese methods get a satisfactory result compared with traditional empirical methods, but there still exist some shortcomings. For example, ANN and ANFIS approaches are time consuming in terms of the optimal selection of parameters due to the fact that the number of training parameters is excessive and the model is easy to fall into overfitting. e SVM method is difficult to operate large-scale training data and is sensitive to the choice of parameters and kernel function. At present, there is still no effective method to solve this problem. Moreover, the machine learning models established by single dataset may not perform well in other liquefaction datasets [21]. Kohestani et al. [22] also stated that machine learning methods have a limited domain of applicability and are mostly case dependent. erefore, it is necessary to improve these existing methods or seek for other more advanced methods for liquefaction evaluation. e random forest (RF) is an ensemble learning algorithm developed by Breiman [23] based on a combination of a large set of decision trees. e advantage of the random forest is that it is simpler in selection of hyperparameter and can solve the overfitting problem [22,24]. RF method has been successfully used for solving the geotechnical engineering such as landslide [25], ground surface settlements [24,26], the prediction of soil shear strength [27], and bearing capacity of foundations [28]. However, few studies have been reported about the RF model applications in the liquefaction evaluation. Kohestani et al. [22] reported the evaluation of liquefaction potential based on CPT data using RF method. Nejad et al. [29] established a RF model for predicting the occurrence or nonoccurrence of liquefaction based on the shear wave velocity data collected by Kayen et al. [2]. However, these liquefaction models using RF cannot be well verified by other datasets since limited dataset was utilized by separating training dataset from testing set randomly.
In this study, the liquefaction evaluation was conducted based on shear wave velocity by using the random forest. A total of 225 cases from Andrus et al. [1] and 336 cases from Kayen et al. [2] with respect to liquefaction and nonliquefaction histories are used as training dataset and testing dataset, respectively. e performance of the model using the RF method for liquefaction evaluation proposed in this study is compared with the Chinese code method [30] and the method proposed by Andrus et al. [1], respectively.

Random Forest.
e random forest (RF) is an intelligent recognition method based on statistical learning theory [23]. Many predictors are generated based on the strategy of ensemble learning, which can be applied for solving classification and regression problems by classification tree and regression tree, respectively. e purpose of this study is to evaluate the liquefaction potential of soil.
ere are two results for liquefaction evaluation including liquefaction or nonliquefaction; thus, classification tree is adopted herein. Figure 1 shows the general architecture of the random forest for classification, where X is the training dataset, n is the number of classification trees (represented by n_estimator in machine learning tool, Scikit-learn), and max_depth is the maximum depth of the tree which plays an important role in controlling the complexity and size of the tree. e construction process of the random forest is summarized as follows.
(1) New training subsets X n are generated by randomly drawing samples with replacement from original training dataset (X). Each training subset contains approximately two-thirds of the elements of X (called bootstrap samples) and the remaining elements are called out-of-bag (OOB) samples. OOB samples can be used to evaluate the generalization ability of the random forest model based on the calculation of OOB error rate (ERR OOB ) for training subsets, which can be also utilized to determine the optimal hyperparameters of the random forest model [28] such as n_estimator, max_depth, and max_features (the number of training parameters in random subset at each node).
(2) For each subset X n , the classification tree is growing.
At each node of the tree, rather than choosing the best spilt among all training parameters in classification tree, max_features training parameters is randomly selected and the best spilt is chosen among them based on calculating the value of Gini index as follows [31,32]: where P i represents the possibility of class K t in X n and t is the number of the classes which is 2 for liquefaction evaluation since the classification result only has two types: liquefaction or nonliquefaction. (3) e random forest produces classification results K t for each classification tree. e final classification result (K) is obtained based on the voting results by following the principle that the minority is subordinate to the majority [22].

Database.
e database published by Andrus et al. [1] (hereafter called the Andrus database) was used as training dataset in this study to develop the liquefaction evaluation model using random forest. Numerous scholars have used the Andrus database to develop the new model for liquefaction evaluation and the authority of Andrus database is widely recognized [5,33]. e fundamental parameters such as M w (moment magnitude), d s (sandy-layer buried depth), d w (depth of ground water table), V s1 (overburden pressurecorrected shear wave velocity), and a max (peak horizontal ground surface acceleration) are provided in Table 1. As shown in Table 1, the Andrus database collected 225 cases of shear wave velocity data from 26 earthquakes and more than 70 sites, including 96 liquefaction and 129 nonliquefaction cases. According to the Chinese seismic intensity table [34], the dataset is divided into seismic intensities VI, VII, VIII, and IX, as shown in Table 2. e database published by Kayen et al. [2] (hereafter called the Kayen database), given in Table 3, also consists of soil and seismic parameters including M w , d s , d w , V s1 , and a max . e Kayen database was used as testing dataset which consists of 415 case studies collected from 256 sites of nine earthquakes mainly distributed in Asia, Greece, the United States, and China. Excluding 79 sets of cases which were duplicated with the Andrus database, a total of 336 cases were used as testing dataset and the classification is shown in Table 4.

Selection of Training Parameters.
e accuracy of the random forest method for liquefaction evaluation is highly related to the selection of influence parameters, which can be divided into three categories including the intensity of ground motion, buried condition of soil layer, and compactness of soil.
ere are 9 liquefaction influencing parameters in the Andrus and Kayen database. Among these parameters, the seismic magnitude (M w ), peak horizontal ground surface acceleration (a max ), and cyclic stress ratio (CSR) reflect the intensity of ground motion; the sandy-layer buried depth (d s ), depth of water table (d w ), total vertical stress of soil at the depth considered (σ r ), effective vertical stress of soil at the same depth ((σ r ′ ), and shear stress reduction factor (r d ) reflect the buried conditions of soil layer; the stress-corrected shear wave velocity (V s1 ) characterizes the compactness of soils. Some of these parameters are coupled with each other and have strong correlation which leads to the inaccuracy of the model. erefore, it is necessary to analyze the correlation of these parameters and select the training parameter that is less relevant to the others to establish the random forest model. Figure 2 shows the correlation matrix of these parameters. It can be seen that four parameters d s , d w , σ r , and σ r ′ that reflect the buried condition of soils have strong correlation, which will reduce the accuracy of the model. erefore, a new parameter k is introduced herein to reflect the buried condition of soil layer. Yao et al. [35] proposed the stress ratio, k, to establish a logistic regression model for liquefaction evaluation and achieved satisfying evaluating results. e stress ratio, k, is defined as Advances in Civil Engineering It can be shown from Figure 2 that, compared with d w , k is less relevant to σ r and σ r ′ . In addition, r d only depends on the buried depth of sand, d s . us, the stress ratio, k, and sandy-layer buried depth, d s , are selected as training parameters for the random forest model to reflect the buried condition of soil layer. CSR and a max are two highly correlated factors, while a max is more easily obtained than CSR.
us, a max and M w are selected to reflect the intensity of ground motion. Moreover, the corrected shear wave velocity of soil, V s1 , reflects the compactness of soil. Shear wave velocity are of great significance for the establishment of the liquefaction evaluation model. In summary, five factors including d s , k, a max , M w , and V s1 are used as training parameters in the random forest model. Table 5 shows the statistical information of training parameters used for model training and normal distribution is selected for these parameters.

Optimizing the Random Forest Hyperparameters.
e best prediction accuracy of the random forest model can be obtained by tuning hyperparameters. As presented in Section 2.1, three hyperparameters need to be optimized including n_estimator, max_depth, and max_features. e value of ERR OOB for each combination of hyperparameters was calculated, and we chose the optimal combination for three hyperparameters corresponding to minimum ERR OOB [36].
e n_estimator was initially set as 1-100 and max_depth was set as 1-20. Usually, max_features < M (the number of training parameters), which is 5 in this study as discussed above. e default max_features is [log 2 (M) + 1] [36] and then is decreasing and increasing until the minimum ERR OOB is obtained. e optimal hyperparameters were determined as max_features � 2, max_depth � 7, and n_estimator � 13. e same optimal hyperparameters were used in the random forest model to evaluate the liquefaction potential based on the new Kayen database [2]. e predicted result can be produced by using python programming language in Scikit-learn [37].  Note. L represents liquefied sites; NL represents nonliquefied sites.

Results and Analysis
e random forest model trained by the Andrus database is compared with the Chinese code method and the Andrus method by using the same database. Accordingly, these two methods will be introduced in brief first. Moreover, the established random forest model is validated by evaluating the actual cases and comparing with these two methods by using the Kayen database which was not used in the training process.

e Chinese Code Method for Liquefaction Evaluation.
e critical shear wave velocity, V scri , is used as an index to evaluate soil liquefaction in the Chinese code [30], which is expressed as where V s0 � a reference value of shear wave velocity, which is 65 m/s, 90 m/s, and 130 m/s for seismic intensity VII, VIII, and IX, respectively. When the stress-corrected shear wave velocity, V s1 , is smaller than V scri , the site is evaluated as liquefied; otherwise, it is nonliquefied. It should be noted that the Chinese code method does not evaluate the sites distributed in seismic intensity VI and X. erefore, the cases in seismic intensity VI and X were excluded from comparisons among different methods. [1], called the Andrus method hereafter, which is written as

Andrus et al. proposed a method for liquefaction evaluation
where CRR � soil liquefaction resistance, V s � measured shear wave velocity, P a is a reference stress of 100 kPa, and V * s1 � limiting upper value of V s1 . Once CSR > CRR, liquefaction is predicted to occur; otherwise, liquefaction does not happen. More details about the method can be referred to reference [1].

Comparison among Different Methods.
It should be noted that 34 sets of cases in seismic intensity VI are removed in order to compare the random forest method with the Chinese code method. Table 6 shows the success rates of liquefaction evaluation based on the Andrus database by different methods. e total success rate is defined as the ratio of site number evaluated successfully to total site number. Take the total success rate of the Chinese code method as an example. As displayed in Table 2, there are 96 liquefied sites and 95 nonliquefied sites in seismic intensity VII, VII, and IX. Of these, 86 liquefied sites and 28 nonliquefied sites are successfully evaluated by the Chinese code method. us, the total success rates, S total , is obtained as S total � (86 + 28)/(96 + 95) × 100% � 59.7%.
For liquefied sites, the success rates using the Chinese code method and the Andrus method are 89.6% and 97.9%, respectively. However, the success rates for nonliquefied sites are only 29.5% and 34.7%, respectively. ese two methods with total success rates smaller than 70% tend to be conservative. e random forest method gives a satisfactory total success rate of 96.9%, along with 100% at liquefied sites. To be noted, three methods all give satisfactory success in terms of predicted result of liquefied sites presented in Table 6, which are close to or more than 90%. e random forest method shows a significant difference in the prediction of nonliquefied sites as compared with the other two methods. Many nonliquefied sites are misjudged as liquefied sites by the Chinese code method and the Andrus method while the random forest method correctly classifies most cases of nonliquefied sites. In conclusion, the other two methods are fairly conservative. e random forest method can satisfactorily predict both liquefied sites and nonliquefied sites. e success rates of liquefaction evaluation obtained by the random forest model based on the Kayen database are depicted in Table 7 by comparing with the Chinese code method and the Andrus method, noting that 7 sets of cases were excluded from the Kayen database. It is concluded that the Chinese code method and the Andrus method give success rate greater than 90% at liquefied sites for most seismic intensities except 78.3% for the Chinese code method at seismic intensity VII. e total success rates for liquefied sites obtained by the random forest method are more than 80%. However, the success rates for nonliquefied sites given by the Chinese code method and the Andrus method are too low. For example, the success rates given by these two methods at seismic intensity VIII and IX are both less than 15%. erefore, it tends to be conservative and increase the engineering cost if these two methods are used for seismic design. For liquefied sites, the Chinese code method and the Andrus method give the total success rates of more than 95% (regardless of seismic intensity), and the random forest method gives a value of 83.1% which has reached an acceptable level. However, the total success rates for nonliquefied sites given by the Chinese code method and the Andrus method are 23.5% and 15.4%, respectively, while 75.3% is given by the random forest method which is far higher. Moreover, the total success rate given by the random forest method for all cases is 81.2% which is the highest of all methods. It can be inferred that the random forest method is effective and reliable. Moreover, the random forest method developed in this study differs from the conventional studies which divided a database into 70% as the training dataset   e results indicate that the random forest model developed in this study is not only applicable in the range of training data, which limited the generalization [21,22]. e application of the random forest is expanded in the engineering, which can provide references for similar projects.
e feature importance of the random forest model is used to output the importance of each training parameter, whose sum is equal to 100% [38]. Figure 3 shows the feature importance of each training parameter. It is worth noting that a max and V s1 are the two most important parameters for predicting soil liquefaction, which is in line with the conclusion given by Nejad et al. [29] and Yang et al. [39]. e other three parameters also show 15% feature importance for liquefaction evaluation, which indirectly proves that it is reasonable to choose these five parameters as training parameters and the liquefaction evaluation results are also credible.
Even though the random forest method has been successfully applied in soil liquefaction evaluation, there still exist some limitations that need to be mentioned for further research. e accuracy of the random forest model may be further improved by tuning other hyperparameters. On the other hand, even though the contribution of each training parameter to soil liquefaction was discussed through the correlation analysis; however, similar with other machine learning methods, the feature importance of training parameters for the random forest is sensitive to the selection of training dataset. us, more datasets need to be collected in order to further confirm the relationship among training parameters.

Conclusions
e random forest method is utilized to evaluate the soil liquefaction.
e selection of training parameters is discussed by adopting the Andrus database as the training dataset. ree hyperparameters in the random forest were optimized by correlation analysis and the liquefaction evaluation based on the random forest was established and examined by using the Kayen database as testing dataset. e success rates under different seismic intensities were obtained and compared with the Chinese code method and the Andrus method. e main conclusions can be drawn as follows.
(1) Based on the Andrus database, four parameters including the sandy-layer buried depth (d s ), peak horizontal ground surface acceleration (a max ), seismic magnitude (M w ), stress-corrected shear wave velocity (V s1 ), and an introduced parameter-stress ratio (k) were used as training parameters by correlation analysis for the random forest model. (2) e optimal hyperparameters were determined by the minimum OOB error rate including the number of classification trees, maximum depth of classification trees, and maximum number of features corresponding to 13, 7, and 2, respectively. (3) e success rates for liquefaction evaluation of three methods were compared based on the Andrus database, which indicated that the random forest model developed based on the training dataset was credible. (4) Based on the Kayen database which was not used in the training dataset, the success rates of liquefaction evaluation were compared among three methods including the Chinese code method, the Andrus method, and the random forest method established