Mapping Landslide Sensitivity Based on Machine Learning: A Case Study in Ankang City, Shaanxi Province, China

The main purpose of this research is to apply the logistic regression (LR) model, the support vector machine (SVM) model based on radial basis function, the random forest (RF) model, and the coupled model of the whale optimization algorithm (WOA) and genetic algorithm (GA) with RF, to make landslide susceptibility mapping for the Ankang City of Shaanxi Province, China. To this end, a landslide inventory map consisting of 4278 identi ﬁ ed landslides is randomly divided into training and test landslides in a ratio of 7 :3. The 15 landslide in ﬂ uencing factors are selected as follows: slope aspect, slope degree, elevation, terrain curvature, plane curvature, pro ﬁ le curvature, surface roughness, distance to faults, distance to roads, landform, lithology, distance to rivers, rainfall, stream power index (SPI), and normalized di ﬀ erence vegetation index (NDVI), and the potential multicollinearity problem among these factors is detected by Pearson correlation coe ﬃ cient (PCC), variance in ﬂ ation factor (VIF), and tolerance (TOL). We evaluate the performance of the model separately by statistical training and test dataset metrics, including sensitivity, speci ﬁ city, accuracy, kappa, mean absolute error (MSE), root mean square error (RMSE), and area under the receiver operating characteristic curve. The training success rates of LR, SVM, RF, WOA-RF, and GA-RF models are 0.7546, 0.8317, 0.8561, 0.8804, and 0.8957; the testing success rates are 0.7551, 0.8375, 0.8395, 0.8348, and 0.85007. The results show that the GA signi ﬁ cantly improves the predictive power of the RF model. This study provides a scienti ﬁ c reference for disaster prevention and control in this area and its surrounding areas.


Introduction
Landslide disaster is a common and highly destructive adverse geological phenomenon. Because of the complex and diverse terrain conditions, climate conditions, and frequent engineering geological activities in China, many kinds of adverse geological phenomena occur frequently every year. According to a public report released by the Ministry of Natural Resources in China, there were 7840 geological disasters in 2020. Landslides accounted for about 61.35%, lost $ 161 million, and killed 139 people. The landslide disasters are mainly concentrated in the central and western regions in our country, which is the most concentrated in southwest China. Ankang City is located in southern Shaanxi and is the most severely affected area by landslides.
According to the statistics of the local government, the landslide disaster in Ankang City has caused more than 500 dead or missing, and the direct economic loss amounted to 1.07 billion US dollars since 1983. The prevention and control of such frequent and destructive landslides are one of the important issues of disaster prevention and mitigation, and the key step is the evaluation of landslide susceptibility. Although there are many kinds of research results, the analysis methods and evaluation criteria are quite different [1][2][3].
Due to the powerful data processing and mapping capabilities of geographic information system, it has been widely applied to draw landslide susceptibility mapping over the past decades [4]. Currently, the widely used models for predicting landslide susceptibility mainly include qualitative, quantitative, and artificial intelligence attribution analysis [5,6]. The use of quantitative attribution analysis model accounted for the main research model. Qualitative attribution analysis is highly correlated with researchers with knowledge reserves and subjectivity, resulting in a large difference in their effects [7], such as analytic hierachy process(AHP) and Entropy weight method [8]; other statistical models and the coupling models are also widely applied, such as frequency ratio model [9][10][11], information value model [12,13], and evidence weight model [14][15][16]. However, the results of these models are not ideal.
In recent years, with the innovation of artificial intelligence algorithms and the improvement of data processing capabilities, machine learning methods have been widely studied in the field of landslide susceptibility. The machine learning model can handle the geological, hydrological, and other information datasets of high-latitude and large datasets and has higher prediction success rate [17][18][19]. Common models include random forest (RF), gradient boosting machine (GBM), support vector machine (SVM), artificial neural network (ANN), and logistic regression (LR) [20][21][22][23][24]. Among them, RF, which is widely employed, has achieved remarkable results in model classification and accuracy. For example, Yu and Chen used information model, AHP, and RF method to predict Helong City landslide susceptibility; the effect of RF is obvious [25]. Compared with other common machine learning models, some scholars selected the most suitable algorithm for the prediction results of their respective research areas [26,27]. The comparison shows that the accuracy of RF training and testing is relatively high. It is proved that the accuracy of the model can be improved by coupling RF with other models. Other traditional machine learning algorithms have been widely applied, such as logistic regression model. Accuracy is widely used in landslide susceptibility evaluation because it can directly generate decision boundaries from original datasets [28]. The advantage of SVM lies in the ability to eliminate overfitting and noise problems in the modeling process, and it has a wide range of practicability [29]. With the introduction of deep learning model, it uses more hidden layers to model the complex relationship between data, and depth replaces breadth, improves fitting degree, and achieves higher precision. Compared with the three baseline models (SVM, NBTree, and REPTree), the GA-optimized deep learning model based on ELM, DBN, and BP exhibits better prediction performance by performing hierarchical analysis on the original dataset to extract the most relevant features [30,31]. Additionally, the comparison and coupling between models can also further improve the accuracy [32,33]. However, a large number of research results show that which model is the most suitable for landslide sensitivity evaluation has not been determined.
According to the current research situation of landslide sensitivity analysis, statistical analysis of the temporal and spatial distribution of factors such as rainfall, geomorphology, and earthquake in historical landslide events shows that there is a great correlation between the occurrence of landslides and rainfall [34,35]. Furthermore, some scholars have also analyzed and compared landslide susceptibility through qualitative and quantitative aspects in Shaanxi Province, such as FR-AHP [36], certainty factor (CF), and index of entropy (IOE) [37] model. In addition, compared with other models, SVM model shows better prediction results [38]. However, there is no further study on quantitative analysis.
Firstly, this paper screened the influencing factors of landslide by two methods-Pearson correlation coefficient and multicollinearity. The purpose of the study is to investigate the potential application of the optimized RF model by whale optimization algorithm (WOA) and genetic algorithm (GA) in landslide susceptibility analysis in Ankang City and compare and analyze with LR, SVM, and RF algorithm. In the study area, the optimization of RF models with genetic algorithm and whale optimization algorithm is relatively novel. Analyzing various statistical indicators (sensitivity, specificity, accuracy, kappa index, mean absolute error, and root mean squared error), each model is evaluated and analyzed, and GA-RF is the optimal model.

Study Area
The study area is located in the southeast of Shaanxi Province, China, with a range of 31°42 ′ N~33°49 ′ N and 108°01 ′ E~110°01 ′ E, belonging to the humid monsoon climate of northern subtropical continent. Generally, Ankang basin is in the middle of two mountains and one river. The north is on Qinling Mountains, and the south is on Bashan Mountains. Hanshui-Chihe-Yuehe-Hanshui are the boundary between Qinling Mountains and Bashan Mountains.
The resident population of the city is more than 24.9million; the city is composed of nine counties and one district, including Hanbin District, Hanyin County, Shiquan County, Ningshan County, Ziyang County, Langao County, Pingli County, Zhenping County, Xunyang County, and Baihe County. The width of city is about 200 km from east to west, and the length is about 240 km from north to south; the total area is about 23529 km 2 . The elevation of the study area is 97~2897 m, the terrain is undulating, the central terrain is low, and the northern and southern terrain is high.
Ankang City is a subtropical continental monsoon climate with annual average temperature of 15~17°C and annual precipitation of about 1050 mm. As the longest tributary of the Yangtze River, the Hanshui River is about 20 km across Ankang City, and the average annual runoff is 2:01 × 10 11 m 3 . The flow of river from July to October accounts for about 72.2% of total annual flows.
Landslide event refers to the phenomenon that the mechanical balance of rock or soil disappears and falls down the connected shear surface due to factors such as rainfall, slope overflow erosion, and earthquake. The tectonic structure of the study area is composed of the Qinling Indo-China fold belt and the Caledonian fold belt of the Daba Mountain. Sedimentary environment varies greatly; the area of sedimentary rocks is 60.3% of the total area in Ankang City. Besides, a great number of folds and faults are developed in the area, which are the material basis of geological disasters such as landslides and collapses. Therefore, landslides occur frequently in Ankang City, and the disaster points are up to 4278, of which the central region is densely distributed, as shown in Figure 1.

Datasets
3.1. Landslide Inventory Map. The landslide inventory map is used to determine the number, location, and type of landslides, the accuracy of which affects the results of the landslide susceptibility analysis [39]. The landslide distribution data are from the Chinese Academy of Sciences' Data Center for Resource and Environmental Sciences, and the landslides are individually screened and studied through historical records analysis and satellite images. The 4278 landslide points are distributed in Ankang City. The landslide points are randomly divided into two types of datasets for training (70%) and verification (30%).

Landslide Conditioning
Factors. Different factors and data classification standards in the same region will affect the output results of the model to varying degrees. The selection of landslide factors directly affects the success rate of landslide sensitivity model, because the number of landslide factors and the collinearity between factors will affect the prediction ability of the model. On the premise of collecting 4278 landslide events, according to the human geography and environmental characteristics of Ankang City and the previous research results, 15 landslide factors are collected as follows: slope aspect, slope degree, elevation, terrain curvature, plane curvature, profile curvature, surface roughness, distance to faults, distance to roads, landform, lithology, distance to rivers, rainfall, stream power index (SPI), and normalized difference vegetation index (NDVI). The digital elevation model (DEM) with 30 × 30 m cell size in the study area is obtained from the 30-m resolution ASTER GDEM (http://www.gscloud.cn). Elevation, slope aspect, slope degree, surface roughness, terrain curvature, SPI, plan curvature, profile curvature, and distance to rivers are essentially extracted from the study area DEM data. NDVI is obtained from Landsat 8 OLI image with a resolution of 30 × 30 m. It is processed in ENVI through data reading, radiometric calibration, image clipping, atmospheric correction, and image mosaic. The meteorological data are collected from Shaanxi Meteorological Service. Furthermore, other factors based on the data collected are described and further vectorized by topographic map and geological map.
Combined with the results of previous studies on topography, hydrogeological conditions and human engineering activities. The specific influencing factors include (1) topography factors: elevation, slope aspect, slope degree, terrain curva-ture, plan curvature, profile curvature, distance to faults, surface roughness, landform, and lithology; (2) geological and hydrogeological factors: distance to rivers, rainfall, and SPI; and (3) human engineering activities factors: NDVI and distance to roads. After preprocessing factors data, the basic unit of data in the study area is set to 30 m × 30 m. The sources of the various factors are shown in Table 1.

Topographic Factors.
Elevation is a classical landslide factor extracted directly from DEM data. Ankang City is a mountainous area, which is between the Daba Mountains and Qinling Mountains as a whole, and the central part is low. According to the geographical characteristics of the study area, it is divided into four categories: <600 m, 600~1000 m, 1000~1400 m, 1400~1800 m, and >1800 m (Figure 2(a)).
The different slope aspects show that the azimuth angles of different slopes are different in horizontal direction. The solar radiation, illumination time, temperature, rainfall, and wind speed will also be different in different slope aspects. The difference of slope weathering and erosion is reflected in the change of rock and soil structure and groundwater occurrence conditions and the interaction between sand and mudstone in different degrees [40,41]. According to the ground slope aspects of the study area, it is divided into flat, north, north-east, east, south-east, south, south-west, west, and north-west ( Figure 2(b)).
The slope degree reflects the degree of surface unit, and the stress distribution in rock and soil is different with different slope degree [42]. According to the research slope, the slope degree in the study area is divided into five categories: 0~10°, 10~20°, 20~30°, 30~40°, and>40° (Figure 2(c)).
Terrain curvature is a function of slope point, including profile curvature and plane curvature. It shows the structure and morphology of the terrain, not only reflects the aggregation and dispersion of surface flow, but also affects the acceleration of water flow, thereby affecting surface erosion and deposition [43]. Terrain curvature was extracted from DEM data in the study area using GIS surface tools such as curvature as shown in Figure 2(d). The distribution of topographic curvature in the study area is analyzed according to the classification standard of curvature value which are <0 (concave), =0 (flat), and >0 (convex), and then the distribution map of profile curvature and plane curvature is finally obtained (Figures 2(e) and 2(f)). In addition, three different classes of plane curvature values were defined as follows: <-0.5, -0.5~0.5, and >0.5.

Geofluids
Following surface roughness is the ratio between the slope section area and horizontal projection area in a certain area, which reflects the degree of surface fluctuation and erosion caused by solar radiation [44]. The roughness of the study area is divided into <1.05, 1.05~1.15, 1.15~1.25, and >1.25 (Figure 2(g)). Surface roughness of the study area is calculated using the "Raster Calculator" of where α is the slope degree (unit: radian).
The study area is mostly high mountains in the north and south direction, and the valley basin is in the middle. Different landforms lead to large morphological differences, and steep terrain affects slope stability [45]. The topography in the study area fluctuated greatly, mainly in middle and high altitude mountains, accounting for about 93.86%. Low mountains, plains, and hills accounted for 6.14%. Through investigation and analysis (Figure 2(h)), the landforms of the study area can be divided into I (plain), II (hills), III-1 (small undulating low mountains), III-2 (medium undulating low mountains), IV-1 (small undulating high mountains), IV-2 (medium undulating high mountains), and V (large undulating high mountains).  Lithology is one of the most essential factors for landslide occurrence. The rock components and structures in different parts of the same region are different, and the physical and mechanical properties of rocks are different [46]. Influenced by the same external force, the deformation of rock with different mechanical properties is also different and vice versa. Because the coverage area of sedimentary rock in the study area is about 60%, the rock strength of sedimentary rock is low, it is easy to be eroded by heavy rainfall and rivers, and it is unstable. Therefore, there are more landslides in the sedimentary rock area, accounting for about 69.65% of the total landslide events. According to the genetic classification of rock, it is divided into six categories: sedimentary rocks, carbonate rocks, igneous clastic rocks, metamorphic rock, intrusive rock, and volcanic rocks, as shown in Table 2 and Figure 2(i).

Geological and Hydrogeological Factors.
The flow of rivers is mainly erosion and abrasion, which destroys the stress balance of dams and mountains along the rivers. Especially in areas where rivers and precipitation affect soluble rocks such as carbonate rocks, dissolution changes topography [47,48]. The study area is divided into <1500 m, 1500~3000 m, 3000~4500 m, 4500~6000 m, and >6000 m by distance to rivers (Figure 3(a)).
The research area belongs to the northern subtropical continental monsoon climate, the average annual rainfall is 900~1000 mm, and the largest rainfall occurs in May, July, and September. Due to the large and sudden rainfall in a short time, the shear strength of the rock-soil interface on the slope is significantly reduced. Melillo et al. established an independent algorithm to determine the rainfall threshold and improve the accuracy of landslide prediction through rainfall by analyzing historical rainfall and landslide information and conducting practical verification [49]. According to the spatial distribution of rainfall in Ankang City, the regional rainfall is divided into <800 mm, 800~900 mm, 900~1000 mm, 1000~1100 mm, and>1100 mm ( Figure 3(b)).
The stream power index (SPI) is related to rock lithology, particle size, and permeability. Slope surface often produces rill erosion and sediment accumulation, and when slope shear stress exceeds the shear strength of the surface, it may appear instability [50,51]. SPI in Ankang City can be divided into three categories: <2.5, 2.5~5, and >5 (Figure 3(c)).
The study area is located in the border area between Qinling geosyncline fold system and Bashan orogenic belt, where fold and fault structures are relatively developed. Fold structural belt may produce a large number of interlayer fractures, transverse, and longitudinal tensile fractures. The developed cracks lead to rock and soil breakage, increase the water richness of rock layers, and accelerate rock erosion. In addition, a large number of secondary small faults are often developed near the large fault structural belt [52,53]. Regional tectonics leads to the complexity of geological conditions in the study area and aggravates the probability of landslide disasters. The distance to faults as a classification reference are divided into four kinds that are <1500 m, 1500~3000 m, 3000~4500 m, 4500~6000 m, and>6000 m ( Figure 3(d)).

Human Engineering Activities
Factors. NDVI is obtained through ENVI remote sensing image data preprocessing, including radiometric calibration, atmospheric correction, and removal of outliers. Generally, the higher vegetation coverage, the greater shear strength of rock-soil interface. In areas where human activities are frequent or where rock lithology is hard, the smaller NDVI creates conditions for landslides [54]. NDVI in Ankang City is divided into five classes by natural discontinuity method: <0.18, 0.18~0.48, 0.48~0.63, 0.63~0.72, and>0.72 (Figure 4(a)).
Road is one of the indirect factors inducing landslide [55,56]. The total highway mileage in the study area is 24500 km, the railway mileage is 900 km, and the traffic is very developed. Mountain roads may change the original diversion channel, and the continuity of rock and soil is destroyed, and some remote areas may set up less protective measures, or not timely repair, which will affect the stability of the landslide. GIS is used to make multiple buffers for roads, and the study area is classified into <1500 m, 1500~3000 m, 3000~4500 m, 4500~6000 m, and>6000 m by the distance to roads (Figure 4(b)).

Modeling Process.
(1) The random area safety points are created by GIS, the number of which is roughly the same as the number of landslide points, and the safety points and random points are combined together. The collected factor data are normalized, and the variables are classified using the Multipoint Extraction tool. (2) We analyzed and compared 15 landslide factors using Pearson's correlation coefficient method and multicollinearity method and finally eliminated curvature, roughness, and SPI. The remaining 12 factors (slope aspect, slope degree, elevation, plane curvature, profile curvature, distance to faults, distance to roads, landform, lithology, distance to rivers, rainfall, and NDVI) are used. The Scikit-learn library in Python is used to randomly select 2994 points (70%) and 1284 points (30%) for training and model validation of landslide points and safe points [57]. The RF, LR, and SVM models are determined after evaluating the AUC values. (3) After the coupling optimization of the GA and WOA and the RF model, each model is evaluated by comparing the values of RMSE, MSE, and AUC, and the optimal parameter combination is selected. (4) Convert the grid data of the study area into more than 30million vector points through grid-to-point and predict the information value of each factor to obtain the output probability value of the landslide in the study area. Using the machine learning algorithm in python, the probability values are rearranged into the grid size, and the probability grid data of landslide occurrence in this area is obtained. (5) Draw landslide susceptibility mapping, and analyze and discuss the performance of each model and the main control disaster factors.

Selection of Landslide Conditioning Factors.
In the area of landslide sensitivity evaluation, the selection of landslide condition factors is not based on the quantity. Unnecessary factors will cause excessive spatial data and increase the 10 Geofluids       13 Geofluids meaningless workload, which will often have an adverse impact on the evaluation accuracy [58]. There is no fixed reference for the selection of the number of influencing factors. According to the relevant literature and the geographical characteristics of the study area, 15 factors are selected in this study. It is also necessary to ensure that each index factor in the evaluation system is independent of each other, so as to avoid mutual interference between factors with strong correlation, which will affect the accuracy of prediction results. In order to remove the factors with large correlation and unimportant factors, two most commonly used tests, namely, multicollinearity analysis and Pearson correlation coefficient method, are used [59].

Logistic Regression.
On the basis of linear regression, logistic regression adds a sigmoid function (logic equation) to realize the division of decision boundary of classification problems. The principle is to map the range (-∞, +∞) of linear regression to the range (0, 1) of Sigmoid function by logical function [60]. The expression of Sigmoid function is as follows: where 4.4. Support Vector Machine. Support vector machine (SVM) classifies 12 landslide factor data in a linear way relative to the optimal boundary [61]. The sample of the study area is the support vector that determines the boundary line. The best boundary (hyperplane) is two lines of maximum distance between different vectors. It ensures the correctness and difference of classification ( Figure 5). The formula is as follows: where m is the number of landslide impact factors, x i is the landslide influencing factor in the study area, and y is the objective function. When the accurate optimal boundary cannot be found by using soft interval, the support vector is generally mapped to high-dimensional space to solve this problem. However, the calculation of feature space is generally difficult. In order to optimize the problem, the Kernel function KðxÞ is introduced to avoid the specific form of feature space γðxÞ [62]. It is only necessary to determine that the Kernel function is the inner product of two infinite dimensional vectors. The function formula is expressed as where y i is the training output of 12 factors and y is the training target. By comparing the performance of the five kernel classifiers through various statistical indicators, it is concluded that RBF-SVM is relatively excellent [63], and then RBF-SVM is selected as one of the prediction models in this paper.
Radial basis function is kðx i , xÞ = ð−γðx i − x j ÞÞ, γ > 0, where b and γ are parameters of the kernel functions.

Random
Forest. The random forest based on decision tree principle is mainly modeled by bagging algorithm. Bootstrap method is used for K times of random sampling of n training set samples, and n times are repeated to obtain K decision trees. Then d ðd < DÞ features are randomly selected from the D features of the training subset, and the best features are selected from d features as the classification basis, and n times are repeated. Count the mode of the decision tree results as the final classification results [64]. The classification result of RF is more accurate, simple, and easy to understand, which is suitable for the big dataset processing and analysis ( Figure 6). Based on the law of probability and statistics, the probability of each sample being drawn to a bootstrap set is P, and the formula of P is as follows: where n is large enough in this study, p is about 63%, and the remaining 37% data are not involved in the modeling, called out-of-bag. Compared with the bagging algorithm, the random forest has more feature restrictions. The parameter d is too small or too large, and the final fitting effect is relatively poor, so the optimal feature number is the biggest problem The kappa value is between -1 and 1. The closer the value is to 1, the better the prediction effect of the model is [78] 9  [65]. In the model, Gini coefficient is used to continuously test all the segmentation points of the feature subset of the same tree, and the branch corresponding to the smallest feature is selected. Then the model is verified by comparing different out-of-bag errors [66].

Genetic Algorithm.
Genetic algorithm is a stochastic global search optimization method that simulates the phenomena of duplication, crossover, and mutation that occur in natural selection and inheritance [67]. The solution of the solution space is represented as the genotype string (that is, chromosome) structural data of the genetic algorithm, and different combinations of these string structural data constitute different points. Starting from any initial population (Population), through random selection of chromosomes with higher fitness, crossover, and mutation operations, a group of individuals more suitable for the environment is generated, so that the group evolves into a better and better area in the search space [68]. Such generations continue to multiply and evolve and finally converge to a group of individuals that are most adapted to the environment, so as to obtain a high-quality solution to the problem. The main steps of the genetic algorithm are as follows: Step 1. Generation of the initial population, random generation of hyperparameter combinations within the approximate hyperparameter range after grid search.
Step 2. Individual fitness value evaluation and detection, based on the score of each individual's random forest training results, evaluate the fitness of each individual.
Step 3. Breeding selection, establishing criteria to select individuals from the parent to participate in breeding. While selecting elite individuals as much as possible, the diversity      20 Geofluids of the population should also be maintained to prevent the algorithm from falling into local optimum prematurely.
Step 4. Mutation, the mutation process includes a series of biologically inspired operations, such as recombination and mutation. Through the mutation operation, the individual codes of the parent are inherited and recombined in a certain way to form a descendant group.
Step 5. Environmental selection to regroup parents and offspring into new groups. The offspring bred in this process are reinserted into the parent population, replacing part or all of the parent population, forming a new population with a similar size to the previous generation.
Step 6. The stopping criterion determines when the algorithm stops. There are usually two situations: the algorithm has found the optimal solution, or the algorithm has been selected into the local optimum and cannot continue to search in the solution space.

Whale Optimization
Algorithm. The whale optimization algorithm (WOA) is a swarm intelligence optimization algorithm that imitates the whale's predation behavior in nature, and the whale's predation behavior is mainly divided into three categories: surround prey, foam net attack, and search and prey [69]. Therefore, before using WOA algorithm to solve the problem, the above three types of predation behaviors are expressed mathematically [70]. The purpose of whales' predation is to capture prey. When a group of whales

Geofluids
are looking for prey together, there must be a whale that finds the prey first. At this time, other whales will definitely swim to the whale that found the prey to compete for the prey. In this paper, the number of whale populations is 10, and in the termination condition, the maximum number of iterations is 100. The main steps of the WOA algorithm are as follows: 4.7.1. Encircling Prey. The next position of individual whale (X j ) under the influence of the best individual whale (X j+1 ) is calculated as where a is linearly decreasing from 2 to 0 as the number of iterations i increases, A is the distance adjustment factor, C is a random number, and D k is the distance difference between the best whale and the current individual whale.

Search for Prey.
In the mathematical model of hunting behavior surrounded by search,the value of A is restricted to (-1,1), individual whale is best .However,when the value of A is not at (-1,1), the current individual whale may not have reached the current maximum. The best whale individuals approach but randomly select a whale individual from the current whale individuals to approach, which is the idea of search and prey. Search and predation may make the current whale individual deviate from the target prey, but it will increase the global search ability of the individual whale.
target whale = best whale, A j j < 1, random whale, A j j ≥ 1: ( ð7Þ 4.7.3. Spiral Position Update. The current whale approaches the current best individual whale in a spiral fashion. The whale location update is formulated as follows: When whales hunt their prey, they will not only shrink the encircling circle, but also swim to the prey in a spiral form, so they choose to shrink the encircling circle or swim toward the prey in a spiral fashion with a 50% probability.    Figure 9: ROC curves of five landslide susceptibility models using (a) training and (b) testing. 24 Geofluids the proportion of nonlandslide points divided into positive categories in all nonlandslide points; y-axis representative is divided into the positive class and is divided into the number of landslide points accounted for the proportion of all landslide points. The farther the ROC is away from the 45°diagonal line, the more accurate the model is [81].
The area under ROC, namely AUC value, is referred to deep study on the quantitative comparative analysis of prediction success rate accuracy [82]. The closer the AUC value is to 1.0, the more accurate the discrimination model is.

Elimination and Selection of Landslide Affecting Factors.
The results of Pearson correlation coefficient and multicollinearity analysis [83] show that the tolerance of surface roughness (TOL), terrain curvature, and SPI is less than 0.5, and their variance inflation coefficient (VIF) is more than 5 in Table 4. In addition, in terms of relevance, curvature is strongly correlated with plan curvature, curvature with profile, surface roughness, and slope. The comprehensive results show that it is best to eliminate these three factors (terrain curvature, surface roughness, and SPI) before importing the model in Table 5. Therefore, in this paper, 12 factors including slope aspect, slope degree, elevation, plane curvature, profile curvature, distance to faults, distance to roads, landform, lithology, distance to rivers, rainfall, and normalized difference vegetation index (NDVI) are finally selected as the evaluation indicators to participate in this landslide sensitivity study.

Modeling Process and Evaluations.
After removing the factors with high correlation and strong collinearity, we combined the landslide pixels and nonlandslide pixels to extract the values of 12 factors. In this paper, the modeling and comparison analysis of the test and training datasets are carried out, respectively. GA and WOA heuristic algorithms are proposed to optimize the RF model and find the optimal parameter combination. Then, the values of statistical indicators such as RMSE, MSE, and accuracy are used to verify the success rate and performance superiority of the model. The optimal parameter combination has the maximum evolutionary generation number ðNGENÞ = 40, the population size ðpop sizeÞ = 300, the number of individuals selected for the next generation ðMUÞ =5, the number of children to be produced in each generation ð LAMBDAÞ = 10, the crossover probability ðCXPBÞ = 0:7, and the mutation probability ðMUTPBÞ = 0:2.
After selecting the optimal parameter values for each model, several metrics after training and test sets are run separately, including sensitivity, specificity, accuracy, kappa index, mean absolute error, and root mean square error. The following table compares and analyzes the index values of each model statistics and shows that the prediction effect of the GA-RF model is significantly higher than that of the LR, SVM, RF, and WOA-RF models.
For each model, the statistical measures of the test dataset are shown in Table 6. In summary, The GA-RF ensemble model had the highest sensitivity (0.

Geofluids
The value of each index intuitively reflects the optimization improvement of the GA-RF model compared with the performance before optimization (Table 7).

Landslide Susceptibility
Map of Multiple Models. This paper uses GA and WOA algorithm to optimize the RF model after successful testing and obtain the best parameter combination and then together with the LR and SVM models to predict the probability of each pixel in the study area. In order to realize the visualization of landslide susceptibility mapping, this paper employs Jenks natural breakpoint method to divide the grid data output from the model into four categories: low sensitivity (LS), moderate sensitivity (MS), high sensitivity (HS), and very high sensitivity (VHS), as shown in Figure 7. The results show that, according to the SVM model, the low sensitivity area is 38.40%, medium class is 21.21%, high class is 18.25%, and extremely high class is 22.14%. In the GA-RF model, the low sensitivity area is the largest (33.64%), followed by the medium class (23.40%), the high class (22.34%), and the very high class (20.62%). The extremely high sensitivity area in the WOA-RF model is the smallest among all models (18.85%), and the medium class is 23.81%, and high class is 23.83%. Low class is 31.53%, medium class is 25.07%, high class is 24.18%, and very high class is 19.22% in the LR model. Low class is 33.34%, middle class is 24.67%, high class is 19.40%, and very high class is 22.62% in the RF model ( Figure 8). Generally, the five models have the same trend of change.
The models in the study area intuitively reflect that the landslide sensitivity between Daba Mountain and Qinling Mountains are higher than that in the southern and northern regions. High and very high sensitivity areas have large terrain fluctuations. Overall, the sensitivity map visualizes the landslide susceptibility distribution.

Model Prediction Evaluation.
In order to quantify each category, this study draws the proportion of the sensitive area of four levels in the total area and the proportion of historical landslide events in the total number of landslides. Green represents the proportion of each level of landslide sensitivity mapping, and red represents the proportion of historical landslide events. The sensitivity area of high and very high risks is less than 50% in the models, but the pro-portion of historical landslides is more than 90%. Landslide densities (LD) is the ratio of the percentage of the number of landslide events to the percentage of all pixel areas for each category on the sensitivity mapping. Comparing the values of LD of several models, the higher the value corresponding to the high and the extremely high sensitivity, the better the prediction effect of the model. The higher the red column is, the better the prediction effect of the model is. Meanwhile, the landslide density values corresponding to the very high sensitive areas in the five models LR, SVM, RF, WOA-RF, and GA-RF are calculated, and the results are 2.82, 3.08, 3.32, 3.33, and 3.39, respectively. That is to say, GA-RF > WOA-RF > RF > SVM > LR as shown in Table 8 and Figure 8.
This research analyzes the prediction accuracy of RF, SVM, RF, WOA-RF, and GA-RF models using under the ROC curve based on the training dataset and validation dataset. The training success rates of LR, SVM, RF, WOA-RF, and GA-RF models are 0.7546, 0.8317, 0.8561, 0.8804, and 0.8957; the testing success rates are 0.7551, 0.8375, 0.8395, 0.8348, and 0.85007 ( Figure 9). Combined with the analysis of the above raster histogram results, LR is the worst of the five models, and the success rate of GA-RF prediction is the highest, followed by RF, SVM, and WOA-RF are relatively low. And combined with the value of accuracy of the training set and the test set in Section 5.2, the comprehensive results show that the GA-RF model has the best effect.

Contribution Value of Disaster-Causing Factors.
The main controlling factors of landslide susceptibility still have no deterministic specification, and the contribution degree of evaluation factors is of great significance to the predictors of landslide susceptibility. Some landslide factors are eliminated through Pearson correlation coefficient and multicollinearity analysis. The results show that the various models of landslides significantly improve the index values of AUC and accuracy. Due to the different contribution rate of different features to the output results, the factor contribution rate sorting has a certain effect on the interpretability and visualization of the model. After landslide susceptibility prediction and evaluation, Python is used to rank feature importance of GA-RF model. The greater the decrease in the accuracy outside the bag is, the greater the impact of this feature on the prediction classification results is. After the GA-RF model is classified, the feature importance is calculated based on the Gini coefficient, in which the overall importance of feature j is measured by the average importance of feature j in a tree.
The contribution rates of each feature of GA-RF model with high success rates to landslide prediction results are plotted. We consider that these factors all contribute positively to the simulation of debris flow sensitivity ( Figure 10). The five factors of elevation, NDVI, distance to roads, rainfall, and distance to rivers have the highest contribution rate. Previous studies in the area and surrounding areas have also shown that there is also a strong correlation between elevation, rainfall, and roads through the occurrence of landslide events.

Geofluids
The distribution of landslides with high and very high sensitivity classes on five most important factor thematic maps (elevation, NDVI, distance to roads, rainfall, and distance to rivers) is shown in Figure 11. Most have high and very high landslide susceptibility closely related to the following conditions: the average elevation is below 1000 m, the NDVI is between 0.63 and 0.72, the distance to roads is below 1500 m, the rainfall is between 900 mm and 1000 mm, and the distance to rivers is also below 1500 m. Therefore, in these types of highly coupled areas, strengthen the early warning of heavy rainfall and increase prevention and control measures to protect people's property and personal safety. 5.6. Discussion. Compared with traditional qualitative analysis, some classical machine learning methods are more effective in predicting landslide susceptibility, such as LR and SVM models [84]. The LR model is directly based on the original dataset; multicollinearity is not a problem, and it has good prediction performances. SVM outperforms logistic regression in landslide susceptibility evaluation. SVM and RF models are widely used in landslide prediction with good results [85][86][87]. Because RF and SVM models will be overfitted, landslide factors need to be controlled within a certain range to ensure the performance of the landslide prediction model; 8-12 landslide factors are the best [88]. This study analyzes and compares 15 landslide influencing factors and excludes curvature, roughness, and SPI by Pearson correlation coefficient, variance expansion coefficient, and tolerance. We find that each model has a certain improvement in AUC value.
Then we counted the specific parameters of the training and testing sets, and the results show that all models have achieved good results. With the coupling of heuristic algorithms and machine learning, the accuracy of landslide prediction models can be effectively improved [89]. The GA and WOA can quickly and efficiently acquire hyperparameter combinations [90]. The WOA and GA in this study are actually looking for the best parameter combination of the RF model to improve the prediction ability of the RF model. We evaluate the performance of the model separately by statistical training and test dataset metrics, including sensitivity, specificity, accuracy, kappa, mean absolute error root mean square error, and area under the receiver operating characteristic curve in the case of training dataset and test set, and the results show that the GA-RF model has the best performance. The weight of each factor based on the Gini index of the GA-RF model has been obtained. The distribution of landslides with high and very high sensitivity classes is on five most important factor thematic maps (elevation, NDVI, distance to roads, rainfall, and distance to rivers). According to the results obtained, relevant preventive measures are formulated to achieve early prevention and control of landslides.

Conclusion
The selection of landslide factors is a key step in landslide mapping. Since the main control factors of landslides have not been determined, the choice of the number of landslides has to be solved. Choosing too many numbers may lead to overfitting of the model, while too small numbers may lead to insufficient accuracy. A large number of scholars have studied the coupled optimization model. Among them, the GA and WOA in this study are applied. The research mainly studies the optimization of the RF model by GA and WOA, that is, to find the best parameter combination of the RF model.
In this study, we use the GA and WOA not to be separate models but to optimize the RF model to find its optimal parameter combination. First, a landslide inventory map consisting of 4278 identified landslides is randomly divided into training and test landslides in a ratio of 7 : 3. The 15 landslide impact factors detected potential multicollinearity among the factors using PCC, VIF, and TOL. The performance of the model separately by statistical training and test dataset metrics are evaluated, including sensitivity, specificity, accuracy, kappa, MSE, RMSE, and AUC.
The It shows that genetic algorithm is better than other algorithms in optimizing RF performance. The landslide susceptibility map produced can be used for correct decision-making and risk management of landslides. After comprehensive evaluation, GA-RF model is the most suitable prediction model in the five models, which has great reference significance in disaster prevention and control in Ankang City and its surrounding areas.
In terms of preventing overfitting and improving performance, it may be possible to further optimize and compare more deep learning and whether factor features need to be reduced in dimension to improve accuracy, which requires further research and analysis.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare no conflict of interest 28 Geofluids