Improving Voting Feature Intervals for Spatial Prediction of Landslides

In this study


Introduction
In recent years, population growth and development in unstable hilly areas have led to an increase in natural disasters such as landslides [1]. Based on 100 years of data analysis of natural hazards, after floods and earthquakes, landslides are the most frequent and important natural disaster causing casualties, financial losses, and adverse environmental impacts all over the world [2]. Landslides can cause destruction of infrastructure facilities, land use changes, erosion, and a high volume of sediment production in watersheds [3][4][5]. Mostly landslides occur due to gravity action on groundmass as a result of rainfall, earthquakes, soil saturation, and excavation of slopes [6,7]. Landslide influencing factors include topography (slope, aspect, curvature, altitude, and elevation), geology (lithology, fault, and weathering crust), hydrology (rainfall and drainage), and land use [8][9][10][11]. Understanding the features and elements of landslide development and expansion helps in risk prediction and prevention of landslide damages [12].
In landslide study, it is important to identify and demarcate landslide susceptible zones [13,14]. Landslide zoning mapping requires an assessment of the relationship between the prevailing conditions of the basin situation and the factors affecting the occurrence of the landslides [15]. In general, there are several methods of landslide susceptibility mapping and zoning [16,17] which include mathematical/statistical and machine learning techniques [18,19]. Mathematical modeling approach for delineating landslide hazards in watersheds was discussed in detail by Simons and Ward (http:// andrewsforest.oregonstate.edu/pubs/pdf/pub2055.pdf) and Simons and Ward [20]. Corominas et al. [21] reviewed the literature and recommended methodologies for the quantitative analysis of landslide hazard, vulnerability, and risk at different spatial scales. ey have also used this method for the verification and validation of the results [21].
However, it is difficult to demarcate natural boundary of transitional/gradational geological units and also continuous topographic features and factors such as elevation, slope, and topographic indices by traditional and statistical models [22,23]. Simplification of major landslide parameters, their classes, and interactions between them can lead to incorrect results in the final map [24,25]. ese concerned led to the use of machine learning (ML) and data mining techniques in landslide studies [1,26]. Nowadays, these methods are being used more widely for landslide susceptibility mapping due to their accuracy and speed [13,14,27]. Some of the prominent models used for mapping include artificial neural network (ANN) [28,29], boosted regression trees (BRTs) [30,31], random forest (RF) [32,33], rotation forest (ROF) [34,35], particle swarm optimization (PSO) [36,37], support vector machine (SVM) [28,38], binary logistic regression (BLR) [22], bagging [39,40], logistic regression (LR) [33,41], and canonical correlation forest (CCF) [42]. ML models have proven their relative superiority over bivariate and multivariate statistical models in several studies [43,44]. In addition, to increase accuracy in dealing with complex problems and uncertainties, these models also lead to the development of new approaches to various problems [45,46]. Although a number of ML models have been used in the landslide study, no model is perfect to be applied in all geoenvironmental conditions. erefore, there is always scope of improvement in methodology by using different combinations of algorithms.
With this objective, a new ensemble framework-based ML models, namely, ABVFI and MBVFI, which are combination of a popular single ML model voting feature intervals (VFIs), and two effective ensemble techniques, namely, AdaBoost and MultiBoost algorithms, were proposed for the development of landslide susceptibility maps. Muong Lay district, which is one of the most landslide affected areas of Vietnam, was selected as the study area. e main contribution of this study is in the development and application of a novel hybrid approach for accurate landslide susceptibility mapping. Validation of these models was carried out using different quantitative statistical indices including area under the ROC curve and accuracy. Weka and ArcGIS software were implemented for processing the data, modeling, and mapping of landslide susceptibility.

Voting Feature Intervals. Voting feature intervals (VFIs)
is one of the classification methods that is based on feature separation and works on nonincremental classification [47]. In the VFI method, the features are considered independently [48].
is method has been used successfully in various medical, computer, and natural sciences studies [49,50]. e primary purpose of this approach is to deal with very imbalanced datasets [51].
VFI methodology involves two main steps: [1] training and [2] classification. First, in the training phase, the feature intervals are constructed around each class by calculating the lowest and highest values of each feature. In the classification stage, a feature vote is computed for each category based on each interval from each element, and then, the votes for each feature interval are united to produce one output [47]. One of the most important advantages of this algorithm is that it ignores the missing feature values at both the training and classification stages [47].

AdaBoost.
AdaBoost or Adaptive Boosting is a ML algorithm devised by Yaw Freund and Robert Schapire [52]. AdaBoost is a hybrid learning technique and most wellknown method of the algorithm's family. In this algorithm, models learned sequentially so that a model is trained at any one time. At the end of each time, incorrectly classified examples are identified, and their emphasis is on a new training set which can be used for the next training session for training a new model [53]. e idea is that new models should be able to compensate for errors created by previous models. In fact, AdaBoost is a meta-algorithm used to enhance performance along with other learning algorithms. Purpose of the AdaBoost algorithm is to increase learning rate of the classifiers. is algorithm combines several weak clusters to obtain a suitable boundary between two classes of data.
e AdaBoost algorithm is sensitive to noise and outliers, but it is better suited to the overfitting problem in comparison to other learning algorithms [52].
If the base classifier used is better than the random classifier (50%), the algorithm's performance improves with more iteration. Even classifiers with higher error than random classifiers enhance overall performance by taking the negative coefficient [54]. In the AdaBoost algorithm, a weak classification is added at each round. At each call, weights are assigned based on the importance of the samples. With each round, the weight of misclassified samples increases, and the weight of correctly classified samples decreases, so the new classifier will focus on the more difficultto-learn samples [55].

MultiBoost.
MultiBoost is one of the ensemble learning methods developed by combining two ensemble learning algorithms, namely, AdaBoost and Wagging [56][57][58]. Wagging uses training samples with deferring weight, which could significantly reduce the high bias of the AdaBoost algorithm [59]. Combination of the two AdaBoost and Wagging techniques improves weak classifications learning and transforms them into a robust classifier [56]. In case of MultiBoost technique, training of data is done in three main stages: (i) randomly, a subset is separated from the training data and used for models based on initial classification; (ii) sample weight is adjusted according to the predictive ability of the model; and (iii) the new subset is selected according to the weighted sample and is used to train the new model [60].

Study Area
e study area of Muong Lay district is located in the northwest of Vietnam between 22°0′N and 22°5′N and 103°5′E and 103°10′E, covering 11403 km 2 is highly prone to landslides ( Figure 1). e area is located, at the confluence of Da, Nam Na, and Nam Lay Rivers in a narrow and long valley [3,4]. e elevation varies between 125 and 1778 m. e hill slopes are connected with sheered cliffs and marked by rapids. e area is tectonically active, structurally disturbed, and traversed by several faults including Chay River fault, Red River fault, and Dien Bien-Lai Chau fault zones, within the Lai Chau-Dien Bien fault zone, thus vulnerable to natural disasters such as floods and landslides.
is area experiences annual average temperature ranging between 21°C and 23°C, humidity up to 84% and average number of sunshine hours ranging from 1820 to 2035 hours per year [3,4].

Geospatial Database
Geospatial data of landslide inventory were obtained from the Vietnam Academy of Geosciences and Minerals official web portal (http://canhbaotruotlo.vn) and updated from Google Earth images and field surveys. In total, 271 landslide events were recorded and studied for the development of models. Landslides in the area are of rotational, translational, debris, rock falls, and mixed types. Most of the landslides occur along and adjacent to the main connecting road to the Muong Lay district, on the Highways 6 and 12 [3,4]. For developing landsides prediction models, landslide conditioning or affecting factors such as topographical factors (aspect, slope, and curvature) were generated from digital elevation model (DEM) of 12.5 m available online (https:// vertex.daac.asf.alaska.edu). Geological and topographical factors (distance to faults, distance to rivers, geology/lithology, focal flow, weathering rocks, and distance to roads) were generated and extracted from geology and topography maps (1 : 50000) collected from General Department of Geology and Minerals of Vietnam. Maps of these conditioning factors are presented in Figure 2, while the spatial analysis of past and present landslides carried out on these conditioning maps is presented in Figure 3 [3,4]. More detailed analysis of the individual influencing factors and mechanism of landslides is presented in the published works carried out in the same area [3,4].

Modeling Methodology
Major steps of the methodological framework include [1] data collection and preparation, [2] model development, [3] model validation, and [4] generation and validation of landslide susceptibility maps (Figure 4).

Data Collection and Preparation.
Landslide data of 271 past landslide events were generated by identifying landslides on Google Earth images in conjunction with available landslide records. Out of these, 70% of landslide (152 locations) and nonlandslide (152 locations) data were used to generate the training dataset for building the models, whereas 30% remaining (65 landslide locations and 65 nonlandslide locations) data were used to create testing dataset for model validation. Training and testing data in the ratio of 70/30 were selected based on the experience of authors and other published work on the similar studies [72][73][74][75]. Correlation-based feature selection method [76], which is known as one of the most effective feature selection methods for landslide susceptibility modeling [77,78], was used to select the suitable factors for landslide modeling.

Landslide Susceptibility Model Development.
For the developments of models, the training dataset was used to construct the models (VFI, ABVFI, and MBVFI). In ABVFI, AdaBoost was used as an optimization technique to optimize the training dataset, which was then used as inputs for classification of landslide and nonlandslide classes using a base classifier of VFI. Similarly, in MBVFI, MultiBoost was used as an optimization technique to optimize the training dataset which was then used as inputs for classification of landslide and nonlandslide classes using a base classifier of VFI.

Landslide Susceptibility Model Validation.
Validation of the models (VFI, ABVFI, and MBVFI) was carried out using the testing dataset and quantitative statistical indices, namely, AUC, ACC, SST, SPE, PPV, NPV, RMSE, and Kappa index.

Landslide Susceptibility Map Generation and Validation.
Landslide susceptibility indices scores generated by the models were classified into very low, low, moderate, high, and very high susceptibility areas based on Jenks' natural break classification method [79] for map generation.
ereafter, performance of the generated maps was validated by frequency ratio analysis [80].

Validation and Selection of Important Factors.
Validation and selection of important factors was done using correlation-based feature selection [3,77], and the results are presented in Table 2. It can be observed that distance from rivers (AM � 0.437) is the most important factor, followed by distance from roads (AM � 0.404) and distance from faults (AM � 0.336 aspect (AM � 0.226), weathering crust (AM � 0.126), geology (AM � 0.115), slope (AM � 0.076), focal flow (AM � 0.054), and curvature (AM � 0.029), respectively ( Table 2).

Validation and Comparison of Landslide Susceptibility
Models. Validation and comparison of landslide susceptibility models were done using PPV, NPV, SST, SPF, ACC, Kappa, and RMSE scores. e ABVFI model achieved the highest accuracy on both training (ACC � 82.12%) and testing datasets (ACC � 81.54%) compared with other models (VFI and MBVFI).
is model also achieved the highest PPV (83.08%) on test data, the highest NPV on training (86.75%) and testing (80.0%) datasets. ABVFI was highly sensitive towards correctly predicting landslides in this area on both training (SST � 85.40%) and testing (SST � 80.60%) datasets. It achieved the highest SPF on the test (82.54%) dataset. ABVFI scored the highest kappa value on both training (0.624) and testing (k � 0.631) datasets. In contrast, ABVFI achieved the smallest RMSE on both training (0.367) and testing (0.390) datasets (Table 3 and Figure 5).
ABVFI model achieved the highest AUC on training (AUC � 0.897) and testing data (AUC � 0.859), followed by MBVFI on training and (AUC � 0.895) testing data (AUC � 0.839) and VFI on training (AUC � 0.845) and testing data (AUC � 0.814), respectively ( Figure 6). Kappa Root mean square error Area under the ROC curve AUC � TP + (TN/P) + N [8] TP, TN, FP, and FN are considered the percentage of pixels classified correctly and incorrectly as landslide and nonlandslide classes; m is the total number of instances in the datasets; V p and V a are predicted and actual values of outputs; R ept and R a are expected agreements and the percentage of samples predicted correctly for landslide or nonlandslide classes; N and P are the total number of landslide and nonlandslide classes, respectively [71].    In general, it is apparent that ABVFI scored the highest AUC, ACC, and kappa values and the lowest RMSE on both the train and test data; therefore, this model can be selected as the best model in terms of predictability as well as robustness. MBVFI was the second-best model followed by the VFI model.

Construction of Landslide Susceptibility Maps.
Landslide susceptibility maps based on the model's study were generated into five classes: very low, low, moderate, high, and very high susceptibility areas (Figure 7). Based on the frequency analysis of each class of landslide susceptibility for each model, we found that VFI algorithm was able to predict more correctly very high and high landslide susceptible areas than the moderate and low landslide classes (Table 4). Very low landslide areas could not be predicted by VFI. MBVFI was able to predict more correctly very high landslide susceptible areas. MBVFI could equally predict high and moderate landslide susceptible areas (Table 4). Like MBVFI, ABVFI was also found to be good at predicting very high landslidesensitive areas. Overall, ABVFI could correctly predict most of the landslide susceptible classes (Table 4).

Discussion
In this study, we have developed improved hybrid VFI models ensemble with AdaBoost and MultiBoost algorithms and applied them at the Muong Lay district, Dien Bien province, Vietnam, for landslide susceptibility mapping and prediction. To develop the ML models, it is important to validate and select the most suitable conditioning factors for better landslide susceptibility assessment and mapping [81]. In this study, correlation-based feature selection was applied to validate importance of the conditioning factors and accordingly select the best factors for landslide susceptibility modeling. e main principle of this method is based on the correlation analysis between the input and output variables and among input variables [3,82]. It is a well-known feature selection method for ML applications [82]. e results indicated that distance from rivers (AM � 0.437), distance from roads (AM � 0.404), and the distance from faults (AM � 0.336) had the highest impact in the landslide susceptibility prediction in the models (Table 1), which corroborated the study of earlier workers in this region [3,4]. Reason for greater impact of rivers on the landslide occurrences is that slope close to rivers is generally saturated with water; moreover, erosion of toe support is likely at the bottom of valleys through which river flows thus causing more landslides in river valleys. Similarly, removal of toe support while construction of roads on hilly and mountainous areas also creates instability of groundmass. Road construction also disturbs slope and surrounding rock/ ground mass, which cause landslides unless protected adequately. Faults are one of the prominent slopes affecting factors, which may itself cause landslides depending on its location, orientation, and nature of infilling material. Landslides generally occur in the fault affected areas due to ongoing tectonic activities. Validation and comparison results of the models showed that ABVFI is the most accurate and robust model on both the training and testing datasets (Table 2 and Figures 5 and 6). One of the advantages of this is that it is neither overtrained nor undertrained when compared to specifically VFI. Kappa statistics are used to evaluate the robustness of machine learning models. ABVFI and MBVFI both scored "K" greater than 0.61 on test data that makes both the models substantially robust [83,84]. However, VFI shows a moderate kappa value of 0.446 on testing data [83,84]. Although the RMSE value of all the three models relatively increased on testing data, it was the lowest for the ABVFI model (increase of 0.023) on training data. MBVFI scored the second lowest RMSE on testing data with an increase of 0.026 when compared to the RMSE value on training data. With the highest AUC on testing data, ABVFI scored 0.038 which is lower than it achieved on training data. On the contrary, the second-best AUC scorer MBVFI achieved 0.056, which is less AUC score on testing data than it achieved on training data. VFI achieved 0.021 which is less AUC score on the test data than it achieved on training data. In addition, it can be seen from Table 4 that the frequency ratio values of high and very high classes of the map produced by ABVFI are higher than those produced by other models (MBVFI and VFI), which proves that prediction probability of landslides of the ABVFI is higher than other models. Main reason for the better    performance of ABVFI in comparison to other two models (MBVFI and VFI) is that it uses the AdaBoost ensemble technique, which has many advantages such as (i) it analyses large amount of data efficiently; (ii) it handles uncertainties and performs error analysis in better way; (iii) it optimizes the training dataset, selects the informative features, and provides appropriate weights to features for better data interpretation; and (iv) it is mathematically insensitive to overtraining and training error diverges to zero exponentially [85].
In general, ABVFI achieved the best performance in this study, while comparing to other models. It is noticed that this is the first time AdaBoost and MultiBoost ensemble with VFI as base classifier and were developed as hybrid models (ABVFI and MBVFI) and evaluated for the prediction of landslide susceptibility. Limitation of the study is that we have used data of available 271 landslide events for the development of models. erefore, we suggest a larger sample size of data in future study to check and refine performance of the models.

Concluding Remarks
In the present study, spatial landslide susceptibility prediction models, namely, ABVFI and MBVFI with VFI as a base classifier were developed as ensemble or hybrid models, which have emerged as better decision-making tools. e hybrid novel model ABVFI (AUC � 0.897) is the best model in comparison to single VBI (AUC � 0.845) and other developed hybrid model MBVFI (0.895). Validation and statistical analysis results show that ABVFI is the most accurate and robust model on both the training and testing datasets. Accurate susceptibility maps generated by this model can be used for safe and economic construction of roads, powerhouses, and other infrastructures. us, the ABVFI model can be used for the proper management of landslides in the hilly areas not only in Vietnam but also other areas of the world. In future study, it is proposed to consider excessive rains and drought factors due to climate   change effects for further improvement in prediction capability of landslide susceptibility models.

Data Availability
Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.