Evaluation of Different Machine Learning Models for Predicting Soil Erosion in Tropical Sloping Lands of Northeast Vietnam

Soil erosion induced by rainfall under prevailing conditions is a prominent problem to farmers in tropical sloping lands of Northeast Vietnam. This study evaluates possibility of predicting erosion status by machine learning models, including fuzzy k-nearest neighbor (FKNN), artiﬁcial neural network (ANN), support vector machine (SVM), least squares support vector machine (LSSVM), and relevance vector machine (RVM). Model evaluation employed a historical dataset consisting of ten explanatory variables and soil erosion featured four diﬀerent land use managements on hillslopes in Northwest Vietnam. All 236 data samples representing soil erosion/nonerosion events were randomly prepared (80% for training and 20% for testing) to assess the robustness of the ﬁve models. This subsampling process was repeatedly carried out by 30 rounds to eliminate the issue of randomness in data selection. Classiﬁcation accuracy rate (CAR) and area under receiver operating characteristic (AUC) were used to evaluate performance of the ﬁve models. Signiﬁcant diﬀerence between diﬀerent algorithms was veriﬁed by the Wilcoxon test. Results of the study showed that RVM model achieves the best outcomes in both training (CAR � 92.22% and AUC � 0.98) and testing phases (CAR � 91.94% and AUC � 0.97). Four other learning algorithms also demonstrated good performance as indicated by their CAR values surpassing 80% and AUC values greater than 0.9. Hence, these results strongly conﬁrm the eﬃcacy of applying machine learning models for soil erosion prediction.


Introduction
Water erosion often causes loss of soil from the field, breakdown of soil structure, and decline of organic matter and nutrients [1]. Erosion leads to reduction of cultivable soil depth and decline in soil fertility, eventually reducing production. Furthermore, sedimentation downstream reduces the capacity of rivers, reservoirs, and drainage ditches, which shortens their designs' life. It also enhances the risk of flooding and blocks irrigation channels [2]. e soil erosion severity is highly variable depending on site's climate, soil, topography, cropping, and land management [3]. Particularly, soil erosion potential in tropical areas is high due to heavy rainfall coupled with land management such as mono cropping in the uplands of Northwest Vietnam [4,5]. Accelerated erosion is often observed at the beginning of the cropping season when heavy rains coincide with poor ground cover [6]. Climatic condition, soil characteristic, land form, and land management significantly contribute to soil erosion in different weights that need to be investigated.
Soil loss studies at the plot scale have been of crucial importance to identify the mechanism of the processes. e erosion plot experiments can help to introduce new erosion prevention technologies as it provides access to reliable and consistent erosion measurements and large numbers of data necessary to test new models [7]. Most recent empirical models employed data from plot studies such as USLE/ RUSLE based on Universal Soil Loss Equation A � RKLSCP, where A is computed soil loss, R is the rainfall-runoff erosivity factor, K is a soil erodibility factor, L is the slope length factor, S is the slope steepness factor, C is a cover management factor, and P is a supporting practices factor [8,9], SWAT [10], a physically based model Water Erosion Prediction Project (WEPP) [11], and Tradeoffs (InVEST) Sediment Delivery Ratio (SDR) model [12], etc.
Machine learning approaches could provide a helpful alternative to deal with the multivariate and complex nature of problems in soil science and geoscience [13][14][15][16]. Artificial neural network (ANN) generally predicts soil loss at acceptable results [17,18] or even better than that of WEPP model (2011) [19]. Kohonen Neural Networks (KNN), multivariate adaptive regression splines [20], and support vector classification coupled with metaheuristic [21] being used for runoff-erosion modeling had shown a superior result to the conventional multiple linear regression model [22]. Soil erosion prediction is a complex and dynamic process, requiring comparison of various advanced machine learning algorithms. Machine learning has demonstrated great potentiality and effectiveness for solving complex soil science problems. is modern method can construct datadriven models from historical datasets and establish prediction models used for predicting various complex phenomena including soil erosion [23][24][25].
is study elucidates potential application of five competent machine learning models to predict soil erosion: artificial neural network (ANN), support vector machine (SVM), least squares support vector machine (LSSVM), relevance vector machine (RVM), and fuzzy k-nearest neighbor (FKNN) using a dataset containing ten explanatory variables, collected from fields in Northwest Vietnam. e ANN method is inspired from the actual neural systems of human brain; this method possesses the universal approximating capability and can accurately approximate any nonlinear function [26,27]. SVM is a robust machine learning model which is based by the structural risk minimization [28]; therefore, SVM is less susceptible to overfitting than ANN. LSSVM and RVM can be considered as variants of the original SVM.
e first reformulates the model training procedure of SVM so that it is only required to solve a linear system instead of a constrained nonlinear programming problem in SVM [29]. e latter model of RVM takes advantage of Bayesian framework to construct more robust and sparse models which may result in less numbers of support vectors than the standard SVM [30]. e sparseness property of a RVM means that this approach can be resilient to noise and less susceptible to noisy data samples [31,32]. In addition, the FKNN [33] is an extension of the standard k-nearest neighbor (KNN) algorithm; this model incorporates the fuzzy theory into the KNN model structure to enhance the flexibility of data modeling and better constructs the class decision boundary. Due to such characteristics and advantages, these five models are selected to be employed in this study.

e Dataset.
e erosion dataset was collected from two experiments that featured four different land use managements in Northwest Vietnam during three years . Details of the experiments have been described in [34]. In brief, erosion plots were arranged in a randomized complete block design with four treatments, three replicates. e employed treatments represent conventional local farmers' maize cropping practice based on slashing, burning, and ploughing with fertilization (1), and soil conservation practices such as grass barrier (Panicum maximum) (2), minimum tillage with cover crop (Arachis pintoi) (3), or relay cropping with Adzuki beans (Phaseolus calcaratus).
e minimum tillage and/or cover crop option provided better land cover and less disturbed soil condition, hence lowering soil loss. Each plot is sized 72 m 2 (4 m wide and 18 m slope length), laid on slopes within 24.8-34.8 degrees.
A system of buckets was installed to collect the deposited sediment subjected to soil erosion from the above plots. Erosion data were recorded on storm basic in the three years: 2009-2011.

Description of Soil Erosion Data.
Climate, soil, topography, and land use factors affect rill and inter-rill soil erosion caused by raindrop impact and surface runoff. More precisely, soil erosion depends on the erosivity caused by the amount and intensity of rainfall and runoff, and the resistance of the soil surface or the degree of erodibility caused by intrinsic soil properties, adopting land use practices, and the topography of the landscape as described by slope length and steepness. To represent these factors, a set of ten explanatory variables has been chosen as described in Table 1. Data distributions are shown in the histograms ( Figure 1). In this study, we classify the dependent variables either as "erosion" or "nonerosion." When soil loss measured in the field is greater than 3 tons per hectare, it is considered as a significant erosion in tropical regions [36]; otherwise, the loss is negligible. A total of 236 data samples had been collected, within which 118 records were classified as "erosion." (1) OC denotes organic matter.

Artificial Neural Network (ANN)
. ANN is a widely employed machine learning method inspired by biological neural networks. is method simulates the knowledge acquisition and reasoning processes occurring the human brain [37][38][39][40][41]. Given the learning task is to train a function f: X ∈ R D ⟶ Y ∈ R 1 , where D denotes the number of input attributes, an ANN model employed to learn the function f typically includes the input, hidden, and output layers.
Via a training process, the knowledge learnt by an ANN model is adapted and stored in the form of matrices of connection weights. Generally, the parameters of an ANN model are trained via a process that employs the framework of error backpropagation [42,43]. Overall, an ANN-based soil erosion classification model can be expressed as follows:  Applied and Environmental Soil Science 3 where b 0 and b 1 denote the two bias vectors of the input and hidden layers, respectively, f A represents the activation function. SM is the softmax activation function [44,45], W L0L1 is the matrix of connection weights between the input and hidden layer, and W L1L2 denotes that between the hidden and the softmax layer. e softmax activation function used to compute the class probability is expressed as follows: where CN represents the number of output classes.

Support Vector Machine (SVM).
Proposed by [28], the SVM algorithm was a powerful method for linear binary classification. e algorithm aims at constructing a hyperplane to separate positive and negative samples with the margin as large as possible. e SVM models are highly suitable for medium-size datasets and are less susceptible to overfitting than ANN models [46][47][48]. Given a training dataset x k , y k N k�1 with input data x k ∈ R n and corresponding class labels y k ∈ − 1, +1 { }, the SVM algorithm establishes a decision boundary so that the gap between classes is as large as possible. Moreover, SVM relies on the kernel trick to cope with nonlinear classification problems [49][50][51]. e formulation of the SVM training process can be described as the following optimization problem: where w ∈ R n denotes a normal vector to the classification hyperplane and b ∈ R represents the model bias; e k > 0 denotes slack variables; c is a penalty constant; and φ(x) represents a nonlinear mapping from the input space to the high-dimensional feature space. By solving the above constrained optimization problem, the final SVM model used for pattern classification is expressed as follows [52]: where α k is the solution of the dual form of the optimization described in equation (2), SV represents the number of support vectors (the number of α k > 0 ), and K(.) denotes the radial basis function (RBF) kernel [52]: where σ denotes the RBF parameter.

Least Squares Support Vector Machine (LSSVM).
LSSVM is a least squares version of the standard SVM within which the model structure is identified by solving a set of linear system instead of a nonlinear optimization problem [53,54]. Similar to the standard SVM, the LSSVM relies on kernel functions to deal with complex and nonlinear datasets [55][56][57]. e LSSVM formulation for pattern classification can be stated as follows [58]: where w ∈ R n is the normal vector to the classification hyperplane, and b ∈ R is the bias; e k ∈ R represents error variables; and c > 0 denotes a regularization constant.
By solving the above optimization problem, the LSSVM classification model can be expressed as follows: where α k and b are the solution of the systems stated in equation (4). K(.) also denotes the RBF kernel function [54]. [59], is a Bayesian inference-based method that can be employed for solving classification problems. e functional form of RVM is similar to that of the support vector machine. Furthermore, an expectation maximization based method is utilized to construct the RVM prediction model [60]. Compared to the aforementioned SVM and LSSVM, the Bayesian-based RVM requires fewer tuning parameter; hence, the model construction phase of the RVM can be fast to accomplish [61,62]. Furthermore, a RVM model often results in good predictive performance thanks to its sparseness property. It is because a RVM model relies on a small number of relevant vectors extracted from the training samples to construct the classification model [31].

Relevance Vector Machine (RVM). RVM, proposed by
e RVM-based classification model is presented compactly as follows [30]: where w � [w 0 , w 1 , ..., w M ] represents a vector of the model denotes a vector of Gaussian basis functions.
The Gaussian basis function basis is given as follows : where b represents the width of the Gaussian basis function.

Fuzzy k-Nearest Neighbor (FKNN).
Proposed by [33], the FKNN algorithm is an extension of the original k-nearest neighbor [63]. One major advantage of the FKNN is that it takes into account the distances among samples. e FKNN utilizes the concept of fuzzy logic to express the membership strength of data instances in each class. e membership degree of a data instance in a class is computed as a function of distance to its nearest neighbors [64]. e FKNN classifier computes a fuzzy partition matrix U � [u ij ] as follows [64]: where n i denotes the number of neighbors of the data instance x j that is actually in the i th class and c(x j ) represents the class label of x j .
Based on the matrix U, the fuzzy memberships of a new sample x in different classes can be obtained and the class label having the largest membership degree is selected as the output for a new input data x. e fuzzy memberships of x are computed as follows: where i � 1,2,. . ., C, and j � 1,2,.., k. Moreover, k is the number of nearest neighbors. e parameter m is called the fuzzy strength coefficient.

Performance
Metrics. e classification accuracy rate (CAR) is employed to measure and compare the performance of classifiers. CAR is the percentage of correct classified cases calculated by the following equation: where Nc and Na represent the numbers of correctly classified instances and the total number of instances, respectively. In addition to CAR, true positive rate (TPR) (the percentage of positive instances correctly classified), false positive rate (FPR) (the percentage of negative instances misclassified), false negative rate (FNR) (the percentage of positive instances misclassified), and true negative rate (TNR) (the percentage of negative instances correctly classified) are also utilized to quantify the performance of classifier [65]. e formulation for calculating the above four metrics is as follows: where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative, respectively. Receiver operating characteristic, a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied, can be applied to summarized TPR and FPR [65]. e area under the ROC curve, or AUC for short, can be calculated to quantitatively exhibit the classification performance of a model [66].

Selection of Model Parameters.
All values representing soil erosion/nonerosion are randomly divided into two subsets: a training dataset (80%) was used for model establishment and a test dataset (20%) was used to measure the model generalization capability. e data were normalized into the range between 0 and 1 with the Z-score normalization by the following equation: where X n and X o denote the normalized and the original input variables, respectively, and μ x and s x are the mean and the standard deviation of the variable X o , respectively. Five machine learning algorithms, ANN, SVM, LSSVM, RVM, and FKNN, were used to establish the soil erosion status prediction models. e FKNN model is coded in MATLAB environment by the authors. e ANN and SVM models are implemented in MATLAB environment with the Statistics and Machine Learning Toolbox [67]. e LSSVM and RVM models are established via the toolboxes developed by [68,69], respectively.
A fivefold cross-validation procedure coupled with grid search was carried out to identify appropriate free parameters for model performance. e model training and prediction was repeatedly carried out five times on five mutual exclusive groups being separated from the whole dataset. Model selection was based on a set of free parameters that leads to the highest average CAR. Moreover, the grid search procedure employed for a model with two free parameters is described in Algorithm 1.

A Preliminary Analysis on the Relevancy of Input Factors with Mutual Information.
A preliminary analysis on feature relevancy was evaluated prior to model training and prediction phases. is analysis may help to identify irrelevant Applied and Environmental Soil Science input variables. In this study, the mutual information method [70] was utilized to compute the independence relationship of each conditioning factor to the class label (erosion/nonerosion). It is proper to note that large mutual information indicates a strong relevancy between the conditioning factor and the class label. e analysis result is shown in Figure 2 that provides the mutual information values of all input variables. It is clearly shown that the input factor X 8 (topsoil texture-clay) obtains the highest mutual information value, followed by the input factor X 1 (EI30), X 9 (topsoil texture-sand), X 10 (soil cover), and X 4 (pH topsoil). e factors X 3 (OC top soil), X 2 (slope), and X 5 (topsoil bulk density) receive comparatively low mutual information values. Since all mutual information values are not null, the subsequent model establishment phase should take into account all of the ten factors. is study shows that mutual information of slope (factor 2) is lower than pH (factor 4), which may be explained by change in pH causing physical properties of the soil to change in the clayey soil rich in Al, Ca, and Mg [34] leading to the development of soil water erosion [71].

Model Calibration.
e ANN model requires the selection of the number of neurons in the hidden layers and the learning rate. In this experiment, we study the number of neurons within the range of 5 to 30 and the learning rate parameter within the set of [0.001, 0.01, 0.1, 1]. e model performances of ANN with different number of neurons are reported in Figure 3(a). e best ANN model (CAR � 87.92%) corresponds to a model consisting of 15 neurons and the learning rate � 0.01.
For the case of SVM, the model performance corresponding to different sets of the penalty parameter c and the kernel function parameter σ is investigated. It is worth noticing that the parameter c influences the model complexity and the parameter σ affects the smoothness of the classification boundary of SVM. ese two parameters of an SVM model are allowed to be varied within the range of 0.01 and 1000. e SVM model performance with each pair of c and σ is illustrated in Figure 3(b). e best values of the penalty parameter c and the kernel function parameter σ are 1000 and 10 (with CAR � 90.11%), respectively. Figure 3(c) reports the model selection of LSSVM in which the regularization (c) and the kernel function (σ) parameters are studied. e best LSSVM corresponds to c � 10 and σ � 5 with CAR � 88.50%. In the case of FKNN, the highest model accuracy obtained from the fivefold crossvalidation is accompanied with the value of nearest neighbors k � 3 and the fuzzy strength m � 2 (see Figure 3(d)). In addition, the classification accuracy of RVM corresponding to different values of the Gaussian bandwidth (b) is provided in Figure 3(e) in which b � 0.015 is the most suitable value that leads to an average CAR � 91.45%.

Water Erosion Prediction Modeling.
It is noted that a single run of experiment cannot reliably exhibit the capability of the soil erosion status prediction model due to the issue of randomness in data selection.
us, a repeated subsampling process consisting of 30 runs was carried out. After 30 runs, the performance metrics of the five employed models are summarized in Table 2. Figure 4 illustrates the   Applied and Environmental Soil Science   e box plot shown in Figure 5 summarizes the CAR and AUC results of the five models obtained from 30 runs.
In addition, the Wilcoxon signed-rank test [72] was employed to investigate whether the prediction performances of each pair of methods were statistically different.
is is a nonparametric hypothesis test used for model comparison. e significance level of the test (p value) was set to be 0.05. Based on the threshold p value � 0.05, if the p value of the test was lower than 0.05, we could reject the null hypothesis that the performances of the two models of  interest are statistically indifferent. Comparison of each pair of models is presented in Figure 6. In this table, the symbols "++," "+," "− − ," and "− " stand for a significant win, a win, a significant loss, and a loss, respectively. Observably, RVM attains four significant wins over other benchmark models. LSSVM, as the second best approach, obtains a significant win over FKNN, and two wins over ANN and SVM. FKNN receives three significant losses in the duals with ANN, LSSVM, and RVM, and one loss in the dual with SVM.
Based on the experimental results supported by the employed statistical test, it can be stated with confidence that the RVM is the best suited method for the current dataset. e outstanding performance of this machine learning approach can be explained by its advantages including the ease of model establishment and improved generalization. e first advantage of the RVM may stem from the fact that this model only requires one hyperparameter which is the width of the Gaussian basis function. e second advantage of the model is based on the model sparseness; the RVM only selects a small portion of the training samples as crucial data points to construct the classification model. erefore, this advanced machine learning model is less susceptible to noisy data points than other employed machine learning approaches. Based on these findings, the RVM is strongly recommended for soil erosion prediction problems under this tropical prevailing condition. Broader spectrum of data collected in wider conditions is required for a more comprehensive prediction in the future.
It is proper to note that most conventional erosion prediction models based on physical or empirical or both face difficulties in model development and in predictive accuracy. Moreover, model parameters often need to be calibrated against observed data, creating problems with model identification and the physical interpretability of model parameter [73]. Developing concepts for erosion processes requires a considerable length of time due to the natural complexity of the systems where erosion occurs.
Furthermore, the appropriateness of erosion concepts commonly employed in model structures is still questionable [74]. Despite the fact that physical processes of detachment, transport, and deposition in overland flow are well recognized and have been widely incorporated within erosion models, the experimental procedures to test conditions when processes are occurring concurrently have only recently been developed [75]. Our initiative of using machine learning approaches therefore proves to be a promising alternative for erosion prediction in which it overcomes obstacles in parameterization, calibration, and validation processes that are often considered to be the main difficulties while applying conventional models.

Conclusion
is study evaluated performances of five machine learning algorithms, namely, FKNN, ANN, SVM, LSSVM, and RVM, using a historical dataset collected in tropical slopping fields featured by ten soil erosion conditioning factors. Experimental results supported by the Wilcoxon signed-rank test pointed out that RVM was deemed best suited for the problem at hand. e RVM model achieved the best   performance in the testing phase (CAR � 91.94% and AUC � 0.97). Four other learning algorithms also demonstrated good performance as indicated by their CAR values surpassing 80% and AUC values greater than 0.9. us, these results strongly confirm the efficacy of applying machine intelligence for solving the problem of interest. Furthermore, RVM can be a very promising tool to assist landowners and managers to quickly identify potential soil erosion areas and develop preventive measures. e reasons for the good performance of RVM may lie in the fact that this model utilizes Bayesian inference to obtain parsimonious solutions the soil erosion prediction problem in this study which is modeled as a pattern classification task. e employed Bayesian inference of RVM can help to result in a robust classification model which features a small number of support vectors. erefore, the decision boundary constructed by such support vectors has good generalization property and resilience to noise. ese facts explain why predictive accuracy of RVM is better than those obtained from other machine learning models.
Future extensions of the current works may include the following: (i) Investigation of the capabilities of other advanced machine learning models (such as tree ensemble, functional tree, gradient boosted regression tree, stochastic gradient tree boost, alternating decision tree, logistic model tree, boosted regression trees, random forest, and naive Bayes variants) in soil erosion prediction (ii) Collection of more data samples to increase the current data size and therefore enhance the generalization as well as applicability of the current data-driven models (iii) Investigation of other influencing factors of soil erosion to ameliorate the explicability of the current study Data Availability e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.