Risk Assessment of Debris Flow in Huyugou River Basin Based on Machine Learning and Mass Flow

e Huyugou river basin is a typical debris ow river basin in the Shanxi Province, which has great harm after the outbreak and seriously aects the safety of people’s lives and property. erefore, it is urgent to carry out debris ow risk assessment. In this paper, a machine learning algorithm is implemented to assess the disaster susceptibility of each branch gully in a river basin of the Huyugou. Furthermore, its high-susceptibility branch gully and main gully were selected as the starting points of debris ow simulation for numerical simulation. e machine learning algorithm is implemented in a cloud-edge platform to minimize the model training and prediction times. Under the simulated rainfall conditions of major debris ow disasters, e.g., the one that occurred in 1996, the accuracy rate reached 84%. e results show that the debris ow susceptibility of each branch gully in the study area is mainly aected by the peak ow rate of the river basin, the length of the main gully, and the relative height dierence of the river basin. e total risk area of debris ow is 1.91× 105m, and the high-risk area accounts for 52.18% of the total area. It is mainly located in the upper part of the main gully accumulation area and the conuence of each channel and the main gully. e middle-risk area accounts for 36.14% of the total area, and the low-risk area accounts for less. We also observed signicant reduction, from 34.68% to 36.98%, in the training and prediction times of the machine learning models when implemented over the proposed edge-cloud framework. e reappearance of debris ow in the study area is relatively accurate, which provides a certain scientic basis for the risk assessment of debris ow in the future.


Introduction
e Shanxi Province is located in the west side of Taihang Mountain, in the middle of Loess Plateau, and the eastern edge of Ordos Basin. e structure is complex, the altitude gap is relatively large, and therefore many tragedies are likely to happen [1]. Although the incidence of debris ow disasters is far less than other disasters, it poses a serious threat to people's health and property safety, and the degree of risk is self-evident [2]. Similarly, the rise of state-of-the-art computational technologies such as big data, Internet of things, and cloud computing can enhance the safety through introducing some sort of monitoring system. With the growth of big data and storage technologies that are becoming more and more mature, the mining data acquisition becomes simple and convenient. However, the data analysis of mining disaster datasets is no longer limited to the simple statistical analysis. In fact, machine learning approaches such as logistic regression, decision tree, and XGBoost algorithms become more essential. e main problem with the machine learning methods is the time required to train the model and then predict the disaster that should be minimized. e new concept of edge computing can solve the issues related to data analysis.
e Huyu River Basin in the Taiyuan City is the main river connected with the Xishan coal eld. In 1996, a major debris ow disaster occurred. e ood caused by rainstorm and the debris ow mixed together with solid deposits hit down, rushed through the coal mine and coal power group [3]. After the debris ow left the mountain area, it continued to move eastward along the street and nally entered the Fenhe River. e a ected area was 15 km long from east to west, and the a ected area was about 8 square kilometers. e disaster caused many people to be killed, and more than 100 people were trapped underground. e trapped time was very long. e direct economic loss caused by the damaged houses and roads is as high as CNY 240 million [4]. With the harm of debris flow to modern society, the problem of Huyugou has seriously affected the urbanization construction of the Taiyuan City, so it is of great significance to study the river basin quickly to monitor and prevent debris flow disasters [5,6].
In this paper, based on the traditional geographic information system (GIS) evaluation system, combined with numerical simulation method, field investigation, and multiparty data analysis, the random forest (RF) method in machine learning was used to evaluate the susceptibility of debris flow in the study area [7]. Moreover, we also used other methods to assess the correctness and precision of the proposed system. e suggested RF method has been implemented in different modules that run on different layers of the edge computing model. On this basis, the numerical simulation of major debris flow, which occurred in 1996, was reproduced and compared with field data and the risk zoning was carried out to deliver practical basis for the monitoring, avoidance, and control of local debris flow tragedies.
e major contributions of this study are as follows.
(1) A machine learning algorithm is implemented to assess the disaster susceptibility of each branch gully in a river basin of Huyugou. (2) e high-susceptibility branch gully and main gully were selected as the starting points of debris flow simulation for numerical simulation. (3) e machine learning algorithm is implemented in a cloud-edge platform to improve the training and prediction times. e rest of the paper is structured as follows. In Section 2, we give an overview of the study area that was used in this research. In Section 3, evaluation of the debris flow susceptibility is presented which is based on the random forest method. Risk assessment of the debris flow in Huyugou based on mass flow is deliberated in Section 4. Evaluation of the proposed methods, obtained findings, and discussion are given in Section 5. As a final point, Section 6 completes this study and offers directions for future research.

Overview of the Study Area
e debris flow in the study area is located in the southwest mountainous area of the west mountain in Wanbailin District, Taiyuan City, Shanxi Province. In fact, it is a small river basin within the scope of the Huyugou river basin, with an area of about 12.1695 km 2 , and is dominated by lowmiddle mountains. e terrain is gradually reduced from west to east. e elevation of the main gully in the study area is about 1585.6 m, and the elevation at the gully mouth is about 1070 m, and the relative elevation difference is approximately 515.6 m. e study area is located in the interior of the continent, far from the ocean, and the monsoon climate is obvious. e maximum annual rainfall can reach 800 mm, and the spatial and temporal distribution of rainfall is uneven, mostly concentrated in summer. e rainfall can reach 80% of the total annual rainfall. e average temperature in the region is low, about 2°C∼6°C, and the lowest temperature can reach −7°C, and the highest temperature can reach as high as 22.7°C. e study area is located in the Duerping-South Korea fault zone, with a length of about 26 km. e fault zone is formed by the Duerping fault and the Yayadi fault, toward northeast. e fault generally strikes northeast, which is a normal fault. is should be noted that the study area is distributed by Carboniferous, Permian clastic rocks, a small amount of Ordovician carbonate rocks, and Holocene gravel.
ere are abundant sources in the basin, and the main solid source is the product of weathering of clastic rock layer. e solid material source comes from the loose accumulation caused by a large amount of slope instability that is subsequently caused by coal mining, road construction, bridge construction, and other activities in the region. Furthermore, it also includes domestic waste, cinder, and stone slag, which provide a large amount of material source for debris flow. In summary, the debris flow in Huyugou mining area has been an important area of debris flow disaster prevention and mitigation in the Taiyuan City. It is a major task to study and analyze, simulate, and promote the implementation of debris flow protection measures. Figure 1 shows a view of the river basin diagram of the study area.

Evaluation of Debris Flow Susceptibility Based on Random Forest Method
In this section, we first illustrate the proposed edge intelligence framework that is used to implement the machine learning algorithms. e main purpose of the edge computing is to bring computation closer to where the data is produced. In this way, the data can be preprocessed and can be fitted well for training purposes. e entire framework is shown in Figure 2 below. In the proposed framework there are three layers, namely, the IoT layer, the edge layer, and the cloud layer. e IoT sensors may include cameras and other data collection devices. Once the data is gathered, it could be preprocessed over the edge devices because the IoT devices have very low processing capabilities. e preprocessing may include data aggregation methods that can remove duplicate and unnecessary data. is duplication may occur when data from overlapping regions are collected. is should be noted that, due to (i) no availability of duplicate entries and (ii) small size of the dataset, we do not use any aggregation technique in this work. Largely, the well-known Euclidean distance equation is used to identify whether two particular collected data points (through sensors) belong to either the same region or two different regions, which is used for data aggregation purpose [8,9]. e processed data is then moved to the cloud for long-term storage. It should be noted that, in the proposed framework, machine learning algorithms can be used in three different manners: (i) perform the prediction at the edge; (ii) perform the prediction at the cloud; and (iii) train the model on cloud and perform prediction on the edge [10]. However, in case of (i) different algorithms have different computational times and it might not be possible for the edge (limited resources) to compute quickly. In the case of (ii), networks are the bottlenecks and it will take quite long time, dependent on the data size, to do predictions. In the case of (iii), the model is trained at regular intervals to make sure that prediction outcomes are more accurate. Figure 3 shows the flow of data between the edge and cloud in terms of machine learning. e lower part illustrates the scenario when edge computing is not taken into account.
is type of setup might be helpful in offline learning, but for real-time online learning this might not be a good option. e upper part describes two situations: (i) when machine learning methods are used over the stored data while preprocessing happens at the edge and (ii) when machine learning is used over the reproduced data over the cloud and the preprocessing along with data aggregation method is used at the edge. e machine learning algorithm is then run in two different modules. e first module is the training that runs on the cloud. In case that enough data is not available, then more data can be produced through synthesized workloads [11]. Also, the IoT sensors continuously collect data and send it to the edge for preprocessing. Subsequently, the processed data is moved to the cloud for training purposes. e second module runs in the edge and predicts the unseen situations based on the data stored and trained model. It should be noted that, to reduce the training time, the amount of data can also be reduced through data aggregation techniques such as Euclidian distance. In this work, we do not suggest any data reduction mechanism.

Principle of Random Forest Method.
e RF (Random Forests) is one of the most popular algorithms used to solve multiclassification and prediction problems [12][13][14]. It is an integrated method of binary decision trees trained independently. It was introduced by Breiman    prediction. It has obvious effect in classification and regression problems [15,16]. e RF can be defined as a set of random trees (decision trees). e basic method for classification problems is based on training each decision tree alone, and the final result is estimated by considering the results obtained by each decision tree. e random forest algorithm works as follows: (1) Resample the original data and repeat it several times. (2) In each resampling process, a group of disasterpregnant factors are randomly selected as the eigenvalues. (3) e resampling and the corresponding eigenvalue of the disaster-pregnant factor are estimated to obtain the decision tree set. (4) Aggregate the estimated decision tree set in order to obtain a single decision tree. erefore, the basic notion of the RF procedure is to generate multiple decision trees on a random subset [17]. In fact, the performance of the suggested RF method predominantly depends on the amount of decision trees (Ntree), as well as the candidate features that are enclosed in the subset (mtry) [18]. It should be kept in mind that larger Ntree values may potentially increase modeling time, while the smaller Ntree values may cause prediction errors. e RF model can summarize and minimize the risk of overfitting without any pruning process. e training process involves creating many different boot samples from the original data set, one-third of which is excluded from the process as test cases, and based on this test case to estimate unbiased test error, known as out-of-bag-error, which represents the predictive ability of the RF model [19]. For the purpose of classification, the RF model uses the high variance between individual trees. is is achieved by voting each tree as a class member and allocating the corresponding class value according to the public vote. Furthermore, the RF classifier is more accurate and robust than a single classifier, because it has many advantages; for example: (i) it can handle large databases relatively very effectively, and (ii) it offers a way to calculate the proximity between pairs of cases used to locate outliers, etc. [20,21]. e RF algorithm also uses the Gini index as the attribute selection metric to measure the purity of attributes and classes. Assuming that the sample R # corresponding to the characteristic index in the data preprocessing set R * contains J categories, then its Gini index is given by the following equation [22]: where p j is the probability of the j th sample. After one segmentation, the set R * is divided into m parts {N1, N2, . . ., Nm}. en, the segmented Gini index ginisplit (T) is given by e final ginisplit (T) is the Gini index corresponding to each feature sample, and its set is set as G � g1, g2, . . . , gj .

Influence Factors of Debris Flow
Susceptibility. e initiation of debris flow is caused by many factors such as precipitation, topography, geomorphology, and human factors. In this paper, the selection of debris flow factors is mainly considered in the above aspects. e following factors affecting the development of debris flow are selected: river basin area, average slope of the river basin, shape coefficient, channel length, longitudinal shrinking slope of the main gully, relative height difference of the river basin, rainfall, vegetation coverage rate, and the peak flow of the  river basin [23]. e actual values are detailed and given in Table 1.
Although the scope of the study area is small and the rainfall is basically the same, in order to ensure the integrity of the factor selection, the stratigraphic lithology is still listed in Table 1. e clear water flow of each river basin is calculated by the debris flow clear water flow formula, which is given by where Q b represents the clear water flow in the region (m 3 /s); F represents the river basin area (km 2 ); i is the production flow coefficient and its value is assumed as i � 0.9; and r represents hourly surface rainfall (mm/h). e critical rainfall value of debris flow within 24 hours in Shanxi Province is about 30 mm [24]. According to the characteristics of Huyugou climate and the analysis of rainfall in Taiyuan City, it is concluded that the daily rainfall should be approximately 120 mm/d when Huyu gully triggers severe rainstorm [25,26]. e calculation formula of peak flow Q c in debris flow basin is given by where φ represents the sediment coefficient of the basin; and D c represents the blockage coefficient in the basin. us, the peak flow in each river basin of debris flow can be obtained.
According to the geological hazard risk assessment standard and related research results [27,28], the factors are divided into four levels: high (IV), middle (III), low (II), and very low (I), and the classification results are substituted into the random forest method to calculate the weight. e grading standards and weight calculation results are shown in Table 2, and the grading results are shown in Figure 4.

Risk Assessment of Debris Flow in Huyugou Based on Mass Flow
According to the evaluation results of debris flow susceptibility, area 4, area 7, and main gully in the high-susceptibility area are selected for evaluation.

Unit Weight of the Debris Flow.
e determination of unit weight can be roughly divided into three methods, namely: (i) field investigation method, (ii) morphological investigation method, and (iii) standard look-up table method. e debris flow severity used in this numerical simulation is mainly determined by field investigation method that can be mathematically expressed as follows: where c c is heavy debris flow fluid (t/m 3 ); G c is slurry quality (t); and V is the mud volume (m 3 ). As shown in Table 3, the field method is used to investigate the density of debris flow. e slurry is mixed at the upstream of channel, middle and lower reaches of the channel, and the exit of the channel in the study area, respectively. Multiple experiments are carried out and the average value is finally obtained. e comprehensive analysis shows that the average unit weight of debris flow in the study area was c c � 1.602 t/m 3 , and the density is moderate, belonging to rare debris flow. At the same time, according to the morphological investigation method, the fluid and motion characteristics of debris flow are described by the affected villagers [29]. It is concluded that the fluid properties of debris flow should be between dense debris flow and   Mobile Information Systems dilute debris flow; that is, the unit weight is 1.60 t/m 3 , indicating that the field experiment is accurate.

Debris Flow and Flow Process
Line. e clear water flow and debris flow in the river basin have been calculated, respectively, in the above factor calculation, which is not described here. e method used in this simulation is the generalized pentagon theory with high recognition. e method is to take 1/3 of the complete debris flow time as the node, and the peak flow calculated above is substituted into the boundary point with 1/3 and 1/4, respectively, so as to describe the flow process line of debris flow outbreak [30]. Figure 6 is the simulation results of the debris flow movement process in the study area under the condition of actual rainfall. Figure 7 shows that the maximum velocity of debris flow is 5.53 m/s∼6.41 m/s and the maximum mud depth is 5.1 m∼6.5 m under the condition of major debris flow rainfall in Huyugou in 1996, which is located in the middle and upper part of the gully accumulation area and the confluence of each channel and the main gully. Note that the measured total risk area is approximately 2.28 × 105 m 2 , the numerical simulation risk area is 1.91 × 105 m 2 , and the accuracy is about 84%.

Risk Assessment of Debris Flow.
According to the study of Xu [31] in the Shanyang County (Table 4), the hazard zoning of the debris flow in Huyugou in 1996 is carried out, as shown in Figure 5.
e results of debris flow hazard evaluation show that the total area of debris flow hazard zone is 1.91 × 105 m 2 , and the high hazard zone accounts for 52.18% of the total area, which is mainly located in the downstream of the main gully and the intersection of the branch gully and the main gully. Furthermore, the area of medium-risk area is 0.69 × 105 m 2 , accounting for 36.14%, and the low-risk area is relatively small. In general, the study area is a relatively dangerous debris flow, which needs strict prevention.

Machine Learning and Edge-Cloud Results.
In this section, we discuss the results of the machine learning techniques and the training and prediction model were supposed to run on different platforms. From the algorithm perspective, we consider two different machine learning algorithms, namely: (i) random forest (RF) and (ii) CNN. Each algorithm runs in two phases: (i) training and (ii) prediction. From the platform perspective, we use different scenarios. In scenario A, we assume that both phases of each algorithm run over the edge. In this case, since the data is stored on the cloud, we assume that the required data is moved to the edge. Once the data is used, it is deleted from the edge server. In scenario B, we assume that both phases run on the cloud. In scenario C, we assume that the training happens on the cloud while the prediction runs over the edge server. We report the timing durations for the training and prediction phases [32]. e results are illustrated in Table 5. e findings suggest that, for various algorithms, the response time can be significantly decreased (i.e., from 24.64% to 33.24%) using the proposed cloud-edge platform. Furthermore, we also noted approximately 34.68% to 36.98% reduction in the prediction durations. is improvement is possible at some cost of prediction duration. Furthermore, we observed the RF method outperforms the classical CNN approach (i.e., ∼25.54%), but we believe these outcomes will change in line with the amount of data.

Discussion and Model Accuracy
In this section, we briefly discuss the findings of this research and accuracy of the machine learning methods. After verification, this paper gets the following conclusions and understanding: (1) e debris flow susceptibility of each branch gullies in the study area is mainly controlled by the peak flow rate of the river basin, the length of the main gully, and the relative height difference of the river basin. ere are 12 branch gullies, 2 high-prone  branch gullies, 7 middle-prone branch gullies, and 3 low-prone branch gullies in the region. (2) rough the previous multifactor superposition analysis and parameter calculation, the motion state of the study area is reproduced by numerical simulation. e simulation results show that the mud depth of debris flow at the accumulation of gully mouth and the intersection of gully and main gully in the study area is the largest, about 6.5 m, and the maximum velocity is 6.41 m/s at the middle and lower reaches of the gully and the steep terrain. By testing the goodness of fit of the simulation results, the accuracy is about 84%. e high-risk areas of debris flow in the study area accounted for 52.18%. e return accuracy of debris flow in the study area under the condition of heavy debris flow rainfall in 1996 is relatively close, which provides corresponding scientific suggestions for the comprehensive evaluation and risk zoning of debris flow in the future. e experimental findings were assessed using different evaluation metrics, i.e., (i) precision or accuracy, (ii) recall rate, (iii) F1-measure, and (iv) IoU. In fact, accuracy is the proportion of correctly forecasted samples to all predicted samples. e recall rate is calculated as the proportion of accurately anticipated positive samples to all real positive samples. Moreover, the F1 score is the harmonic average of recall rate and precisions (accuracy). Finally, the IoU is the crossing of pixels labelled as building in the ground truths and anticipated outcomes and subsequently divided by the union of pixels labelled as building in the ground truths and forecasted outcomes [8].
e following are the calculating formulas: Deep of Mud where TP stands for the quantity of correctly taken out pixels, FP for the quantity of incorrectly pulled out pixels, and FN for the amount of lost or misplaced pixels. e accuracy of the RF and CNN methods is shown in Figure 8. We can observe that the RF method is more accurate than the CNN approach in terms of all evaluation metrics.

Conclusions and Future Work
Based on the investigation of debris flow disasters in the distribution areas in Shanxi Province, this paper selects the debris flow in the study area as a representative river basin for analysis to explore a relatively reasonable evaluation method of debris flow in the Loess Plateau, especially in Shanxi Province. e method in this paper is mainly based on the weight calculation of the random forest method and the combination of multifactor superposition and numerical simulation. rough the evaluation of various factors in the river basin, namely, rainfall, topography, and geomorphology, the susceptibility of debris flow in each channel in the region is evaluated, and it is used as the main material source of debris flow. Numerical simulation is combined with the results of multifactor analysis to simulate the movement characteristics of debris flow under this condition and carry out risk zoning. e two complement each other, and the evaluation of debris flow has a more detailed process. e results are more reasonable than a single way.
In the future, we will take into account deep learning techniques that are more suited for mines and the operational monitoring systems, like graph convolutional network (GCN), U-net, and attention networks. But as we saw, not all neurons can be stimulated by the activation function used in this paper, which results in restricted precision and accuracy. As a result, finding the best activation function and improving the model's structure are ongoing research projects. Similar to this, we will look into the effects of the activation functions employed in conjunction with deep learning techniques. To enhance the performance of the suggested system, robust data reduction or aggregation approaches should be looked at.
Data Availability e raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Mobile Information Systems 9