Prediction of Power Outage Quantity of Distribution Network Users under Typhoon Disaster Based on Random Forest and Important Variables

,


Introduction
Typhoon disasters may lead to a large area of power outage for distribution network users. e prediction of the power outage quantity of distribution network users under a typhoon disaster can effectively improve the accuracy of disaster prevention and reduction. It can also shorten the outage time of distribution network users, reduce power outage loss, and improve user satisfaction.
Under typhoon disaster, there are many factors affecting the power outage of distribution network users, including meteorological factors, geographical factors, power grid factors, and so on [1]. If the traditional model-driven method is used to predict the power outage quantity, the model will be complex and difficult to solve. In addition, with the increase and normalization of power outage data of distribution network users, it is possible to predict power outage quantity of distribution network users using a datadriven method [2,3].
At present, some scholars have successfully used the data-driven method to assess the risk to power systems under typhoon disasters. Statistical learning models, such as linear model, are firstly applied to evaluate the power outage in hurricane weather in [4]. However, they mainly focused on the fitting effect of the model, instead of prediction accuracy. e impact of soil and terrain on power outage of distribution network users based on classification and regression trees (CART) is studied in [5]. However, it did not pay attention to the improvement of prediction accuracy. To make the model more comprehensive, many scholars decided to take more influence factors into consideration. Considering the influence factors, such as maximum wind speed, wind speed duration, rainfall, etc., a cumulative time failure model was used to predict the power grid outage under hurricane in [6]. Considering meteorological, geographical, and social information, models of equipment failure rate under natural disasters were established in [7]. On this basis, data-driven methods are widely used to assess the risk to power systems under natural disasters. An ice cover risk assessment model for power systems based on fault tree was proposed in [8]. It involved the effective assessment of transmission line risk, line break, and tower collapse. Based on the relevant public data affecting the power system, prediction models of power outage rate under disasters through data mining were established in [9][10][11]. In addition, a method for predicting the risk level of power outage in distribution network was presented by [12], which takes into account the weather factors. However, the risk level was classified, while factors such as region were not taken into consideration. Based on support vector machine (SVM) and grey prediction technology, a reliability prediction model for transmission line operation was proposed in [13]. It considered factors such as the running time of components and the region where the components are located. Considering storm, rainstorm, high temperature, and other weather factors comprehensively, a prediction model of the original parameters of a power system based on fuzzy clustering and similarity degree was proposed in [14]. is model considered most climatic factors but did not further evaluate the damage to the power grid.
In order to improve the prediction accuracy, the prediction area was firstly meshed in [15]. To carry out distribution network planning in a scientific and reasonable way, a multistage grid division method for distribution network was proposed [16]. en, based on geographical grid division, the negative binomial regression model was used to predict the power outages quantity of distribution network users under Hurricane [17]. Based on the data of weather and land cover type, the spatial distribution of power outage in the 2-kilometer grid was predicted by using the Boosted Trees [18]. In addition, the support vector machine was used to predict the number of distribution towers in a 3-kilometer grid [19]. However, due to the large grid division, the eigenvalues of variables in the grid vary to a great extent, resulting in the inaccuracy of the obtained sample data, which affect the final prediction accuracy.
In the light of the aforesaid scenario, this paper proposed a prediction method of power outage quantity of distribution network users based on Random Forest (RF) algorithm. e main innovative contributions of the paper can be summarized as follows: (1) A data sample space with twenty-six explanatory variables covering meteorological factors, geographical factors, and power grid factors is constructed. In addition, to better understand the relationship between explanatory variables and response variables, correlation of each explanatory variable and response variable is analyzed. (2) To take as many variables into account as possible, we established a RF-global variable model covering all the twenty-six explanatory variables to predict the power outages quantity of distribution network users. e remainder of this paper is organized as follows. e framework of the prediction model proposed in this paper is described in Section 2. In Section 3, the data sample space is introduced, and the relationship between each explanatory variable and response variable is analyzed. e RF algorithm we mainly used and the evaluation indicators are described in Section 4. In Section 5, the prediction model based on all the 26 explanatory variables and RF is built. In Section 6, the prediction model based on 8 important explanatory variables and RF is built, and the errors of Nomodel, LR, SVR, DTR, and the proposed two models are analyzed. Finally, Section 7 is the conclusion.

Prediction and Evaluation Framework of Power Outage Quantity of Distribution Network Users
e prediction framework of power outage quantity of distribution network users established in this paper is shown in Figure 1.
Firstly, create a data sample space. To consider as much as possible the collectible variables that may have an impact on the results, twenty-six explanatory variables are collected. e explanatory variables include meteorological factors (such as maximum wind speed, wind direction, rainfall, etc.), geographical factors (such as altitude, slope, underlay type, etc.), and power grid factors (such as number of distribution network users, number of box transformers, line

Data Sample Space Construction
e power outage quantity of distribution network users under typhoon disaster is affected by many factors. Mathematical Problems in Engineering erefore, the data of the prediction model is firstly described and the sample data space is constructed.

Analysis of Explanatory
Variables. Similar to distribution network users' power outage spatial distribution [1], the factors affecting the power outage quantity of distribution network users under typhoon disaster include meteorological factors, geographical factors, and power grid factors. Among the power grid factors, the failure of the distribution network line mainly refers to the failure of the 10 kV overhead line. e cable is generally laid underground with insulation and protective layers, and its failure has little to do with the impact of typhoons and rainstorms. erefore, this article only considers the power outage of the distribution network caused by the failure of the 10 kV overhead line exposed to the outdoor environment. In this paper, explanatory variables are added as much as possible to explore the relevant factors affecting the power outages quantity of distribution network users and to improve the accuracy of the prediction model. e selected explanatory variables of the prediction model are shown in Table 1 [20].
is paper establishes the prediction model of the power outage quantity of distribution network users on the basis of the sample data of three historical typhoons (Rammasun in 2014, Kalmaegi in 2014, and Mujigae in 2015) affecting Xuwen county, Guangdong province, China [21][22][23]. e data are provided by meteorological bureau and Electric Power Research Institute of Guangdong Power Grid Co., Ltd, China. In this paper, the study area is divided into 1641 samples; each sample represents a grid of 1 km × km. e variable X 1 is the maximum wind speed of each grid under the whole typhoon. Based on the regional grid division of 1 km × km, each typhoon produced 1641 samples with a total of 28 characteristic variables. Hence, the size of the entire sample space is Φ � (X, y) 4923×28 . e variables in the meteorological factors and geographical factors are provided in the form of 1 km × 1 km data points. e Inverse Distance Weight Interpolation method is used to transform the data into continuous area data, and then the meteorological information and geographic information is extracted on this basis.

Analysis of Response Variable.
In this paper, the power outage quantity of distribution network users under the typhoon disaster is predicted. erefore, the power outage quantity of distribution network users Y 1 is taken as the response variable. e sample of descriptive statistics on the power outage quantity of distribution network users is shown in Table 2.
As shown in Table 2, the distribution range of the power outage quantity of distribution network users Y1 is 0∼6121. e average predicted outage quantity is 70.51, and the standard deviation is 297.12. ree quartiles of 25%, 50%, and 75% are used to explore the distribution of results. It can be seen that the samples are mainly concentrated in the range of small data values. e probability distribution diagram of the response variable Y1 is shown in Figure 2.
e samples are more concentrated in the range of small data values. e probability distribution diagram of tc response variable Y 1 is shown in Figure 2.
In order to eliminate the influence of the large coverage of power outage quantity of distribution network users, this paper normalizes this value and converts the response variable into the proportion of power outage. e proportion of power outage Y2 is equal to the number of power outage users Y1 divided by the number of distribution network users X20, Y2 � Y1/X20. Unless otherwise specified, the following response variables refer to the proportion of power outage Y 2.

Correlation Analysis between Each Explanatory Variable and Response Variable.
In order to intuitively show the relationship between each explanatory variable and response variable Y 2 , the scatter diagram between each explanatory variable and response variable is visualized, as shown in Figure 3.
As can be seen from Figure 3, there is no significant linear relationship between each explanatory variable and response variable, indicating that the effect of linear model will be poor. In order to further explore the relationship between each explanatory variable and response variable, Pearson correlation coefficient is used for quantitative correlation analysis. Assuming the existence of two variables, X and Y, the corresponding Pearson correlation coefficient [24] is calculated as follows.
where COV represents covariance and Var represents variance. If |r xy | < 0.4, then X and Y are weakly correlated; if 0.4 ≤ |r xy | < 0.7, then X and Y are significantly correlated; if 0.7 ≤ |r xy | < 1, then X and Y are strongly correlated. As can be seen from Figure 4, among the explanatory variables, there is a strong positive correlation between the distribution network users (X 12 ), maximum wind speed (X 1 ), wind speed duration (X 6 , X 7 ), rainfall (X 3 ), and the power outage proportion (Y 2 ), while correlation between the other explanatory variables and the power outage proportion is weak.
In order to find out whether there is a correlation among the explanatory variables, the correlation heat map is shown in Figure 5.
As can be seen from Figure 5, there is a strong positive correlation between maximum wind speed (X 1 ) and rainfall (X 3 ), wind speed duration (X 6 , X 7 ) and landing area (X 11 ).
at is, when a typhoon lands in the study area, it will be accompanied by high wind speed and precipitation. And high wind speed makes the wind speed last longer.

Principle of Random Forest
Algorithm. e main objective of supervised learning is to estimate the unknown function f of the prediction variable Y (such as the power outage quantity) by using the d-dimensional vector of relevant input X (such as meteorological features, geographical features, and power grid features). For example, Y � f(X)+e, and e is the irremediable error. By minimizing the loss function L that represents the deviation between the observed value and the predicted value, the best unknown function f can be selected to make the prediction work best.
is is the idea of supervised regression learning algorithm. Random Forest (RF) is a nonparametric integrated data mining algorithm based on tree. Unlike a single regression tree with high variance and low bias, RF overcomes the problem of high variance by using model average. In addition, when the number of input variables is large, RF has better precision than other classical machine learning algorithms [7]. Hence, this paper establishes a prediction model for the power outage quantity of distribution network users based on the RF algorithm. e final RF output estimate is the predicted average of all the trees, expressed as follows: where M is the number of regression trees in RF, and T m (x) represents the model constructed by the m-th regression tree. e advantage of this method is that it can capture the nonlinear structure of data well, and it is robust to outliers and noise with a strong prediction accuracy.

e Evaluation Indicators.
After the construction of the prediction model for the power outage quantity of distribution network users under typhoon disaster, it is necessary to evaluate the advantages and disadvantages of the model. In this paper, the evaluation indexes of the regression model are Mean Absolute Error (MAE), Mean Square Error (MSE), and Root Mean Square Error (RMSE). Suppose the data set is (x i , y i ), i � 1, 2, . . . , n , and the prediction regression function is f(x), then the various error expressions are as follows:      (3) In this paper, y i represents the actual power outage quantity of distribution network users in i-th grid, and f(x i ) represents the predicted power outage quantity of distribution network users in the i-th grid.

RF-Global Variable Modeling and Analysis
In order to as far as possible explore the potential relationship between each explanatory variable and response variable, the global variables (all explanatory variables) are used in this section, and the importance of variables is analyzed to identify the contribution of variables in the prediction model.

Analysis of Prediction
Results. Firstly, 80% samples are randomly selected from the sample data for model training, and the remaining 20% samples are conducted for model test. en, it is recycled 100 times. At last, the average values of MAE, MSE, and RMSE are obtained, as shown in Table 3.
As shown in Table 3, the prediction model of the power outage quantity is constructed with the proportion of power outage as the response variable. e MAE, MSE, and RMSE in the test errors are up to 0.1497, 0.0613, and 0.2474, respectively. To intuitively reflect the prediction effect, new model evaluation indexes ±100/±200/±300 (if the deviation between the predicted quantity and the actual quantity is within 100/200/300, the prediction is considered accurate) and ±10%/±20%/±30% (if the proportion of the deviation between the predicted quantity and the actual quantity is within ±10%/±20%/±30%, the prediction is considered accurate) are added. e accuracy analysis of the power outage quantity prediction model is shown in Table 4.
As shown in Table 4, the accuracy rate of prediction error within ±100/±200/±300 is higher than 90%. However, considering the small number of users of distribution network in most actual grids, evaluating the model with a fixed error may overestimate the predictive effect of the model. erefore, the evaluation index ±10%/±20%/±30% based on floating error is constructed, in which the accuracy of the error within ±10% is 0.7546, within ±20% is 0.8320, and within ±30% is 0.8660. As can be seen from Tables 3 and 4, the prediction method of power outage quantity of distribution network users based on RF proposed in this paper has better performance.

Assessment of Variable
Importance. As many explanatory variables as possible were selected in the early stage of modeling. However, this may lead to a large workload of data collection and processing in the actual application of the model. In order to evaluate the contribution of each explanatory variable in the prediction model and reduce the pressure of data collection, the importance of explanatory variables is evaluated.
In the RF model, the importance ranking is calculated based on the degree of chaos (Impurity/Gini coefficient). at is to say, the criterion to measure the importance of a feature is to see how much chaos the feature reduces in the process of building a random forest through the decision tree [25]. After synthesizing all the trees, the greater the average decrease is determined as the more important feature. But the problem is that when features are continuous or there are many categories of classification factors (High-cardinality category variables), the method of feature importance analysis mentioned above will increase the importance of these features. us, the Permutation Importance Measure is used in this paper to solve this problem. e specific method of variable importance evaluation based on RF is as follows: (1) e original accuracy of test data or OOB (out of bag) data in random forest (such as the OOB data error, denoted as err OOB1 ) is taken as an accuracy baseline. (2) One of the features that need to be measured is permuted; that is, scrambling the data and rearranging them. en run the model again with the test data (the same data set) to calculate the new accuracy rate, denoted as err OOB2 . (3) Calculate the difference between the new accuracy and the baseline accuracy. e larger the difference, the more important the feature is. Assuming that there are n trees in RF, the importance of the characteristic is 1/n (err OOB2 − err OOB1 ).
In this process, the data do not need to be standardized, and the final importance ranking is not 1 but a relative ranking.
e importance analysis diagram of global variables is shown in Figure 6.
As can be seen from Figure 6, the explanatory variables such as longitude, latitude, maximum wind speed, wind direction, rainfall, number of users of distribution network, line length, and altitude contribute greatly to the accuracy of the prediction model. However, the explanatory variables such as landing time, landing area (whether landing in the

Variable Dependency
Analysis. e classical Partial Dependence Plots (PDP) [26] help visualize the average relationship between the response variable and one or more of the characteristics. When a specified characteristic changes in its marginal distribution, the PDP plots change in the average predicted value. With the help of the PDP, the trained supervised learning model can be better understood.
In order to formally define the PDP, let S ⊂ {1,..., p}, C be the complement of S, and S ∪ C � m. And m is the set of all characteristics. en, the partially dependent function f of the partial characteristics set x S is as follows: Since f and dP(x C ) are unknown, equation (4) can be estimated by the following equation: where, n is the number of samples of the training set, x C1 , . . . , x Cn represents different values of the characteristic set x C of the training set. When the characteristic set x S contains only one characteristic variable x j , j � 1, 2, . . . , m, the partial dependency function of x j is: where, the PDP value f j (x j ) of the characteristic variable x j represents the average value of the output value of the regression prediction function when x j is fixed and changes along its marginal distribution.
To analyze the impact of the characteristics of variables on the response variable, this paper analyses the nine most important explanatory variables for modeling (longitude X 18 , latitude X 19 , number of distribution network users X 20 , maximum wind speed X 1 , rainfall X 3 , line length X 26 , whether there are distribution users X 12 , wind direction X 2 , and altitude X 13 .) based on variable importance analysis. e partial dependency is shown in Figure 7.
It can be seen from Figure 7 that the longitude and latitude have a positive influence on power outage of distribution network users; that is, the increase of longitude and latitude leads to an increase in its influence on distribution network users. e main reason may be that the region mentioned in this paper is a coastal region. e closer a region is to the sea, the stronger the typhoon attacks on its distribution network users, and the more serious the impact. However, the dependence of the model on the number of distribution network users is not obvious and the influence is relatively stable. Moreover, the greater the maximum wind speed and rainfall of a typhoon, the greater the impact of the typhoon on distribution network users. In the geographic Mathematical Problems in Engineering information, the influence of altitude on power outage of distribution network users is negatively correlated; that is, the higher the altitude in the region, the smaller the influence on power outage of distribution network users, which is consistent with the influence trend of longitude and latitude. As for the line length, its influence is positively correlated with the increase of the line length. e longer the line length, the higher the probability of power outage of distribution network users will be. For classification variables with or without distribution network users, there is a relatively obvious positive correlation, because only if there are distribution network users in the grid, the distribution network users may have a power outage accident under the typhoon disaster. For the wind direction, there is no obvious correlation shown in the PDP chart. e main reason may be that the wind direction data changes rapidly and the model is not able to capture its performance characteristics. Besides, the wind direction is not a constant value under a typhoon disaster; it is difficult to select an appropriate quantitative description. us, we decided not to take it as one of the values in RF-important variable model.
Since longitude and latitude, wind speed and direction, wind speed and rainfall often occur simultaneously, the characteristic dependence of the two variables of these combinations is analyzed, as shown in Figures 8-10.
As shown in Figure 8, the combination of longitude and latitude can locate an area. When the longitude is large and the latitude is small, it has a greater impact on power outage of distribution network users. e region is located in the southeast corner of the study area, closer to the landfall area of the typhoon.
In general, high wind speed tends to bring rain and aggravate the impact on power distribution network users. As shown in Figure 9, the greater the wind speed and greater the rainfall, the greater the impact on power distribution network users.
As shown in Figure 10, there is no obvious correlation between wind direction and power outage of distribution  network users. It shows that the wind direction has little influence on the power outage of users, so it can be removed. In addition, the higher the wind speed, the greater the probability of power outage for distribution network users.

Analysis of Modeling Important Variables
In Section 5, global variables are used for modeling, and the prediction results of power outage quantity of distribution network users are evaluated and analyzed. Based on historical data, more explanatory variables could be mined to support the accuracy of power outage quantity prediction. However, in reality, some explanatory variables are difficult to obtain, such as wind speed duration of 20 m/s and 30 m/s. In addition, many variables contribute little to prediction accuracy. erefore, this section analyzes and compares the prediction accuracy of models considering global variables and important variables, so as to increase the efficiency and availability of the model.

Model Training Test Analysis.
According to the analysis results of the above section, in this section, the eight explanatory variables that are most important to the predicted results are selected as explanatory variables to carry out the training of power outage quantity prediction model: longitude X 18 , latitude X 19 , maximum wind speed X 1 , rainfall X 3 , distribution network user X 20 , line length X 26 , whether there are distribution users X 12 , and altitude X 13 . For all the samples, 80% are randomly selected as the training set and the remaining 20% as the test set, with random recycling for 100 times. e error results of the training test are shown in Table 5. e accuracy of the change of the evaluation index is shown in Table 6. It can be seen from Table 5 that the test set MAE is 0.1366, MSE is 0.0580, and RMSE is 0.2406 for modeling analysis with important characteristic variables, and the overall prediction effect is good. e prediction accuracy of the model calculated when changing the evaluation index is shown in Table 6. Table 6 shows that eight important variables are used for prediction model training; the accuracy of 100/±200/±300 reaches 0.9346, 0.9706, 0.9852, and the accuracy of ±10%/ ±20%/±30% is 0.7582, 0.8345, and 0.8822, respectively. e prediction accuracy of the model is close to that of the RFglobal variable model, indicating that building a prediction model with less important variables does not significantly reduce the accuracy of the model but makes the process of predicting and evaluating the power outage quantity simpler and faster (saving time for collecting and sorting out the remaining variables). Furthermore, it accelerates the assessment of the power outage quantity of distribution network users under the typhoon disaster and prepares the conditions for further emergency decision-making. It can be seen from Figure 11 that, except for a few points, the difference between the actual value and the predicted value of most points is around 0. It indicates that the fitting data of the user outage number prediction model of the distribution network is good.

Comparative Analysis of Models.
In order to further analyze the model built based on important variables in this paper, a No-model [27] and three other machine learning algorithms are used to compare with the trained RF model based on global variables and important variables, as shown in Table 7. e average values of the samples are used as the prediction value in No-model, LR, SVR, and DTR. At the same time, in order to visually demonstrate the prediction effect of each model, a histogram of the error analysis of each model is shown in Figure 12. Table 7 and Figure 12 show that the prediction model of power outage quantity of distribution network users based on RF in this paper has a better prediction effect. Whether based on global variables or import variables, its MAE, MSE, and RMSE are all smaller than that of the other three  is improves the efficiency of    prediction and evaluation, and provides an effective basis for the early allocation of emergency repair resources, the reduction of power outage loss, and the improvement of distribution network user satisfaction.

Conclusion
In this paper, the prediction and evaluation method of power outage quantity of distribution network users under typhoon disaster is studied, and the prediction model of power outage quantity of distribution network users based on RF is proposed.
(1) In order to make the evaluation process more convenient, this paper selects the eight most important explanatory variables for model training. e results show that the model errors do not increase seriously but decrease slightly, providing auxiliary guidance for rapid prediction. is found that the RF-global variable model and RFimportant variable model trained in this paper are better, and their MAE, MSE, and RMSE are significantly reduced. And the prediction effect of the RFimportant variable model is slightly better than that of the RF-global variable model, which can provide an effective basis for disaster prevention and reduction of power grid. (4) In the actual application process, the predicted maximum gust wind speed of 72 hours, 48 hours, and 24 hours before typhoon landing can be used as model inputs, respectively. e prediction results can provide some guidance for the formulation of pre disaster emergency dispatching strategy.
Data Availability e datasets used or analyzed during the current study are available from the authors upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.