An Early Warning Method of Distribution System Fault Risk Based on Data Mining

Accurate warning information of potential fault risk in the distribution network is essential to the economic operation as well as the rational allocation of maintenance resources. In this paper, we propose a fault risk warning method for a distribution system based on an improved RelieF-Softmax algorithm. Firstly, four categories including 24 fault features of the distribution system are determined through data investigation and preprocessing. Considering the frequency of distribution system faults, and then their consequences, the risk classification method of the distribution system is presented. Secondly, the K-maxmin clustering algorithm is introduced to improve the random sampling process, and then an improved RelieF feature extraction method is proposed to determine the optimal feature subset with the strongest correlation and minimum redundancy. Finally, the loss function of Softmax is improved to cope with the influence of sample imbalance on the prediction accuracy. *e optimal feature subset and Softmax classifier are applied to forewarn the fault risk in the distribution system. *e 191-feeder power distribution system in south China is employed to demonstrate the effectiveness of the proposed method.


Introduction
As the last step in the power industry, the distribution system is closely and directly connected to end-users [1,2]. e stable operation of the distribution system is crucial to the reliable power supply for users [3][4][5]. However, the distribution system has a complex network structure and various equipment components and even worse, the external factors that cause distribution system faults are highly random. Since the causal relationship is nonlinear, the conventional fault prediction method based on the electrical mechanism is challenging to function. erefore, exploring the potential risks in the distribution system operation process and furtherly taking corresponding measures have become a severe challenge to power supply companies [6,7]. e concept of distribution system fault risk is put forward in [8], and the fault probability index system and fault consequence index system are established, respectively, from two dimensions of possibility and severity. In [9], time series, gray theory, and statistical analysis are applied to deeply analyse the distribution system's fault repair data and extract fault characteristics. However, the above methods focus on the definition of failure risk and the establishment of an index system, and the correlation analysis of failure outage factors and the establishment of the fault early warning model are less studied.
Data mining has recently been widely applied in the field of power systems due to its excellent computing performance and adaptability [10][11][12][13]. e fuzzy classification algorithm is developed in [14] to identify the fault cause. In [15], the correlation mining method between load characteristics and air temperature index is proposed based on the linkage analysis theory. Association rule mining is implemented in [16] to conduct fault cause analysis, which can effectively identify critical variables that have a substantial impact on fault. In [17], a novel framework of distribution system fault detection is presented to cope with the power system complexity and double-way power flows, and the support vector data description method is adopted due to the limited available fault data. In [18], the fault line selection problem is transformed into a classification problem, including data preparation, training, classification, and evaluation.
For the feature dimension reduction, the characteristics and main algorithms of filtering, encapsulation, and embedded dimension reduction methods are summarized in [19,20], and the robustness, prediction accuracy, and interpretability of different algorithms are compared. Literature [21] proposed a feature extraction method suitable for highdimensional data. is method can extract feature vectors strongly related to classification but has poor performance in removing redundancy. In [22], principal component analysis is utilized to extract the main components from high-dimensional features through matrix transformation and eliminate the weak value information. Literature [23] conducts correlation recognition and redundancy removal for feature vectors based on correlation analysis theory.
Many researchers [24,25] have explored the classification method and applied it to distribution system fault prediction [26][27][28]. Literature [29] presented a fault risk warning method of the distribution system based on improved support vector machine algorithm. Literature [30] focuses on meteorological factors and presents a distribution system fault classification prediction method combining AdaBoost and decision tree. Literature [31,32] introduces the overall structure of a distribution operation analysis system and expands the application of the massive fault data to the fault risk level prediction and weak spot identification. However, the distribution system fault is an accidental event, and the proportion of fault data samples is far less than that of normal operation data sample; that is, the distribution system fault prediction is a typical unbalanced sample problem. erefore, it is necessary to improve the model to adapt to the classification prediction of minority categories of samples.
Given the aforementioned consideration, a fault warning method based on improved RelieF-Softmax algorithm for the distribution system is proposed. Firstly, data acquisition and data preprocessing are carried out to determine the initial associated feature set of distribution system faults, and a fault risk classification method of the distribution system is proposed. Secondly, because of the RelieF algorithm's deficiency in the initial sampling and redundancy removal, an improved RelieF feature extraction method combined with the correlation coefficient method is proposed to screen out the fault optimal feature vector with the strongest correlation and minimum redundancy. Finally, a fault risk level prediction model of distribution system based on improved Softmax classifier is established to enhance unbalanced sample prediction accuracy. Taking 191 feeder lines in south China as examples, the analysis results demonstrate the effectiveness of the fault risk warning method proposed in this paper, which can provide crucial guiding significance to the operation practice. e remainder of this paper is given as follows. Section 2 describes the preprocessing of distribution system data. e improved RelieF algorithm is presented in Section 3, and a fault risk warning method of the distribution system based on improved Softmax loss function is proposed in Section 4. e numerical results of the presented method are detailed in Section 5. e final section of the paper gives conclusions based on this study.

Data Collection.
By investigating the widely used information management systems related to the distribution system of State Grid Corporation Company of China, six systems, including the distribution production management system, distribution automation system, electricity information collection system, geographical information system, 95598 customer service system, and marketing business management system, are selected to collect the operation data, equipment data, and historical fault information of the distribution system. e data classification and data sources are detailed in Table 1.

Data Preprocessing.
Preprocessing the initial data is necessary to improve the prediction accuracy, generally including data cleaning, data transformation, data integration, and outlier diagnosis.
Data cleaning is to process the vacant and repeated values in the initial data to ensure the data set's integrity, consistency, and rationality. e empty values in the data samples can be eliminated or replaced by the mean or the median. e repeated value recognition rules could be set based on the logical relationship between the data samples. For example, if the feeder name and the power failure time of two samples are the same, it is considered that there may exist a repeated sample and one of them should be eliminated. Other rules can be presented according to the specific problems and decision-maker preferences.
Data transformation, mainly including standardized processing, data grading, and quantization, makes it easier to perform data analysis. e max-min method, z-score method, or decimal scaling method can be used for standardized data processing. For data such as rainfall, thunderstorm, and wind, continuous numerical values should be discretized and classified to highlight data differences.
Data integration is to integrate, summarize, and correlate data from multiple sources. Due to the diversity of data sources, it is necessary to cross-verify the data. For example, according to the historical faults data, the planned power failure part can be eliminated according to the power loss information of the user side, that is, the power failure only caused by the faults. Afterwards, this part's power failure information can be verified and compared with the power failure information recorded in the faults work order. e outlier diagnosis effectively identifies and eliminates the wrong inputs and meaningless values that may appear in the initial data. Outliers may lead to a decrease in the accuracy of prediction results. erefore, statistical methods, clustering, or graph-based methods should be applied to search and delete outliers.
Finally, the initial feature set could be obtained after a series of data preprocessing, as shown in Table 2.

Fault Risk Classification.
Distribution system fault risk consists of the frequency of failure and the consequences of failure. Given the assessment indexes of power grid companies, the failure rate (the frequency of failure) of 100 km and the number of households affected by power failure (the influenced scope of fault) are selected as the basis for the classification of distribution system fault risk. Among them, the former is an important index used by power supply companies to assess each branch company's annual operation level, which reflects the fault frequency of distribution network feeders per unit length. e latter is similarly a standard index to measure the reliability of the power supply of distribution system, reflecting the scope of the influence of power failure.
Equation (1) is utilized to obtain the failure rate of the feeder per 100 km per month.
where S i represents the monthly failure rate per 100 km of the feeder i, f ij denotes the state of the j th power failure of the feeder i, and L i is the length of the feeder i. Equation (2) gives the mathematical definition of the number of households affected by the monthly power failure.
where C i is the number of households affected by the monthly power failure of the feeder i, n f is the total number Parameter data Geographical information system and distribution production management system e total length of the feeder, the length of the overhead section, the length of the cable, the operation time, and the substation Monthly cumulative rainfall f 5 Monthly mean temperature f 6 Monthly highest temperature f 7 Monthly extreme weather days f 8 Monthly mean humidity f 9 Monthly extreme humidity days f 10 Monthly gale days f 11 Monthly thunderstorm days f 12 Monthly snowy days Operation data f 13 Maximum monthly load f 14 Average monthly load f 15 Month classification Parameter data f 16 e geographical location classification f 17 Feeder construction mode f 18 Power supply area classification f 19 Length of overhead feeder section f 20 Length of cable feeder segment f 21 Total length of the feeder f 22 Number of feeder segment switches f 23 Number of feeder transformers f 24 Feeder operation time of power failure accidents in that month, and F ij is the set of transformers affected in the j th power failure of the feeder i. n ij,k and t ij,k represent the number of households and power failure time of the k th affected transformer in the j th power failure of the feeder i, respectively. Because of the calculated feeder failure rate per month of 100 km and the number of households affected by power failure, the power failure risk level of the distribution system can be divided into general, emergency, and severe. Referring to a city in south China in 2018, the annual failure rate is 2.502 times/ 100 km·year, and the number of households affected by the distribution system failure is 102,500 households. e fault risk classification is shown in Table 3. It should be pointed out that the threshold value of each level can be adjusted according to the actual situation of the local distribution system, and the higher risk level of any two indexes is taken as the result in the calculation.

Fault Feature Extraction of Distribution
System Based on Improved RelieF Algorithm e dimension and quality of input vectors directly affect the accuracy of classification prediction results. Too many input vectors may lead to model overfitting and operation efficiency reduction, and input collinearity will reduce the model's stability. erefore, it is necessary to extract the optimal fault feature subset and screen the strongest correlation and the least redundant vectors, to improve the efficiency and accuracy of fault risk prediction.

RelieF Algorithm.
RelieF algorithm is a typical filtering feature selection method to extract feature vectors with a significant distinguishing degree for target classification. In the RelieF algorithm, weight is assigned to each feature based on the distance measurement, so as to evaluate the ability of feature vectors to distinguish target categories. e specific definition is as follows: for sample set D, randomly select sample s and find k nearest neighbors in the same kind of s, which are defined as H and k nonnearest neighbors, which are defined as M. In this paper, Euclidean metric is employed to measure the distance between samples. e calculation formula of the characteristic difference between samples is described as follows: where diff (a, X, Y) represents the difference between sample X and sample Y on feature a. e weight updating formula is defined as follows: where W a denotes the weight value of feature a, k is the number of the nearest neighbor sample, t is the number of sampling, H i and M i represent the i th nearest neighbor and non-nearest neighbor of the sample s, respectively, and class( ) is the ratio function of the sample number to the total sample number.

Improved RelieF Algorithm.
Although the RelieF algorithm has no restrictions on data types and relatively high operating efficiency, it still has the following disadvantages: (1) Considering that the initial random sampling is put back sampling, the selected sample may be too limited due to repeated sampling. Since the repeated sample does not provide new information for the classification and is an invalid input, the model results' accuracy may be affected. (2) e algorithm has a weak ability to distinguish the redundant features, which leads to a considerable noise of the input features.
Given the consideration mentioned above, the RelieF algorithm is improved from two aspects. On the premise that the algorithm flow remains unchanged, the clustering algorithm is introduced to cluster the initial data, and a hierarchical sampling algorithm based on clustering is proposed. Due to the deficiency of redundancy elimination in the RelieF algorithm, the feature extraction method combining RelieF and the correlation coefficient method is presented to identify and eliminate redundant features effectively.

Hierarchical Sampling Based on K-Maxmin Clustering
Algorithm.
e K-maxmin distance method selects data point as far as possible as the clustering center based on Euclidean distance, so as to effectively avoid the situation that the initial clustering center may be too close when compared to the k-means method. e K-maxmin algorithm is with high efficiency and is unnecessary to determine the initial clustering number. Due to the page limit, the K-maxmin clustering algorithm process can refer to the literature [33], which will not be described here.
e K-maxmin clustering algorithm is introduced to cluster the initial feature set, and then stratified sampling is applied in line with the category proportion. e total sampling number M is distributed to all categories proportionally, and the number of sampling points of each category can be determined by the proportion of the category to the total sample. In this way, the low probability of local sampling in random sampling can be effectively avoided. Besides, each sampling is strictly limited to nonrepeated sampling, which ensures that each sampling is assigned with new weight for the feature vectors, so as to significantly improve the classification effect of random sample points on the classification results.

Correlation Coefficient Method.
Pearson coefficient is an indicator of the degree of linear correlation between variables, widely used in statistics. Its value is between (−1, 1). If it is positive, it means a positive correlation between two variables, or otherwise, it means a negative correlation. e higher the absolute value is, the higher the correlation will be. Pearson correlation coefficient calculation formula is described as follows.
where Cov (X, Y) is the covariance of sample X and sample Y, and σ X and σ Y are the variances of sample X and sample Y, respectively. e correlation coefficient matrix can be obtained by calculating the correlation coefficient between each input vector. It is generally believed that the correlation is strong if the correlation coefficient is higher than 0.7, and the corresponding feature vector pair will be put into the redundant set. Suppose that the features extracted by RelieF algorithm show redundant features, the input vectors with a small weight value of redundant feature pairs will be eliminated, and only one feature vector is retained.

Optimal Feature Extraction Method Based on the Improved RelieF Algorithm.
e flow chart of the feature extraction method based on improved RelieF algorithm is depicted in Figure 1. Firstly, the initial data set is clustered by K-maxmin algorithm to realize stratified sampling and nonrepeated sampling. Secondly, the improved RelieF algorithm is applied to identify the feature vectors that can significantly distinguish the target classification. Finally, the correlation coefficient method is conducted to reduce the optimal feature subset's dimension, and the redundant feature vectors are furtherly eliminated to obtain the optimal feature subset.

Fault Risk Warning Method of Distribution
System Based on Improved Softmax Loss Function 4.1. Improved Softmax. Softmax classification [34] is an extension of binary classification logistic regression to solve multiple classification problems. Its algorithm is based on Softmax regression, and the category with the highest output probability is the prediction category.
For input data with m dimensions (x 1 , y 1 ), (x 2 , y 2 ), ..., (x m , y m )}, where x i is the input vector, y i is the corresponding category vector, and there are K categories; namely, y i belongs to 1, 2, ..., K { }. Softmax regression is used to estimate the probability that the input data belongs to each category. For any input vector, its prediction function can be expressed as where P ( ) represents the probability of occurrence within parentheses. θ � [θ 1 , θ 2 , ..., θ K ] is the weight vector of n×K, and n is the number of sample features. K j�1 e θ T j x i is a normalized parameter that guarantees the sum of the probabilities to be 1.
e Softmax loss function is based on the logarithmic cross entropy theory, which can be expressed as where Ind (y i � j) is 0-1 indicating function, if true in parentheses, the value is 1, otherwise 0. Combined with equations 6 and 7, the classification prediction problem can be transformed to solve the prediction function parameter, with equation (7) being minimized, so as to obtain the probability of different categories of the sample. e physical meaning of the loss function defined in equation (7) is to make the proportion of the correct classification samples as large as possible. Still, this function assumes no difference between the categories in the proper classification under the condition of sample data equilibrium. However, in the problem of fault early warning for the distribution system, the loss of high risk is mistaken for the low risk could be much bigger than the reverse. In other words, the correct classification of high risk is more important than the correct classification of low risk, and the low risk data account for a large proportion in the studied samples. erefore, the loss function of equation (7) is improved to adapt to the proposed strategy in this paper.
where the first item is to measure the classification error, α j is the category weight to adjust the sample imbalance degree and increase the weight of minority class being mistaken for other categories. e second item is the regularization function, where λ is the regularization parameter and called L2 norm. Regularization function can make it easy to obtain the optimal global solution while avoiding the training model's overfitting and improving the model's generalization ability. e gradient descent method is a common optimization method to solve the maximum or minimum value of a function. Hence, it is employed to train the Softmax classifier. e partial derivative of equation (8) can be expressed as According to equation (10), update theta with each iteration: where δ is the iteration step.

Fault Risk Warning Method of Distribution
System. e data-driven fault warning method can be divided into three stages: data acquisition and preprocessing, feature extraction, and risk level prediction. e research idea and risk warning process of this paper are illustrated in Figure 2.
Data acquisition and preprocessing. Collect fault data, operation data, parameter data, and meteorological information and perform data cleaning, integration, and outliers diagnose as required. Determine the fault risk classification based on the failure rate of 100 km and the number of households affected by power failure. Consequently, the initial sample set can be obtained, where each row represents a sample, each column denotes a feature, and the last column represents the fault risk level.
Feature extraction. Feature extraction includes two aspects: remove weak correlation and redundancy. e improved RelieF algorithm and correlation coefficient method are presented to obtain the optimal feature subset with the strongest correlation and minimum redundancy.
Risk level prediction. e improved Softmax is proposed to train the training set and learn the mapping relationship between the fault influencing factors and the fault risk level of the distribution system. Based on this learning model, the fault risk level of test samples can be reasonably predicted.

Case Study
A total of 191 feeders and their data from January 2018 to December 2018 in a southern city are collected as training samples to predict the monthly feeder fault risk level from January 2019 to June 2019.

Data Preprocessing.
e failure data, operation data, ledger data, and meteorological data of 191 feeders collected from January 2018 to December 2018 are processed by the method in Section 2. Taking each feeder as a unit, 24 fault features of 4 categories are obtained, as shown in Table 2.
Among them, f 1 fault risk level can be determined comprehensively in line with f 2 and f 3 . After preprocessing, the initial data set is obtained with 2292 samples, including 2154 samples of class I, 95 samples of class II, and 43 samples of class III. e number of categories obtained based on K-maxmin clustering algorithm is 5, and the sampling proportion of each cluster can be determined according to the ratio of the sample number of each cluster in the total sample. Afterwards, the improved RelieF algorithm is applied to extract key fault feature, in which the number of sampling is 30, the nearest neighbor number is 8, and the number of iterations is 20. e calculation result of the feature weight is depicted in Figure 3. e dotted line in Figure 3 is the average value of weight 0.127, which is also the threshold value of feature weight screening. As can be seen from Figure 4 Table 4. e result of feature extraction in Table 4 reflects that the features directly related to the fault are retained. Among them, the maximum monthly load, the average monthly load, the month classification, and the geographical location of the feeder correspond to the load characteristics, geographical characteristics, and time characteristics of the fault, respectively, and they have a relatively obvious and direct correlation with the feeder fault. In terms of meteorological data and ledger data, the monthly extreme weather days, monthly gale days, and the total length of feeder are retained, and some redundant indexes are effectively eliminated. According to the distribution system's actual operation in different areas, the optimal feature set obtained may also be different.
It is necessary to conduct sensitivity analysis. e sampling number and the nearest neighbor number are adjusted to 80 and 10, respectively. e calculation results show little difference from Table 4, indicating that the improved RelieF algorithm is relatively stable, and there is no

Early Warning of Distribution System Fault Risk.
Eight optimal features are extracted to train the Softmax classifier, and the monthly failure risk levels of 191 feeders from January to June 2019 are predicted. In order to measure the processing ability of the model to unbalanced samples, the definitions of commonly used accuracy and recall rate are slightly adjusted in this paper, and the classification accuracy T pr and recall rate T re for minority classes are proposed. H a , A r and confusion matrix are also introduced as evaluation indexes of the model. e classification accuracy T pr and recall rate T re can be defined as follows: where T 2 and T 3 are the numbers of correct classification in the target categories II and III, respectively, and F 1 , F 2 and F 3 are the numbers of mistaken classification in the target category in the three categories, respectively. H a is the weighted harmonic value of precision rate and recall rate, which can evaluate the overall performance of the model, and its results are more focused on the classification performance of minority category.
where β is the weight coefficient, is positive, and represents the relative relationship between model recall and precision. In this paper, the value is set to 1. A r is the proportion of correctly classified samples to all samples, which can measure the model's overall classification performance and can be expressed as As to the Softmax classifier initialization, the weight attenuation parameter is 0.002, and the corresponding value of α for each category is inversely proportional to the ratio of category sample size, which is set as 1, 20, and 50, respectively. e gradient descent learning rate is set as 0.1, and the number of iterations is 500.

Prediction Results.
e model prediction results are shown in Figure 4. Each row represents the actual category, and each column denotes the prediction category. e diagonal elements represent the number of samples correctly classified, and the off-diagonal elements represent the number of samples incorrectly classified. e last horizontal row and last vertical column represent the classification accuracy and recall rate of various categories. e last element represents the overall prediction accuracy rate of the model. Figure 4 shows that the recall rate of all levels of 1146 test samples is 99.09%, 97.5%, and 100.00%, respectively, the accuracy rate is 99.91%, 82.98%, and 80.00%, and the overall classification accuracy rate is 99.04%. e classification accuracy of the first category is relatively high, which is because the number of samples in the first category is large, which corresponds to the samples without faults or with less impact, and the learning performance is good. e classification accuracy of category III is low, mainly because the total number of such samples is small, and any misclassification will greatly impact the results.
In the case of unbalanced samples, the model's recall rate is reasonable, and the recall rate of categories II and III with serious risk level is similar to that of category I, which indicates that the model can effectively identify high risk faults of the distribution system. It can be concluded from the confusion matrix that the misclassification in the prediction model mainly focuses on category I into II or III, indicating that the model focuses more on the recall rate of categories II and III with high risk, which is due to the higher cost of misclassification of the high risk categories.

Comparison of Different Prediction Methods.
Four indexes including T re , T pr , H a and A r are utilized to compare the prediction results of the training set and test set of the improved RelieF-Softmax (denoted as A), improved Softmax (denoted as B without feature extraction), and improved RelieF (denoted as C without improvement in Softmax loss function). e predicting results are shown in Table 5.
It can be found from Table 5 that Case A, based on the improved RelieF-Softmax algorithm, performs well in both the training set and the test set, and its performance is better compared to Case B and Case C, indicating that the improved model has better generalization ability. Besides, it can be seen that feature extraction can improve classification performance to a certain extent by comparing Case A and Case B. Further, by comparing the training set and the test set of Case B, it can be concluded that the training classification effect of the model is significantly better than that of the test set, which is because too much redundant input vector will make the model passively learn too much information, and the complex model will lead to the reduction of generalization performance.
By comparing Case A and Case C, it can be found that the recall rate of Case A is considerably higher than that of Case C, indicating that the former can effectively identify high risk categories with better classification performance. For the distribution system fault prediction problem, there are various influencing factors and complex correlations. e distribution system fault data itself belongs to a minority of samples, so it is necessary to adopt the improved RelieF-Softmax algorithm to improve the model's adaptability and prediction efficiency.

Conclusions
In this paper, a fault risk warning method of distribution system based on improved RelieF-Softmax algorithm is proposed, and the following conclusions can be drawn.
Compared with the single dimension reduction method, the feature extraction method based on the improved RelieF algorithm can effectively overcome the deficiency, which is the fact that the traditional RelieF algorithm cannot remove the redundancy, reduce the dimension of features, and therefore improve the classification performance. e failure data in a year are analysed based on data mining technology, and the fault risk level of the distribution system is predicted. e model can effectively identify minority category samples and avoid misclassification of high risk samples that lead to severe consequences. e proposed method can provide a scientific basis for maintenance and repair resource configuration for the distribution system. is paper aims to put forward the thinking of the fault risk early warning method for the distribution system. Due to the differences in regional information level, distribution system operational means, and the distribution system data acquisition method, the regional features should be considered in the concrete analysis. In the future, the relevant data of distribution system in an all-round way should be extracted as thoroughly as possible, and the distribution system falt features need to be identified through association mining and other efficient means, such as deep learning, and the accuracy and efficiency of the fault early warning can be enhanced further.
Data Availability e data can be accessed from the distribution system of State Grid Corporation Company (http://www.js.sgcc.com. cn/sz/).

Conflicts of Interest
e authors declare no conflicts of interest.