Assessment of Equipment Operation State with Improved Random Forest

the


Introduction
In recent years, the number of wind turbines is increasing with the extensive application of wind energy. The generator is the key component of wind turbines. So, its operation state has a direct impact on the power generation of wind turbines. Generally, wind farms are built in complex and harsh places such as the Gobi. The wind turbine generator has failed frequently under the influence of extremely harsh working environment, complex working conditions, and extreme weather [1,2]. A major problem is maintenance which is difficult and expensive. So, it is crucial to assess the operation state of a wind turbine generator.
There is a growing body of scholars that recognized the importance of state assessment in the wind turbine generator. Some methods such as fuzzy comprehensive evaluation, support vector machine (SVM), and neural networks (NN) are main methods to assess. At the same time, the SCADA system has been installed in many wind farms. It is a powerful way to obtain operating data of wind turbines. Extensive research has shown that state parameter data from the SCADA system is critical. It is related to the operation state of wind turbines, and the state assessment is realized. Wang et al. established an evaluative model based on fuzzy mathematics to assess the design performance of wind turbines comprehensively, but its operating state was not assessed effectively [3]. Zheng et al. considered randomness into the fuzzy method. Combination weight was applied to determine the index weight for higher accuracy, and the health state was assessed effectively [4]. An et al. comprehensively considered multisource information such as wind speed and rotational speed. The experimental data of different faults was verified based on SVM. Finally, fault diagnosis of the wind turbine was realized [5]. Liang and Fang considered the coupling relationship among components of the wind turbine and established a regression prediction model with SVM [6]. Lin et al. proposed an adaptive immune fruit fly optimization algorithm (AIFOA) to optimize the parameters of SVM. The feature index was predicted more accurately. And performance assessment was realized comparing deviation with normal value [7]. With the usage of the traditional SCADA alarm system, the fault diagnosis was more convenient. Li et al. established a normal behavior model based on NN to assess the wind turbine operation state. The health class was proposed to measure differences between operating state and normal state in the paper. Finally, the wind power generation system was effectively assessed [8]. Wang proposed a two-level NN recognition method used for fault classification and fault diagnosis, respectively [9]. Zhao et al. proposed a deep learning method (DLM) with a deep autoencoder (DAE) network and established the DAE model. SCADA data was input, and early warning of fault components was realized [10]. Yang et al. established an assessment index system for wind turbines firstly. Quantitative assessment of main components such as blades and generator bearings in wind turbines was realized based on SCADA data [11]. Tautz and Watson realized state and fault monitoring of wind turbines based on SCADA data and five aspects which concluded clustering methods, normal behavior modeling, damage modeling, and expert system review estimate [12]. Hu et al. proposed an evaluation method based on temperature characteristic parameters and deterioration degree function. The early deterioration of the wind turbine generator system was detected successfully [13]. Qian et al. proposed an online sequential extreme learning machine (OS-ELM) algorithm for wind turbine condition monitoring. The long-term deterioration characteristics and the short-term faults of the gearbox were detected efficiently based on SCADA data and the proposed method [14]. Hsu et al. regarded control charts based on an exponentially weighted moving average (EWMA) model as a main assessment method and set upper and lower limits to monitor state variables. The operation state of wind turbines could be reacted at all times, but the data was limited in this process [15].
These research findings provide the theoretical and experimental foundations for assessment of wind turbines or their core components. However, some problems exist. Fuzzy comprehensive assessment is defective in determination of index weight; it is highly subjective. SVM is suitable for classification based on the small size of samples, but it is difficult for the large size of samples. NN is simple to learn and implement compared with the first two methods, but results must be obtained accurately based on a large number of data sets. Importantly, the obtained results are highly dependent on the parameters, and a lot of work and experience are required in the process of fine tuning these parameters require [16].
However, ensemble learning is a learning paradigm where many classifiers are combined to solve a problem. The generalization ability of a single classifier can be significantly improved based on a classifier ensemble. For example, random forest (RF) is widely applied because of strong classification ability, strong learning ability, and no requirements for samples. Importantly, the method is more suitable for classification or regression problems with less noise. It is insensitive to the adjustment of parameters. Classification attributes are not divided too much, and data dimensions should be under tens of dimensions. Wang et al. proposed a panoramic crack detection method based on structured RF to realize condition monitoring and fault diagnosis. Finally, the surface cracks of a panoramic steel beam were found efficiently [17]. But the same weight is given to decision trees with different classification capabilities in the final voting stage of RF, which weakened classification performance.
Therefore, a state assessment method based on IRF is proposed in the paper. Undersampling and SMOTE are introduced for imbalanced data sets. Right weight is introduced to the final voting stage of RF. To improve accuracy, different weights are set according to the different capabilities of decision trees. 10-fold cross-validation and improved assessment criteria based on a confusion matrix are applied for model assessment. Finally, the method is verified based on data sets from the SCADA system in wind turbines. The state of generators is assessed correctly. The efficiency of the method is verified compared with traditional classifiers.
2. Method: Improved Random Forest 2.1. Processing for Imbalanced Data. Unbalanced data mean that some categories have a large number of samples and others have a small number of samples, which forms an imbalance of each category in the data sets. Generally, a small number of samples is called a minority. It is easy to be misclassified with a small number of categories, and classification accuracy is bad because of imbalance. At present, the imbalance is solved by data processing and algorithms. The data processing is that increasing minority samples is based on undersampling or reducing majority of samples based on oversampling. However, oversampling is a copy of a minority of samples, which causes overlap and overfitting of data sets. An undersampling method is to randomly delete some data to balance samples, which causes some important sample information lost, and the result is affected. Someone has studied the combination of two methods, and good results are obtained, so the method combined between undersampling and Synthetic Minority Oversampling Technique (SMOTE) is introduced in the paper [18].
SMOTE is linear interpolation realized between a few neighboring samples to synthesize new minority samples. That is, k (usually 5)-nearest neighbor samples are found for each sample in the minority data sets. According to the sampling magnification N, N same kind of samples are randomly selected from k-nearest neighbor samples, it is y 1 , y 2 , ⋯, y N . The linear interpolation is realized between x and y i ð1, 2,⋯, NÞ of minority samples to synthesize new minority samples new data. The formula is expressed as follows: where rand is a random number between ð0, 1Þ. new data is new samples.

Random Forest
2.2.1. Establishment of RF Model. RF is a classifier integration algorithm that combines "random subspace method" and "bootstrap aggregate" to establish decision trees (DT). RF is established as follows [19]. The bootstrap resampling method is applied to allocate training set and testing set. The original sample sets are replaced and randomly sampled N times to form a new training set that is the same size as the original sample sets. According to the distribution of probability, 60+% of samples are repeatedly collected as the training set. About 36% of samples not collected are the testing set.
Each generated training set is applied to establish a corresponding decision tree C 1 , C 2 , ⋯, C n . mðm ≤ MÞ attributes are extracted from each node of decision trees and as the split attribute of the current node for classification. During the growth of an entire forest, m is determined by the Gini index of each node. The Gini index indicates impurity of each node. The purity is inversely proportional to the Gini index. The formula is expressed as follows: where P j is probability that the sample X contains the attribute j.
The input testing set is verified by each decision tree, and results are obtained based on the number of votes. The RF model is expressed as follows [20]: where h i ðxÞ = Y is output by the ith decision tree. Y = 1, 2, ⋯, c is the corresponding category, i = 1, 2, ⋯, n, n is the number of decision trees in the random forest. Ið⋅Þ is the indicator function. ⋅ is classed correctly; its value is 1; otherwise, it is 0.
The number of decision trees in RF has crucial influences on generalization ability. The data sets that are not extracted are set to Q. Q is input to the above RF model; the corresponding classified result is output. The number of incorrect classification is set to R, so the probability of incorrect classification about Q is E OOB = R/Q. Finally, the number of decision trees is determined effectively [21].

Improvement Process of RF.
To avoid the effect that traditional RF makes on assessment accuracy owing to the same weight for each decision tree. Weight is introduced to the voting process. Different weights are endowed to different decision trees, and generalization ability is improved. The formula of weight is expressed as follows: where X ′ is the pretested sample (it is part of the training set) and X ′ correct,i is the number of samples classed correctly. The improved RF model is expressed as follows: The assessment process of the wind turbine generator with IRF is shown in Figure 1. The setting of weight is shown in Figure 2. 2.3. Online Assessment Strategy. To assess errors caused by noise, online assessment is introduced. The class of operation state is set to c; data at time t is input to IRF model HðxÞ. The 3 International Journal of Rotating Machinery voting results Y i of each decision tree are output, so degree probability of the cth operation state is expressed as follows: The state degrees of the generator are converted between adjacent degrees. Finally, the corresponding state degree of the data x t is expressed as follows:

Confusion Matrix.
Generally, classification accuracy ðACCÞ is used as an assessment standard of the model. But the model performance is often ignored for the minority samples from unbalanced data sets. So, a confusion matrix is introduced in the paper [22]. The relationship between the true category of the samples and the classification result is described by the confusion matrix to present the assessment standard of model performance.
The confusion matrix is shown in Table 1. N is the majority class; P is the minority class; TP,TN is the number of majority classes and minority classes classified correctly, respectively; and FN,FP is the number of majority classes and minority classes misclassified.
To assess the classification model more accurately, harmonic average of minority class accuracy and F − measure, geometric average correct rate ðG − meanÞ, and Matthew correlation coefficient ðMCCÞ are determined as the assess-ment standard of the model based on the confusion matrix. The specific formula is expressed as follows: However, F − measure, G − mean, and MCC are only applicable for binary problems. A "one vs. one" strategy is introduced to solve multiclass problems in the paper. That is, multiclass is paired in pairs; the multiclass problem is converted into binary problems. The average is finally taken as a result. So, the improved assessment standard is applied in the paper. The formula is expressed as follows:   International Journal of Rotating Machinery

Preparation for Simulation
3.1.1. Data Collection. The data sets in the paper come from the SCADA system of wind turbines in a wind farm. The SCADA is a distributed control system (DCS) and power automation monitoring system based on a computer. It achieves data collection, equipment control, measurement, and parameter adjustment of core components in wind turbines such as generators, gearboxes, and blades by monitoring and controlling on-site equipment, namely, wind turbines. The purpose is to correctly grasp the state of system and each component, to make a decision quickly, to help diagnose the fault state, and so on. The F7 wind turbine failed in the wind farm at 14:01 on July 1, 2017. The generator data sets are collected before failure from the SCADA system in the paper. The data details are shown in Table 2, which includes the number of samples, features, classes, class distribution, and imbalance rate. The imbalance rate is obtained by the largest sample and the smallest sample.

The Setting of Assessment Features.
According to features related to the generator in the SCADA system, nine feature data sets of generator that have a greater impact are determined, respectively, A 1 : front shaft temperature; A 2 : rear bearing temperature; A 3 : cooling water inlet temperature; A 4 : u 1 winding temperature; A 5 : u 2 winding temperature; A 6 : v 1 winding temperature; A 7 : v 2 winding temperature; A 8 : w 1 winding temperature; and A 9 : w 2 winding temperature.

The Setting of State Degree.
Generally, it is appropriate to divide state degrees into 3-5. In this paper, the state degrees of the wind turbine generator are finally determined as 4, namely, "excellent," "good," "attention," and "badness." 3.1.4. The Setting of Optimal Characteristic. The mean decrease accuracy is calculated based on the Gini index (as shown in Figure 3). The number of optimal characteristics is the same as above 70%. Namely, the number of optimal characteristics is set to 4 in the paper, and decision trees are branched. As shown in Figure 4, the out-of-band (OOB) error rate decreases as the number of decision trees increases. After n > 150, the OOB error rate remains stable and below 4%. To assess more accurately, it is set to n = 200.       Table 2. The data sets are imbalanced for the classification problem. Part of data sets is shown in Table 3.
The feature distribution is shown in Figure 5 after unbalanced data set processing based on a combination of undersampling and SMOTE.
In the paper, a simulation testing is completed on MATLAB R2016a. Firstly, the impact of the sample size is analyzed on the accuracy of the model. The total number of samples is determined as 1400, 1000, and 600. The accuracy of the improved random forest model is verified by 10-fold cross-validation and improved assessment criteria. The results are shown in Table 4.     According to the assessment results in Table 4, it is not significant for the number of the training set to affect accuracy of the assessment model, and final assessment accuracy will fluctuate around 95%-96%. Meantime, the assessment results for 10-fold cross-validation based on 600 data sets are shown in Figure 6.
To further verify that IRF has higher generalization ability and classification ability on state assessment of a generator, in the same condition, DT, RF, Probabilistic Neural Network (PNN), Learning Vector Quantization (LVQ), and SVM are adopted separately for training and testing. The data set size of 600 is taken as an example, and comparison of the assessment accuracy of different classifiers based on 10-fold cross-validation and confusion matrix is shown in Table 5.
As you can see from Table 5 The results are analyzed comprehensively. Average accuracy is higher by 1.67% than RF. The IRF model has the best performance, which reflects that it has good prediction accuracy, extrapolation ability, and robust performance. Definition of all the symbols used in the paper is shown in Table 6.
Meanwhile, 60 testing sets are used to calculate the probability of state classes. The results of probability are shown in Figure 7.
In Figure 7, the probability of "excellent" is the largest for the first 13 samples. From the 14th sample, the probability of "good" gradually increases. From the 14th sample to the 23rd sample, the probability of "good" is the largest. From the 24th sample, the probability of "good" gradually decreases, and the probability of "attention" gradually increases. From the 31th sample to the 37th sample, the probability of "attention" is the largest. From the 38th sample, the probability of "attention" gradually decreases, and the probability of "badness" gradually increases. From the 46th sample to the 60th sample, the probability of "badness" is the largest.
The state assessment class and original class are shown in Figure 8. The assessment results of 60 data sets are determined: 1st-13th: excellent; 14th-30th: good; 31st-45th: attention; 46th-60th: badness. That is, the original testing set at "excellent" class is mistakenly classified as good twice, and original testing set at "good" class is mistakenly classified as "excellent" class once. The accuracy of results reaches

Conclusion
The state assessment method for a wind turbine generator based on improved random forest (IRF) is proposed. Firstly, data sets containing nine features of generator are determined from the supervisory control and data acquisition (SCADA) system. Undersampling and Synthetic Minority Oversampling Technique (SMOTE) is introduced to solve the imbalanced data problems in the paper. Bootstrap is applied to resample original data sets, and then, decision trees are generated. The weight is determined according to classification performance of different decision trees. The IRF model and corresponding online evaluation strategy are established. Finally, 60 data sets selected are input to verify the established model based on 10-fold cross-validation and confusion matrix. The state of wind turbine generator is assessed correctly, and then, the same data sets are applied to realize online assessment. The accuracy reaches 95.67%. The proposed method can not only ensure accuracy and effectiveness of assessment but also improve efficiency. The accuracy of the proposed method is better than traditional classifiers. It provides a certain reference for state assessment of wind power equipment.

Data Availability
All the data is in the manuscript. If the researchers are interested in obtaining the numerical solution files, please contact the email address: skyyangna@126.com.

Conflicts of Interest
The authors declared no potential conflicts of interest to the research, authorship, and publication of this article.