Abnormal Detection of Wind Turbine Based on SCADA Data Mining

In order to reduce the curse of dimensionality of massive data from SCADA (Supervisory Control and Data Acquisition) system and remove data redundancy, the grey correlation algorithm is used to extract the eigenvectors of monitoring data. The eigenvectors are used as input vectors and the monitoring variables related to the unit state as output vectors. The genetic algorithm and cross validation method are used to optimize the parameters of the support vector regression (SVR) model. A high precision prediction is carried out, and a reasonable threshold is set up to alarm the fault. The condition monitoring of the wind turbine is realized. The effectiveness of the method is verified by using the actual fault data of a wind farm.


Introduction
SCADA system changes the operation mode of wind farm systems with healthy working environment and reduces the costs of operation and maintenance. However, a large set of high dimensions and many types of data are not fully utilized or developed; only stay on real-time data and historical data reporting statistics are typically monitored or gathered. erefore, it is important to make full use of the data collected by wind power centralized control center to collect the data of the massive wind turbines, conduct state monitoring of the turbines, and predict their condition and life [1,2]. Several surveys of WT failures have been conducted in the last two decades to identify failure rates and associated downtime for different subassemblies. However, the different taxonomies used by different turbine manufacturers, wind farm operators, and researchers make comparisons between these surveys challenging. e evaluation of 15 years of data from the German "250 MW Wind" programme [3] and >95% of all the turbines operating between 1997 and 2005 in Sweden [4] gave first insights into the reliability of the first onshore WTs. e German turbines had an average availability of about 98%. An average failure rate of 0.4 failures per turbine per year resulted in an average downtime of 130 hours per turbine per year for the Swedish turbines. A distinctive difference between failure rate and downtime distribution in subassembly groups was identified. e electrical and electronic control systems were identified as the most failure-prone, but gearbox and generator failures caused the longest downtime.
Many scholars have researched large power wind turbine monitoring and fault diagnoses [5], based on statistical learning method to detect abnormal situations through the wind turbine response model of the weighted least squares support vector-based wind power generator and external regression conditions [6]. e results show that the model is better than conventional forecasting methods. Pandit and Infield [7] used an in-depth analysis of commonly used stationary covariance functions in which wind turbine power curve was used, where GP-based power curve was constructed using different stationary covariance functions, and after that, a comparative analysis was carried out in order to identify the most effective covariance function. e commonly used squared exponential covariance function is taken as the benchmark, against which other covariance functions are assessed.
e results show that the performance (in terms of model accuracy and uncertainty) of GP fitted power curve models based on rational quadratic covariance functions is almost the same as for the most commonly used squared exponential function. e studies of Astolfi et al. [8,9] are a catalog of generalizable methods for studying wind turbine power curve upgrades. In particular, from the study of the selected test cases, it arises that complex wind conditions might affect wind turbine operation such that the production improvement is nonnegligibly different from what can be estimated under the hypothesis of ideal wind conditions. Wan et al. [10] proposed wind form using wavelet based on energy function for asymmetrical fault detection in doubly fed induction generator.
e proposed method not only detects the fault within one and half cycle of fundamental wave but also reveals the effectiveness under time-varying conditions. Turbine condition monitoring (TCM) through vibration analysis has pros and cons: basically high diagnostic power against high cost and high complexity for elaborating the information [11] from the data stream into knowledge. Chun-yu et al. [12] put forward a dynamic prediction model of wind turbine blade failure based on the grey theory. e relative error between prediction and field investigation data is less than 5%, meeting the actual needs of engineering and verifying the effectiveness and applicability of the proposed algorithm. e main contribution of Chakkor et al. [13] is designing an intelligent wireless remote monitoring and control system according to features and requirements of wind turbines.
is system based on IP communication combines Web and database client/server technology to copy data measurements received from the different sensors installed in the wind turbine machines. Eggers et al. [14] used Hotelling T2 control chart and an automatic relevance neural network to analyze the wind turbine power to identify wind turbine detection faults. Zhang [15], combined with AHP and variable weight theory, used a wind turbine performance evaluation model based on Grey eory and a variable weight fuzzy comprehensive evaluation. However, these studies did not consider the correlation and coupling between the components of the unit, which makes the model inaccurate. Zhang et al. [16], based on SVR prediction model, helped to establish a prediction model, which takes the amount of SCADA systems as input and the active power of the unit as output. e disadvantage of this model is that the feature extraction of the high-dimensional input vectors is not easy, and the power is used as the only standard to diagnose the state of the unit. BP neural network is used to model and predict gearbox and generator [17], and multiagent method is used to synthetically analyze the diagnosis results of different components, giving the overall operation status of the unit. However, the use of neural network modelling requires time-consuming learning process, and the selection of learning samples lacks basis.
Being based on statistical analysis, it commonly requires vast datasets for providing meaningful indications: the most common opinion therefore is that SCADA can detect incipient faults at a late stage. Astolfi et al. [18] employed artificial neural networks, for their capability in reconstructing nonlinear dependency between inputs and outputs, and formulated simple models for the diagnosis of occurring faults at the level of gearbox. e datasets employed have the 10-minute sampling time of the common SCADA control systems; the gearbox vibrations and the gearbox temperatures are selected as target output to model. It will be shown that the time resolution of SCADA is too coarse for reliable vibration analysis, which should be rather observed at its proper time scale (several Hz). At present, data mining methods such as clustering and statistical model are widely used in domestic and foreign enterprises, but their cleaning process is complicated and the cleaning conditions are harsh [19]. erefore, in order to make a reliable analysis of the power generation performance of wind turbines, an efficient and versatile cleaning method is urgently needed.
In view of this, this paper firstly extracts the features from the massive and high-dimensional data collected by the SCADA system, removes the irrelevant and redundant parameters of the operation state of the unit, and improves the monitoring accuracy of the wind turbine by improving the model input. e reasonable threshold is selected to alarm the abnormal state of wind turbine to avoid false alarm and untimely alarm. e paper is organized in three sections. Section 2 discusses feature selection and sparse learning technology to reduce the dimension of the operation parameters of the SCADA system, remove the independent and redundant parameters of the operating state of the wind turbine, and retain the related characteristic parameters. In Section 3, the multi-input and multi-output SVR model, which takes the active power, the speed of the blade, and the pitch 1 angle as the output vector and the characteristic parameter as the input vector, is established. Cross validation (CV) is combined with a genetic algorithm (GA) for parameter optimization. In Sections 4 and 5, the proposed method is applied to the industrial data. Performance of the proposed model is also discussed. Section 6 concludes the paper.

Data Mining of Characteristic Parameters for
Wind Turbines e data collected and recorded by the SCADA system of the wind turbine has high-dimensional characteristics. In this paper, 74 digits of the wind turbine components are selected. e method of feature parameter data mining reduces the number of features and dimension disaster so that the generalization ability of the model is stronger and the overfitting phenomenon is reduced. e commonly used methods for selecting characteristic parameters include principal component analysis (PCA) [20], the Pearson correlation coefficient [14], and the random forest method [21]. When the data are high-dimensional vectors, the calculation of PCA is complicated and it is most suitable for linear data. Pearson correlation coefficient is only sensitive to the disadvantages of the most obvious linear relationship. e random forest method is prone to an overfitting phenomenon. erefore, in this paper, a data mining algorithm based on the grey correlation degree [22] is proposed to overcome the above shortcomings and to improve the accuracy and effectiveness of wind turbine operation state assessments.

Extraction of the Characteristic Parameters Based on Grey Relational Grade.
ere are 74 variables in wind turbine information recorded by the SCADA system. e acquisition interval is 10 minutes, as shown in Figure 1. Figure 1 shows the monitoring variables collected by the wind turbine SCADA system and its response code. e object of study in this paper is that the wind turbines are in the condition of unlimited power and healthy operation. ey have some monitoring quantities such as the control mode and alarm of some parameters recorded in the SCADA system, speed mode, state of shaft 1, shaft 2. and shaft 3 converters, etc. Variables in an invariant state can be ignored. Table 1 is part of the parameter alarm information of GE wind turbine manufacturer. To preprocess these eigenvectors, we must remove these eigenvectors to avoid the disaster of dimensionality caused by too many features.
Wind turbine operation is mainly reflected in active power, rotor speed, and pitch angle; these three parameters are used as input vectors. We take the pitch angle 1 as an example to monitor the pitch angle. e grey correlation with other variables is calculated to reduce the wind turbine data dimension while ensuring the smallest loss of information. e concrete steps of extracting the operating characteristic parameters using the grey correlation degree are as follows: (1) e characteristic set of the wind turbine operating (2) According to the parameters of the primary wind turbine D, the corresponding parameters are extracted from the SCADA system as the sample set of the grey correlation degree Ω: where x mn is the n parameter of the m samples. e degree of correlation between the calculated parameters is found as follows.
Determine the reference sequence X 0 and the comparison sequence X i (i � 1, 2, . . . , m) according to the training sample Ω. e absolute difference X i between X 0 and sequences Ω is calculated from the sample set as (2) e absolute difference Δ i is used to calculate the maximum difference and minimum difference of level two, respectively, as X 0 is calculated, and the sequence X i in the moment of k for correlation coefficient r 0i (k) is compared by

Mining of Characteristic Parameter Data of Wind
Turbines. e grey correlation analysis mentioned above is  Alarm information Latest(�n) alarm coming from the frequency converter n − 1 alarm coming from the frequency converter n − 2 alarm coming from the frequency converter n − 3 alarm coming from the frequency converter applied to 1.6 MW wind turbines. By preprocessing the eigenvectors, 23 variables related to the operating state of the wind turbine are selected, and the values of these variables are extracted from the SCADA system. Based on operational data from January 1, 2015, to January 1, 2017, the sample set features selected by grey correlation analysis and the characteristic parameters of the grey correlation matrix for color map are as shown in Figure 2. According to Figure 1, the grey correlation degree between each primary parameter of the power, rotor speed, and pitch angle is different, which makes it feasible to excavate the characteristic parameters of the generator set. In this paper, variables with correlation degree greater than 0.5 are selected as input of monitoring variables. e set of characteristic parameters is shown in Tables 2-4.

Prediction Model Based on Support Vector Regression
After the data effectiveness analysis and dimensionality reduction are conducted, the parameters of the wind turbine are regressed. SVR [23] algorithm of structural risk minimization criterion solves the practical problems of small sample, nonlinearity, and high dimension and overcomes the shortcomings of the indetermination of the network structure and local minima, over learning and under learning. erefore, this paper chooses SVR algorithm to build regression prediction model. e specific algorithm is as follows.
Set a given sample training set for (x i , y i ) i � 1, 2, . . . , l, x ∈ R l , y ∈ R, x i is the i input vector, and y i is the i output vector. Nonlinear mapping should be used for the nonlinear SVR model φ(·). e mapping sample sets are used to feature spaces φ(x i ). e optimal decision function is as follows: where w T is the characteristic space weight coefficient vector and b is the bias. It is assumed that all training samples can be in precision with linear functions at ε accuracy. According to the principle of structural risk minimization, the problem can be formulated as where ξ i is the relaxation factor. Introducing the Lagrange function, the optimization problem in the dual space is used to obtain the following formula: where α i , α i , c i , and c i are the Lagrange multipliers and c is the penalty factor respective to w, b, ξ, and α. Find the partial derivative and make it equal to 0 and bring the derivative into the Lagrange function: Using the positive definite matrix theorem, the φ(x i )φ(x j ) inner product is replaced by a kernel function k(x i , x j ). erefore, the SVR function can be obtained as follows: In the kernel function, the structure of the radial basis function (RBF) kernel is simple, and its generalization ability is better. Based on this, the kernel function of the model is selected as the radial basis function. k(x i , x j ) � exp(− ‖x i − x j ‖ 2 /2σ 2 ), where σ is the kernel width.

Genetic Algorithm (GA).
In this model, the penalty coefficient and parameter of the kernel function affect the SVR precision. erefore, GA is used to optimize the parameters of the SVR model, which is based on the natural selection and genetic mechanism of the theory of biological     evolution by Darwin [24] to find the optimal solution. e main process is to encode the solution to the problem. ere are two ways to code the solution individually, including binary coding and real number coding, which essentially maps the solution space to the chromosome space. en, a reasonable initial population is generated in these solution spaces, and individuals are selected according to fitness function, genetic selection, crossover, and mutation operation. e individual with high fitness value is kept and vice versa. is new generation of offspring retains the advantages of the previous generation, whereas the last generation did not. is process is iterated many times until the optimal solution is obtained.

Cross Validation (CV).
In machine learning, CV is mainly used for model performance evaluation and learning. e basic principle is that the original sample is divided into a training set and a validation set, and then the training set is used to train the model. e model validated by the test set is obtained from the training model. As a performance index evaluation model, CV considers the training error as well as the generalization error. e most common CV method is kfolding cross validation (K-fold CV), and the specific algorithms are as follows: (1) e sample S are divided into k subsets that are not intersected, and the number of samples is m/k. S 1 , S 2 , . . . , S k are remembered. (1) e SVR parameters (c, σ) are coded to form the initial population. (2) For population decoding, we calculate the fitness of individuals based on the K-fold CV method. In this paper, ε S j (h ij ), the minimum mean square error MSE of samples (c, σ), is chosen as the fitness function value of the GA algorithm. (3) Judge whether or not to meet the terminating condition if it is satisfied to turn (5); otherwise, proceed to (4). (4) Update the population by selection, cross, and variation; then, return to (2). (5) e optimal (c, σ) and optimal model is output.

Data Processing.
In this paper, power, rotor speed, and pitch angle are taken as the output vectors and other feature parameters are taken as input vectors, and then multipleinput and multiple-output SVR model is established. e accuracy of the proposed model is verified by running data of the wind turbine for four months.
(1) According to the fault information recorded by the SCADA system, the samples of the maintenance shutdown due to the failure of the wind turbine and the samples of the less power operation are eliminated. (2) Consider that the cutting wind speed is 3 m/s, the rated wind speed is 12 m/s, and the cutting wind speed is 25 m/s. According to the actual power curve of wind turbine, the wind speed range selected in this paper is 3 to 25. (3) To eliminate the magnitude of interference between the parameters, the parameters are normalized to [0, 1], according to the dimensions.

Model Establishment.
In order to prove the validity of the model, this paper selects four months effective data of wind turbine to predict. In this model, the principle of cross validation selection first considers the minimum error MSE. According to the errors of MSE, to avoid the occurrence of the learning state, a group of smaller penalty parameters is selected as the best parameter. From Figure 4, we can see that 5-CV and GA model are the best. e average fitness curve in Figure 4 indicates the average fitness of all the individuals in each generation. e best fitness curve represents the maximum fitness of all individuals in each generation. e convergence of the fitness curve is very fast, and the convergence level of the final fitness curve is relatively consistent, which reflects the optimization of SVR parameters. When the power, rotor speed, and pitch angle are output, the best parameter is applied to the SVR model. e comparison between the actual and predicted values of the SVR model is shown in Figure 5. Table 5 lists the power, rotor speed, and pitch error. e mean relative error values indicating the good prediction accuracy and stability of the SVR model CV-GA algorithm is shown in Table 5.
If ] is selected too low, the algorithm is too sensitive to the change of the operating state of the wind turbine and it is prone to misjudge the results. If ] selection is too large, the prediction time will be reduced and the detection rate of the abnormal operating state will be affected. To solve this problem, it is necessary to select the appropriate threshold ]. In this paper, two-year SCADA data of the 1.6 MW wind turbine is analyzed, and the appropriate threshold is determined. From Table 5, it can be seen that threshold selection will cause the operating state of the wind turbine to not be normally recognized. To find a lower detection rate and the misjudgment rate, the threshold is selected as ] � 0.16 (in Table 6).

Example Analysis
At 5:32 on August 7, 2016, a wind turbine in Hebei Province went into shutdown due to the SCADA system failure alarm. After checking the pitch gear of blade 1 of the wind turbine, the wind turbine went into shutdown due to failure. 950 sets of data were extracted from the SCADA system before the blower alarm stopped. 950 sets of collected data are input into the model, as shown in Figure 6. As can be seen from the figure, close to 150 data points were detected as abnormal points before the wind turbine shut down. e situation of power restriction indicates that some of the wind turbines have begun to deteriorate at these times, and the unit has issued abnormal alarms. e reduced power generated by the unit indicates that the model can give a hint before the failure occurs. erefore, the proposed model is effective for the state monitoring and fault prediction of the wind turbine, and it can avoid the continued deterioration of the fault and the influence on the safe operation of the power grid.

Conclusion
In order to extract relevant state from massive and highdimensional data of the SCADA system, realizing the monitoring of the state of the wind turbine, a grey correlation degree is proposed based on data mining technology to extract characteristic parameters of the wind turbine's operating state, which reduces the data dimensions and computation. To improve the precision, GA and CV are combined to optimize the parameters of the regression model. To verify the validity of the model, the threshold of the SVR model is analyzed, and the model is applied to wind farm. e results show the following: (1) By establishing a data mining model of the characteristic parameters based on the grey correlation analysis, we extract parameters that are more related to the power, rotor speed, and pitch angle, effectively avoiding "dimension disaster."

Mathematical Problems in Engineering
(2) By comparing the power, rotor speed, pitch angle regression model, and the measured values, the results showed that the average relative error of the SVR model is low. e regression model has high accuracy and generalization ability; it can be applied to wind turbine anomaly distinguishing analysis.
(3) Applying the model to practice, the analysis of the model results and the SCADA system can be used to record the measured values. e results show that when using the distance threshold to choose the appropriate conditions, the wind turbine condition monitoring can reflect the operating status of the wind turbine to provide technical references for online monitoring of wind turbines.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.