Rotating Machinery Fault Diagnosis for Imbalanced Data Based on Fast Clustering Algorithm and Support Vector Machine

To diagnose rotating machinery fault for imbalanced data, a method based on fast clustering algorithm (FCA) and support vector machine (SVM) was proposed. Combined with variational mode decomposition (VMD) and principal component analysis (PCA), sensitive features of the rotating machinery fault were obtained and constituted the imbalanced fault sample set. Next, a fast clustering algorithm was adopted to reduce the number of the majority data from the imbalanced fault sample set. Consequently, the balanced fault sample set consisted of the clustered data and the minority data from the imbalanced fault sample set. After that, SVM was trained with the balanced fault sample set and tested with the imbalanced fault sample set so the fault diagnosis model of the rotating machinery could be obtained. Finally, the gearbox fault data set and the rolling bearing fault data set were adopted to test the fault diagnosis model. The experimental results showed that the fault diagnosis model could effectively diagnose the rotating machinery fault for imbalanced data.


Introduction
With the development of modern large-scale production and the progress of science and technology, the structure of mechanical equipment has become more complex.During equipment operation, sudden failure of the equipment would lead to the loss of service ability or may even cause a serious disastrous accident [1,2].To ensure the reliability of the equipment to obtain greater economic and social benefits, the timely and accurate diagnosis of the equipment's failure mode is particularly significant to guarantee the normal operation of the equipment.Rotating machinery, such as bearings and gears, has been widely used in numerical control machine tools, aeroengine, electric power system, agricultural machinery, transport machinery, metallurgical machinery, and other modern industrial equipment [3][4][5].In recent years, new technologies and theories such as artificial neural networks have been widely applied in mechanical equipment fault diagnosis, which greatly improves the accuracy of fault diagnosis.For rotating machinery, there are various kinds of faults; however, samples of some typical faults are difficult to obtain [6,7].Therefore, it is necessary to study rotating machinery fault diagnosis technology for the condition of imbalanced data.
At present, SVM-based fault diagnosis is one of the most widely used fault diagnosis methods for mechanical equipment [8,9].This method learns the process data of different operating states of the equipment and then classifies the data into different faults by constructing classification hyperplanes in high-dimensional space.Imbalanced data means that out of the data used in training classifiers, the number of some fault data is larger than other types.Adopting the imbalanced data as the training set, the classification hyperplane would be offset from the real one, thus reducing the validity of fault diagnosis.In SVM, the penalty factor indicates the error sensitivity of the classifier.Currently, one of the methods 2 Journal of Sensors used to solve the imbalanced data problems defines different penalty factors as positive and negative samples to increase the penalty factor of the disadvantage samples so the classifier is sensitive to them [10,11].However, during the process of setting penalty factors, it is difficult to choose suitable penalty factors for different faults.Different values will directly affect the performance of the classifier, and small penalty factors often result in no obvious suppression effect while larger penalty factors weaken the generalization ability of the classifier.Another method to solve the imbalanced data problems is to conduct preprocessing for data [12,13] by reducing the number of the majority data to balance the data.Therefore, the selection of core data is the key in determining the performance of the SVM classifier for imbalanced data.
The purpose of the cluster algorithm is to classify the data according to their similarity.Therefore, we proposed an approach based on a fast clustering algorithm to reduce the number of the majority data from the imbalanced data.This fast clustering algorithm was proposed by Rodriguez and Laio in 2014 based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities [14][15][16][17].Based on these two assumptions, the fast clustering algorithm can be used to dispose of different clusters.
To diagnose rotating machinery fault for imbalanced data, a kind of method based on fast clustering algorithm and SVM was proposed.According to the proposed method, original features of different faults are constructed by VMD.Next, PCA is applied to reduce the dimension of the original features so that sensitive features can be obtained.After that, the fast clustering algorithm is adopted to reduce the number of the majority data from the imbalanced sensitive features.Finally, SVM is trained with the data clustered by the fast clustering algorithm, so that the fault diagnosis model for imbalanced data can be obtained.

SVM and Imbalanced Data Classification
2.1.SVM.As a machine learning algorithm developed from statistical learning theory, SVM maps inseparable learning samples from low-dimensional space into high-dimensional space through a kernel function to obtain an optimal hyperplane [18].If training set {(  ,   ),  = 1, 2, . . ., } consists of two categories, then the computational goal can be expressed as where  is the penalty factor and   is the slack variable.The constraint conditions can be defined as Then, the Lagrange function is constructed as where   and   are the Lagrange operators.
Then, the classification function can be formed as where (  , ) is the kernel function.

Classification Boundary Migration of SVM.
SVM classification algorithm assumes that the number of each class is approximately equal.In fact, for rotating machinery, the acquisition of fault samples is full of randomness, so it is difficult to guarantee the balance among different fault samples.Figure 1 shows the skewing of hyperplane.
From Figure 1, it can be seen that the hyperplane could easily distinguish two types of classes from the balanced data set.However, the hyperplane obviously shifted towards the minority class if the data set was imbalanced.Since the number of class 2 was small and two classes adopted the same penalty factor, the overall error caused by class 2 was also small.The result was that the hyperplane was easily affected by the outlier and moved to the direction of class 2, which caused a large classification error of the minority class.Therefore, to improve the classification performance of the SVM classifier for imbalanced data, a fast clustering algorithm was adopted to balance the data set.

Imbalanced Data Classification Based on
Fast Clustering Algorithm and SVM  where   is the distance between data point  and data point  and   is the cut-off distance.
With Gaussian kernel, the local density   of data point  can be calculated as From ( 5) to (6), the local density   means the number of the data points that are closer to data point  compared with   .
Distance   is defined as where set  = {  >   }.From (7), we know that distance   is the minimum distance between point  and the point with higher density, except that point  has the highest density.
For each data point, we can calculate its local density   and distance   .Then, the weight of clustering center   is constructed as Obviously, points with larger weights are clustering centers.The sequence   is constructed as where sequence   is the index number of local density   sorted in descending order.The sequence   represents the index number of the point closest to point , while the local density of this point is larger than point .
Then, the nonclustering center points can be categorized as where  is the label of the clustering centers.
For each cluster, the mean local density of this cluster is calculated.By comparing the mean local density, the points of this cluster can be divided into core points or halo points.
The synthetic point distributions data set [16] was adopted to test the effectiveness of the algorithm.Figure 2(a) shows the distribution before clustering, while Figure 2(b) shows the distribution after clustering.It is clear that core points of five class data were correctly chosen from the raw synthetic point distributions data set and showed that the fast clustering algorithm could be well applied to eliminate the halo points of the raw data.

Imbalanced Data Classification.
With the fast clustering algorithm, the imbalanced data set was preprocessed and the number of the majority classes reduced.Therefore, the raw data set was reassembled into a balanced data set.Then, the SVM classification algorithm was adopted to learn the balanced data set.The movement of the SVM hyperplane during the process of clustering is shown in Figure 3.
As shown in Figure 3, affected by the number of the data sets, the hyperplane was obviously biased to the minority class.The purpose of the fast clustering algorithm was to search the core points of the majority class and reconstruct a balanced data set so that the hyperplane could return to the side of the majority class.Therefore, the classification accuracy of the SVM classifier could be improved.

Evaluation of Imbalanced Data Classification.
For imbalanced data, the proportion of minority samples was not large, so the classification results of the minority samples had little effect on overall accuracy of classification.Therefore, there were some unique classification evaluation indexes for imbalanced data [19,20].Based on the confusion matrix, we defined the positive class (minority class) as P and the negative class (majority class) as N in the imbalanced data.As shown in Figure 4, TP and TN denote the correctly identified positive and negative samples, respectively.FP indicates that the negative samples are misclassified into positive class, while FN indicates that the positive samples are misclassified into negative class.
The recall of the positive class can be defined as The recall of the negative class is The precision of the positive class can be formed as Then -mean can be constructed as -mean can be constructed as As the evaluation index, -mean takes into account the classification performance of both positive and negative class.If the classification of the classifier is biased towards one class, it will directly affect the classification accuracy of another class where the  value will be very small.From (15), we can see that -mean considers the recall and precision of the positive class.Therefore, -mean can comprehensively show the classification effect of the classifier on positive class (minority class).

Rotating Machinery Fault Diagnosis
for Imbalanced Data where  is the penalty factor and  is the Lagrange multiplier.The process of decomposing was as follows.First, {  }, {  }, , and  were all initialized as 0.Then,   ,   , and  were updated through the circulative iteration.  was updated as The center frequencies   can be calculated as The condition for convergence is the following: where  is the discriminant accuracy.Finally, the original signal  was decomposed into a number of IMFs,   .Then, the energy of each IMF was calculated to constitute original feature vector, which was used to distinguish the original signal.
To test the validity of VMD, a pure harmonic signal affected by noise was adopted.Furthermore, we also conducted a comparison with empirical mode decomposition (EMD) based on the exact same testing signal.Here, the pure harmonic signal was the following: The noisy input signal was the pure harmonic signal affected by noise with the expression as follows:  =  1 + 0.3 cos (1400) + 0.36 cos (576) + 0.7, (22) where  ∼ (0, ) represents the Gaussian additive noise.
The signal waveforms of the pure harmonic signal and the noisy input signal are shown in Figure 5.
Figure 6 shows the decomposition of the noisy input signal.It was clear that the VMD algorithm correctly extracted the pure harmonic signal from the noisy input signal and that the EMD algorithm extracted seven IMFs from the noisy input signal.There was no pure harmonic signal in the seven IMFs.
With VMD, the vibration signal of the rotating machinery was decomposed into a number of IMFs.Then, the energy of each IMF was calculated to constitute original feature vector.Since these original feature vectors were high-dimensional features, dimensionality reduction algorithm was applied to reduce the computational complexity.

Feature Dimension Reduction.
A kind of traditional dimensionality reduction algorithm, PCA, was adopted to reduce the dimension of the original feature vectors.PCA is a statistical method which adopts orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables.X = [ 1 ,  2 , . . .,   ] expresses the -dimensional original features, while Y = [ 1 ,  2 , . . .,   ] is used to express the linearly uncorrelated sensitive features.With PCA, the contribution of the th component   can be defined as follows: where   means the variance of the   .
Then, the contributions of the first  principal component    can be calculated as follows: Finally, the principal components with high contributions can be chosen as the sensitive features.

Sample Selection and Fault Diagnosis Model.
The rotating machinery fault sample set (an imbalanced data set) is made up of several kinds of faults.Some faults are majority class while others are minority class.Each fault sample contains a number of sensitive features.The distance between the th fault sample and the jth fault sample can be calculated as where   are the sensitive features of the th fault sample and   are the sensitive features of the jth fault sample.K is the number of sensitive features for each fault sample.
According to ( 5)-( 7), the local density   and the distance   were obtained.Then, based on (8), the weight   of each fault sample was calculated.With reference to the number of samples of the minority class, the same number of fault samples with higher weight   were selected from the majority class.Whole samples of the minority class and selected samples of the majority class constructed balanced fault sample sets.Finally, the SVM classification algorithm was adopted to learn the balanced fault sample set.The flowchart of building fault diagnosis model is shown in Figure 7.
From Figure 7, it can be seen that the training samples chosen from the balanced fault sample set were used to train SVM, while the imitative testing samples chosen from the imbalanced fault sample set were applied to test the identification accuracy of the trained SVM.The SVM would be retrained until the classification accuracy of the trained SVM was acceptable; then this trained SVM could be adopted as the fault diagnosis model.

The Experimental Results
To verify the viability and effectiveness of the proposed algorithm, the gearbox fault data set and the rolling bearing fault data set were adopted to test the proposed fault diagnosis model.

Gearbox Fault Diagnosis.
A wind turbine transmission chain fault simulation test bed is shown in Figure 8.The test bed mainly consisted of a motor driver, motor, gearbox, wind wheel, sensors, and data acquisition system.The wind wheel was driven by the motor through the gearbox and the motor speed was controlled by the motor driver.An acceleration sensor was installed on the top of the gearbox while the signal was acquired by the data acquisition system.The tested gearbox was a kind of single-stage planetary transmission, while the number of the planetary gear teeth was 20.In this test, two faults of gearbox were simulated: half fracture and full fracture for planetary gear.To simulate the real working condition of the wind turbine transmission chain, different wind wheel speeds were also considered.For each fault, three kinds of working conditions (wind wheel speed: 197 r/min, 237 r/min, and 277 r/min) were simulated.Therefore, as is shown in Table 1, the gearbox fault data set consisted of three condition modes.Figure 9 shows the pictures of planetary gears.
As is shown in Figure 10, the vibration signal of the gearbox can be decomposed into a number of IMFs by VMD.Then, the original features can be obtained by calculating the energy of each IMF.
With PCA, the original features are mapped to another plane and replaced with the sensitive features.In the sensitive features, feature is sorted according to its contribution degree.
Figure 11 shows the first five principal component contributions of PCA.It is clear that the accumulated contribution of the first three sensitive features was 89.28%; thus, sensitive feature 1, sensitive feature 2, and sensitive feature 3 were selected as the sensitive features.Figure 12 was obtained by drawing three kinds of condition modes in a space formed by sensitive feature 1, sensitive feature 2, and sensitive feature 3. From Figure 12, it was clear that the distribution area of the normal planetary gear had been distinguished from half fracture and full fracture, but for half fracture and full fracture, an aliasing region exists in the distribution areas where it is difficult to make a distinction.The aliasing region can easily lead to the miscarriage of different failures, especially for the imbalanced failure data set.Thus, half fracture data and full fracture data were used to construct an imbalanced data set to test the classification of the proposed fault diagnosis model.
The imbalanced data sets under different proportions were constructed, while the distributions of imbalanced data sets are shown in Figure 13.Full failure was defined as the positive class (minority class) while half failure was the negative class (majority class).The number of the positive classes varied from 10 to 100.In the meantime, the number of the negative classes was 150.-mean and -mean were adopted as the evaluation indexes.

Rolling Bearing Fault Diagnosis Based on Casing Vibration.
In the case of gearbox fault diagnosis, the gearbox fault data set was used to test the performance of the fault diagnosis model when the model was applied to distinguish two failure modes.In this case, the fault diagnosis model was applied to distinguish multiple failure modes in the imbalanced data set.
The rolling bearing fault simulation test bed is shown in Figure 16.The motor was connected to the axis by a coupling, while the other end of the axis fits together with blades and the testing rolling bearing.The motor was responsible for driving blades and a casing was installed around the blades.Two one-way accelerometers were installed on the surface of the casing at a 90-degree angle.A data acquisition system was used to acquire the accelerometers' signals.The rotating speed was 1800 rpm and the sampling frequency was 16 kHz.The rolling bearing data set consisted of four modes such as normal, rolling element failure, inner race failure, and outer race failure.
Table 4 shows the composition of the rolling bearing data set where it was clear that the inner race failure and outer race failure were positive classes (small classes) in this imbalanced rolling bearing data set.With the proposed approach, the fault diagnosis model was obtained from the imbalanced rolling bearing data set.To test the classification accuracy of the model, testing samples consisting of 400 samples (100 samples from each mode) were constructed.
Figure 17 shows the confusion matrixes of the fault diagnosis model and other models.From Figure 17(a), the classification accuracy of the fault diagnosis model was 93.25%.Obviously, Figure 17(b) shows that the FCA + BP model was unable to identify the inner race and outer race failures.From Figure 17(c), the classification accuracy of the SMOTE + SVM model was 90.25%, less than the fault diagnosis model.Figure 17(d) shows the confusion matrix of the SVM model as the SVM model is confused with inner race failure and outer race failure.The reason for this situation was that the SVM model was trained by the imbalanced data set.Since the inner race and outer race failures were small classes, the hyperplane of the SVM model was biased to small classes.Therefore, the trained SVM model found it difficult to identify the inner race and outer race failures from the testing samples.In conclusion, the fault diagnosis model could distinguish the mode of the rolling bearing and obtain good classification accuracy, which proved the validity of the proposed approach.

Conclusions
In this paper, a kind of data-based approach was proposed.The experiment results showed that our proposed approach achieved better classification accuracy when compared to the other models.Some conclusions were obtained: (1) The signals of the accelerometers could be acquired to diagnose the rotating machinery fault.(2) By combining VMD and PCA, sensitive features of the rotating machinery fault could be extracted.(3) To diagnose the rotating machinery fault for imbalanced data, a kind of data-based approach was proposed in this paper.The fast clustering algorithm was adopted to reduce the number of the majority data from the imbalanced sensitive features.Then, the SVM was trained and tested with the data clustered by the fast clustering algorithm so the fault diagnosis model for the imbalanced data was obtained.The fault diagnosis model showed a very good classification capability in both the gearbox fault data set and rolling bearing fault data set.Therefore, our approach was suitable to the rotating machinery fault diagnosis for imbalanced data.

Figure 1 :
Figure 1: Hyperplane of SVM.(a) Balance data set and (b) imbalance data set.

Figure 4 :
Figure 4: Confusion matrix of the imbalanced data.

Figure 7 :
Figure 7: Flowchart of building the fault diagnosis model.
show a comparison of the evaluation indexes of the fault diagnosis model and other models.It is clear that the fault diagnosis model obtained good classification performances in different data sets.It is particularly worth mentioning that the fault diagnosis model was less affected by the proportion of the data set.A good classification effect could still be obtained with even less samples of the positive class (small class).

Figure 11 :
Figure 11: The first five principal component contributions.

Figure 12 :
Figure 12: Distribution of three condition modes after VMD and PCA.

Table 1 :
Three kinds of condition modes for gearbox.

Table 2 :
Evaluation indexes comparisons of fault diagnosis model and other models.