To diagnose rotating machinery fault for imbalanced data, a method based on fast clustering algorithm (FCA) and support vector machine (SVM) was proposed. Combined with variational mode decomposition (VMD) and principal component analysis (PCA), sensitive features of the rotating machinery fault were obtained and constituted the imbalanced fault sample set. Next, a fast clustering algorithm was adopted to reduce the number of the majority data from the imbalanced fault sample set. Consequently, the balanced fault sample set consisted of the clustered data and the minority data from the imbalanced fault sample set. After that, SVM was trained with the balanced fault sample set and tested with the imbalanced fault sample set so the fault diagnosis model of the rotating machinery could be obtained. Finally, the gearbox fault data set and the rolling bearing fault data set were adopted to test the fault diagnosis model. The experimental results showed that the fault diagnosis model could effectively diagnose the rotating machinery fault for imbalanced data.
With the development of modern largescale production and the progress of science and technology, the structure of mechanical equipment has become more complex. During equipment operation, sudden failure of the equipment would lead to the loss of service ability or may even cause a serious disastrous accident [
At present, SVMbased fault diagnosis is one of the most widely used fault diagnosis methods for mechanical equipment [
The purpose of the cluster algorithm is to classify the data according to their similarity. Therefore, we proposed an approach based on a fast clustering algorithm to reduce the number of the majority data from the imbalanced data. This fast clustering algorithm was proposed by Rodriguez and Laio in 2014 based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities [
To diagnose rotating machinery fault for imbalanced data, a kind of method based on fast clustering algorithm and SVM was proposed. According to the proposed method, original features of different faults are constructed by VMD. Next, PCA is applied to reduce the dimension of the original features so that sensitive features can be obtained. After that, the fast clustering algorithm is adopted to reduce the number of the majority data from the imbalanced sensitive features. Finally, SVM is trained with the data clustered by the fast clustering algorithm, so that the fault diagnosis model for imbalanced data can be obtained.
As a machine learning algorithm developed from statistical learning theory, SVM maps inseparable learning samples from lowdimensional space into highdimensional space through a kernel function to obtain an optimal hyperplane [
The constraint conditions can be defined as
Then, the Lagrange function is constructed as
Then, the classification function can be formed as
SVM classification algorithm assumes that the number of each class is approximately equal. In fact, for rotating machinery, the acquisition of fault samples is full of randomness, so it is difficult to guarantee the balance among different fault samples. Figure
Hyperplane of SVM. (a) Balance data set and (b) imbalance data set.
From Figure
In this paper, a type of fast clustering algorithm was used as the theoretical basis of balancing the original data set as the basic idea of this clustering algorithm is novel and simple and is very suitable for searching the core samples from the imbalanced data set. This fast clustering algorithm assumes that cluster centers are surrounded by neighbors with lower local densities; meanwhile, they are at a relatively large distance from the points with a higher local density [
With Gaussian kernel, the local density
From (
Distance
From (
For each data point, we can calculate its local density
Obviously, points with larger weights are clustering centers. The sequence
Then, the nonclustering center points can be categorized as
For each cluster, the mean local density of this cluster is calculated. By comparing the mean local density, the points of this cluster can be divided into core points or halo points.
The synthetic point distributions data set [
Synthetic point distributions data set: (a) before clustering and (b) after clustering.
With the fast clustering algorithm, the imbalanced data set was preprocessed and the number of the majority classes reduced. Therefore, the raw data set was reassembled into a balanced data set. Then, the SVM classification algorithm was adopted to learn the balanced data set. The movement of the SVM hyperplane during the process of clustering is shown in Figure
Schematic diagram of SVM hyperplane movement.
As shown in Figure
For imbalanced data, the proportion of minority samples was not large, so the classification results of the minority samples had little effect on overall accuracy of classification. Therefore, there were some unique classification evaluation indexes for imbalanced data [
Confusion matrix of the imbalanced data.
The recall of the positive class can be defined as
As the evaluation index,
For rotating machinery, the vibration signal is composed of multiple components. VMD, a novel adaptive signal decomposition method, was adopted to construct the original features in this study. The target of the VMD was to decompose the original signal
To solve this constrained variational problem, the augmented Lagrange function is introduced as
The process of decomposing was as follows. First,
The center frequencies
The condition for convergence is the following:
Finally, the original signal
To test the validity of VMD, a pure harmonic signal affected by noise was adopted. Furthermore, we also conducted a comparison with empirical mode decomposition (EMD) based on the exact same testing signal. Here, the pure harmonic signal was the following:
The noisy input signal was the pure harmonic signal affected by noise with the expression as follows:
The signal waveforms of the pure harmonic signal and the noisy input signal are shown in Figure
Signal waveforms. (a) Pure harmonic signal and (b) the noisy input signal.
Figure
Decomposition of the noisy input signal: (a) IMFs extracted by VMD and (b) IMFs extracted by EMD.
With VMD, the vibration signal of the rotating machinery was decomposed into a number of IMFs. Then, the energy of each IMF was calculated to constitute original feature vector. Since these original feature vectors were highdimensional features, dimensionality reduction algorithm was applied to reduce the computational complexity.
A kind of traditional dimensionality reduction algorithm, PCA, was adopted to reduce the dimension of the original feature vectors. PCA is a statistical method which adopts orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables.
Then, the contributions of the first
Finally, the principal components with high contributions can be chosen as the sensitive features.
The rotating machinery fault sample set (an imbalanced data set) is made up of several kinds of faults. Some faults are majority class while others are minority class. Each fault sample contains a number of sensitive features. The distance between the
According to (
Flowchart of building the fault diagnosis model.
From Figure
To verify the viability and effectiveness of the proposed algorithm, the gearbox fault data set and the rolling bearing fault data set were adopted to test the proposed fault diagnosis model.
A wind turbine transmission chain fault simulation test bed is shown in Figure
Three kinds of condition modes for gearbox.
Planetary gear  Wind wheel speed (r/min)  

Condition mode 1  Normal  197/237/277 
Condition mode 2  Half fracture  197/237/277 
Condition mode 3  Full fracture  197/237/277 
Wind turbine transmission chain fault simulation test bed.
Pictures of planetary gears: (a) half fracture and (b) full fracture.
As is shown in Figure
VMD decomposition of vibration signal (wind wheel speed: 237 r/min): (a) normal; (b) half fracture; and (c) full fracture.
With PCA, the original features are mapped to another plane and replaced with the sensitive features. In the sensitive features, feature is sorted according to its contribution degree. Figure
The first five principal component contributions.
Figure
Distribution of three condition modes after VMD and PCA.
The imbalanced data sets under different proportions were constructed, while the distributions of imbalanced data sets are shown in Figure
Distributions of the imbalanced data set under different proportions: (a) 10 : 150; (b) 15 : 150; (c) 40 : 150; (d) 50 : 150; (e) 80 : 150; and (f) 100 : 150.
The proposed fault diagnosis model was adopted to classify the imbalanced data sets under different proportions. To test the validity of the proposed fault diagnosis model, the random undersampling (RU) algorithm, the synthetic minority oversampling technique (SMOTE) algorithm, the backpropagation (BP) neural network, and the radial basis function (RBF) neural network were introduced simultaneously. Table
Evaluation indexes comparisons of fault diagnosis model and other models.
Proportion of the data set  10 : 150  15 : 150  40 : 150  50 : 150  80 : 150  100 : 150 


0.94  0.95  0.96  0.94  0.92  0.88 

0.54  0.65  0.86  0.85  0.88  0.86 

0.61  0.69  0.68  0.87  0.89  0.87 

0.16  0.28  0.48  0.77  0.86  0.84 

0.92  0.93  0.92  0.90  0.91  0.89 

0.47  0.59  0.78  0.79  0.87  0.86 

0.93  0.91  0.90  0.91  0.91  0.90 

0.50  0.54  0.75  0.81  0.86  0.88 

0.85  0.91  0.92  0.92  0.92  0.91 

0.36  0.55  0.78  0.83  0.88  0.89 

0.60  0.68  0.85  0.87  0.92  0.89 

0.16  0.28  0.69  0.77  0.88  0.87 
Evaluation indexes comparisons: (a)
The testing samples consisted of 300 samples of which 150 samples were from half failure data while another 150 samples were from the full failure data. The classification accuracy comparisons of the fault diagnosis model and other models are shown in Table
Classification accuracy comparisons of fault diagnosis model and other models.
Proportion of the data set  10 : 150  15 : 150  40 : 150  50 : 150  80 : 150  100 : 150 

Fault diagnosis model  85.33%  81.67%  85.00%  87.67%  89.33%  88.67% 
FCA + BP  74.00%  76.67%  80.67%  83.33%  81.33%  85.33% 
FCA + RBF  73.67%  73.00%  74.67%  75.67%  80.33%  88.33% 
RU + SVM  79.67%  78.67%  80.00%  82.00%  84.00%  86.00% 
RU + BP  71.00%  70.33%  82.33%  86.67%  87.33%  87.33% 
RU + RBF  76.33%  75.67%  79.33%  78.67%  85.33%  85.00% 
SMOTE + SVM  75.67%  77.67%  78.67%  82.00%  83.33%  85.67% 
SMOTE + BP  72.00%  79.67%  82.33%  84.33%  85.33%  87.33% 
SMOTE + RBF  76.00%  78.67%  78.67%  80.00%  83.67%  82.33% 
SVM  68.00%  71.67%  76.33%  79.00%  83.00%  84.67% 
BP  66.00%  70.67%  81.00%  83.00%  82.00%  84.33% 
RBF  53.67%  65.33%  78.00%  81.67%  82.67%  84.67% 
Classification accuracy comparisons.
In the case of gearbox fault diagnosis, the gearbox fault data set was used to test the performance of the fault diagnosis model when the model was applied to distinguish two failure modes. In this case, the fault diagnosis model was applied to distinguish multiple failure modes in the imbalanced data set.
The rolling bearing fault simulation test bed is shown in Figure
Rolling bearing fault simulation test bed.
Table
Composition of the rolling bearing data set.
Mode  Processing method  Fault size (width × depth) (mm)  Sample size 

Normal 


100 
Rolling element failure  Line cutting  0.3 × 1  100 
Inner race failure  Line cutting  0.3 × 0.5  5 
Outer race failure  Line cutting  0.3 × 0.5  5 
With the proposed approach, the fault diagnosis model was obtained from the imbalanced rolling bearing data set. To test the classification accuracy of the model, testing samples consisting of 400 samples (100 samples from each mode) were constructed.
Figure
Confusion matrix comparisons: (a) fault diagnosis model; (b) FCA + BP; (c) SMOTE + SVM; (d) SVM.
In this paper, a kind of databased approach was proposed. The experiment results showed that our proposed approach achieved better classification accuracy when compared to the other models. Some conclusions were obtained:
The signals of the accelerometers could be acquired to diagnose the rotating machinery fault.
By combining VMD and PCA, sensitive features of the rotating machinery fault could be extracted.
To diagnose the rotating machinery fault for imbalanced data, a kind of databased approach was proposed in this paper. The fast clustering algorithm was adopted to reduce the number of the majority data from the imbalanced sensitive features. Then, the SVM was trained and tested with the data clustered by the fast clustering algorithm so the fault diagnosis model for the imbalanced data was obtained. The fault diagnosis model showed a very good classification capability in both the gearbox fault data set and rolling bearing fault data set. Therefore, our approach was suitable to the rotating machinery fault diagnosis for imbalanced data.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The research is supported by National Natural Science Fund of China (11572167).