Rolling Bearing Fault Diagnosis Using a Deep Convolutional Autoencoding Network and Improved Gustafson–Kessel Clustering

Deep learning (DL) has been successfully used in fault diagnosis. Training deep neural networks, such as convolutional neural networks (CNNs), require plenty of labeled samples. However, in mechanical fault diagnosis, labeled data are costly and time-consuming to collect. A novel method based on a deep convolutional autoencoding network (DCAEN) and adaptive nonparametric weighted-feature extraction Gustafson–Kessel (ANW-GK) clustering algorithm was developed for the fault diagnosis of bearings. First, the DCAEN that is pretrained layer by layer by unlabeled samples and ﬁne-tuned by a few labeled samples is applied to learn representative features from the vibration signals. Then, the learned representative features are reduced by t-distributed stochastic neighbor embedding (t-SNE), and the low-dimensional main features are obtained. Finally, the low-dimensional features are input ANW-GK clustering for fault identiﬁcation. Two datasets were used to validate the eﬀectiveness of the proposed method. The experimental results show that the proposed method can eﬀectively diagnose diﬀerent fault types with only a few labeled samples.


Introduction
Rolling bearings are crucial components widely used in modern machines [1]. ey often survive in harsh working environments where they are frequently damaged. A sudden failure of the rolling bearings may result in unexpected downtime, significant economic losses, and casualties [2]. erefore, it is meaningful to develop intelligent fault diagnosis methods for rolling bearings. In general, intelligent fault diagnosis of the bearing comprises two parts: feature extraction and condition identification [3,4]. Traditional feature extraction methods based on the EMD, entropy, and multifractal method have been successfully used in mechanical fault diagnosis. Chen et al. [5] used EMD to decompose the bearing vibration signals and calculated the permutation entropy (PE) of the first few IMFs as the characteristic vector, and SVM was applied for operation status identification. Li et al. [6] applied a multifractal method to extract the generalized dimensional spectral features from the vibration signals of hydropower units, and probabilistic neural network was used for fault diagnosis. However, these methods are largely dependent on prior knowledge about signal processing techniques and expert diagnosis experience.
As a rising star in the field of intelligent fault diagnosis, deep learning has received much attention in recent years [7][8][9]. Deep learning [10][11][12][13] can learn representative features hidden in the original data and directly establish an accurate mapping relationship between a model and the operating state of devices. e stacked autoencoder (SAE), or stacked denoising auto-encoder (SDAE), and CNN are two typical deep learning models. SDAE [14] is constructed by stacking multilayer denoising autoencoders. e representative features are extracted on unlabeled samples in a layerwise learning method. e original signal is fed into the first autoencoder to generate a latent representation, and this "code" is used to reconstruct the input signal. e output features of the first autoencoder are input into the second one to further extract hierarchical representative features. e high-level representations are generated in an unsupervised way, which avoids the dependence on knowledge and experience. Lu et al. [15] established an SDAE with a transmitting rule of greedy training to learn high-order feature representations from the input data, and a softmax regression algorithm was employed for multiclass classification. Chen and Li [16] proposed an SDAE to extract features from the frequency domain information of the vibration signal for fault detection.
ese unsupervised feature extraction methods based on an SDAE have achieved remarkable results in fault diagnosis. However, the hidden layers in an SDAE are fully connected, which makes it difficult to train a large number of parameters with limited samples.
In contrast, a CNN has the characteristics of a local connection, weight sharing, and pooling structure, which reduces the parameters of the model and improves the training efficiency. erefore, it is widely used in fault diagnosis. Janssens et al. [17] utilized a CNN model for the intelligent fault diagnosis of bearings and compared the advantages of a CNN with manually engineered features. Zhang et al. [18] proposed a deep WDCNN model to learn the deep representative features from the original bearing signals and achieved a high bearing fault recognition rate under variable loads and strong background noise. Jing et al. [19] constructed a CNN to learn deep features from the spectrum of vibration signals collected from a gearbox and realized the fault diagnosis of the gearbox. Jiang et al. [20] designed a multiscale CNN architecture for extracting multiscale features from the vibration signals and realized the fault diagnosis of wind turbine gearboxes. ese CNNbased models all exhibited an excellent performance for learning and fault recognition. However, one problem we must face is that plenty of labeled samples are required for training the model. Labeling samples, however, is an expensive and time-consuming activity.
Cluster analysis is an effective method to deal with the classification of unlabeled data [21]. GK clustering [22] is a clustering algorithm based on the objective function, which can identify the extracted features. Benyounes et al. [23] applied GK clustering to identify the control parameters in a gas turbine and developed a reliable nonlinear mathematical model. Hua et al. [24] introduced a method for the driver intention classification by GK clustering. Li et al. [25] used GK clustering as one of four main clustering algorithms and introduced how to select the most suitable identification method for bearing fault diagnosis. Chen et al. [26] extracted multiscale permutation entropy and adopted GK clustering to detect rolling bearing faults. However, the inadequacy of the GK clustering method is that the different contributions of the data for clusters are not considered, and the number of cluster centers must be given in advance.
To overcome the abovementioned problems, we propose a novel method using the DCAEN and ANW-GK clustering for the intelligent fault diagnosis of bearings. Our contributions can be summarized as follows: (1) A DCAEN that was pretrained by unlabeled samples and fine-tuned by a few labeled samples was constructed to learn the representative features from the original vibrational signals. (2) An improved ANW-GK clustering was proposed.
e contribution of each sample for clusters was redefined using nonparametric weighted-feature extraction (NWFE) [27]. e initial number of clusters of the ANW-GK was adaptively determined using the PBMF function [28].
(3) A novel bearing intelligent fault diagnosis method using the DCAEN and ANW-GK clustering was developed.
e rest of the paper is organized as follows. Section 2 briefly introduces the CAE, GK clustering, and NWFE. Section 3 describes the constructed DCAEN, improved ANW-GK clustering, and general procedure of the proposed method. In Section 4, the effectiveness of the proposed method is validated on two bearing datasets: one is an open benchmark dataset and the other is a laboratory-measured dataset. Section 5 discusses the proposed method. Finally, our findings and outcomes are summarized and elucidated in Section 6.

Theoretical Background
Some basic theories of the CAE, GK clustering, and NWFE are briefly introduced in Sections 2.1, 2.2, and 2.3. In Section 2.4, three kinds of clustering evaluation indexes are introduced. (CAE). CAE [29,30] is an unsupervised learning model where the convolutional structure is embedded into the basic encoder. When the convolutional structure is used to replace the fully connected structure of the basic encoder, the encoder has the characteristics of sharing the local receptive field and the weight. As depicted in Figure 1, the CAE consists of the encoder and decoder network. e encoder comprises a convolutional layer and a pooling layer, which can transform the input data from a high-dimensional space into a set of 1d feature maps. e decoder comprises an unpooling layer and a deconvolutional layer, which can reconstruct the input data from the 1d feature maps. e parameters of the encoder and decoder are optimized by minimizing the difference between the reconstructed data and input data. erefore, the CAE is a data-driven unsupervised feature extraction model.

Encoder.
Given a high-dimensional input data X, a set of 1d feature maps Y � y 1 , y 2 , . . . , y H , H is the number of convolution kernels, and the encoding process can be expressed as follows: where * is the convolution operation; K (k) is the kth convolution kernel; b k is the bias of the kth convolution kernel; σ(·) is the exponential linear unit function (ELU); and ψ(·) is the max pooling with step s.

Decoder.
e decoder is used to reconstruct the input data X by the 1d feature maps y k . e reconstructed data X is expressed as follows: where ψ(·) is the unpooling, which is used to expand the feature map according to the pooling step s of the encoder, that is, insert s − 1 zeros between the elements; K (k) is the kth deconvolution kernel; and b k is the bias of the kth deconvolution kernel.

2.1.3.
Training. e parameter set K, b, K, b of the CAE is optimized by minimizing the reconstruction error. e error is defined as follows:

Gustafson-Kessel Clustering (GK Clustering).
e GK clustering algorithm [22] obtains the fuzzy membership matrix U � [μ ik ] c×n and cluster center V � v 1 , v 2 , . . . , v c by minimizing the objective function. Here, c is the number of clusters; n is the number of samples; and μ ik is the fuzzy membership degree of the sample point k relative to the cluster center i.
Given a clustering sample set Z � Z 1 , Z 2 , . . . , Z n , its objective function is defined as where m is the fuzzy index, generally, m � 2; D 2 ik is the distance from any sample Z k to the cluster center v i , which is a square inner product norm; A i is a positive definite symmetric matrix, determined by the clustering covariance matrix F i ; and P i is the prior probability of the ith cluster.
e Lagrange multiplication is used to optimize the objective function (equation (4)), and the necessary conditions for the minimum value of equation (4) are en, the iteration algorithm of the GK clustering is described as follows: (1) e number of clusters c, the fuzzy index m, and the fuzzy membership matrix U are initialized. e iteration number is set as l � 0, 1, . . .. (2) e cluster center v i is calculated according to equation (9). (3) e covariance matrix F i is calculated according to the following equation: (4) e distance norm D 2 ik is calculated according to equations (5) and (6).

Convolution
Max pooling Unpooling Deconvolution Encoder Decoder (5) e fuzzy membership matrix U is updated according to equation (8).
For any positive η, if ‖U (l+1) − U (l) ‖ < η, the operation is terminated. Otherwise, the number of iterations is increased, and l ⟵ l + 1 is repeated until the condition is satisfied.

Nonparametric Weighted-Feature Extraction (NWFE).
e NWFE [27] is used for computing different weighted clustering centers for each sample. e distributed weighting matrix is defined by the Euclidean distance between the samples and clustering centers. e nonparametric withinclass scatter matrix S w is defined as follows: e nonparametric between-class scatter matrix S b is defined as follows: where Z (i) k is the kth sample in class i; P i is the prior probability of class i; n i is the number of samples in class i; and w (i,j) k is the distributed weight matrix of the kth sample in class i to the class j, which is defined as follows: where M j (Z (i) k ) is the weighted mean of the sample Z (i) k . e weighted mean is defined as follows: where λ is the weight of local mean. e weight of the local mean is defined as follows: where dist(a 1 , a 2 ) is the Euclidean distance between vector a 1 and vector a 2 .
In equations (14) and (15), kt is in reverse ratio to the Euclidean distance between Z (i) k and Z (j) t . erefore, the greater the distance between Z (j) t and Z (i) k , the smaller the contribution of sample Z (j) t for clustering.

Evaluation Indexes of the Clustering Effect.
In this section, 3 kinds of clustering evaluation indexes are introduced: the partition coefficient (PC), classification entropy (CE), and clustering accuracy (Acc). ey are defined as follows: where n is the number of the sample set and θ i is the number of samples correctly partitioned into class i. PC and Acc are closer to 1. CE is closer to 0, and so the clustering effect will be better.

Proposed Fault Diagnosis Method
In this section, a novel intelligent fault diagnosis method for bearings is discussed. e method includes 3 parts: DCAEN construction, improved ANW-GK clustering, and general procedure.

DCAEN Construction.
e process of construction for the DCAEN is shown in Figure 2.
e DCAEN is constructed by stacking CAE. e output of the pooling layer of the previous CAE serves as the input of the current CAE. At first, unlabeled data are used for pretraining the CAE layer by layer. en, a full connection layer and softmax classifier are added to the coding part of the pretrained DCAEN, and a small number of labeled samples are used for the supervised fine-tuning of the network. Finally, the classification layer is removed from the fine-tuned network, and the trained DCAEN with better deep feature extraction capability is constructed.
In the process of pretraining the network, each layer of the DCAE becomes a shallow neural network, which can make use of the advantages of convex optimization of the shallow neural network and reduce the risk of the network falling into a local optimum. e pretrained network is finetuned by a few labeled data to achieve a better feature learning ability. Essentially, the process of encoding layer by layer is extracting abstract features step by step. With the increase of the layers, the features become more abstract and more global.

Improved ANW-GK Clustering.
e membership degrees of the samples are used to calculate the corresponding cluster centers in the GK clustering algorithm. e different contributions of the samples, however, are not considered. Different samples should be given different feature weights, which can make the sample near the cluster center more typical.
ereby, the contribution of the typical sample which should play a leading role in the clustering process is increased. at is to say, when calculating the membership degree of sample Z i belonging to class i, the samples near Z i should belong to the same class and be given larger weights, while those farther away from Z i should be given smaller weights.

Shock and Vibration
Different weights of samples can be assigned in the NWFE. e importance of local information is emphasized. erefore, in our method, the NWFE algorithm is integrated into GK clustering, and its new weighted clustering center v i is defined as follows: e objective function of NW-GK clustering is defined as follows: where According to the Lagrange multiplier method, the weighted membership matrix can be updated as follows: until e cluster number c must be given in advance for the traditional GK clustering algorithm. It mainly depends on the experts' experience or relevant background knowledge. To enhance the adaptivity, the clustering evaluation PBMF function [28] is integrated into the NW-GK algorithm. According to the change of the PBMF function's value with the cluster number c, the optimal c can be selected. e PBMF function is defined as follows: where E 1 is the value of J NW when c � 1.
As can be seen from equation (21), the bigger the PBMF value, the better the clustering effect, and the value of the corresponding c is closer to the real number of clusters.
where n is the number of samples. Based on the above analysis, an ANW-GK clustering algorithm was developed, and its flowchart is shown in Figure 3.
(1) e clustering parameters of c, c max , m, and η are initialized. (2) e fuzzy membership matrix U is initialized and satisfied μ ik ∈ [0, 1], c i�1 u ik � 1. (3) e weight matrix Λ � λ ik is calculated according to equation (15), and the cluster center v i is updated according to equation (17). (4) e fuzzy membership matrix U is updated according to equation (19). (5) If ΔJ NW > η, go to Step 2, and until the clustering information is converged. Otherwise, the next step is performed. (6) Calculate PBMF (c) according to equation (21), let c � c + 1, and go to Step 2. If c > c max , then a set of the PBMF values is obtained. Otherwise, return to Step 3. (7) e maximum value of PBMF (c) is found from the set of the PBMF values, and the corresponding c is the optimal cluster number. Its corresponding U and cluster centers v are the best clustering results.

Fault Diagnosis
Procedure. According to the abovementioned discussion, the proposed fault diagnosis method based on the DCAEN and ANW-GK is shown in Figure 4. e general procedure of the fault diagnosis method can be summarized as follows: (i) Step 1: the vibration signals are collected by a data acquisition system, and the collected signals are divided into training and test samples.

Experiment Verification and Analysis
Two cases of rolling bearing datasets are discussed in this section. ey were used to validate the availability and superiority of the proposed fault diagnosis method. e data used for the verification of the proposed method were from the Case Western Reserve University (CWRU) bearing data center [32]. e data were collected by accelerometers from a motor driving mechanical system at a sampling frequency of 12 kHz. e motor bearings were seeded with faults using electrodischarge machining as shown in Figure 5. e system was able to bear 4 kinds of loads: 0-3 hp. Besides the normal (NR) operating status, single point fault with fault diameters of 0.007 in, 0.014 in, and 0.021 in were separately introduced at rolling element (BF), inner raceway (IF), and outer raceway (OF). erefore, there were 10 categories of health conditions under a load in total. In this experiment, data with a load of 1 hp were used to make a sample set, each state included 100 samples, and each sample contained 1024 data points. For each fault category, 80 samples were randomly selected as training samples, and 20 samples were selected as test samples. e details of all the datasets are described in Table 1.

Parameters of the Model.
We used the parameters of the DCAEN from several studies [18,19]. e specific structural parameters were as follows: Conv 64 16 ⟶ max pool 2 ⟶ Conv 3 32 ⟶ max pool 2 ⟶ Conv 3 64 ⟶ max pool 2 ⟶ Conv 3 64 ⟶ max pool 2 ⟶ Conv 3 64 ⟶ max pool 2 ⟶ FC 200 , where Conv k n denotes a convolutional layer with n convolution kernel of size k × 1, and the default step size is 1; max pool 2 denotes a pooling layer with 2 × 1 size, and the default step size was 1. When pretraining, the minibatch was set to 80, the learning rate was set to 0.001, the epochs were set to 200, and the optimization algorithm was the Adam algorithm. When fine-tuning, the learning rate was set to 0.005 for improving the efficiency. e fuzzy weighted exponent m was set to 2, and the iteration termination tolerance η was set to 0.0001.
To determine the appropriate proportion of fine-tuned samples, 10%, 30%, 50%, and 70% of the training samples were used for fine-tuning, and each experiment was repeated 20 times. e statistical results are shown in Table 2. e performance indexes of each experiment are shown in Figure 6.
As can be seen from Table 2 and Figure 6, when the proportion of fine-tuned samples is 10%, the test accuracy of the model is up to 97.5%, the lowest is 88.5%; the maximum PC value is 0.875, and the minimum is 0.75; the minimum CE value is 0.32, and the maximum is 0.51.
is indicates that the clustering evaluation index has large fluctuations, and the model stability is poor. As the proportion of fine-tuning samples increases, the stability of the model is increased, and the clustering indexes are improved gradually, but the improved magnitude is decreased gradually. So, under the premise of ensuring the performance of the model, the proportion of fine-tuning samples used in this paper was selected as 30%. e cluster number was set to be c ∈ [2, 14] (c max � ��� 200 √ ), and the change of PBMF with c is shown in Figure 7. erefore, the optimal number of clusters is c � 10, which encompasses the data of the actual situation.       e results of clustering are shown in Figure 8.
In Figure 8, V1 to V10 correspond to the cluster centers of 10 bearing states, and their specific coordinate values are shown in Table 3. It can be seen from Figure 8 that 10 bearing states are clearly separated and gathered in the vicinity of the cluster centers. Each group of samples is packed tight, and the spaces between the classes are large, and no aliasing occurs. e average memberships of each group of samples are shown in Table 4. It can be seen that the average membership of the NR group samples for V3 is 0.982, which is much larger than the other 9 cluster centers. erefore, the NR group samples belong to the V3 class. Similarly, the NR-OF3 samples belong to V9, V1, V4, V6, V7, V5, V8, V10, and V2, respectively. erefore, the proposed method has obvious fault identification effects. It should be noted that the membership degree of the BF3 group belonging to the V4 class is 0.735, and memberships belonging to V5, V9, and V7 are 0.094, 0.074, and 0.047, respectively, which are significantly higher than the membership degrees of other cluster centers. erefore, the BF3 group samples are mainly affected by the IF1, BF1, and IF2 samples when clustering. Similarly, it can be seen from the memberships of the OF3 group samples that this group of samples is greatly affected by the OF1 group.
is is consistent with the conclusion of Figure 8.

Generalization Performance under Different Loads.
In practical applications of mechanical equipment, the loads of bearings are often variable. In this section, we discuss the generalization performance of the proposed method under different loads. e model was trained and fine-tuned using the training set of 1 hp load. e test sets were under the loads of 0 hp, 2 hp, and 3 hp. e experimental results are shown in Figure 9. Under 3 different loads, the clustering accuracies are 96%, 97%, and 95.5%; the PC values are 0.811, 0.853, and 0.879, respectively; and the CE values are 0.473, 0.399, and 0.312, respectively. e clustering results still maintained a high precision. For the stability of the model, each experiment was repeated 20 times, and the statistical results are shown in Figure 10. Under different loads, the clustering accuracy rate is above 96%, the PC value is above 0.8, and the CE value is within 0.5. ese results show that the proposed method has a certain generalization when the load changes.

Comparative Experiment.
To illustrate the superiority of the proposed method, the following comparative experiments were conducted. (1) We compared our proposed method with traditional signal processing and handcraft feature methods. e original vibration signals were decomposed into several IMF components using the EMD, and permutation entropy (PE, m � 2, r � 0.1 SD) was employed to calculate the entropy value of each IMF component as feature vectors. For visualization, t-SNE was used for the dimension reduction. e 2-dimensional IMF-PE vectors were input into the ANW-GK cluster for fault identification. e multifractal method was used to extract the features of the original vibration signals, and the q-D (q) parameters (q � 10) of the signal were used as the feature vectors. For visualization, t-SNE was used for the dimension reduction. e 2-dimensional q-D (q) feature vectors were input into the ANW-GK cluster for fault identification. (2) We compared our proposed method with the SDAE. e SDAE was used to extract the features from the original vibration signal, and the input ANW-GK clustering was for fault identification. To maintain consistency, the network structure of the SDAE was 1024-1024- 96-192-192-192-200. (3) We compared our proposed method with the GK clustering algorithm. e DCAEN was used to extract the features of the original vibration signal, and the extracted features were input into the GK clustering algorithm for fault identification. e comparison results are shown in Table 5, Figure 8, and Figure 11.
In comparison to the manual feature extraction methods of the EMD + FE and multifractal methods, features learned by the DCAEN have a better cluster recognition effect, as     shown in Table 5. For features extracted using the EMD + PE, "OF1" and "OF3" are seriously aliased and "BF2" and "BF3" are seriously aliased, as shown in Figure 11(a); for features extracted using the multifractal method, "IF3" and "OF3" are seriously aliased and "BF1" and "IF2" are seriously aliased, as shown in Figure 11(b). is is mainly because the features extracted manually are not comprehensive, and important sensitive features may be lost, which results in identification difficulties. As shown in Table 5, compared with SDAE, the cluster recognition effect of the features learned using the DCAEN is also better. For features learned by SDAE, "BF1" and  "OF2" are aliased, and "OF3" and "IF3" are aliased, as shown in Figure 11(c). e full connection between each network layer was used in the SDAE, which results in a large amount of redundancy in the network's structural parameters. is makes the features learned by the network more global, while the locality of the features may be ignored. e structure of the convolution and pooling plus full connection layer is used in the DCAEN. e convolution pooling layer learns the local features from the input data, and the full connection layer learns the global features. us, the features extracted by our method are more distinguishable.

Shock and Vibration
Compared with GK clustering, the fault recognition effect of the ANW-GK clustering is much better, as shown in Table 5. When GK clustering was used to identify the fault types of features learned by the DCAEN, "OF1," "IF1," and "OF3" are slightly aliased, as shown in Figure 11(d); when our method is used to identify the fault types of features learned by the DCAEN, no aliasing between the various types occurs, as shown in Figure 8.
is is mainly because different weights are given to each sample in our method, which enhances the role of typical samples. e different importance of samples for each type is more effectively characterized, so that the clustering accuracy is improved.

Case 2: Laboratory-Simulated Bearing Fault Dataset.
To further verify the effectiveness of the proposed method, the proposed method was applied to analyze the laboratorysimulated bearing faults dataset.

Experimental Setup.
e laboratory-simulated bearing fault data were collected from a rotor test bench and shown in Figure 12.
e rotor test bench was used to simulate different operating states of the ball bearings. A three-phase inverter motor, a shaft, and a speed controller were used to vary speeds of the test bearings. e single point fault is arranged on the bearings (NSK6308) using a wire electrical discharge machine and a file as shown in Figure 13. A couple of accelerometers (HD-YD232) were placed vertically on the bearing seat to collect the vibration signals of the test bearings.
e dataset included five different operating states of the bearings: normal (NR), outer ring fault (OF), inner ring fault (IF), rolling element fault (BF), and fix fault (FF). In the experiment, the rotating speed of the shaft was 2600 rpm, the sampling frequency was 8 kHz, and 200 samples were collected in each operating state. For each operating condition, 160 samples were randomly selected as training samples, and 40 samples were selected as test samples. e samples used for training a deep network must contain at least one complete signal period; otherwise, fault features cannot be effectively learned. To meet this requirement, the sample length must be longer than the number of points contained in a complete period, and the latter can be calculated by the sampling frequency and bearing speed. Since the number of data points in the collected raw data is constant, the sample length is inversely proportional to the number. If the sample length is too long, on the one hand, the number of samples may be too small and affect the training of the model; on the other hand, it may increase the cost of computation and affect the training speed. On the premise of ensuring the training effect, for the convenience of storage and calculation, the length of the sample is 2048. e details of all the datasets are described in Table 6.

Parameters of the Model.
e parameters of the model in Case 1 were used. e cluster number was set to c ∈ [2,14] (c max � ��� 200 √ ), and the change of PBMF with c is shown in Figure 14. erefore, the optimal cluster number is c � 5.

Results and Analysis.
e training samples were input into the established DCAEN for unsupervised pretraining, and 30% of the labeled training samples were used for fine-tuning. We input 200 ( Figure 15.
In Figure 15, V1, V2, V3, V4, and V5 are the cluster centers of FF, NR, BF, IF, and OF, respectively, and their specific coordinate values are shown in Table 7. As can be seen from Table 7 and Figure 15, five kinds of samples are clearly separated and clustered near their cluster centers. Different types of samples are gathered closely, no aliasing occurs and the distances between classes are large. e average membership degrees of each group of samples are shown in Table 8.
e membership of the first group of samples for V2 is much larger than that of the other four groups, which indicates that the first group of samples belongs to V2. Similarly, the other groups of samples belong to different classes. erefore, the excellent fault identification effect of the proposed method is verified again.

Generalization Performance under the Different Rotating Speeds.
In actual mechanical equipment, the rotating speeds of bearings are often variable. Consequently, the generalization performances of the proposed method at different rotating speeds were tested. e model was trained with the training set of 2600 rpm. e test sets were at rotating speeds of 2800 rpm, 3000 rpm, and 3200 rpm. e fault diagnosis results are shown in Figure 16. At three different speeds, the clustering accuracies are 97%, 96.5%, and 98.5%; the PC values are 0.912, 0.901, and 0.922, respectively; the CE values are 0.194, 0.215, and 0.14, respectively. It can be seen that the proposed method still maintains higher fault diagnosis accuracy at variable rotating speeds. erefore, our method has certain generalization performances at different rotating speeds.
To avoid a contingency, the experiments at different rotating speeds were performed 20 times. e final results are the average of clustering evaluation indexes for the 20 experiments, as shown in Figure 17. At different rotating speeds, the clustering accuracy was maintained above 96%, the PC value was above 0.85, and the CE value was below 0. 25. ere results show that the proposed method has a good fault identification ability at different rotating speeds.

Results and Discussion
As mentioned above, we know that it is difficult and sometimes impossible to obtain a large number of labeled samples in the process of fault diagnosis. e insufficiency of labeled samples easily leads to lower diagnostic accuracy.

Sha Accelerometer
Three-phase inverter motor Bearing Figure 12: e test bench of the bearings.   Figure 15: e cluster contour map of the proposed method.   erefore, it is important to explore fault diagnosis methods that use fewer labeled samples to achieve higher accuracy. In this paper, an intelligent fault diagnosis method using DCAEN and ANW-GK clustering is proposed. e method can identify fault types with a few labeled samples. e performance of the method is validated on two bearing datasets. However, there are still some potential problems and research directions remained to be improved and studied.
To effectively use a small number of labeled samples to improve the feature extraction capability of the model, a labeled sample fine-tuning technique is used during the construction of DCAEN. rough experiments, 30% of the training samples are used to fine-tune DCAEN and the model can obtain better diagnostic performance. But, this is only tested on two datasets with single fault, which has certain limitations. e parameters optimization of DCAEN needs to be considered. e number of convolutional layers, the size of the convolution kernel, the pooling size, and the activation function have an important impact on the performance of model. Based on the empirical values in the references, there are many shortcomings for model performance. erefore, the issue of how to choose parameters of DCAEN should be considered.
ANW-GK clustering has certain advantages compared to the existing method (i.e., GK clustering). e integration of NWFE and PBMF into GK clustering improves the algorithm's fault identification ability and also increases its complexity. Whether it affects the real-time performance of fault diagnosis method is required be studied lately.

Conclusions
A method based on the DCAEN and ANW-GK clustering for rolling bearing fault diagnosis is proposed in this paper. In our method, the fine-tuned DCAEN with a few labeled samples was used to extract high-level features of the input signals, and the extracted features reduced in dimension by t-SNE were input into the improved ANW-GK clustering algorithm for fault identification. Our method was validated on a benchmark bearing dataset and a laboratory-measured bearing dataset. e diagnostic accuracies are 96.5% and 97.5%, the PC values are 0.848 and 0.915, and the CE values are 0.399 and 0.186. e experimental results show that the feature extraction is better than that of other models, such as the EMD + PE/multifractal/SDAE model. e classification accuracies also show that the ANW-GK clustering can identify the bearing faults effectively under various conditions.
In the future, we will focus on deep embedded clustering that directly adds a clustering layer to the top of the DCAEN. e deep embedded clustering will iteratively improve the weight parameters and clustering goals of the joint optimization network through soft allocation. us, the model will not need to be fine-tuned, and the operation efficiency can also be improved.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no potential conflicts of interest with respect to the publication of this article.  16 Shock and Vibration