A Novel Cuckoo Search Optimized Deep Auto-Encoder Network-Based Fault Diagnosis Method for Rolling Bearing

To enhance the performance of deep auto-encoder (AE) under complex working conditions, a novel deep auto-encoder network method for rolling bearing fault diagnosis is proposed in this paper. First, multiscale analysis is adopted to extract the multiscale features from the raw vibration signals of rolling bearing. Second, the sparse penalty term and contractive penalty term are used simultaneously to regularize the loss function of auto-encoder to enhance the feature learning ability of networks. Finally, the cuckoo search algorithm (CS) is used to ﬁnd the optimal hyperparameters automatically. The proposed method is applied to the experimental data analysis. The results indicate that the proposed method could more eﬀectively distinguish fault categories and severities of rolling bearings under diﬀerent working conditions than other methods.


Introduction
Rolling bearings are widely used in rotary machines, which play an important role in determining the running states of equipment. Under complex working conditions, rolling bearings will inevitably obtain various faults. is may affect the operating performance of the whole machine and even lead to enormous losses and serious casualties. erefore, it has important practical significance for the state monitoring and fault diagnosis of rolling bearings [1].
At present, the vibration analysis method has been widely used in fault diagnosis of rolling bearings, which generally contains three steps: (1) feature extraction, (2) feature selection, and (3) pattern recognition [2]. For (1), the effectiveness of feature extraction is related to the accuracy of fault diagnosis. e common feature extraction method mainly includes time-domain methods, frequency-domain methods, and time-frequency methods. In time domain, the most popular methods are the statistical analysis [3], such as root mean square, root amplitude, and maximum peak value. e frequency-domain methods mainly analyze the characteristic frequency [4], such as frequency standard deviation and root mean square frequency. In the time-frequency domains, the famous methods include short-time Fourier transformation (STFT) [5], wavelet transformation (WT) [6], and empirical mode decomposition (EMD) [7]. For (2), feature selection can reduce the input dimension of classifier and improve classification efficiency. e existing feature selection methods include decision tree [6], random forest [8], and distance-based feature selection method [9]. For (3), the selected features are classified by Bayesian classification [10], support vector machines (SVM) [11], artificial neural network (ANN) [12], and other machine learning algorithms. e shallow structure of machine learning lacks powerful representation capabilities and has difficulty in effectively learning the complex nonlinear relationships in mechanical fault diagnosis problems [13]. As a breakthrough in the field of machine learning, deep learning has received widespread attention from all fields [14,15]. It can automatically mine the representative information hidden in the raw data and directly establish an accurate mapping relationship between data and the operating state of the equipment [16].
AE is a kind of deep learning model, which can automatically learn features from samples and is widely used in the field of mechanical fault diagnosis [17]. Jia et al. [18] built a deep neural network based on AE and applied it to bearing fault diagnosis. Shao et al. [19] proposed a new deep feature fusion method for rotating machinery fault diagnosis, which has achieved satisfactory diagnostic results in bearing fault diagnosis and gear fault experiment analysis. Meng et al. [20] improved the performance of the denoising auto-encoder (DAE) and successfully identified rolling bearing faults. Sun et al. [21] successfully used a deep neural network based on sparse auto-encoder (SAE) for fault classification of induction motors. In addition, Shen et al. [22] proposed a fault diagnosis method for rotating machinery based on contractive auto-encoder (CAE). Zhang et al. [23] conducted mechanical fault diagnosis using the ensemble deep contractive auto-encoder (EDCAE) under noisy environment. e deep structure can automatically extract more representative and discriminating high-level features from the input and obtain satisfactory fault diagnosis results.
However, the effective characteristics of the signal are often submerged in industrial background noise. is leads to the performance decline of deep AE [24]. Moreover, the selection of hyperparameters in deep auto-encoder networks has always been a challenge. It is difficult to find the most suitable parameters. erefore, it is necessary to design a novel deep AE.
In order to further enhance the performance of deep auto-encoder and select hyperparameters adaptively, a novel CS optimized deep auto-encoder network-based fault diagnosis method for rolling bearing is proposed in this paper. e proposed method was applied to the experimental data analysis of rolling bearing under different working conditions. e results show that the proposed method is more effective and robust than other methods. e main contributions of this paper can be summarized as follows: (1) e multiscale features of vibration signals are extracted as the input of the network, which improves test accuracy and reduces training time (2) e sparse penalty term and contractive penalty term are simultaneously used to regularize the loss function of the AE, which enhance the feature learning ability and robust of deep AE (3) CS algorithm is used to optimize hyperparameters of the new AE network automatically, which makes the new auto-encoder network more suitable for the signal characteristics e rest of the paper is organized as follows. e basic AE is described in Section 2. e proposed method is described in Section 3. In Section 4, the experimental diagnosis results for rolling bearing are analyzed and discussed. Finally, conclusions are given in Section 5.

Basic AE
Basic AE is a three-layer unsupervised neural network [25], and it can learn useful features automatically by nonlinear dimension reduction. e structure of the basic AE is shown in Figure 1. It consists of two steps: encode and decode. e encoding procedure maps the input data to hidden features vector, and the decoding procedure maps the hidden features vector back to reconstructed vector.
Given the unlabeled sample data x d � x 1 , x 2 , . . . , x M (d � 1, 2, . . . , D), the encoding procedure is defined as (1) . (1) e decoding procedure is defined as where s is the sigmoid function, W (1) and W (2) are the weight matrices, b (1) and b (2) are the bias matrices, h j (x d ) is the hidden layer features vector and x d is the output layer features vector, D is the number of training samples, and d stands for the d th sample. AE minimizes the reconstruction error by optimizing parameters W and b.
e loss function of AE is usually defined by the mean square error (MSE) as where ‖·‖ represents the norm. us, the total loss function of M samples is defined as

DCSAE Network.
If the encoder and decoder are given too large capacity, the AE will perform the copy task without capturing any useful information about data distribution. e regular term constraints are added to the AE loss function that can encourage the model to learn useful features without restricting the capacity of the model [26].
Sparse penalty is a common regular term constraint, which is added to the loss function of AE, can reduce the dimension of input data effectively, and can speed up network training; the sparse penalty items is defined as [27] where j is the number of each unit in the hidden layer, s 2 is the number of units in the second layer, ρ is the sparse parameter, which is a small value artificially given, and ρ j is the average activation value of the i th hidden layer.
Contractive penalty is added to the AE, which can learn the representative and robust hidden features from noisy environment. e contractive penalty term is defined as [28] e element of the J f (x d ) ji in the j th row and i th column is defined as When the activation function of AE is the sigmoid function, equation (7) can be calculated by where W (3) ji is the weight matrix of the unit in layer i and the unit in layer j.
Based on equations (8) and (6), it can be further written as Considering the advantages of sparse penalty and contractive penalty, in this paper, the sparse penalty term and contractive penalty term are added to the loss function of AE at the same time, and a contractive sparse autoencoder (CSAE) can be obtained. CSAE loss function can be defined as where β is the sparse penalty factor used to control the relativity between the sparse penalty and the reconstruction error and λ is the contractive penalty parameter used to control the relativity between the contractive penalty term and the reconstruction error. Two CSAEs and one softmax classifier can be stacked to form a DCSAE network with several hidden layers, as shown in Figure 2. e detailed process of DCSAE network is as follows: Step 1: the input data and the first hidden layer are made up of the first CSAE for feature learning Step 2: the hidden layer features of the first CSAE are used as the input of the second CSAE; repeat this process until all hidden layers are pretrained Step 3: update the parameters by minimizing the CSAE loss function Step 4: the hidden features of the last CSAE are used as the input of softmax classifier for supervised training Step 5: the backpropagation (BP) algorithm is used to fine tune the entire network to obtain a well-trained DCSAE Moreover, the batch normalization (BN) layer is added to the DCSAE that is beneficial to the training network weight matrix.

Parameter Optimization.
e sparse parameter ρ, the sparse penalty factor β, and the contractive penalty parameter λ in equation (10) are key hyperparameters of the DCSAE. Setting the parameters manually is hard to ensure that the set of parameter is optimal. e CS algorithm is a simple and effective optimization approach [29], the path can be randomly searched during the optimization process, and the optimal solution is easily obtained by setting fewer parameters [30].
us, the CS algorithm is used to select hyperparameters of DCSAE. Process of the CS algorithm is as follows: Step 1: set the following cuckoo parameters number of nests, population size, probability of discovery, and number of iterations, set fitness function, and initialize nest position.
Step 2: calculate the value corresponding to each nest according to the fitness function and select the current best nest.
Step 3: retain the optimal value of the previous generation and its corresponding nest position, and then, update other bird nest positions and states.
Step 4: compare the current value with the retained previous generation optimal value; replace if better, otherwise keep it unchanged.
Step 5: after updating the nest position, a new probability is randomly generated. If the new probability is less than the set discovery probability, the nest position remains unchanged; otherwise, the nest position is randomly changed. Finally, keep the best nest position.

Encode Decode
Input layer Hidden layer Output layer Figure 1: Structure of the basic AE.

Shock and Vibration 3
Step 6: if the number of iterations does not reach the maximum or the error does not reach the minimum, then return to step 2; otherwise, the global best nest position is the output.
Step 7: the global optimal nest position is output as the optimization parameters and the optimization is finished.

Procedure of the Proposed Method.
A new fault diagnosis method for rolling bearing is proposed based on CS and DCSAE in this paper. e flowchart of the proposed method is given in Figure 3. e procedure of the proposed method is as follows: Step 1: multiscale features are extracted from the rolling bearing vibration signals to constitute feature set Step 2: multiscale features are divided into training samples and testing samples according to a certain proportion as the input of network Step 3: the CS algorithm is used to optimize the key hyperparameters of DCSAE  Figure 4. e test rolling bearings are 6206-2RS1 SKF deep groove ball bearings, and the single point faults are seeded by using wire cutting technology with different depths of 0.2 mm, 0.3 mm, and 0.4 mm. e vibration signals from four kinds of rolling bearings, normal bearing (NO), bearings with inner race (IR), outer race (OR), and rolling ball (RB) faults, were collected. e motor speeds are 150 r/min, 300 r/min, 900 r/ min, and 1500 r/min with the load of 0 kN, 5 kN, and 7 kN. e sampling frequency is 10.24 kHz, and the data length is 204800.

Dataset Introduction of Experiment 2.
In this experiment, rolling bearing data of the same fault degree under the different loads were selected to verify the recognition ability of the DCSAE. e rolling bearing data set is collected under the motor speed of 900 r/min and the loads of 5 kN and 7 kN. e operation conditions are listed in Table 2. Each condition also consists of 200 samples, and each sample also consists of 1024 data points. e random 180 samples of each fault are chosen as training samples, and the rest 20 samples are chosen as testing samples.

Multiscale Feature Extraction.
It is common to perform multiscale analysis on vibration signals to extract multiscale  features of the signals to mine fault information more fully. Multiscale analysis based on signal coarse-graining is widely used. It mainly contains two steps. First, the coarse-grained sequence of raw signals at different scales is calculated, and then, the eigen values of the coarse-grained sequence of each scale are calculated to obtain the multiscale features [31].
For a certain length of sample, the coarse-grained sequence of different scales is given by [32].   Shock and Vibration where x τ (j) is the coarse-grained sequence, τ is the scale, u(i) is the sample, and L is the sample length, 1 ≤ j ≤ (L/τ).
To illustrate the effectiveness of multiscale features as network input, raw bearing vibration signals, single-scale features, and multiscale features of bearing vibration signals are used as input for DCSAE. In multiscale features extraction based on the data of experiment 1, each sample consisting of 1024 data points and 17 types of time-domain and frequencydomain features in the first 5 scales of the raw rolling bearing vibration signals are extracted to form a new features set with a dimension of 1 * 85. e extracted features [33] are as follows: (1) Time-domain features: mean, root mean square, root amplitude, average amplitude, maximum peak value, standard deviation, skewness index, kurtosis index, peak value index, margin index, waveform index, and pulse index (2) Frequency-domain features: center of gravity frequency, frequency standard deviation, root mean square frequency, skewness frequency, and kurtosis frequency e formulas of partial time-domain features and frequency-domain features are shown in Table 3.   Figure 6. e test accuracy and training time are shown in Table 4.
In Figure 6 and Table 4, compared with the other groups, the training time is the longest and the average test accuracy is the lowest in each trial when raw data are used as input for DCSAE; this is because raw data of rolling bearing are highdimension data, and they contain a lot of redundant information which is not conducive to the weight training in the network and will lead to long training time. e training time of single-scale feature as the input for DCSAE is the shortest, but the recognition rate is far lower than that of multiscale feature.
Multiscale feature as the input for network can fully mine fault information; therefore, multiscale features are used as input for all training networks of experimental verification in Section 4.4.

Results and Analysis of Experiment 1.
To verify the effectiveness of the proposed method, it is compared with SVM, SAE, stacked sparse auto-encoder (SSAE), and stacked contractive auto-encoder (SCAE).
In this experiment, the basic parameters of the CS are set as given in Section 4.2, and the optimal hyperparameters [ρ β λ] are [0.15 0.27 0.13] through the CS algorithm. e deep structure DCSAE is selected as [85 50 30 7] by experimentation, the learning rate is 0.01, and the number of iterations is 200. e kernel function of SVM is RBF function, and the penalty factor and kernel function parameters are obtained by the CS, which are 0.64 and 0.017, respectively. e structure of SAE is [85 30 85], the learning rate is 0.01, and the hidden layer features are input to the softmax for classification. e structures of SSAE and SCAE are the same as that of the DCSAE, the learning rate is 0.01, the number of iterations is 200, and the sparse parameters ρ and β in SSAE and the contractive penalty parameters λ in SCAE are also obtained by the CS as well, which are 0.08, 0.18, and 0.69, respectively.
Ten trials are carried out for the five methods to eliminate the influence of accidental errors, and the average result of the ten trials was used as the evaluation index. e comparison results of the five methods are shown in Figure 7 and Table 5. e following can be seen from Figure 7 and Table 5: (1) Comparison between SSAE, SCAE, and DSCAE: the average test accuracies of SSAE and SCAE are 96.43% and 95%, respectively; they are lower than that of DSCAE, which is 98.57%. e standard deviations of SSAE and SCAE are smaller than that of DSCAE. It is indicated that the sparse penalty term and the contractive penalty term are applied to the loss function of the stacked AE at the same time, which can obtain more satisfactory diagnosis results. (2) Comparison between SVM, SAE, and other three deep networks: the average test accuracies of SVM and SAE are 52.14% and 70.71%, respectively; they are much lower than that of other three deep networks as SSAE, SCAE, and DSCAE. is illustrates that deep learning methods have better performance than shallow models in dealing with large samples, and the reason is that the deep learning method can mine more useful features from fault data.
e confusion matrix of the DCSAE for experiment 1 is shown in Figure 8. e columns stand for the true label, and the rows stand for the predicted label; the chart bar in the right shows the correspondence between color and numbers (from 0 to 1). DCSAE obtains the best result of 100% on the IR1-1, IR1-2, OR1-1, OR1-2, and NO1-1 in Figure 8. e only misclassification occurred on RB fault samples, and about 10% of RB1-1 samples were misclassified as RB 1-2.
A manifold learning method is used to visualize the data samples to further analyze the feature extraction capability Table 3: e formula of the partial extracted features.

Statistical features in frequency domain
Center of gravity frequency Frequency standard deviation Note: S(f) is the power spectral density, Shock and Vibration 7 of DCSAE. t-distribution stochastic neighbor embedding (t-SNE) [34] is an embedding model that can map data in highdimensional space to low-dimensional space and retain the local characteristics of the data set. It is mainly used for dimension reduction and visualization of high-dimensional data. erefore, t-SNE is adopted to extract the raw data and the first three components of the last layer of DCSAE and draw the scatter plots in Figures 9 and 10. As shown in Figure 10, fault samples of the same category converge in a center and fault samples of different categories can be easily distinguished. In conjunction with Figures 8 and 10, it can be seen that a small number of samples overlap on the RB1-1 and RB1-2. is leads to the misclassification of the samples with the RB fault condition.       In general, DCSAE has achieved a high accuracy to distinguish the different categories under the same motor speed and load. e average test accuracy is higher than that of existing deep learning methods and traditional machine learning methods. e result of the 10 trials is shown in Figure 11, and the average test accuracy and standard deviation of the specific 10 trails are shown in Table 6. As shown in Figure 11 and Table 6, compared with other three deep AE networks, the average test accuracy of DSCAE is 91.25 ± 0.0092 which has the highest test accuracy and the lowest standard deviation. It indicates that DCSAE can obtain better diagnosis accuracy when dealing with fault data under different conditions. e confusion matrix of the DCSAE for experiment 2 is shown in Figure 12. From the result, about 5% of NO2-1 are misclassified as NO2-2, about 25% of RB2-1 are misclassified as RB 2-2, and about 35% of RB2-2 are misclassified as RB 2-1; all of other categories are classified correctly while processing bearing fault data with different loads.

Results and
Similar to the above process, t-SNE is used to extract the raw data and the first three components of the last  layer of DCSAE in experiment 2 and draw the scatter plots in Figures 13 and 14. As shown in Figure 14, fault samples of different categories under different loads can be distinguished. In conjunction with Figures 12 and 14, it can be seen that a large number of samples overlap on the RB2-1 and RB2-2, and it is difficult for the softmax classifier to achieve a high-precision classification effect. is leads to the misclassification of the samples with the RB fault condition. Overall, DCSAE has robustness and generalization ability to distinguish the different categories with the different loads.

Conclusion
In this paper, a novel deep AE network is proposed for rolling bearing fault diagnosis. In the proposed network, the sparse penalty term and the contractive penalty term are adopted to regularize the novel deep AE loss function to enhance the ability of feature learning. Moreover, the multiscale features of rolling bearing vibration signal are extracted as the input of DCSAE to reduce the training time and data dimension. Furthermore, the CS algorithm is used to optimize the key hyperparameters of the deep auto-encoder automatically. Effectiveness of the proposed method is     verified by two different experiments of rolling bearings fault diagnosis. e results show that the proposed method is more effective and robust for fault diagnosis than other methods. It is a worthy research topic to apply the deep learning method to other mechanical fault diagnosis fields. e authors would continue to study this field in the future work.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 6/12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest concerning the publication of this manuscript.