Rolling Bearing Fault Diagnosis Based on Stacked Autoencoder Network with Dynamic Learning Rate

Fault diagnosis is of great significance for ensuring the safety and reliable operation of rolling bearing in industries. Stack autoencoder (SAE) networks have been widely applied in this field. However, the model parameters such as learning rate are always fixed, which have an adverse effect on the convergence speed and accuracy of fault classification. ,us, this paper proposes a dynamic learning rate adjustment approach for the stacked autoencoder network. First, the input data is normalized and enhanced. Second, the structure of the SAE network is selected. According to the positive and negative value of the training error gradient, a learning rate reducing strategy is designed in order to be consistent with the current operation of the network. Finally, the fault diagnosis models with different learning rate adjustment are conducted in order to validate the better performance of the proposed approach. In addition, the influence of quantities of labeled sample data on the process of backpropagation is analyzed. ,e results show that the proposed method can effectively increase the convergence speed and improve classification accuracy. Moreover, it can reduce the labeled sample size and make the network more stable under the same classification accuracy.


Introduction
As an important part of rotating machinery, bearing plays an important role in modern industry. Bearing faults could cause unbearable and unpredictable loss [1,2]. erefore, lots of artificial intelligent (AI) fault diagnosis methods have been applied to keep the bearings working properly and reliably. However, the traditional AI methods are primarily based on shallow machine learning theory, which work on the original feature representation without creating new features during the learning process. It is difficult to reveal the inherent nonlinear relationship of complex mechanical systems [3,4]. In addition, with the development of condition monitoring, the data of the bearing working station become much wider than ever before, which brings new opportunities and challenges for bearing fault diagnosis. In view of the characteristics of big data such as imperfection, multisource, and low value density, the shallow machine learning method needs to be fundamentally improved. Consequently, the deep learning method has been carried out.
Deep learning was considered a top-ten breakthrough by MIT Technology Review. It attempts to assemble deep architectures with many layers and build more complex nonlinear functions to simulate the process of learning knowledge [5]. Due to the powerful feature extraction ability and unsupervised training pattern, the deep learning method has received a great deal of attention in many areas such as computer vision [6][7][8], speech recognition [9,10], text processing [11,12], medicine [13,14], finance [15,16], and driverless car [17][18][19]. Especially in the field of machine fault diagnosis, many scholars have done a lot of research which included bearings [20][21][22][23], wind turbines [24][25][26], pumps [27][28][29], and power transmission systems [30][31][32]. Until now, there are four typical architectures: stacked autoencoder (SAE), deep belief network (DBN), convolutional neural network (CNN), and recurrent neural network (RNN). Among all of these models, SAE can learn rich representation features and reduce data dimensionality and has received a great deal of attention in the field of bearings. Sun et al. [33] used sparse autoencoder to realize bearing fault diagnosis and analyzed the influence of dropout on the fault recognition rate. Pang and Yang [34] developed a crossdomain stacked denoising autoencoders (CD-SDAE) with a new adaptation training strategy. Wang et al. [35] proposed a new activation function named RelTanh to solve the problem of vanishing gradient.
is research makes SAE own the potential to be used to improve the diagnosis accuracy of bearing. However, there are still some questions that need to be solved. For example, the learning rate is an important parameter of iteration. e traditional SAE adopts a fixed learning rate, which requires a lot of experience to set the parameter. If the learning rate is set too high, it will be difficult to converge or skip the optimal value. Otherwise, if the learning rate is set too small, the convergence speed will be too slow. Some scholars have studied methods including Adam and AdaDec [36], but these methods do not really change the learning rate, and these methods will also be affected by the initial learning rate. In addition, the quantity analysis on the number of labeled sample data in the process of reverse fine-tuning of SAE is rarely reported. If the labeled sample data can be decreased, the cost of collecting and marking sample data would be substantially reduced.
In order to solve the above problems, this paper proposes a novel dynamic learning rate method to replace the original fixed learning rate in pretraining and reverse fine-tuning process of bearing fault diagnosis, making the following two main contributions: (1) according to the positive and negative value of training error gradient, a learning rate reducing strategy is designed to be consistent with the current operation of the network. e convergence rate and convergence accuracy had been accelerating significantly. (2) It studies the influence of the number of labeled sample data on the accuracy and iterations of SAE. e bearing data sets, provided by Case Western Reserve University's (CWRU) Bearing Data Center, were used to verify the performance of the proposed method. Compared with the fixed learning rate model, the results showed the proposed method took up less convergence time and had higher classification accuracy. Under the same accuracy, the proposed method needed less labeled sample data. e organization of the remainder of the paper is as follows: Section 1 introduces the basic methods briefly and the proposed approach. Section 2 details the data source and the method of data processing. In Section 3, some experiments are conducted to evaluate the proposed method under different dynamic learning rates and different numbers of labeled samples; visualization about the proposed method is presented in Section 4 too. Finally, we present the conclusions and future work in Section 5.

Stacked Autoencoder
2.1.1. Autoencoder. Autoencoder (AE) is a single-hidden layer neural network proposed by Rumelhart in 1986, whose structure is shown in Figure 1 [37]. is kind of network keeps the input and output as consistent as possible in the way of unsupervised learning. Assume the input is n-dimensional data X and the output is n-dimensional data Y. e transfer process of raw data from the input layer to the hidden layer is called encoding, and the transfer process from the hidden layer to the output layer is called decoding, which can be described by equations (1) and (2) [5]. In reality, AE aims to learn an approximate function of input data through minimizing errors between the reconstructed data and the original data. e mathematical expressions of the autoencoder are as follows: where W a ∈ R n×k , W s ∈ R k×n , b a ∈ R k , b s ∈ R k are the weights and bias that need to be optimized and σ a (·), σ s (·) are the activation functions.

Stacked
Autoencoder. SAE was improved by Hinton on the autoencoder machine, and its network structure is shown in Figure 2 [38]. e coding part of the autoencoder is stacked; that is, the input of the first layer of the AE machine is the original data, and the input of the lower layer is the hidden layer data of the upper layer. Finally, a classifier is added to the network. e training mode of SAE is consisting of the pretraining and the reverse fine-tuning process. It uses a large amount of unlabeled data for unsupervised learning, extracts features autonomously, and uses the labeled data to reverse fine-tuning the network. Both the pretraining and the reverse fine-tuning process are based on the gradient descent algorithm.

Dynamic Learning
Rate. e essence of the gradient descent method is adjusting the weight of the network according to the partial derivative of the loss function iteratively. e updated calculation formula of the weight is as follows: where W i and W i+1 are the weights of the i th iteration and the (i + 1) th iteration in the calculation process. η is the learning rate and L i is the loss function. Learning rate is a very important parameter in the training process. If the learning rate is too large, it may lead to the difficulty of convergence or skip the optimal solution. On the contrary, if the learning rate is too small, it may lead to slow convergence speed, increase in the calculation time, and difficulty of efficiency improvement. In order to solve this problem, the learning rate should be nonconstant and self-adapting adjusted. 2 Advances in Materials Science and Engineering

AdaDec Dynamic Learning Rate.
AdaDec is an improved form based on AdaGrad; it was proposed by senior in 2013 [36]. e principle is shown in equations (6), (7), and (8). e gradient part of the denominator is determined by the gradient value of the previous round and the current round. It eliminates too many historical gradient data. At the same time, it uses a power index with a downward trend as a numerator to ensure the stable decline of the learning rate:

Improved Dynamic Learning Rate.
e AdaDec method has the following disadvantages: it depends on the initial learning rate setting; it can only reduce the learning rate, and it cannot increase the learning rate.
Because the relationship between the partial derivative of the loss function and the weight is complex, the training error is selected as the basis of the learning rate adjustment. In this manuscript, the learning rate adjustment strategy is designed as follows: where h(i) is the learning rate at iteration i, h(i + 1) is the learning rate at iteration i + 1, and ΔL i is the gradient of training error at iteration i (i > 2). e forward difference quotient method is adopted to calculate the gradient of reconstruction error, and the solution method is shown as follows: rough the calculation of the above equation, if the volatility of training error is obvious, ΔL i is negative, so the learning rate decreases in a slow way. Otherwise, if the training error decreases smoothly, ΔL i is positive, so the Hidden layer Output x Coding Coding Coding Decoding Decoding Decoding Advances in Materials Science and Engineering learning rate decreases rapidly. Furthermore, in order to avoid the learning rate being too small, the range of the learning rate is limited to 0.01-5. e flowchart of the dynamic learning rate method is shown in Figure 3.

Data Source.
e proposed method was verified by the experimental data provided by Western Reserve University (CWRU) [39]. e test rig is shown in Figure 4. e bearing fault location is outer race, inner race, and rolling body, respectively. For each type of fault, there are three fault diameters, 0.18 mm, 0.36 mm, and 0.54 mm. So, there are 10 kinds of classifications in this data set. Two accelerometers are installed at the drive end and the fan end of the motor casing to collect vibration signals at a sampling frequency of 12 kHz. 120000 nodes were used for each state, among which the first 80800 nodes were taken as training data, and the rest were test data.

Data Normalization.
e vibration data were normalized. e normalization formula is as follows: where X represents the sample, X ′ represents the normalized result of X, X max represents the maximum value of X, and X min represents the minimum value of X.

e Enhancement of the Data Set.
e data length of each sample was set to 800. In order to obtain more training samples, the overlapping sampling technique by sliding window was adopted to enhance the data according to [40,41], as shown in Figure 5. e offset was set to 50, and the sample size of the training data in each state was 1600, while the test data was not overlapped, and the sample size of the test data in each state was 49. So, the sum of the training data samples was 16000, and the sum of the test data samples was 490.

Network Structure.
e constant pretraining learning rate was 0.1, the pretraining iteration times were 600, the reverse fine-tuning learning rate was 0.01, the labeled samples were set to 10% (after calculation, when the ratio is more than 10%, the change of accuracy is not obvious), the High initial learning rate h (1) i = 1 Learning rate h (i) iterative computation iteration times were 600, and the batch size was 40. e number of hidden layer nodes was set to 95 initially according to the following empirical formula: where m is the length of the input data node, n is the length of the output data node, and k is a constant within [0, 10]. First, aiming to study the influence of different hidden layers on the bearing fault classification, a comparative experiment was conducted in which the hidden layers were set to 1, 2, 3, 4, 5, and 6, and the average value of ten results was taken as the final result.
As shown in Table 1, the number of hidden layers was three and the accuracy was the highest. If the number of hidden layers was too small, the network cannot extract the features of the input data effectively, so you can find that the accuracy was 77.08% when there was only one hidden layer. On the other hand, the network with lots of hidden layers also cannot achieve a good accuracy because it might lead to the information loss of data features in the iteration process. When the hidden layers ranged from three to six, the accuracy reduced to 86.62%. us, it is important to choose an appropriate number of hidden layers. In the next experiment, according to the above analysis, the number of hidden layers was set to 3 identically in order to improve the accuracy of the network. Subsequently, the experiments with different hidden layer nodes were compared and analyzed. All the results were the average of ten experiments also. As can be seen from Table 2, the network structure (800-400-200-100-10) had the highest accuracy, and this network structure was selected in the follow-up study.

Dynamic Learning Rate Adjustment.
e experiments with the fixed learning rates of 0.01, 0.1, 0.2, and 0.3 and dynamic learning rate were conducted simultaneously. In the pretraining process, the learning rate was set to 0.2 initially. e other network parameters were consistent with the network of fixed learning rate. e training error curve is shown in Figure 6.
As can be seen from Figure 6, the training error reduced with the increase of iterations. When the learning rate was 0.3, the training error appeared unstable and fluctuant at the beginning of the iterations and converged into a local optimal solution eventually. When the learning rate was 0.01, at the end of the iterations, the training error was still unstable (not convergent), and the training error showed an obvious downward trend. Different from the performance of the learning rate 0.3 and 0.01, the training error reduced quickly and smoothly when the learning rate was set to 0.1 and 0.2 and the dynamic learning rate. e training error of the dynamic learning rate, especially, was the smallest. Besides, the descent velocity of the training error with the dynamic Advances in Materials Science and Engineering 5 learning rate was the fastest. It is obvious that the network with the dynamic learning rate requires minimal steps to reach the stable and convergent station. It can be seen from Figure 7 and Table 3 that the dynamic learning rate method has a faster convergence speed and better convergence accuracy than AdaDec. Table 3 shows the iterations numbers, the time consumption, and the training error of various learning rates. Compared with the results of learning rates of 0.1 and 0.01, the performance of dynamic learning rate was much better, whereas when compared with the results of learning rate 0.2, the iteration number and convergence time of the dynamic learning rate were increased by 24.4% and 25.5%, respectively, but the training error decreased by 47.6% dramatically. e training error reduced from 0.2868 to 0.1504. So, from a comprehensive perspective, the proposed dynamic learning method can be regarded as a better choice to improve the effectiveness of bearing fault diagnosis. Figure 8 shows the change of dynamic learning rate with the iteration process. e learning rate increased at the beginning, reaching a maximum of 0.2863; then, the learning rate gradually decreased to 0.0573 eventually. e initial learning rate given by AdaDec is 0.02. When the initial learning rate is greater than 0.03, the training error is always around 3.6 and cannot converge. erefore, the AdaDec method is also affected by the initial learning rate. It can be seen from Table 4 that dynamic learning can converge to better accuracy no matter what the initial learning rate is.

Different Number of Labeled Samples in the Process of
Reverse Fine-Tuning. In order to explore the effect of dynamic learning rate in the process of reverse fine-tuning, three groups of experiments are set up, in which the fixed learning rate of the reverse fine-tuning process is 0.01, the weight value and bias obtained by the pretraining method of 0.1 fixed learning rate or dynamic learning rate. e comparison of the results obtained through reverse fine-tuning training is shown in Figure 9. It is obvious that the accuracy of the dynamic learning rate used in the process of reverse fine-tuning showed fluctuation also. But compared with the fixed learning rate 0.01, the amplitude of variation of dynamic learning rate was bigger at the beginning of the iteration. While after about 100 iterations, the fluctuation of the dynamic learning rate became smaller than that of the fixed learning rate. It proved that the dynamic learning rate method not only had higher accuracy but also had better convergence. It can also be seen from the figure that the results obtained by using the fixed learning rate in both pretraining and reverse fine-tuning are the worst in terms of stability and accuracy.
For the purpose of exploring the influence of the labeled samples on the accuracy in the process of reverse finetuning, the experiments of different percentages of labeled samples were conducted. e labeled samples were set to 1%, 2%, 3%, 4%, 5%, 6%, 8%, and 9% of the training data, respectively. e rules for adding labeled samples are shown in     Experimental results of different percentages of labeled samples were shown in Figure 10. e reverse fun-tuning process was accomplished using the weight and bias obtained by the same pretraining process in which dynamic learning rate was used. e reverse fine-tuning process used a dynamic learning rate too. As can be seen from Figure 11, the more the labeled samples, the fewer the required iteration steps to achieve 90% accuracy. In general, the accuracy increased with the number of labeled samples.
In order to determine the most suitable number of labeled samples, the experiment of fixed learning rate 0.1 and dynamic learning rate used in pretraining process was conducted. e fault classification accuracy was the average value obtained by ten experiments. e results were shown in Figure 12. It is obvious that the network of dynamic learning rate was more accurate under the same labeled sample size. In other words, fewer labeled samples were needed for dynamic learning rate to achieve the same accuracy. When the percentage of the labeled date ranged from 1% to 8%, the accuracy increased rapidly. When the percentage of labeled data exceeded 8%, the accuracy increased slowly. Taking into account the iteration numbers and accuracy simultaneously, it was recommended to set the percentage of labeled data as 8%.

Visualization.
In order to verify the above conclusions further, the visualization of the third hidden layer with different percentages of labeled samples is given in Figures 13-15. In this manuscript, the T-Distributed Stochastic Neighbor Embedding (t-SNE) method was adopted to extract two features for visualization. T-SNE was proposed by van der Maaten and Hinton in 2008 [42]. T-SNE has achieved good results in dimension reduction, clustering, and visualization. e horizontal axis and the vertical axis represent the first two principle components achieved by t-SNE. Figure 12 is the visualization of 1% labeled samples. It can be seen that only four kinds of faults can be     distinguished. Figure 13 is the visualization of 5% labeled samples. Seven kinds of faults can be distinguished clearly. But there are no clear boundaries among the remaining three faults. Figure 14 is the visualization of 8% labeled samples. All of the ten kinds of faults can be easily distinguished.

Conclusions
In this manuscript, a novel SAE model using dynamic learning rate is developed for bearing fault diagnosis, which can effectively overcome the shortcomings of the fixed learning rate. In order to verify the performance, the proposed method is applied on a typical bearing fault data set. According to the positive and negative value of the training error gradient, different learning rates updating strategies are used. In addition, the optimal network structure and the optimal percentage of labeled samples are determined through comparative experiments. e results show that the dynamic learning rate method can improve the accuracy and convergence ability of the network, and the influence of the initial learning rate is very small.

Data Availability
e data used to support the findings of this study have been deposited in http://csegroups.case.edu/bearingdatacenter/ home.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors' Contributions
All of the authors contributed equally to the conception of the idea, the design of experiments, the analysis and interpretation of results, and the writing of the manuscript. W.T. and H.P. wrote the original draft; H.P., J.X., and M.B  reviewed and edited the article. All authors have read and agreed to the published version of the manuscript.