Deep Transfer Learning-Based Fault Diagnosis for Gearbox under Complex Working Conditions

In the large amount of available data, information insensitive to faults in historical data interferes in gear fault feature extraction. Furthermore, as most of the fault diagnosis models are learned from offline data collected under single/fixed working condition only, this may cause unsatisfactory performance for complex working conditions (including multiple and unknown working conditions) if not properly dealt with.*is paper proposes a transfer learning-based fault diagnosis method of gear faults to reduce the negative effects of the abovementioned problems. In the proposed method, a cohesion evaluation method is applied to select sensitive features to the task with a transfer learning-based sparse autoencoder to transfer the knowledge learnt under single working condition to complex working conditions. *e experimental results on wind turbine drivetrain diagnostics simulator show that the proposed method is effective in complex working conditions and the achieved results are better than those of traditional algorithms.


Introduction
With the extensive application of technology in industrial production, fault diagnosis is playing an increasingly important role. In production, the occurrence of abnormal accidents can be avoided and economic losses and casualties can be reduced through timely detection of equipment fault [1]. Data-driven fault diagnosis has high accuracy for practical complex system diagnosis tasks such as gear faults in rotating machinery due to its complex structure which is hard to carry out mathematical modeling [2,3]. It consists of three main directions: signal processing, statistical analysis, and artificial intelligence-based methods [4]. In signal processing methods, the signals are analyzed by several techniques to extract fault features, such as wavelet filter and singular spectrum analysis [5,6]. Statistical analysis methods utilize the statistical methods such as principal component analysis and partial least square methods to analyze the historical data [7,8]. Artificial intelligence-based methods apply different artificial intelligence techniques in fault diagnosis, such as neural network, support vector machine, and fuzzy logic [9][10][11].
Among fault diagnosis methods-based on artificial intelligence, deep learning method has been broadly applied in detecting abnormal situations. Deep learning methods such as recurrent neural network and convolution neural network are widely studied and applied in the field of fault diagnosis in industrial systems due to their self-learning and adaptivity [12,13].
Sparse autoencoder (SAE) is a type of neural network that can learn features from unlabeled data, which was proposed based on autoencoder (AE) in 2006 [14]. AE takes the input information as its learning target to extract features and reduce dimensionality through encoding and decoding [15,16]. In fault diagnosis, AE is trained to extract the features of input data, which is not suitable when the distribution of testing data is different from training data [17,18]. To fortify the adaptability and flexibility of the network model, the concept of transfer learning is proposed to apply the knowledge learned in a pretrained model into novel tasks [19,20].
With the above literature review on artificial intelligence-based fault diagnosis methods, there are still two main difficulties in fault diagnosis of practical complex systems such as rotating machinery: (1) Insensitive Information. Insensitive information can be described as the components caused by irrelevant variables in original signals [21,22]. Liu combined a 1D autoencoder and convolutional neural network in detecting faults of rotating machinery under noisy environment [23]. Wang adopted conditional variational neural networks to extract the features of planetary gearbox under noisy environment [24]. Zhang proposed a deep learning model based on convolutional neural network with wide first-layer kernels for fault diagnosis to withstand interference information [25]. e research works in literature review did not consider the effect of insensitive information, such as the features that contribute little or eventually have a negative interference fault diagnosis performance. e other problem with these approaches is that they did not consider the performances of the proposed methods under different working conditions, which is discussed afterwards.
(2) Complex Working Conditions. In the actual production, system operational parameters give rise to complex working conditions such as multiple working conditions and even unknown working conditions. e model trained under single working condition is not able to effectively adapt to complex conditions on this occasion [26]. Moreover, serious distribution discrepancy can be observed between training data and testing data when the structures of two data sets are different [27]. To solve this problem, many approaches have been proposed by researchers. Wang discussed domain adaptation for different conditions in transfer learning for gearing fault diagnosis [28]. Hasan adopted transfer learning and convolutional neural network in bearing to make sure that the model is adaptable in different working conditions [29]. Qian proposed a new transfer learning method to detect faults of rotating machine under variant working conditions [30]. However, these previously mentioned research works have limitations: first, there is no discussion or analysis on both multiple working conditions and unknown working conditions; second, the handling of insensitive information which can affect the fault diagnosis performance is not comprehensive.
Based on the above literature review, there is no research work at present that investigates the effects of both two difficulties simultaneously. e contributions of this paper are listed as follows: (1) the problem with both complex working conditions and insensitive information is investigated; (2) a deep transfer learning-based fault diagnosis method with sensitive features selection and the combination of SAE based on transfer learning is proposed for the abovementioned problems. To reduce the difficulty of signal analysis under complex conditions, transfer learning is applied to adjust the accuracy of the model under such conditions. Transfer learning refers to applying the prior knowledge learned from one task to a different but related task, which was first proposed at the 1995 NIPS-95 seminar on "Learning to Learn" [31]. Transfer learning reduces the cost of model construction and data requirement when there is difference between source and target data, which is applied in different fields, such as data mining, image recognition, language translation, fault diagnosis, and fault diagnosis [32][33][34][35]. e rest of this paper is organized as follows. In Section 2, the proposed algorithm is introduced in detail, including sensitive features selection, SAE, and MMD. In Section 3, hardware experiments are conducted on wind turbine drivetrain diagnostics simulator to show the effectiveness of the proposed method for five fault types of gear. Conclusion is made in Section 4.

The Proposed Method
In order to increase the accuracy of fault diagnosis under complex working conditions, a transfer learning-based fault diagnosis method using sensitive features selection is proposed in this section. e relevant methods and algorithm details are explained in the following four subsections.

Sensitive Features Selection.
e signal properties in time and frequency domain including amplitude, probability distribution, and energy change when the fault occurs. In rotating machinery, usually 11 time domain and 13 frequency domain characteristic parameters are analyzed for fault diagnosis [36]. However, the large number of characteristic parameters incurs the following two problems: (1) fault features may not be accurately extracted due to the random components in the signal; (2) large dimension data enhance the modeling difficulty [37]. In this paper, cohesion evaluation is applied to select the sensitive features, which can reserve sensitive features and remove insensitive features by evaluating the cohesion of each feature [38].
Suppose that there is a feature set containing H categories, with q m,h,j denoting j-th feature of m-th sample in the h categories as shown in where M h represents the number of samples h and J is the number of features in each category. Table 1 lists the steps to compute the cohesion factor c j , which reflects feature sensitivity from the intercategory and intracategory cohesion. Cohesion indicates the relationship among categories based on standard deviation, which reflects the details of the difference in the overall data distribution. In Table 1, difference of intracategory distance and difference of intercategory distance are computed based on the average distance from step 1 to step 6. Distance factor of every category is computed by intercategory and intracategory ratios using step 7 and step 8. Average intracategory standard deviation is calculated to gain the difference from step 9 to step 11. Next, average intercategory cohesion difference is defined and computed from step 12 to step 16. Finally, in order to obtain cohesion factor, a weighting factor is defined in step 17 to measure cohesion difference which is similar to the distance factor. It is used to evaluate the cohesion of each category, which can be distinguished if the cohesion difference between intracategory and intercategory is large.
ere are two problems mainly causing the difficulty in accurate sensitive feature extraction: (1) large intracategory distance: in this situation, sensitive features sorted by distance evaluation factor α j only are not accurate and some sensitive features can be discarded as insensitive features since large intracategory distance d inner j reduces the priority of sensitive features; (2) large intracategory cohesion difference: in this premise, there is an overlap between categories if the intercategory cohesion difference is small, resulting in inaccurate selection of sensitive features. e two problems mentioned above can be solved by combining distance and cohesion evaluations, which prevents one of them from producing excessive effect on the result. e sensitive factor η j is computed in the following equation, which is combined with distance evaluation factor and cohesion factor: where coe is a coefficient to modulate the proportion of distance and cohesion evaluation and β j is the sensitivity weighting coefficient. According to the distance evaluation factor α j of step 8 and cohesion factor c j of step 18 in Table 1, the sensitivity weighting coefficient β j is represented in (3) e sensitive factor η j reflects the influence degree of different features in categories. e sensitivities of features are sorted from large to small according to the value of η j . rough sensitivity factor, features that are sensitive to classification are retained, while the insensitive information is discarded without figuring out the type of the features. is preprocessing reduces the complexity of subsequent computation and can help improve the classification accuracy.

Transfer Learning.
Transfer learning is type of a learning mode applying prior knowledge learnt in a task in solving related but different tasks. e prior knowledge including data features and labels can assist the analysis of a related task when it is difficult to process directly due to data collection difficulty, high modeling cost, and long training time. In transfer learning, a domain refers to a data set and its probability distribution. Particularly, the domain containing prior knowledge is called source domain, and the domain containing unknown knowledge is called target domain [39]. e aim of transfer learning is to learn the target task with the help of the knowledge of source task, such as features, parameters, and labels.
Transfer learning is effective when there is connection between the source domain and the target domain. So far, Step Process parameter Equation  1 Intracategory Difference between categories e outer Distance weighting factor Distance of each feature cd n,r,s,j � |q n,s,j − q r,s,j | 13 Quadratic sum of feature distance Standard deviation of feature distance (intracategory cohesion) Cohesion weighting factor the most studied scenario in transfer learning is to reduce the difference between source data and the target data with the same tasks [40]. In this case, transfer learning maintains the reusability of the model by reducing the distribution differences between data sets. With the deep research of transfer learning, a few studies start to work on the scenario that the tasks of source and target domain are different with the same data set [41]. In this paper, the data collected in single working condition and complex working conditions are shown as source data and target data, respectively. Transfer learning reserves the reusability of the model trained by data of single working condition through measuring and reducing the distribution difference between the source data and target data.

Transfer Learning-Based Sparse Autoencoder.
Sparse autoencoder is developed from autoencoder, which is an unsupervised learning network with an encoder and a decoder. As shown in Figure 1, the encoder reduces the dimension of the input data for feature extraction purpose, and the decoder recombines the encoded information and restores the encoded information to the original data [42,43]. SAE improves the ability of feature extraction by adding a sparsity limitation to the neurons in the hidden layers. In this paper, a three-layer SAE is applied as the network model, and the structure of SAE is introduced. After the preprocessing, a m × n data set can be represented as X � x ij , where i � 1, 2, . . . , m and j � 1, 2, . . . , n. m is the number of samples, and n is the dimension of each sample. e source features obtained by the encoder are denoted by ξ, and the output of the decoder X is close to X. e parameter set is represented by θ � W e , B e , W d , B d , where W e and W d are weights of encoder and decoder, respectively, and B e and B d are bias of encoder and decoder, respectively. Based on the above introduction, the value of the features ξ and the output of the decoder X are shown in the following equations: where σ is the activation function sigmoid, whose formula is shown in To restrict the number of active nodes in hidden layer, sparsity penalty factor Kullback-Leibler (KL) divergence is measured.
e average activation value of k-th node in hidden layer, ρ k , is calculated in where w e ik and b e k belong to W e and B e . By using relative entropy and (7), KL is represented in where ρ is a predefined sparse parameter. To achieve the sparsity of the active values in hidden layer, the value of sparsity parameter ρ should be close to 0. For this purpose, it is necessary to adjust the value of ρ k until ρ � ρ k to ensure that KL reaches its minimum value, which is close to 0. erefore, the cost function of SAE J SAE (θ) can be expressed as where x i is the output, L is the loss function of SAE, W e and W d are the weight of encoder and decoder, and α and β are the weight parameters. rough minimizing J SAE (θ), the features of data obtained offline such as single working condition can be extracted as a priori knowledge for transfer learning.

Maximum Mean Discrepancy.
Maximum mean discrepancy is a distance for measuring the difference of probability distribution between two data sets X and Y, which is widely applied in transfer learning [44]. When the probability distributions of X and Y are different, it is not appropriate to apply the same classification model to achieve satisfactory performance [45]. In the problem discussed in this paper, large probability distribution difference leads to the fact that the model trained with data obtained under single working condition is not applicable to complex working conditions. e accuracy of the model can be improved by minimizing MMD with a transformation function to minimize the distance between the transformed feature sets which are obtained in sensitive feature selection. Suppose that the probability distribution of data sets X and Y is p and q, respectively; the expression of MMD is as follows: where H represents Reproducing Kernel Hilbert Space (RKHS). RKHS is a complete inner product space which can

Input layer
Hidden layer Output layer transfer the data set that is not linearly separable to high dimensional space via mapping [46]. Equation (10) represents the upper bound of the mapping of probability distribution between two data sets in RKHS. For the convenience of computation, the square of MMD D 2 H (X, Y) is applied, whose formula is listed in where N X and N Y are the sample numbers of X and Y. e smaller the value of MMD is, the smaller the probability distribution discrepancy between the two data sets is.

e Proposed Algorithm.
In this subsection, an improved algorithm is proposed to transfer the knowledge in single working condition to make it available in complex working conditions, whose architecture is plotted in Figure 2. In data collection, the data are separated into two parts: data in single working condition (called source data) and data in complex working conditions (called target data) which consist of multiple and unknown working conditions. Before feature extraction, sensitive features are selected through cohesion evaluation to constitute a sensitivity parameter set as input data. In the network, after the training of source parameter set is completed, the parameters of the SAE are reused for learning the target labels. To apply the knowledge learnt from source data to the target task through transfer learning, the distance of the source features and the target features is minimized by MMD. e trained target features are classified by a softmax classifier to obtain the target labels. e expected effect of the proposed algorithm is explained via illustration in Figure 3. In this figure, it is assumed that source data and target data have large probability distribution difference and there are three fault types in both sets: feature 1 and feature 2 are sensitive to fault 1 and fault 2, respectively, while feature 3 is insensitive to fault 1 or fault 2. e red patterns represent the incorrectly classified samples. e classification result without sensitive feature selection and transfer learning is shown in Figure 3(a): part of sensitive features is classified incorrectly because of large probability distribution; feature 3 is kept and dispersed into two fault types without sensitive features selection, which interferes with the accurate description of the faults. Figure 3(b) shows the classification result after insensitive feature 3 is discarded, but large probability distribution difference results in inaccurate classification of feature 1 and feature 2. In Figure 3(c), after minimizing the probability distribution difference between the two data sets, feature 1 and feature 2 are classified correctly, while feature 3 is reserved as useless information. Although the methods in Figures 3(b) and 3(c) make improvements to some extent, they fail to solve all the problems shown in Figure 3 Table 2: sensitive feature selection from step 1 to step 3, network training of source data in step 4, and network adaptation by transfer learning from step 5 to step 7. e detailed information of each part is explained as follows: (1) Sensitive feature selection: first, training data set D s and testing data set D c are collected under single and complex working conditions of rotating machinery; second, features are computed and sorted by cohesion evaluation shown in Table 1; third, sensitive features are chosen according to sensitive factor η j in (2), which are reserved to constitute sensitivity parameter set as input data. (2) Network training of source data: the total cost function of the proposed algorithm J(θ) that is made up with J SAE (θ) and MMD is shown in (12) as where α, β, and τ are weighted parameters, the first three terms are the cost function of SAE, and the last term is the square of MMD. In the training of D s only, MMD coefficient parameter τ is set to 0, and the network is trained to gain the updated model parameter set θ and data features.  Shock and Vibration training D c . In the adaptation stage, to minimize the probability distribution difference between D s and D c , the value of MMD is minimized by giving coefficient parameter τ different values. e model parameters and features under complex working conditions are obtained to classify the fault types. In this paper, softmax classifier is chosen to solve this multiclass task, which maps the features obtained above to another vector of (0, 1), and the probability closest to 1 is selected to estimate the output of classifier.

Experiment Setup.
In this paper, wind turbine drivetrain diagnostics simulator (WTDS) is used to collect the experiment data to verify the effectiveness of the proposed scheme. WTDS consists of a planetary gearbox, a fixed axis gearbox, a magnetic brake, a motor, and 4 sensors as illustrated in Figure 4. Figure 5 shows the working diagram of WTDS, in which the speed and load of WTDS are controlled by computer to change the working conditions through speed and brake controllers. Data are collected to the computer by four sensors, including two vibration sensors, a pressure sensor, and a torque sensor, indicated in Figure 5.
In this research, experiments are conducted in WTDS with five different types of gears, respectively: normal gear,    3 Calculate sensitive factor η j in (2) to keep these features in which the value of η j is large. ese parameters constitute sensitivity parameter set as input data. 4 Let τ � 0 in (12); train network to gain suitable parameter set θ and the source features.

5
Assign τ suitable values in (12) to validate the network by target data set D c until minimizing the cost function in (12) by comparing the distance between the target features and source features, using θ from step 4 as initial parameters. 6 After step 5 is done, record the parameters and features of testing. 7 Send the features into classifier to gain the fault types. surface worn, missing tooth, chipped tooth, and root crack as shown in Figure 6. 9 working conditions of WTDS under different load voltages and rotating speeds are considered as shown in Table 3.

Data Collection.
Signals from four sensors are collected with sampling frequency of 5120 Hz and sampling time of 6.4 s. To ensure that the data are more effective, the original signals are the average value of four sensors in 3 experiments. After sensitive feature selection, both numbers of the training data and the testing data in every group consist of 80 samples. Figure 7 illustrates the vibration signals in time domain and frequency domain, respectively.
e results under complex working conditions, consisting of multiple and unknown working conditions, are analyzed separately in this research. 9 data sets groups are set in Table 4; each contains a training set and a testing set. e details of multiple and unknown working conditions are explained: (1) multiple working conditions from group 1 to group 6: to simulate the collection of multiple working conditions data, the testing data are composed of five pieces, which are from the five working conditions except the condition in training data. e data with load voltages of 5 V and 8 V are selected as 6 training sets, and data under other five working conditions shown in Table 3 (excluding working conditions in training sets) are randomly mixed as testing sets under multiple working conditions labeled from multi-A to multi-F. For example, in group 1, the testing set multi-A is the mixture of the data under 6 Hz-8 V, 10 Hz-5 V, 10 Hz-8 V, 14 Hz-5 V, and 14 Hz-8 V without 6 Hz-5 V, because 6 Hz-5 V is the condition of the training set of group 1. (2) Unknown working conditions from group 7 to group 9: data with load voltage 8 V and 3 V are selected as training sets and testing sets, respectively, labeled from single-G to single-I. In the three groups, the working conditions in testing sets are totally different from those in training sets to ensure the unknown of the testing sets. To observe the performance of transfer learning in unknown working conditions, the two voltages with a large difference are selected.

Experimental Results.
Before experimental results analysis, the architecture of SAE is shown: the value of η j is set as 80 to obtain the sensitive feature set. Accordingly, the number of input layers and output layers of SAE is 80, and the number of hidden layers is 60 with sparsity limitation of 0.3. To observe the effects of KL divergence and transfer learning in the cost function in (12), the values of β and τ are varied to adjust KL and MMD terms and the value of α is predefined as a constant (α � 0.001). e range of KL and MMD weight parameter τ is set by experimentation, which affects the results positively: β is searched from 1, 2, 3, 4, 5 { } and τ is searched from 0, 1,5,10,15,20 { } to test the performance of the proposed algorithm. Particularly, τ � 0 implies no domain adaptation in cost function of (12). e rest of this subsection discusses the performance of transfer learning, the classification results under multiple working conditions and unknown working conditions, and the performance of the proposed algorithm comparing with other feature extraction methods.

Performance of Transfer Learning.
To observe the influence by transfer learning, the parameter β is fixed at 3 and only parameter τ is changed from 0 to 20. After testing the model with different τ values, the classification accuracies of the six data set groups are shown in Figure 8 and the MMD variation curve is shown in Figure 9.
From Figures 8 and 9, it is apparent that the classification accuracies of all the nine data sets can be improved with MMD term and the corresponding MMD values can be reduced to be around 0.1. When τ � 0, the network trained under single working condition has poor adaptability to complex working conditions, with the classification accuracies of nine data sets fluctuating around 85% which are represented in the blue line in Figure 8. With transfer learning by using MMD, all the classification accuracies of nine data sets are improved significantly with values fluctuating around 96%.
is result indicates that transfer learning has positive effect on both multiple and unknown working conditions.

Classification Results.
e classification results of multiple and unknown working conditions are illustrated in this part, which are listed in Table 4, which are shown in  To observe the effect of the proposed method on experimental results when KL divergence changes, the average results of multiple working conditions are displayed in Table 5 with different combinations of β and τ of group 1 to group 6. It can be observed that the highest classification accuracy can reach 99.17% (when β � 2 and τ � 20) and the classification accuracies are all below 90% without domain adaptation (when τ � 0). From the analysis above, the proposed method suggests a significant improvement under multiple working conditions after reducing the probability distribution difference between training and testing sets. e average results of group 7 to group 9 are shown in Table 6, which represent the unknown working conditions. In the first row of Table 6, although the classification accuracy reaches 91.70% when β � 5 without domain adaptation, the results fluctuate wildly with the lowest being 83.33%. After domain adaptation, all the classification accuracies are higher than 95% with the highest classification accuracy reaching 100% when β � 2 and τ � 15. e above analysis indicates that after reducing the probability distribution difference between unknown data set and training data set, the trained network is adaptable to the unknown working conditions.

Comparison with Different Feature Diagnosis Methods.
Data-driven fault diagnosis methods contain three main directions: artificial intelligence-based methods, statistical analysis, and signal processing, statistical analysis. To Shock and Vibration explore the performance of the proposed method, it is compared with the following three data-driven methods: traditional SAE, principal component analysis (PCA), and wavelet transform (WT), which are representative of the three directions mentioned above. Figure 10 shows the classification results of the nine experimental groups and the visualization results are shown in Figure 11. As shown in Figure 10, the average accuracies of PCA float around 80%, the lowest of which even reaches 70% in group 6. Both SAE and WT perform better, most fluctuating between 80% and 90%, with the highest of traditional SAE over 90% in group 9. In contrast, the results of the proposed method are all over 95%, which is overall precise to other methods in this figure. Figure 11 displays the distribution of classification result for five types of gears of data set multi-D (when β � 3) of the four methods. Figure 11 Figure 11(c), where the surface worn gears are classified, but the other four types are not separated. In Figure 11(d), the result of WT shows that the root crack gears are classified well and the small part of missing tooth is wrongly classified to chipped tooth. Similar to PCA, the other three types are not separated. From Figures 10 and 11, it is clear that the proposed method performs better than the other three methods in both multiple and unknown working conditions.

Conclusion
In this paper, to investigate the fault diagnosis problem under complex working conditions, a fault diagnosis method for gearbox based on transfer learning is introduced. e proposed method selects sensitive features to decrease the adverse impact of insensitive information and transfers the knowledge learnt under single working condition to complex working conditions through transfer learning. To verify the performance of the model in complex working conditions, experiments are carried out on wind turbine drivetrain diagnostics simulator, which simulates five fault types of gear. Results are compared with traditional SAE, PCA, and WT, which indicate that the classification accuracy is significantly improved after sensitive feature selection and transfer learning. e future work of current research can be extended to other working conditions and data sets.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest in the work reported in this paper.