Deep Transfer Learning Method Based on 1D-CNN for Bearing Fault Diagnosis

In mechanical fault diagnosis, it is impossible to collect massive labeled samples with the same distribution in real industry. Transfer learning, a promising method, is usually used to address the critical problem. However, as the number of samples increases, the interdomain distribution discrepancy measurement of the existing method has a higher computational complexity, which may make the generalization ability of the method worse. To solve the problem, we propose a deep transfer learning method based on 1D-CNN for rolling bearing fault diagnosis. First, 1-dimension convolutional neural network (1D-CNN), as the basic framework, is used to extract features from vibration signal. The CORrelation ALignment (CORAL) is employed to minimize marginal distribution discrepancy between the source domain and target domain. Then, the cross-entropy loss function and Adam optimizer are used to minimize the classiﬁcation errors and the second-order statistics of feature distance between the source domain and target domain, respectively. Finally, based on the bearing datasets of Case Western Reserve University and Jiangnan University, seven transfer fault diagnosis comparison experiments are carried out. The results show that our method has better performance.


Introduction
As an essential component of mechanical system, bearing was widely used in rotating machinery. Once the bearing fails, it will cause the mechanical system failure resulting in rotating machinery shutdown and even causing casualties. erefore, bearing fault diagnosis is of great significance to the health and safe operation of machinery and has attracted more and more attention in scholars and manufacturing industries [1][2][3].
In the last decade, with the increasingly complex structure of mechanical equipment structure and rapid development of sensor technology, the acquisition of vibration signal has become easy and has brought new perspectives and challenges to the traditional intelligent fault diagnosis of rotating machinery [4,5]. Zhang et al. presented a fault diagnosis and location method based on artificial neural networks (ANNs) [6], and support vector machine (SVM) method can obtain a higher accuracy rate in classification for machine diagnosis used in [7]. Random forest (RF) classifier was used for roll bearing fault diagnosis published in [8], and Li et al. [9] used the variational mode decomposition (VMD) and kernel extreme machine learning (EML) for bearing fault diagnosis. Shi et al. proposed an intelligent fault diagnosis method based on deep learning and particle swarm optimization support vectors machine [10]. He et al. reported an intelligent bearing fault diagnosis method based on sparse autoencoder [11]. In [12], authors proposed a CNN model based on dislocation time series for fault diagnosis. ANN was used to model and identify fault signals [13]. Although these traditional intelligent fault diagnosis methods mentioned above can achieve good results, they are all based on the following two assumptions: (1) a large number of labeled fault information samples are available and (2) the training and testing samples are shared with the same probability marginal distribution. However, in actual engineering, it is a luxury to collect massive labeled fault information samples, and the collected data from unknown operation condition are not drawn from the same probability marginal distribution [14].
In order to satisfy the actual needs of bearing fault diagnosis in practical application engineering, transfer learning, a novel classification method by using the learned knowledge from sources domain to unknown target domain, has attracted more and more attentions in fault diagnosis [15,16]. Li et al. used the multilayer domain adaptation method to estimate the discrepancy of source domain and target domain for fault diagnosis [17]. A transfer learning method based on hierarchical deep domain adaptation was proposed for fault diagnosis [18]. An instance transfer learning method based on the long-term memory recurrent neural network model was proposed to solve the problem of difficulty in obtaining a large number of labeled fault data [19]. In order to overcome the problem of sparse feature space and few unlabeled fault data problem of some modes, in [20], Hao et al. proposed a multimodel transfer learning method for chemical process fault diagnosis. An and Ai proposed a fault diagnosis method based on end-to-end unsupervised domain adaptive Riemann CORrelation ALignment metric [21]. Xu et al. presented a transfer component analysis (TCA) method trying to learn some transfer components across domains aiming to alleviate insufficient data conditions [22].
Based on the abovementioned literature analysis, transfer learning has made a great breakthrough in the field of insufficient training data and data collected in varying condition. However, the existing transfer learning-based methods mainly focus on how to measure the interdomain feature marginal distribution discrepancy in domain adaptation. MMD (maximum mean discrepancy), a wellknown domain adaptation method for distance metric, has been widely adopted in marginal distribution optimization and has achieved better performance [23][24][25]. Nevertheless, the limitation of MMD in the domain adaptation is that as the number of samples increases, the computation cost will increase exponentially, resulting in poor generalization ability. us, it is difficult to meet the requirements of realtime and generalization for fault diagnosis methods in realworld industries.
Motivated by the analysis abovementioned, this paper proposes an intelligent bearing fault diagnosis method based on deep transfer learning with CORAL loss metric, which is used to measure interdomain marginal distribution discrepancy. First, as a basic feature representation-learning framework, CNN is used to obtain the robust feature space from vibration signals. To estimate the marginal distribution discrepancy between source domain and target domain, the nonlinear transformed CORAL domain adaption is exploited to minimize marginal distribution discrepancy and at same time to constrain the CNN parameters aiming to obtain more robust feature representation learned by CNN also. en, two objectives need to be optimized, respectively. One optimization objective is a conditional classifier based on CNN, using the cross-entropy loss function to minimize classification error. e other is the second-order statistics of features between the optimal features of source domain and target domain, which are optimized by the Adam method.
Finally, twelve comparative experiments based on Case Western Reserve University bearing dataset and six comparative experiments based on Jiangnan University bearing dataset are carried out to verify the effectiveness of our method. e main contribution of our method lies in the following aspects: (1) One-dimensional (1-D) CNN is build to extract representation features from the original vibration signal, and then the domain adaptation is performed only in the latter two layers unlike other CNN in latter three layers, with the purpose of reducing computation cost. (2) A differentiable loss function is constructed for extending CORAL metric domain adaptation to minimize the marginal distribution discrepancy from cross-domain representation feature covariance.
(3) Two objectives are optimized by cross-entropy loss function and Adam optimizer, to minimize the classification error for CNN and second-order statistics feature of source domain and target domain, respectively. e remainder of this paper is organized as follows. In Section 2, the theory of TL, CNN, and CORAL is briefly introduced. In Sections 3, our method is described in detail.
e comparison experiments for verifying the performance of the proposed method based on bearing dataset are studied in Section 4. Conclusions are given in Section 5.

Theoretical Background
In this section, we will mainly introduce the model structure of TL and CNN, which is usually used to transfer learned knowledge from source domain to the target and to classify fault. In addition, we will introduce the relevant theoretical knowledge of CORAL.

Transfer
Learning. TL, an important branch of machine learning, is usually used to tackle the problem of insufficient data and marginal distribution inconsistency by learning knowledge from training data to testing data. It has been widely used in fault diagnosis [26]. D S used to define the source domain has massive distinguishing knowledge which is the main object to be migrated. D t denotes the target domain, which obtained the knowledge learned from D S , and transfer learning tries to apply the knowledge distinguished learned previously from D S to D t . x i and X represent the feature of ith sample and the feature space of samples, respectively. s and t are used to identify the source domain and the target domain, respectively. e class space of the source domain and target domain can be expressed as Y s and Y t ; meanwhile, y S and y t represent the categories of the source domain and target domain, respectively. e classification accuracy of traditional machine learning methods will drop sharply when source domain and target domain did not have the same marginal distribution. us, domain adaptation was used to weaken the influence of marginal distribution inconsistency from the two domains [27]. As shown in Figure 1, the data marginal distribution of the target domain is quite different from that of the source domain before domain adaptation. After domain adaptive learning, the data marginal distribution difference between the two domains is reduced to achieve the migration from the source domain to the target domain. Given a labeled source domain D S � x s i , y s i nS i�1 and an unlabeled target domain D t � x t i nt i�1 , the marginal distribution of the two domains is P(x S ) and P(x t ) and satisfied P(x S ) ≠ P(x t ), but their feature spaces meet X s � X t , and their category spaces are the same, that is, MMD, a widely used distance in transfer learning for interdomain distribution discrepancy measure, was explored to construct as a new regularized item in loss function to make the distribution discrepancy of the two domains as small as possible. It can obtain the nonparametric distance from interdomain feature distribution without calculating the intermediate density. To measure the marginal distribution discrepancy by migrating data in the reproducing kernel Hilbert space (RKHS), the calculation formula of MMD is defined as follows: where ϕ(•) is the nonlinear mapping from the original feature space to RKHS and H indicates that the distance is measured in RKHS. For a detailed introduction about MMD, please refer to [28].

Convolutional Neural
Network. CNN, one of the most representative networks in the field of deep learning, was extensively used in civil structures, mechanical structures, and wind engineering [29,30]. It has three layers such as convolution layer, pooling layer, and full connection layer. Convolution layer is the core layer of CNN, which contains a set of trainable filters. Weight sharing is the most important characteristic of the convolution layer. It is used to optimize the network parameters to avoid over fitting caused by too many parameters and to relax the computer load, which is expressed as follows: where x l j denotes the lth feature in the j layer, M and k indicate the set of input features and the convolution kernel, respectively, and f(•) and b are the nonlinear activation function and the bias term, respectively. e commonly used activation function is ReLU (rectified linear unit), which is expressed as follows: Generally speaking, the pooling layer (PL) performs the down sampling operation. e main purpose of PL is to reduce the parameters of the neural network while retaining the representative features and to prevent over fitting and improve the generalization ability of the model. e PL operation can be carried out as follows: where x l+1 j is the jth feature of l + 1 layer and p(x l j ) represents pooling operation, respectively. e full connection layer plays the role of "Classifier" in the whole neural network. First, the output of the last pooling layer is expanded into a one-dimensional feature vector as the input of the fully connected layer. en, the inputs and outputs are fully connected, and the activation function of the hidden layer is ReLU. Finally, the Softmax function is used to the output layer; the calculation of full connection layer is given as follows: where w l and b l indicate the weight and bias of the full connection layer, respectively, and f(•) denotes the nonlinear activation function. As the l layer was the hidden layer, the ReLU was used as activation function usually, and when the l + 1 layer was the output layer, the activation function was changed to Softmax and it was given by where p(y (i) � k|x (i) ; θ) is the probability denoted that input ith sample feature x (i) belongs to category j, θ 1 , θ 2 , . . . , θ k ∈ R n+1 is the parameters of the model, and is used to normalize the marginal probability distribution so that the sum of all the probabilities is equal to 1. Shock and Vibration

Correlation Alignment
Metric. CORAL is an effective and simple unsupervised adaptive method which was first proposed by [31] and widely used to measure the discrepancy of source domain and target domain in model recognition such as it aligns the input feature distributions of the source and target domains by exploring their secondorder statistics. erefore, the only computations it needs are computing covariance statistics in each domain. When incorporated into a deep neural network, it can be summarized as follows: where cov(X) � X ⊤ C n X denotes the covariance matrix and C n � I n − (1/n)1 n 1 ⊤ n denotes that the centering matrix I n is a n-dimensional vector with all elements being one.
Compared with MMD, the difference is that MMDbased approaches usually apply the same transformation to both the source and target domain. [31] And asymmetric transformations are more flexible and often yield better performance for domain adaptation tasks [32]. erefore, we use CORAL as a measure of the difference between the two domains instead of MMD to get better results. e domain adaptation is achieved by minimizing the difference between the feature space of the source domain and the target domain, and the CORAL method is used. By taking the coral loss into the optimization objective, the similarity of the feature space learned in the source domain and the target domain is maximized so as to make up for the deficiency of CNN's insufficient learning of domain invariant feature space.

Proposed Method
is section is divided into subheadings. It should provide a concise and precise description of the experimental results, their interpretation as well as the experimental conclusions that can be drawn.

Condition Classifier Based on CNN.
e condition classifier based on CNN consists of 10 layers of one-dimensional CNN, including one input layer, four convolution layers, two pooling layers, two fully connected layers, and one output layer. In 10 layers of CNN, the first seven layers are called feature extractors, which are used to extract the conditional representative feature from vibration signals. Meanwhile, the last layer is regarded as the condition classifiers for judging the condition of the test sample. e input layer is constructed by one-dimensional vibration signal with the length of 784. In the convolution layer, convolution kernels are used for the local region of the input signal and generate corresponding features as shown in Figure 2.
In order to reduce the dimension of convolution features and preserve representatively features as much as possible, a pooling layer is connected after the first convolution layer and last convolution layers. rough four convolution layers and two pooling layers operation, the input features will become flat in the first fully connected layer F1. And then, in the two fully connected layers, the distribution discrepancy of interdomain is estimated by CORAL metric. Finally, the category of input sample is recognized by Softmax classifier. Feature extraction is mainly composed of one-dimensional CNN, and its structure and parameters are given in Table 1.

Domain Adaptation Based on CORAL.
Domain adaptation is an important means to transfer knowledge from source domain to target domain when data marginal distribution is inconsistent between source domain and target domain, which determines the efficiency of knowledge transfer.
e common MMD domain adaptation criteria have high computational complexity and low generalization ability with the increase in data volume. In order to effectively measure the data marginal distribution difference between the source domain and the target domain, we use a differentiable loss function to minimize the similarity comparison of the marginal distribution differences between the source domain and the target domain [31]. As shown in Figure 2, we introduce the domain adaptive learning module in FC1 and FC2 of the full connection layer to calculate the covariance distance between the features at FC1 and FC2 of the source and the target domains and define it as CORAL loss. e calculation formula is as follows: where | · | 2 F is the Frobenius norm of the matrix and C S and C T are the covariance matrices of the source domain and target domain. eir calculation formulas are as follows: where 1 is the column vector with elements of 1, F S is the output data of the source domain passing through the full connection layer FC, F T is the output data of the target domain passing through the full connection layer FC, and n S and n T are the number of samples of the source domain and the target domain, respectively. eir gradient calculation is as follows:

Optimization Objective.
In this subsection, we will concern the optimization objectives of the proposed method in detail. ere are two objectives need to be optimized: (1) minimize conditional classification errors on the source domain dataset given as in Figure 2 and (2) minimize the second-order statistics (covariance) of the source and target features between the source domain and the target domain. For the first optimization goal, we aim to minimize the condition classification errors of health condition category on the source domain data set by reducing the cross-entropy loss function. e specific calculation formula is expressed as follows: where m is the batch size of training samples, j is the fault category, and I[•] is the index function. e second optimization object is the covariance distance between FC1S/ FC1T and FC2S/FC2T in the fully connected layer. e covariance distance is written as follows: where l 1 and l 2 are 7-th layer and 8-th layer, respectively, which means that the network adaptation is carried out from layer 7 to layer 8, and the former layers are not employed domain adaptation. erefore, the loss function of our method is constructed as follows: Let θ f and θ c be defined as parameters of feature extractor and condition classifier, respectively. erefore, the loss function is rewritten as follows: Based on equation (14) and the stochastic gradient descent algorithm [33], the parameters θ f and θ c are updated as follows: where η is the learning rate; the marginal distribution discrepancy between the source domain and the target domain can be minimized by domain adaptation, and the unlabeled samples in the target domain can be classified correctly by the condition classifier.

Overview of
is Proposed Method. In the training process of the proposed method, the Adam optimization algorithm is used for objective optimization [34], which can effectively accelerate the training process and solve the optimal problems for a large number of parameters. First, we divide the bearing data from different domains into training set (source domain) and testing set (target domain). In the feature extraction process, the forward propagation method was used to extend samples from source domain and target domain into the CNN to facilitate the feature extraction by the CNN. en, in the domain adaptation process, the multilevel covariance distance of features between two domains is calculated to increase the similarity between the two domains as much as possible. Finally, we optimize the loss function by iterative to constrain the parameters of CNN and train the conditional classifiers in the optimization process until the end of the iterative. After the training, the obtained conditional classifier will be used to classify samples in the target domain. e specific training process is shown in Figure 3.

Experiment and Result Analysis
In order to test the performance of the proposed intelligent fault diagnosis method and verify its effectiveness, we conducted experiments using two bearing datasets. Comparative experiments are also carried out to compare the classification accuracy with existing methods including traditional CNN without transfer learning, TCA-based [22] method, handcrafted feature-based CORAL [31] [38], and the original vibration signal for carrying out experiment collected from the standard platform is shown in Figure 4. e vibration signal is collected by the bearing test platform which consists of a motor (left), a dynamometer (right), and a control circuit, and the experiment data are arranged on the bearing (SKF6205). We divide the experimental datasets into four categories. Each class has 10 groups data, including one category of general data and nine categories of fault data, namely, normal (N), inner-race fault (IF), outer-race fault (OF), and ball fault (BF). Each fault type has different degrees of fault severity (0.007 inch, 0.014 inch, and 0.021 inch fault diameters). So, there are 9 fault conditions and 1 health condition. More details are given in Table 2.
In our experiment process, we select the original vibration signal with the sampling being 12 kHz, and four different motor speeds (1797 rpm, 1772 rpm, 1750 rpm, and 1730 rpm) are applied to the bearing. We regard them as four different operating conditions (named A, B, C, and D), and each operating condition contains 1500 samples (including 9 classes of fault, each class has 150 samples, respectively, and 150 samples with health condition). e waveforms of various categories of original vibration signal for experiment are illustrated in Figure 5. Assuming that the source domain is workload 0 and the target domain is workload 1, we treat the transfer task from 0 to 1 as A ⟶ B.

JNU Bearing Dataset.
e second bearing data come from the centrifugal fan system for rolling bearing fault diagnosis testbed of JiangNan University [39]. e bearings under test are single-row spherical roller bearings. e faults were artificially induced into bearings with a wire-cutting machine. Vibration signals of four categories of bearings include normal, outer-race defect, inner-race defect, and roller element defect. It contains three operating conditions, and the rotation speed is divided into 600, 800, and 1000 rpm, and the sampling frequency is 50 kHz. In the process of the experiment, we divided it into three different operating conditions: E, F, and G. Each operating condition contains 600 samples (including three types of faults, 150 samples, and 150 health samples, respectively). Figure 6 shows the waveforms of all kinds of original vibration signals used in the experiment.

Experimental Results of CWRU Bearing Dataset.
In order to present a comprehensive evaluation for our method, we first carry out the fault diagnosis experiment from source domain to source domain and from source domain transfer to target domain. Various condition classification accuracies of the proposed method are shown in Table 3. It can be found that our method achieves 100% classification accuracy for source domain to source domain and obtains 97.85% average test accuracy for source domain transfer to target domain. In addition, it can be seen from Table 3 that the classification accuracy of the target domain is slightly lower than that of the source domain due to the distribution inconsistency between different domains, but it is not very serious.
Furthermore, to analyze the classification accuracy of each category in more detail, the widely used confusion matrix is used to obtain the performance of compared methods. In this proposal, we choose the task A ⟶ B to calculate the confusion matrix randomly. e detailed results are illustrated in Figure 7, with rows denoting actual health category and columns representing predicted health category.
From Figure 7, we can find that our method achieves average classification accuracy of 98.5% for each condition and the recognition differences for ten conditions are very small. However, the other comparison approaches (CNN, DDC, and DAN) have a lot of confusion across different fault conditions than the proposed, which illustrates the superiority of the proposed method.
In this proposed method, in order to conduct a detailed comparative experiment, we divide the comparison algorithm according to the feature extraction into four categories, when the feature extraction is the same, and then classify according to the transfer manner. e details of various methods are introduced in Table 4, and the comparative experiment by category is carried out as follows.
(1) Comparison with CNN (without Domain Adaptation). To be fairer, the framework of CNN used for comparison with our method is consistent with our feature extraction framework.
e only difference is that the CNN method does not add a domain adaptation layer and is only trained by the source domain. It can be seen from Table 5 that the average classification accuracy of CNN in the target domain is 90.24%, which is better than that of the handcrafted feature method TCA and CORAL, respectively. However, the classification accuracy of our method is 7.61% higher than that of CNN on average. From the comparative experiments in the field of unlabeled fault diagnosis, it is necessary to perform domain adaptation and learned features to improve the classification accuracy.
(2) Compared with TCA and CORAL. Contrasted with the deep learning method, the main difference of traditional transfer learning is that they rely on the handcrafted features Feature extraction Output the diagnosis results Predict labels for the samples by  In the TCA method, the MMD was used to calculate the distribution discrepancy of two domains, and then Gaussian kernel function was employed to minimize the MMD. e CORAL method aims to minimize the domain offset with aligning the second-order statistics (alignment of mean and covariance matrices) by linear transformation, respectively. e details of classification accuracy of various transfer tasks are shown in Table 5; the classification accuracy of TCA and CORAL is lower than that of our method by 25.06% and 30.25%, respectively. e main reason for this result is that the feature extraction manner of TCA and CORAL is handcrafted feature. We can know that it is a very helpful thing to replace the manual feature extract with deep feature learning.
(3) Comparison with Transfer Learning Based on Wasserstein. Wasserstein was proposed to measure the discrepancy of two probability distributions to solve the training instability of GANs (generative adversarial networks) and ensure the diversity of generated samples [40]. e advantage of Wasserstein distance over KL (Kullback-Leibler) divergence and JS (Jensen-Shannon, JS) divergence is that it can still calculate the discrepancy between the two distributions regardless of whether the two distributions are consistent. Although JS divergence is a constant in this case, KL divergence may be meaningless.
Cheng et al. proposed an intelligent fault diagnosis method based on Wasserstein distance deep transfer learning named WD-DTL [35]. Wasserstein was used to measure the distance from two domains and tried to reduce the discrepancy of two domains to achieve transfer learning fault diagnosis from different motor speed and location of sensors. From Table 4, the WD-DTL [35] method can achieve best classification accuracy in tasks B ⟶ C and D ⟶ C and 95.75% average. However, the average classification accuracy is 2.1% less than that of our method.
(4) Comparison with Transfer Learning Based on MMD. As a widely used discrepancy distribution measure in domain adaptation, MMD estimates the distance between two marginal distributions in Hilbert space. In this proposal, two transfer learning methods based on MMD are employed to comparative experiments, such as DDC [36] and DAN [37]. In order to ensure the fairness of the experiment, the CNN framework that automatically extracts features in both methods is consistent with our method. In the DDC method, a single kernel function is used to reduce MMD of two domains, and the domain adaptation is only performed in one layer of CNN. However, multiple kernel functions are used to obtain MMD; meantime, the domain adaptation is executed in three layers of CNN. e classification accuracy of various mentioned algorithms in the proposed method is illustrated in Figure 8, and it can be seen that the proposed method achieves the best classification performance in 12 types of transfer tasks compared with other approaches.

Experimental Results of JNU Bearing Dataset.
In order to verify the results obtained in the previous section of the experiment, we use another dataset to perform experiments on each method. e experimental results obtained are shown in Table 6, and classification accuracy of various mentioned algorithms in the proposed method is illustrated in Figure 9. As the difference between working conditions increases, the accurate progress of each method is compared with the previous experiment that has all declined, but our method is still significantly ahead of other methods. erefore, the experimental conclusions obtained in the previous section are correct.

Implementation Details.
In this subsection, we will give the detailed introduction about our experiments. e software framework used is python, and the GPU is NVIDIA GTX 1660ti. In each experiment, Adam optimizer with the learning rate of 0.001 is set, with batch size being set to 128. And penalty parameter lambda affects the performance of transfer fault diagnosis. By tuning this parameter from {0.1, 0.2, 0.5, 1, 10}, best classification accuracy is acquired. As an example, we take into account the transfer task A ⟶ B to show the training process of its loss function. Due to Adam's fast characteristics, the loss value decreases rapidly in the first 30 iterations and tends to be stable at about 50 iterations and approaches to 0 at about 400 iterations; the detailed process of the iteration is illustrated in Figure 10.
Method     14 Shock and Vibration find that the learned features in domain adaptation obtained by our method are not sensitive to domain variation and have strong fault identification ability.

Conclusion
In this paper, a deep transfer learning method based on CORAL metric for bearing fault diagnosis is proposed. e key idea of this proposed method is to employ the nonlinear transform-based CORAL loss function to estimate the discrepancy of interdomain. As a feature extractor and classifier, CNN is used to train the condition classifier model with its parameters contrasted by CORAL loss function. Eighteen types of fault of transfer tasks in two different dataset are carried out to verify classification performance of the proposed method, and five state-of-the-art architectures are used to compare with our method. ese results illustrated above demonstrate that the proposed approach can achieve more satisfactory classification accuracy and domain adaptation capabilities. However, in the second experiment, the difference between the working conditions increased, and the results obtained by our method were not satisfactory. is may be due to the following two limitations of the CORAL: (1) aligning covariance with usual Euclidean metric is suboptimal and (2) second-order statistics have limited expression for the non-Gaussian distribution [42]. erefore, in the next step, the authors will improve the proposed method and devote themselves to the unsupervised transfer learning fault diagnosis for different machines with greater differences between domains.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.