Spacecraft Intelligent Fault Diagnosis under Variable Working Conditions via Wasserstein Distance-Based Deep Adversarial Transfer Learning

,


Introduction
Fault diagnosis utilizing the data acquired by monitoring the equipment is a technology which aims to detect the potential anomalies as early as possible and identify the root causes of failures and help make maintenance decision to avoid catastrophic breakdown. Therefore, fault diagnosis is urgently required and of great importance for spacecraft [1,2].
In the past few decades, some advanced signal processing methods [3] and machine learning algorithms [4,5] were employed for spacecraft fault diagnosis. Machine learning usually plays an essential role in exploring the mapping relationships between features extracted by the signal processing techniques and the health states of spacecraft [6,7]. How-ever, these methods usually require abundant experience and sufficient expert knowledge [8], making them timeconsuming and labor-intensive. With the rapid increase of test data and fast development in data-driven technology, deep learning models, such as Deep Belief Network (DBN) [9,10], Sparse Auto-Encoder (SAE) [11,12], convolutional neural network (CNN) [8,[13][14][15], and Recurrent Neural Network (RNN) [16,17], can extract effective features and show superior learning capability in fault diagnosis. These deep models are becoming research hotspots in the fault diagnosis of industrial equipment and spacecraft [8,18,19]. However, these algorithms must be subject to a basic hypothesis that the training and test data should be sampled from identical distribution. Unfortunately, this hypothesis does not always hold in most aerospace application scenarios. For example, in the area of spacecraft fault diagnosis, training dataset and test dataset are often collected in different working conditions, which results in shift of data distribution. Therefore, the diagnosis model learned in the training dataset (usually contains sufficient labeled samples) could not be generalized well to the test dataset (usually contains few or even none labeled samples).
In order to tackle these challenges, transfer learning (TL) approaches [20] aiming to leverage knowledge obtained in source domain (with adequate labeled samples under one working condition) to train a model that could generalize to a target domain (with insufficient labeled or unlabeled data under a novel but related working condition) have been introduced into fault diagnosis of spacecraft task and show superior performance in cross-domain fault diagnosis scenarios [21][22][23]. TL can be roughly classified into four categories: instance-based TL [24], parameter-based TL [25], relation-based TL [26], and feature-based TL [27,28]. Among them, the feature-based TL is the most investigated method because of the ability of correcting cross-domain discrepancy. The Transfer Component Analysis (TCA) [29] is a typical feature-based TL method, which adopts Maximum Mean Discrepancy (MMD) [30][31][32] in a lowdimensional feature space as distance metric to reduce distribution discrepancy.
Inspired by generative adversarial network (GAN), more and more researchers have found that learning domaininvariant features using GAN is a novel approach in the field of TL [21,[33][34][35]. Through the adversarial training, domain-invariant features could be learned when the networks reach a state known as Nash equilibrium [36]. When training the GAN, there may be some difficulties to be solved, such as the instability of learning and mode collapse [37]. Arjovsky et al. [38] proposed Wasserstein GAN (WGAN), which introduced a more sensible Wasserstein distance to replace Jensen-Shannon (JS) divergence and Kullback-Leibler (KL) divergence as the loss function in GAN. To alleviate the gradient vanishing or exploding problem when training the WGAN, Gulrajani et al. [39] proposed a new WGAN with gradient penalty (WGAN-GP) term for domain alignment. Miyato et al. [40] proposed the Spectrally Normalized GAN (SNGAN), which found the maximum singular value of the weight matrix in the discriminator. The motivation of this paper is to explore how the Wasserstein-based adversarial learning method performs in the field of spacecraft fault diagnosis.
In this article, we propose a new WGAN-GP-based deep adversarial transfer learning (WDATL) model to tackle the problems of spacecraft fault diagnosis under various working conditions. The WDATL model contains four components: source domain feature extractor, target domain feature extractor, fault model classifier, and domain critic. The source and target domain feature extractors are deep convolutional neural networks which learn feature representations from the raw one-dimensional (1-D) input signals in source and target domain, respectively. The domain critic minimizes the discrepancy of the source and target feature distributions using WGAN-GP. The fault model classifier is designed to map the features with corresponding fault labels by optimizing the loss function. Through the above adversarial learning process, the domain-invariant and fault model discriminative feature representations could be learned to help promote the diagnosis performance in the target domain. The main contributions of this paper can be summarized as follows: (1) We propose a novel WGAN-GP-based deep adversarial transfer learning (WDATL) model, which could learn domain-invariant features and promote diagnosis performance under multiple working conditions. To the best of our knowledge, it is the first attempt in the field of spacecraft fault diagnosis, which prospectively overcomes the obstacles when applying deep learning methods to real engineering scenarios (2) We propose an improved 1-D convolutional neural network, which builds an end-to-end model to extract features automatically without requiring experienced experts and a high skill set. We design wide kernels in the top convolutional layer to suppress high-frequency noise and promote robustness of our model. Moreover, we adopt exponential linear units (ELU) instead of commonly used rectified linear units (ReLU) as activation function to speed up learning (3) The t-SNE is used to visualize the feature representations learned by the feature extractor for our proposed method and other compared models, helping to validate the superior performance of the WDATL (4) Experimental results on two public datasets demonstrate that our proposed WDATL model performs well on diverse working circumstances and achieves significantly better results than most of the baseline approaches The rest of this work is organized as follows. Preliminaries including transfer learning and generative adversarial network are introduced in Section 2. Section 3 details the proposed fault diagnosis methods under various working circumstances. Extensive experiments are conducted, and results are discussed in Section 4. Finally, conclusions are drawn in Section 5.

Preliminaries
2.1. Transfer Learning. TL could utilize knowledge from one or multiple related datasets called source domains to promote the model's diagnosis accuracy and performance in the current dataset called target domain.
There are two basic concepts in TL, domain and task [20]. A domain is denoted as D = fX, ℙðXÞg, which includes two components: a feature space X and a marginal probability distribution ℙðXÞ. X = fx 1 , x 2 , x 3 , ⋯, x n g ∈ X is a n-dimensional vector. There exists source domain D s and target domain D t . If X s ≠ X t and (or) ℙðX s Þ ≠ ℙðX t Þ,

2
International Journal of Aerospace Engineering then we consider that two domains D s and D t are with different distribution, i.e., D s ≠ D t . A task T = fY, ℙðY | XÞg is also composed of two components: a label space Y and a prediction function which is equal to the conditional probability distribution ℙðY | XÞ. Given a source domain D s and learning task T s , a target domain D t and learning task T t , TL aims to promote the performance of target predictive function in D t through leveraging the knowledge in D s and T s , where D s ≠ D t or T s ≠ T t . Usually, there are sufficient labeled samples fðx s i , y s i Þg N s i=1 in source domain to train a diagnosis model, in which N s is the size of dataset and y s i is the corresponding label of x s i . While in the target domain, samples are not labeled, and our aim is to construct a deep learning model which could diagnose or classify the unlabeled data through learning domain-invariant features.

Generative Adversarial Network. GAN proposed by
Goodfellow et al. [41] in 2014 consists of two independent networks: a generator network Gðz ; θÞ and a discriminator network Dðx ; ϕÞ as shown in Figure 1, where θ and ϕ denote the parameters in generator and discriminator, respectively. The generator aims to learn the distribution of the generator p θ ðxÞ from a randomly generated variable z, while the discriminator tries to differentiate between the real data distribution p r ðxÞ and the fake data distribution p θ ðxÞ generated by generator network. When GAN reaches a stage at which the discriminator is unable to distinguish between real and fake data distribution, the stage is called Nash equilibrium.
In summary, the optimizing object of GAN is a min-max problem, and the loss function L can be formulated as where E x~p r ðxÞ ð * Þ calculates the expectation and the variable x follows the distribution of p r ðxÞ. JS(.) is the Jensen-Shannon divergence which measures the difference between p r ðxÞ and p θ ðxÞ. When there are no overlaps between p r ðxÞ and p θ ðxÞ, the gradients of loss function L will become zero. Therefore, the Jensen-Shannon divergence is not suitable and reasonable for the loss function of GAN, which may cause instability of training and mode collapse [37]. In order to address this problem, Wasserstein distance was first used in [38] to evaluate the difference between margin distributes p r and p θ instead of Jensen-Shannon divergence. The Wasserstein distance definition reads as where γ denotes the joint probability distribution and Γ ðp r , p θ Þ is the set of all joint distributions γðp r , p θ Þ. However, Equation (2) is extremely intractable to calculate, and it can be rewritten according to the Kantorovich-Rubinstein duality.
where f : ℝ d ⟶ ℝ is a 1-Lipschitz function subjected to Eventually, calculating the Wasserstein distance between p r and p θ could be converted into exploring the maximum expectation difference of a continuous function subjected to 1-Lipschitz under distribution p r and p θ .

Proposed Method
3.1. The Overall Structure of WDATL. Inspired by GAN, many researchers have introduced the adversarial learning strategy into the field of fault diagnosis [36]. We detail the newly proposed WDATL approach in this section to solve the difficulties of spacecraft fault diagnosis under diverse working circumstances. The WDATL model is composed of four parts: source domain feature extractor, target domain feature extractor, fault model classifier, and domain critic. The overall structure of WDATL is shown in Figure 2.
Unlike the generator aiming to generate fake data in Figure 1, the generator called source and target feature extractors in WDATL is implemented to extract feature characteristics from source input data and target input data, respectively. The source and target feature extractors are with the same network architecture sharing the identical parameters. The domain critic is designed to decrease the discrepancy of feature distributions between source and target data extracted by the feature extractors, and the fault model classifier is implemented to predict the corresponding labels. Through adversarial training, the domain-invariant and fault model discriminative characteristics could be learned, so the fault model classifier could not only classify the labeled data in source domain but also be generalized and applied to diagnose the unlabeled data in target domain with a satisfying accuracy. The feature extractor in the proposed WDATL contains five 1-D convolutional layer-pooling layer pairs that come with 1-D Batch Normalization (BN) layers to tackle the issue of internal covariate shifting and dropout layer in the last layer to avoid overfitting. Moreover, there is a flatten layer behind the fifth convolutional layer-pooling layer pair, and the number of neurons in the flatten layer is equal to the size of output features. The improved 1-D CNN-based feature extractor is illustrated in Figure 3.
The convolutional layer in WDATL convolves the raw input data with 1-D filter kernels and is then followed by an activation function to generate output features. Usually, there are multiple kernels in each convolutional layer, which aims to learn comprehensive feature characteristics. The mathematical formula of the convolutional layer is as follows: where W l represents kernels of the lth layer. x l−1 and x l denote input and output of the lth layer, respectively. b l is the bias term, and the f is the activation function, such as sigmoid, tanh, and ReLU, which enables the layer to acquire the ability of nonlinearity. The pooling operation after a convolutional layer aims to select and reduce the dimensions of features, making the features more robust after convolution. Max pooling and average pooling are two most commonly used pooling operations. In our WDATL, we use max-pooling in the former fourth convolutional layers and adaptive max-pooling in the last convolutional layer, which takes the maximum value in the input features.
Specifically, we modify and improve the CNN-based feature extractor mainly in two aspects.
Firstly, we adopt large 64 × 1 kernels instead of small kernels in the initial convolutional layer. In real industrial application, equipment, especially spacecraft, often work in various conditions and harsh environments, resulting in much noise in data. The larger kernel could extract more comprehensive fault information because large size of kernel has better characterization capability and is less sensible to noise [42] compared with small kernel. Then, small kernels are followed for formulating deeper network architecture and finer feature learning.
Secondly, we utilize exponential linear unit (ELU) instead of the commonly used ReLU as activation function. A drawback of ReLU lies in that it has zero gradients whenever the input is negative, which may lead to neural units being inactivated. This will greatly restrict the learning capability of CNN [43]. To overcome the disadvantage of ReLU, we propose to use ELU [44] in our WDATL model. The ELU is as where α ≥ 0. In contrast to ReLU, the ELU does not only produce a permanent active state for all inputs but also saturates to a negative value when the input is a small negative value, which decreases the forward propagated variation, making it more robust to noise. Moreover, ELU could accelerate training of deep neural networks and bring about higher classification accuracies.

Pretraining in Source
Domain. The fault model classifier consists of 2 fully connected forward neural layers, which is designed to directly map the feature representations learned by the feature extractor to their corresponding fault states of spacecraft. A softmax activation function is added to the last layer to predict the labels of input representations. The first step in our WDATL is to pretrain the source feature extractor and fault model classifier by employing the labeled data x s in source domain, and the corresponding parameters are updated by minimizing the cross-entropy loss l c ðx s , y s Þ between the predicted labelsŷ s i and ground true labels y s i using the Adam [45] algorithm. where lðy s i = kÞ is the indicator function, and when y s i = k, it returns 1 while y s i ≠ k, it returns 0. K represents the number of fault models. The purposes of pretraining are not only to initialize the target domain feature extractor by sharing the same parameters with source domain feature extractor but also to set a reference model during adversarial training to accelerate training process.

Adversarial
Learning of the WDATL. The schematic of the WDATL model proposed in this article is illustrated in Figure 4. The essential aim of adversarial learning in WDATL is a min-max game among the feature extractors, domain critic, and fault model classifier. More specially, the domain critic tries to decrease the distribution discrepancy between the source and target features by using the Wasserstein distance, while the feature extractor is trained to extract target features which could maximize the output of domain critic and the fault model classifier is trained to classify the source features with minimal cross-entropy loss. Thereby, the fault model classifier trained in this adversarial manner could be generalized to diagnose the target domain data (under different working conditions) with unsupervised learning.
Given two minibatch of samples fx s i g n i=1 and fx t i g n i=1 collected from source domain x s and target domain x t , in which n represents the size of minibatch and n < N s , n < N t . The source features r s = f θ e ðx s Þ and the target features r t = f θ e ðx t Þ, where θ e denotes the parameters of feature extractor. Let p s and p t be the margin distribution of source and target features, respectively.    The domain critic f θ d consists of two fully connected neural layers, whose output is the scores corresponding to their inputs. Assume that there exists a set of parameters θ d , which makes f θ d satisfy Lipschitz constraint, according to Equation (3), the Wasserstein distance l wd between p s and p t could be computed by In order to minimize the distribution discrepancy between the source and target features, the domain critic attempts to output a big score for source feature while a small score for target features.
In literature [38], Arjovsky et al. adopted a weightclipping approach to drive the parameters θ d inside a limit space to satisfy Lipschitz constraint; however, this will cause gradient vanishing or exploding. Gulrajani et al. [39] proposed a new gradient penalty (WGAN-GP) term added to the l wd as a regularization, so the final optimization object of domain critic is where ρ denotes balancing coefficient, and l grad = ð∥∇ h f ðhÞ∥ 2 − 1Þ 2 represents the gradient penalty term.
As for the target feature extractor, the aim is to maximize the output of domain critic and the loss function is computed by Moreover, the fault model classifier should be trained to minimize the cross-entropy loss in Equation (7). Through adversarial learning among Equations (7), (9), and (10), domain-invariant and fault model discriminative features could be learned. The flowchart of the proposed WDATL is illustrated in Figure 5. Therefore, the classifier trained in source domain could be generalized to classify data in target data. In this way, the unlabeled target domain is expected to benefit from knowledge in the source domain with sufficient supervised information.
3.4. Epilog. The training strategy of our WDATL approach is summarized in Algorithm 1, which can be divided into 5 steps as follows: (1) The feature extractor and fault model classifier are first pretrained in source domain with sufficient labeled data, and the parameters of θ e and θ c can be updated using Adam   International Journal of Aerospace Engineering the distribution discrepancy between the source and target features according to Equation (9) (3) The parameters of θ e are updated according to Equation (10) while θ c are updated according to Equation (4) Repeat step (2) and step (3) [32], Correlation Alignment (CORAL) [46], and Domain Adversarial Neural Network (DANN) [47] are investigated. The CNN method (without transfer) uses convolutional layers to extract features and use fully connected layers to map the features with corresponding fault models, which are trained in the source domain and then directly adopted to diagnose data in the target domain.
Both DAN and JAN are deep adaption methods, which try to reduce the distribution distance of the source and target domains by mapping the latent features in fully connected layers into Reproducing Kernel Hilbert Space (RKHS) to reduce the MMD distance. The DAN uses MK-MMD to tackle the parameter selection problems and adds the MK-MMD loss to the classification loss forming the final objective function. The JAN utilizes JMMD to measure the distances of empirical joint distributions between source and target domains to align the domain shift. The parameter setting of DAN and JAN can refer to [31,32]. CORAL aims to decrease the distribution discrepancy of source and target domains by aligning the second-order statistic.
DANN is a deep adversarial-based model, which refers to train a feature extract to learn characteristics from input data and a domain discriminator to distinguish source and target domains. The DANN attempts to decrease the margin distributions between source and target features.
The structure and hyperparameters in the CNN-based fault diagnosis model directly determine the classification and diagnosis performance, so it is unfair and difficult to decide whose network is better. Therefore, in this paper, the five compared deep models share the same CNN Require 1: source dataset X s , the size of mini-batch n, the pre-training epochs Num1, learning rate of feature extractor α 1 , learning rate of fault model classifier α 2 . Require 2: target dataset X t , the adversarial learning epochs Num2, learning rate of domain critic α 3 , domain critic training steps n d . Pre-training: 1: Initialize the parameters of θ e and θ c 。 2: for i = 1, 2, 3, ⋯, Num1 do.
As mentioned before, the DAN, JAN, CORAL, and DANN are the TL method, which pretrain 60 epochs in the source domain, then TL methods are activated, and the maximum epoch is 360. We use the minibatch Adam as the optimizer, and the size of minibatch is set to 64 according to the size of the input samples. The learning rates α 1 , α 2 , and α 3 of all the neural networks are initially set to 0.001 with a step decay (multiplied by 0.1) in the epoch 200 and 300, respectively, which not only guarantees the speed of convergence but also avoids oscillation near the optimal point. The balancing coefficient ρ is set to 5, while the λ is set to 1. All the experiments are executed by using Pytorch 1.3 with GPU.

Data Description and Preprocessing.
The CWRU bearing dataset is provided by the Electrical Engineering Laboratory of Case Western Reserve University [48] and has become one of the most well-known public datasets in the field of fault diagnosis. Nowadays, many literatures conducted experiments on CWRU to compare the effectiveness of different diagnosis algorithms [36]. The test rig of CWRU is shown in Figure 6.
Following most of the published articles, we use the vibration data collected from the drive end and the sampling frequency is equal to 12 kHz. One healthy condition and three kinds of single-point bearing faults containing inner race fault (IF), outer race fault (OF), and ball fault (BF) are separated into 10 categories (one normal category and 9   Table 3. In this paper, each working condition represents a task, so 0 ⟶ 1 represents that the source domain contains data sampled in 0 hp/1797 rpm and the target domain contains data collected in 1 hp/1797 rpm. Totally, there are 12 transfer scenarios in CRWU.
The data in CWRU are 1-D time series, and we split the data into small slices without any overlapping, and each slice denotes a sample with the length of 1024. If the length of raw signal is LðL ≥ 120000Þ, then we can get N samples, where N = floorðL/1024Þ. We randomly select 80% of the total samples as the training dataset, and the remaining 20% of total samples as the test dataset. The feature extractor directly processes the raw time-serial slices avoiding laborintensive artificial feature extraction.
Input normalization which limits the input data into a certain range can speed up the training procedure; therefore, it is very vital for deep models. So, we use the Z-score normalization to process the input data. The Z-score normalization is formulated as follows: where x i is the input data and μ and σ represent the mean value and standard deviation of x i , respectively.

Result Analysis.
In this paper, we use classification accuracy defined as below to compare the performance of different algorithms.
where D t ′ represents the test dataset of target domain. We carry out the experiments six times and select the mean accuracy for each algorithm to avoid randomness. The classification accuracies of 12 transfer scenarios in CWRU are provided in Table 4, from which we can draw the following conclusions: (1) The proposed WDATL approach outperforms other compared methods in the total 12 transfer diagnosis tasks with a mean diagnosis accuracy of 99.91% (2) The 1-D convolutional neural network-based feature extractor proposed in this paper has strong feature extraction and generalization capabilities without any TL. Therefore, the CNN method could get high diagnosis accuracy in target domain by directly applied network trained in source domain. Namely, for the transfer task 1 ⟶ 2, the diagnosis accuracy on the target domain is as high as 99.87% by simply using the CNN method     The radar diagram of diagnosis accuracy for different algorithms in the 12 transfer tasks is shown in Figure 7.

Visualization Analysis.
The WDATL algorithm proposed in this paper has good performance of convergence. The loss functions of the domain critic l d and feature extractor l e during the training process of the transfer task 3 ⟶ 0 are shown in Figure 8.
In order to further prove the superiority of the fault diagnosis performance of the WDATL under various working circumstances, the confusion matrix and t-SNE [49] are used to visualize the results, respectively. The confusion matrixes of test date in task 3 ⟶ 0 with different methods are detailed in Figure 9.
The coordinates in Figure 9 indicate the different health states of the bearings. The figures in Figures 9(a), 9(c), 9(e), 9(g), 9(i), and 9(k) show the confusion matrixes of the CNN, DAN, JAN, CORAL, DANN, and WDATL algorithms on the source test data, respectively. From Figure 9, we can conclude that the discussed six methods can reach an accuracy of 100% when diagnosing the data in the source test dataset, verifying that the feature extractor based on the improved 1-D convolutional neural network proposed can extract effective feature representations. The figures in Figures 9(b), 9(d), 9(f), 9(h), 9(j), and 9(l) show the results of diagnosis accuracies on the target test dataset. DAN erroneously classifies the 14 mils inner race fault into the normal state, indicating that the MMD-based method cannot distinguish these two types of data well. The classification accuracies of adversarial learning methods such as DANN and WDATL are significantly higher than other algorithms. Among them, WDATL mistakes one 21 mils ball fault sample to 7 mils ball fault and one 21 mils ball fault sample to 14 mils ball fault. Figure 9 demonstrates that the WDATL algorithm can achieve a highest diagnosis accuracy in the target domain among all the compared methods.
In order to understand the way that the 1-D convolutional neural network processes the raw time-series signals, the features extracted by the feature extractor are visualized by the t-SNE technology, as shown in Figure 10. The different colors in Figure 10 denote different health states of samples. Taking the transfer task 3 ⟶ 0 for example, figures in Figures 10(a), 10(c), 10(e), 10(g), 10(i), and 10(k) are learned by the source feature extractor on source training datasets, and all the six algorithms can distinguish the features well. For comparison, figures in Figures 10(b), 10(d), 10(f), 10(h), 10(j), and 10(l) are the features learned by the target feature extractor on target training datasets. CNN-based algorithms cannot distinguish category 2 from category 5 and category, and the distribution of features in each category is relatively scattered. Therefore, the CNN-based method could diagnose the source dataset well but perform poorly in target domain. The WDATL algorithm can divide 10 categories into 10 nonoverlapping clusters, and each cluster is relatively far away. It is convenient for the classifier to classify them better.

Data Description.
To further verify the performance of our proposed WDATL method, the aforementioned deep models have been applied to the SEU dataset [50] under various working circumstances. The SEU dataset provided by Southeast University includes a gear dataset and a bearing dataset. In our experiment, we use the dataset collected from channel 2. There are five fault models for each subdataset, one normal state and four failure states. Totally, there are nine health states in the SEU dataset. The description of the SEU dataset is depicted in Table 5. There are only two kinds of working circumstances denoted as 0 and 1 in the SEU dataset; therefore, there exist two transfer tasks 0 ⟶ 1 and 1 ⟶ 0.
We use the same data preprocessing method in Section 4.2.1 to process the data in the SEU dataset.

Comparative Analysis.
We still use the diagnosis accuracy to compare the performance of different methods in transfer tasks. The diagnosis accuracies under different transfer tasks using various algorithms are listed in Table 6. The accuracies of the CNN-based method for the two transfer tasks are relatively low (average 33.65%), indicating that the distribution discrepancies between source and target domain are large caused by different working speed and load. The DAN, JAN, CORAL, DANN, and WDATL could improve the diagnosis accuracy by using different transfer strategies, among which the WDATL perform best, increasing from 33.65% to 76.17% in the two transfer tasks averagely.    13 International Journal of Aerospace Engineering discrepancies. However, the CNN could not separate the target features well, which explains why the CNN model performs poorly in diagnosing the target training data. There is a bit of improvement in diagnosing target training data using the DAN method with MMD-based domain adaption. The proposed WDATL method could separate the target training data well, which demonstrates that the WDATL can learn the best features with discrimination and promote the diagnosis accuracy significantly.

Conclusion
In this article, a novel WGAN-GP-based deep adversarial transfer learning (WDATL) model is proposed to decrease distribution discrepancy between source and target domains resulting from different working conditions in the fault diagnosis of aerospace. Firstly, WDATL pretrains in the source domain with sufficient labeled data to initialize the parameters of feature extractor and fault model classifier. Then, the domain critic tries to learn domain-invariant and fault model discriminative feature representations by minimizing the Wasserstein distance between the source and target feature distributions through adversarial training. Lastly, the fault model classifier could be generalized to target domain to diagnose faults with unlabeled or insufficient labeled data. Extensive experiments conducted on two open datasets demonstrate that the proposed WDATL algorithm shows a better diagnosis performance than those of other baseline approaches on transfer diagnosis tasks under diverse working conditions. In the future, we will concentrate on the 14 International Journal of Aerospace Engineering more challenging and practical transfer scenario, i.e., single source and multiple targets in the fault diagnosis of aerospace.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.