A Hybrid Matching Network for Fault Diagnosis under Different Working Conditions with Limited Data

Intelligent fault diagnosis methods based on deep learning have achieved much progress in recent years. However, there are two major factors causing serious degradation of the performance of these algorithms in real industrial applications, i.e., limited labeled training data and complex working conditions. To solve these problems, this study proposed a domain generalization-based hybrid matching network utilizing a matching network to diagnose the faults using features encoded by an autoencoder. The main idea was to regularize the feature extractor of the network with an autoencoder in order to reduce the risk of overfitting with limited training samples. In addition, a training strategy using dropout with random changing rates on inputs was implemented to enhance the model's generalization on unseen domains. The proposed method was validated on two different datasets containing artificial and real faults. The results showed that considerable performance was achieved by the proposed method under cross-domain tasks with limited training samples.


Introduction
Mechanical fault diagnosis plays a significant role in modern industry. Failures of machines are likely to result in an entire mechanical system collapse and production line downtime, as well as serious economic losses. Timely and accurate fault diagnosis has become an indispensable technology in modern industries to ensure the safe and reliable operation of mechanical systems [1][2][3].
Recently, deep learning has achieved considerable progress in computer vision [4,5], speech and natural language processing [6], product defect detection [7], and road planning [8]. Expectedly, an increasing number of researchers have applied deep learning techniques to fault diagnosis and proposed intelligent fault diagnosis methods [9][10][11][12][13][14][15][16]. Hasan et al. [17] proposed an explainable AI-based model for bearings fault diagnosis. Sun et al. [18] developed a sparse autoencoder-based deep neural network for the fault diagnosis of induction motors, which realized accurate fault prediction. Li et al. [19] designed a two-layer Boltzmann machine to develop representations of the statistical parameters of wavelet packet transform for gearbox fault diagnosis. Ding et al. [20] applied a deep convolutional neural network (CNN) by using wavelet packet energy as the input to develop a bearing fault diagnosis system, with which they obtained reasonable fault detection performance. Zhang et al. [21] proposed a method based on deep learning that uses raw temporal signals as input, which achieved high accuracy under noisy conditions. Qiao et al. [22] built a dualinput model and achieved satisfactory antinoise and load adaptability based on a CNN and a long short-term memory neural network. e deep learning methods have discarded the traditional time-consuming and unreliable manual analysis, improving the efficiency of fault diagnosis [23][24][25][26][27][28] considerably.
Traditional deep learning methods can only achieve satisfactory results when the training set (source domain) and the test set (target domain) are in the same data distribution. In practical applications, however, due to the complexity of the working conditions of the mechanical system (load, motor speed, etc.), the training set and the testing set may have distinct distributions. e predictive performance of the deep learning models is greatly affected by these facts. To face this challenge, some transfer learning algorithms have been proposed to enhance the domain adaptability of the model. Zhang et al. [21] presented a novel algorithm based on deep learning to alleviate the degradation of the performance of intelligent fault diagnosis under noisy environments and different working loads. Yao et al. [29] designed a new model based on a Stacked Inverted Residual Convolution Neural Network to ensure the accuracy of the model in noisy environments. Hu et al. [30] proposed a data augmentation algorithm and presented a self-adaptive neural network to boost models' generalization ability. Lu and Yin [31] developed a transferable common feature space mining algorithm to extract the common features from multidomain data. Wu et al. [32] constructed a few-shot transfer learning method in variable conditions. Wei et al. [33] proposed multiple source domain adaptation methods to extract condition-invariant features for fault diagnosis.
Aside from the obstacle posed by cross-domain tasks, a limited training set is another challenge that restricts the practical application of deep learning fault diagnosis algorithms. Most of the deep learning methods require a large amount of labeled data for model training. However, in actual industrial application scenarios, collecting a huge amount of labeled data for every type of failure under each working condition poses a considerable challenge. To address this problem, some studies on mechanical fault diagnosis using limited labeled training data have been conducted. Wang et al. [34] presented an integrated fault prognosis and diagnosis method for the predictive maintenance of turbine bearings, which achieved reasonable performance under limited labeled data. Zhang et al. [35] applied the few-shot approach for fault diagnosis and designed an artificial neural network based on a Siamese network, achieving interesting results with limited data. Li et al. [36] designed a meta-learning fault diagnosis method (MLFD) framework using model-agnostic meta-learning, which has performed excellently under complex working conditions. Hang et al. [37] applied a two-step clustering algorithm and principal component analysis to improve classification performance in the case of unbalanced highdimensional data. Li et al. [38] proposed a deep, balanced domain adaptation neural network, which achieved satisfactory results with limited labeled data. Duan et al. [39] proposed a novel data description support vector based on deep learning for unbalanced datasets.
As two important research directions of fault diagnosis, improving the model's generalization to new domains and performance under limited training samples has made good progress, respectively. However, the reports of studies combining these two directions are relatively rare to find. In this study, to achieve domain generalization under limited training samples, we proposed a hybrid matching network (HMN) designed by connecting a prototypical network to the bottleneck of an autoencoder for fault diagnosis to unseen domains with limited training samples.
Our model mainly consists of two parts: (1) the autoencoder regularizing the feature extractor of the model to reduce the risk of overfitting and (2) the matching network achieving the measurement of samples similarity. Besides, a novel strategy is implemented in the training process to improve the model's domain generalization.
e main contributions of this study can be summarized as follows: (1) A novel fault diagnosis method based on matching network and autoencoder, known as HMN, was proposed to face the cross-domain scenarios. In the tasks, the model was training on the source domain with limited data and testing on the unseen target domains without access to their distributions. (2) Dropout on the input layer with randomly changing rates was employed to improve the generalization ability of the model. Autoencoder was built to reduce the risks of model overfitting with limited training samples by regularizing the feature extractor of the network. e rest of the paper is organized as follows. Autoencoder and prototypical networks are introduced in Section 2. Section 3 describes the proposed method in detail. Section 4 presents the experiments, results, and discussion. Finally, the conclusions are drawn in Section 5.

Autoencoder and Prototypical Network
2.1. Autoencoder. Autoencoder, an unsupervised learning method, uses a neural network to implement the representation learning task. Specifically, a neural network architecture designed to impose a bottleneck layer forces a compressed knowledge representation of the original input.
As shown in Figure 1, the autoencoder is mainly composed of two parts: an encoder and a decoder. e encoder function, which is denoted as f θ , enables the efficient computation of a feature vector h � f θ (x) from an input vector x. It is important to note that the dimensions of h are usually lower than the dimensions of x. Another parameterized function g θ , known as the decoder, maps the feature vector back to the input space, generating a reconstruction vector x � g θ (h).
A simplified autoencoder structure can be represented as a fully connected neural network with three layers, i.e., an input layer, a bottleneck layer, and an output layer. e parameter sets of the encoder and the decoder are trained simultaneously when performing the task of reconstructing the input as much as possible, i.e., minimizing reconstruction error L(x, x) which is usually described by MSE over training examples. For a training set x (i) n i�1 , the reconstruction error of MSE is expressed as follows: 2 Computational Intelligence and Neuroscience If the input is normalized to [0, 1], the cost function can be described as binary cross-entropy, which comes in the form below: where x (i) j and x (i) j represent the j th element of x (i) and x (i) , respectively, n and m represent the batch size and the dimension of x, respectively.
Using penalizing parameters based on reconstruction errors, the network can learn about the most important attributes of the input data and how to best reconstruct the input from the feature vector. [40] have been proposed for few-shot learning, which requires only a small amount of training data with limited information, as compared to traditional machine learning methods requiring a large amount of data to train a model for good results. As shown in Figure 2, the classification task can be achieved by comparing the distances with mean representations of each class in the metric space produced by Prototypical Networks.

Prototypical Networks. Prototypical Networks
Specific to a few-shot task, given a support set that has M labeled samples S � ( } is the label of each class, S k describes the set labeled with k. A representation c k , or prototype, of each class is computed by meaning the support points belonging to class k: where f θ is an embedding function with learnable parameters θ. For a function computing distance d � R D × R D ⟶ [0, +∞), distribution of a query point x q over distances to all prototypes of each class in the metric space is computed by prototypical networks: Train the network by minimizing L(θ) � − logp θ (y � k | x q ), the loss of the k class.

Methods
e proposed HMN for fault diagnosis is described in detail in this section. As shown in Figure 3, our model has both one-input and two-output configurations. One of the outputs was the reconstruction of the input, and the other was the prediction of health conditions using a prototypical network. e details of the model are illustrated in Table 1.

Data Preprocessing.
e proposed model used the shorttime spectrogram as a 2D input. Firstly, as shown in Figure 4, the sliding window of 2048 points generated the samples. Secondly, STFT used a fixed-length nonzero window function to slide along the time axis, truncating the source signal into segments of equal length. Assuming that these segments are stable, Fourier transform can be used to obtain the local frequency spectra of the segments. And finally, these local frequency spectra were recombined along the time axis to obtain a 2D time-frequency graph. e formula is presented in equation (5) as below: where x(t) is the original timing signal and g(t − τ) is the window function applied as the center point at time τ. In this study, the Hann window was used. To speed up the convergence of the model, we converted 2D spectrogram into a grayscale image with a value between 0 and 1. is process can be expressed as follows: where |X(τ, ω)| is the element magnitude, |X(τ, ω)| min and |X(τ, ω)| max represent the minimum and maximum magnitude, respectively. Finally, the normalized spectrogram  Computational Intelligence and Neuroscience X ′ (τ, ω) was compressed into 64×64 time-frequency graphs as the input of the model.

Random Dropout on Input.
Dropout is a technique proposed in [41] to prevent the deep neural nets from overfitting. e key idea is to randomly deactivate the units along with their connections from the network with probability p during training, preventing units from coadapting too much. Applying dropout amounts to sampling a "thinned" network from the original one during training. During the testing phase, dropout is disabled, which can be   seen as an average of the predictions of many "thinned" networks. e networks trained with dropout usually have much better generalization ability on supervised learning tasks. e deactivated units affect all the ones in the network, including the layers with dropout. Dropout applied in the lower layers can also be seen as providing noisy inputs for the higher layers. It can be interpreted as a method of data augmentation by adding noise to its hidden layers.
Adding noise with a specific distribution was not enough. Inspired by [21], we randomly changed the dropout rate during the training to obtain noise with the uncertain feature. Specifically, in each batch of training, the dropout rate was a random value between 0.1 and 0.9. e visualization of the operation is illustrated in Figure 5.
Here * denotes an elementwise product. r i is a vector whose elements follow independent Bernoulli random variable which has a probability p.
x and x are the raw input and the interfered output of x.
e purpose of adding dropout to the input layer was to add masking noise to the input, making the model insensitive to disturbance and improving the domain generalization of the model.

Feature Extraction.
To make full use of unlabelled information, an autoencoder was designed for feature extraction. In the encoding stage, the 2D time-frequency images first passed through a set of 2D convolutional layers. e 2D convolutional layers captured the localized features of the image well due to its translation invariance. To obtain more diverse features at the same feature level, the weights in the convolutional layer were designed as a series of 2D filters. Each filter convolves independently across the input feature map in the forward pass, obtaining the output of one of the convolution layer's channels. Generally, the computing of the convolutional layer l is expressed as follows: where * operator denotes the convolution of the channel i of the feature matrix Z l− 1 i and the kernel W l i,c , which produces the feature map Z l c of the c th channel of the layer l. b l c is the bias of c th channel in the layer l. e f l (·), a nonlinear activation function using RELU in this study is implemented on the final output of the convolution network. e encoder and decoder were designed in a symmetrical form. To reconstruct the coding of the bottleneck layer to the same size as the input time-frequency image, a transposed convolution layer was used in the decoder to unsampled the feature map. Following [42], the encoder contained four convolution layers and two fully connected layers, while the decoder contained four transposed convolution layers and two fully connected layers.

Training of the Proposed Model.
e two outputs of the model correspond to two different losses, including the reconstruction loss L r computed by the autoencoder and the classification loss L c computed by the prototype network. In the training process, L r and L c are minimized. e total loss in the model training can be described as follows: where the hyperparameter α is the weight coefficient used to adjust the weights of different losses. In the training process, the network is optimized with an Adam optimizer which sets the learning rates for each parameter adaptively. e steps of the proposed training algorithm are listed in Algorithm 1.

Experiment Description.
To verify the validity of our method, experiments are carried out on two bearing datasets selected from the Case Western Reserve University (CWRU) bearing datasets [43] and Paderborn bearing dataset [44]. We assume the source domain contains limited labeled samples and set 6, 10, 15, 50, 100, 200, 300, 500, 600 training samples per class to test the performance of the proposed method. Fivefold cross-validation is applied to the experiments. e test platform uses an Ubuntu 18.04 + Python 3.6 + Pytorch with an Intel ® CORE ™ i7-9750H CPU and a Nvidia GTX 1080Ti GPU.

Comparison Methods and Evaluation Metrics.
To verify the advantages of the proposed model, as shown in Table 2, several popular models are compared, using three types of time series input methods (Siamese-based CNN [35], PSDAN [45], and WDCNN [46]) and three types of time-frequency input methods (SCNN, HCAE [42],s and DeIN [47]). e Siamese-based CNN was designed by [35]. PSADAN was an adversarial domain adaptation method. WDCNN, in which a wide convolution kernel was used in the front of the network, was proposed in [46]. DeIN was proposed in [47]. SCNN is a common CNN that follows a softmax at the end of the same structure with the encoder of HMN. e HCAE was proposed in [42]. e HMN model was proposed by our team. All the models are trained in the source domain and tested in the unseen target domain. For the sake of fair comparison, the hyperparameters of models are carefully selected.

Computational Intelligence and Neuroscience
Several evaluation indicators are used to evaluate the performance of the proposed model in the following aspects: (1) accuracy, (2) precision, (3) F1 score (F1), and average F1 score (αF1). Precision, F1, and αF can be obtained using the following equations: where TP, FP, and FN represent true positive, false positive, and false negative, respectively.

Data Description.
In the CWRU bearing datasets [43], the 12k drive end fault data were selected as the original experimental data. Four types of faults, i.e., normal, ball fault, inner race fault, and outer race fault, were found in these data, as shown in Table 3. Each fault type had three different subtypes, i.e., 0.007 inches, 0.014 inches, and 0.021 inches. us, there were altogether 10 different types of fault. Signals of all fault types are shown in Figure 6. Each type of fault had three different loads, i.e., 1, 2, and 3 hp (motor speed of 1772, 1750, and 1730 RPM), as illustrated in Table 4. During data collection, each sample was collected from a vibration signal, as shown in Figure 7. Half of the signals were used to generate training data, and the remaining signals were used to generate the test set. As shown in Initialize: weight coefficient α � 0.5, the batch size is set to 8, the learning rate η is set to 0.0001, and the epoch is set to 300 for n � 0,. . ., epoch do for i � 0,. . .,steps do input a batch samples from the source domain random sampling p from Uniform (0.1, 0.9) dropout on inputs with rate p compute prototypes c k L � L c + αL r , θ ⟵ θ − η(zL c /zθ + αzL r /zθ) end for end for ALGORITHM 1: e proposed training algorithm.   Train  600  600  600  600  600  600  600  600  600  600  1  Test  25  25  25  25  25  25  25  25  25  25   Dataset B  Train  600  600  600  600  600  600  600  600  600  600  2  Test  25  25  25  25  25  25  25  25  25  25   Dataset C  Train  600  600  600  600  600  600  600  600  600  60 0  3  Test  25  25  25  25  25  25  25  25  25  25   6 Computational Intelligence and Neuroscience Figure 4, the training samples were generated using 2048 points sliding window with 80 points overlapping steps. e test set samples passed through sliding windows in the same size, but the samples were generated without overlapping.
We set the data under different working conditions as experimental data. Datasets A, B, and C correspond to different working conditions with loads of 1, 2, and 3 hp, respectively. Each dataset contained 6000 training samples and 250 test samples. Figure 8 illustrates the accuracy of all methods of training with various amounts of samples. With outstanding performance, HMN is evidently superior to the other approaches. We can find that cross-domain task C to A is the most difficult, in which even with sufficient training samples, the accuracy of four compared methods does not reach 90%, but the proposed model still achieves satisfactory results.

Results and Analysis.
e results of training with 6 samples per class were observed. e classification accuracies of the cross-domain tasks are shown in Table 5. e best performance was achieved using HMN among all the methods in all the scenarios. Specifically, HMN achieved an accuracy of 92.65% in C-A, which was 34.61%, 21.57%, 26.38%, 19.32%, 40.21%, and 27.09% higher than DeIN, Siamese  To further evaluate the effectiveness of the proposed method, we observed the effects of the autoencoder and random dropout in improving model's performance through the loss curve. Figures 9 and 10 show the loss curves in cross-domain task C-A with 6 training samples per class.  Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience
As shown in Figure 9, training losses containing reconstruction loss L r and classification loss L c are considered to originate from equation (9), with testing losses set to classification loss L c . According to equation (9), when α is set to 0, the autoencoder does not work. A greater α indicates a higher weight of autoencoder during the training process. As α increases from 0 to 0.2, the testing loss converges to a smaller value. e testing loss's convergence process is smoother when α equals to 0.5. is demonstrates how the autoencoder branch may prevent overfitting and improve the model's performance.
As shown in Figure 10, when the HMN does not employ random dropout on input, the convergence value of the testing loss is greater than 3; however, when random dropout is used, the convergence value of testing loss drops to less than 1, and the curve descends more smoothly. e effect of random dropout on input in improving the model's cross-domain generalization is demonstrated.  (1) (2) (3) (4) (5) Figure 11: Test rig of Paderborn bearing dataset.      Figure 11, the test rig [44] consists of five modules: (1) electric motor, (2) torquemeasurement shaft, (3) rolling bearing test module, (4) flywheel, and (5) load motor. Bearings with different state types were installed in the test module to obtain experimental data. Fault types of bearings come from artificial and real damages.
In the basic setting of operating condition, the test platform ran at n � 1500 rpm with a load torque of M � 0.7 Nm and a radial force on the bearing of F � 1,000 N. Other settings were set up by changing the parameters one by one to M � 0.1 Nm and F � 400 N (named D, E, F, respectively, shown as Table 8.
e bearings with 32 different states were operated under different working conditions, including 14 states with natural damages from accelerated lifetime tests, 12 states with artificial damage, and 6 states with health data.
Each bearing under a load setting is measured with a vibration signal of about 4s at a 64 kHz sampling rate. In the experiment, datasets contained signals obtained from healthy bearings, artificially damaged bearings, and naturally damaged bearings. All bearings of different fault types were running under three different loads at a speed of 1500 rpm. e datasets filenames selected are shown in Table 9. e details of the datasets selected are listed in Table 10. Each dataset contains 1800 training samples and 120 test samples.

Results and Analysis.
By performing the same implementation, Figure 12 compares our method with the compared approaches in terms of the accuracy of different    e results show that our method outperformed the other six stat-of-the-art methods in all the scenarios. Table 11 illustrates the cross-domain tasks accuracy of different methods with 6 training samples per class. e proposed method outperformed all comparative methods by 6.87%-41.26% on average. Tables 12 and 13 compare the methods in terms of precision, F1, and αF1 in the cross-domain task E-D with 6 training samples per class.
e results also show that our method outplay the alternatives.

Conclusions
A novel HMN was proposed for cross-domain fault diagnosis with limited training samples. We improved the model's diagnostic performance in two ways: (1) a novel deep learning structure combining autoencoder and matching network was built, (2) a random dropout strategy adding random disturbance into the inputs during the training process was developed to enhance the model's domain generalization. In Section 4, we present the experimental results showing that the proposed method has better domain generalization ability with limited training samples compared with the state-of-theart approaches.
However, the method proposed in this study still has some restrictions. For example, the method is limited to cross-domain tasks between different working conditions on the same device. However, cross-domain across multiple devices makes intelligent fault diagnosis algorithms more realistic. In addition, HMN can only perform classification tasks, limiting the model's potential to multitask. In future work, we will further optimize HMN and employ it in more complex cross-domain fault diagnosis scenarios and multitask learning.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.