Sequential Network with Residual Neural Network for Rotatory Machine Remaining Useful Life Prediction Using Deep Transfer Learning

. Deep learning has a strong feature learning ability, which has proved its eﬀectiveness in fault prediction and remaining useful life prediction of rotatory machine. However, training a deep network from scratch requires a large amount of training data and is time-consuming. In the practical model training process, it is diﬃcult for the deep model to converge when the parameter initialization is inappropriate, which results in poor prediction performance. In this paper, a novel deep learning framework is proposed to predict the remaining useful life of rotatory machine with high accuracy. Firstly, model parameters and feature learning ability of the pretrained model are transferred to the new network by means of transfer learning to achieve reasonable initialization. Then, the speciﬁc sensor signals are converted to RGB image as the speciﬁc task data to ﬁne-tune the parameters of the high-level network structure. The features extracted from the pretrained network are the input into the Bidirectional Long Short-Term Memory to obtain the RUL prediction results. The ability of LSTM to model sequence signals and the dynamic learning ability of bidirectional propagation to time information contribute to accurate RUL prediction. Finally, the deep model proposed in this paper is tested on the sensor signal dataset of bearing and gearbox. The high accuracy prediction results show the superiority of the transfer learning-based sequential network in RUL prediction.


Introduction
Rotatory machine serves as a significant element in mechanical systems while its working conditions are directly related to the production efficiency and production safety [1][2][3]. However, due to complex operating environment and component wear, machine failure is inevitable during practical operation. Once failure occurs, it may have a negative effect on the whole mechanical system and even threaten the safety and stability of the industrial production. erefore, it is necessary to conduct an effective and efficient predictive maintenance strategy which helps ensure reliability service for the mechanical system [4,5]. Accurate prediction of the remaining useful life (RUL) for the rotatory machines contributes to preventing possible failures, as potential faults are identified in advance and removed in a timely manner. Hence, an appropriate maintenance strategy can be achieved based on the predicted results, realizing system efficiency, and quality improvement [6,7].
Continued advancement in intelligent manufacturing has led to ever-increasing attention on system Prognostic and Health Management (PHM) in industry and academia [8]. Traditional fault prediction methods are mainly based on manual extraction of features, which requires prior knowledge as the basis. e performance of the traditional prediction model largely depends on the quality and applicability of the hand-crafted features. When the selected features are unsuitable for the certain task, the predictive accuracy may dramatically fall [9,10]. Nowadays, datadriven RUL model is able to utilize historical data directly to build prediction model without prior knowledge or feature extraction, which is able to model complex process of mechanical degradation [11][12][13]. Owing to the extraordinary feature learning ability, deep learning-based methods has gained great popularity in industrial applications as it overcomes the limitations of traditional prediction methods. By constructing the deep neural network with multiple hidden layers, the framework is able to learn hierarchical representations directly from the original data [14][15][16]. Deep neural networks automatically extract distinctive representations through model training and therefore obtain accurate prediction results [17][18][19]. By now, deep learning has various successful applications including computer vision (CV) [20], natural language processing (NLP) [21], speech recognition [22], machine translation [23], and automatic driving [24].
Similarly, deep learning has achieved remarkable achievements in the field of machine RUL prediction. Deutsch et al. proposed a sensor vibration signal RUL prediction method, which utilized learning ability and prediction ability of the deep belief network to realize automatic feature extraction and RUL prediction without manual intervention [25]. Qin et al. proposed an attentionbased Long Short-Term Memory (LSTM) network which utilizes attention coefficients to evaluate the importance of intermediate information and its superiority in RUL prediction is verified by gear dataset [26]. Although deep learning-based methods have been successfully applied in the field of mechanical system degradation modeling, there are still several deficiencies. Firstly, due to the limited data availability, training samples are insufficient to conduct the adequate training of the deep model. Consequently, the depth of most deep model is limited, which directly affects the final prediction performance [27]. Secondly, with the increase of the number and scale of hidden layers, model parameters will also increase sharply, where training such a model from scratch is time-consuming. Besides, the selection of hyperparameters (learning rate, activation function, loss function, etc.) may also greatly influence the performance of the model. To overcome the difficulty of deep model training, transfer learning is applied.
To accelerate model training, model parameters that have been learned and trained in advance are transferred to the new model, which greatly improves the efficiency of feature learning procedure [28]. Transfer learning, which provides a reasonable initialization for the target model and simplifies the fine-tuning procedure, has been successfully applied in the field of fault prediction [29]. Sun utilized sparse autoencoder to build a deep model which is pretrained by run-to-failure data with RUL information from a cutting tool. e trained network is then transferred to a new model to achieve accurately RUL prediction [30]. Shao et al. proposed a transfer learning network using the structure and parameters of VGG-16 and conducted accurate fault classification among different mechanical datasets [31].
In order to achieve accurate prediction of sequential sensor signals, deep learning-based sequential models are necessary. LSTM network is commonly applied to deal with time series, which has a strong capability in discovering the underlying variation pattern. LSTM is a variant of Recurrent Neural Network (RNN) and is proposed to overcome the problem of information redundancy caused by long input, therefore achieving the desired performance in the applications of signal prediction [32]. Yuan et al. discussed several applications of RNN, LSTM, and Gated Recurrent Unit (GRU) models in aeroengine fault diagnosis, and experiments proved that the models based on LSTM and GRU had better prediction performance than RNN-based models [33]. Wang et al. designed a residual LSTM network framework to solve the degradation problem of the deep LSTM model and verified the superiority of the structure by experiments [34]. Zhang et al. proposed that a new approach based on the LSTM network models the system degradation process, and it has the capability to learn specific patterns from time series [35].
Inspired by these prior researches, a novel RUL prediction model based on Bidirectional LSTM and transfer learning strategy is proposed. e proposed model utilizes the first two convolution blocks of Residual Network (ResNet-50) as sensor signal feature extractor and outputs the predicted RUL values by Bidirectional LSTM. e main contributions of this paper are summarized as follows: (1) A novel RUL prediction framework of rotatory machine based on sequential network is proposed and combined with transfer learning strategy to improve the training efficiency. A pretrained network trained by ImageNet dataset is designed as a feature extractor at the first stage and then the advanced parameters of the whole framework are finetuned by specific sensor signals of rotatory machine, which greatly reduces the difficulty of deep network training and realizes efficient RUL prediction.
(2) For accurate RUL prediction, the Bidirectional LSTM module is combined with pretrained feature extractor. By the powerful feature learning capability of CNN and the prediction capability of the Bidirectional LSTM, the RUL value of the rotatory machine can be accurately predicted.
(3) In the current research, transfer learning has not been applied to the field of rotatory machine remaining useful life prediction. Different from the traditional fault classification problem, the prediction of RUL requires the network output specific value rather than a category label, which greatly increases the difficulty of model training. e proposed method in this paper extends the application scope of transfer learning. e rest of the paper is organized as follows. Section 2 introduces the theoretical background of the proposed method, including convolutional neural networks, residual networks, transfer learning, and Bidirectional LSTM. In Section 3, the overall RUL prediction framework is illustrated in detail. In Section 4, experimental studies are carried out to verify the effectiveness of the proposed model, together with performance comparisons with other methods. Conclusions and future work are presented in Section 5.

Basic eory of Convolutional Neural Network.
Convolutional neural network (CNN) is a feedforward neural network using local connection and weights sharing strategy. e weights sharing strategy reduces the number of model parameters and improves the computing efficiency effectively. Deep convolutional neural networks automatically learn the potential abstract features from the input data and achieve accurate prediction by capturing the high-level feature representations. CNN contains three main types of building units, including convolutional layer, pooling layer, and fully connected layer. Figure 1 shows a typical CNN architecture.
Convolutional layer realizes convolution operation, and each convolutional layer generates multiple feature maps. e input data are mapped into different feature space through various convolutional kernels, where each convolutional kernel represents one certain feature extraction. Within one convolutional layer, only a certain part of the input is connected to the corresponding neuron, called local connection. e weights associated with different neurons within the same layer are the same, called weights sharing, and weights and bias shared within each layer form a convolution kernel. is sharing strategy greatly reduces the number of model parameters and helps deep network training. Formula (1) is the calculation method of feature mapping: where α n is the characteristic value extracted by the n-th convolutional layer, Z k is the feature mapping value extracted after activation function, f( ) represents the nonlinear activation function where Rectified Linear Unit (ReLU) is widely used in deep architectures, x is the input image, W n and b n represent weights and bias values, and ⊗ is a two-dimensional convolution operation, which performs dot product of convolution kernels and the input. Pooling layer usually follows the convolutional layer to reduce feature dimension and remove redundant information. By compressing the features appropriately, the network computation complexity is simplified and the robustness of feature extraction is improved.

ResNet-50.
As the network deepens, constructing more hidden layers may not achieve performance improvement or even cause model degradation. Redundant hidden layers may lead to a decrease in model convergence rate in the training process and affect the predictive accuracy consequently. To address this issue, Residual Neural Network (ResNet) is proposed, which adopts a skip-connection strategy to reinforce feature learning ability and therefore effectively expands the depth of the network, improving model feature learning ability. Skip connection between stacked cells sends useful information directly to the next layer which is an effective way to avoid gradient vanishing. Figure 2 shows the skip-connection idea of ResNet-50. e output y is expressed as combining the linear superposition of the input x with a nonlinear transform F(x) of the input. Instead of learning direct features from input x, ResNet tries to learn the difference between the expected features and input x, which is called residual. When the socalled residual approaches 0, it means that the stacked layer conducts the identity mapping which at least ensures that the stacked model will not degrade in performance as the network deepens. Actually, the residual part is hardly null and based on the learned representations from previous layers, it can help the model learn new features. Table 1 and Figure 3 show the detailed information of ResNet-50 [36]. ResNet-50 is mainly composed of four residual blocks and one fully connected layer. Each residual block consists of several convolutional layers, which have different convolution kernel sizes. Convolution operation is performed on the input, and then features are extracted through different residual blocks. Finally, fully connected layer is leveraged to output the corresponding targets. Different from conventional neural network where the output of the (n-1)-th layer is only connected to the n-th layer as the input, the skip-connection structure of Residual Network enables the output directly cross several layers, which solves the gradient dissipation problem in the backpropagation process and makes it easy to train.

Transfer Learning.
For a large deep neural network, training from scratch requires sufficient labeled data and the training procedure is time-consuming. To overcome this problem, transfer learning is used to boost the training performance of large model, which aims at solving the training difficulty when training data is limited. e problem of overfitting may occur when the training samples are insufficient, thus limiting the generalization ability of the deep model. Transfer learning strategy pretrains model parameters with a large set of sufficient data and transfers the well-trained model to a new specific task.
rough finetuning the weight parameters of the higher hidden layer with a small number of new data, the higher-level representations can be achieved. Whether the transfer learning strategy is effective depends on the difference between the data used in the pretrained process and the data for fine-tuning. For datasets with similar tasks, only the fully connected layer parameters at the end of the network need to be fine-tuned, without changing the parameters of the overall model. For

Input Extracted features Output
Max pooling Convolution Block copied Shock and Vibration datasets with great differences, most of the convolution block parameters need to be appropriately updated. In general, compared with the methods of trained from scratch, transfer learning reduces the number of parameters that need to be trained and enables the model to converge faster. Figure 4 shows the process of transfer learning. Model parameters are transferred from source domain (image data in the picture) to target domain (vibration signal in the picture), where parameter transfer helps new model realize quick convergence. In general, training data in the source domain is sufficient, while training data for target domain is limited. Based on the similarity between target domain and source domain, fine-tuning corresponding weights and bias in the new model can greatly improve the predictive accuracy in the case of insufficient samples. e pretrained model leverages public dataset to learn the general features at the lower layers, and then the transfer model extracts abstract features at the higher layers through the limited    specific data. By applying the model trained by sufficient training data to the target domain for prediction, the similarity between the models can be fully utilized and model efficiency can be achieved.

Bidirectional LSTM.
RNN is an effective tool to analyze time-series signal. However, it is difficult to reasonably deal with the long and complex time series due to the limitation of its inherent characteristic. LSTM network effectively overcomes the problems existing in the traditional recurrent network through the standard recurrent layer and the internal unique gate structure. However, sensor signal data in machine health monitoring system has strong time dependence, while the basic LSTM can only access the information in specific time step but are unable to build an overall comprehension. As a variant of the traditional LSTM, Bidirectional LSTM (Bi-LSTM) improves model capability in dealing with long sequence, which has stable dynamic learning ability and strong robustness in extracting useful features from complex sequential data. A typical Bi-LSTM model is shown in Figure 5, where Bi-LSTM model adopts bidirectional connection. Each input sequence propagates forward and backward in an independent LSTM, and the output is presented in series. Bidirectional propagation allows each time-series sample to access complete information as it travels in each direction, and the backpropagated LSTM can further smooth the data and reduce the impact of noise.

Proposed Method
In this paper, a high-precision RUL prediction framework for rotatory machine based on deep neural network is proposed, which is able to automatically learn fault signatures and identify the degradation process directly from the original vibration signal. e proposed framework is able to achieve quick and accurate RUL prediction.
e overall pipeline of the prediction framework is shown in Figure 6, which mainly contains 5 stages, including data acquisition, data processing, data partitioning, pretraining feature extractor, and RUL prediction. Algorithm by proposed model for RUL prediction is summarized in Algorithm 1.
Step 1. Data acquisition. e dataset used for RUL prediction is the vibration signals collected by acceleration sensor mounted on rotatory machine. e vibration signal records the whole run-to-fail process of the mechanical system. By analyzing the vibration signals in each time period, the RUL of the rotatory machine is predicted reasonably.
Step 2. Data processing. As the input data format of the pretrained ResNet-50 network is required to be a two-dimensional image, the original one-dimensional signal data need to be processed and converted into two-dimensional images. Different from the time-frequency imaging method adopted by Shao et al. for fault classification tasks, the RUL prediction task is to explore the signal variation characteristic. It is difficult to predict RUL value from the same signal with a small difference in time-frequency change, so the proposed method leverages the amplitude changes of the signals to conduct data processing for the one-dimensional sensor signal. e specific method is designed as follows. Assuming that the original sensor signal is l � [x 1 , x 2 , . . . , x n ] and the step length is k, the first sample extracted is l 1 � [x 1 , x 2 , . . . , x a ] and the next sample is and the amplitude of the original vibration signal can be better retained, which is conducive to realize the data analysis and the RUL prediction. Since the converted image is distributed as a grey image of one channel, it is necessary to duplicate the grey image into three channels by means of channel enhancement, so as to realize the two-dimensional image with three channels.

Shock and Vibration
Step 3. Data partitioning. Each image needs specific RUL labeling. Compared with the amplitude of the vibration signal sequence, the real RUL are relatively large values, where it is difficult for the deep neural network to model the mapping relationship between the input and output. In order to achieve quick convergence of the model and accurate RUL prediction, normalization is performed to the real RUL values, defined as rough normalization, the output label corresponding to the real RUL values can be obtained. After model training, the proposed framework is able to output predicted labels. To achieve RUL from the predicted values, inverse normalization is utilized as rough these transformations, effective model training and RUL prediction can be obtained. In addition, the processed images are divided into two parts: the training dataset and the test dataset. e training set is used to fine-tune the pretrained model and update the model parameters, while the testing set is only used to verify the performance of the effectiveness of the proposed architecture.

Data acquisition
Step 1 Step 2 Step 3 Step 4 Step 5  Step 4. Pretraining feature extractor. e pretrained model used in the proposed method is the first two convolution blocks of ResNet-50 which has obtained accurate classification results on the ImageNet data, which indicates that ResNet-50 has effective and efficient feature learning ability. Detailed information about the selected convolution blocks is shown in Algorithm 1. Different from task-specific features, general features are extracted by the first two blocks of the pretrained model and suitable for various tasks.
Step 5. RUL prediction. As the ImageNet dataset differs greatly from the sensor vibration signal dataset, the issue of RUL prediction is much more difficult than the simple image classification prediction. us, the features extracted by the pretrained network will be used as the input of the Bidirectional LSTM and the sequential network is designed to realize the final RUL prediction. e whole network needs to be fine-tuned for the specific rotatory machine dataset during the training process. e fully connected layer is located at the end of the network to output RUL results. In this paper, as shown in formulas (5) and (6), ReLU function is used as the activation function, and the last layer outputs the specific value of RUL:

Experiment and Analysis
To explore the performances of model and verify the effectiveness of its prediction, several experiments were carried out among the bearing dataset and gearbox dataset, respectively. e bearing dataset contains vibration data of various bearings collected at Xi'an Jiaotong University (XJTU) and the Changxing Sumyoung Technology (SY). e gear dataset was carried out by Chongqing University on the contact fatigue testing machine. ese two different datasets contain the run-to-fail process of bearing and gear operating under different conditions. By comparing with the existing mainstream forecasting methods, the validity of the proposed method is verified. Figure 7, the experimental data were collected by the bearing testbed. e testbed is designed for the degradation test of rolling bearings under different working conditions. e dataset contains run-to-fail sensor data of multiple rolling bearings, and the whole dataset was obtained through a number of accelerated degradation experiments. Table 2 shows the detailed description of the experimental data. Four different fault types are set, including inner race wear, cage fracture, outer race fracture, and outer race wear. RUL algorithm for training is defined as follows: Input: l � [x 1 , x 2 , . . . , x n ] # Historical run-to-fail sensors data of rotatory machine For i � 1, . . . , 2500 do:

Data Description. As shown in
Output: the trained proposed model.

Building and Training.
For the data preparation, the time window is set to be 1024, which means that one single sample has 1024 data points. e selected sensor signal within the time window is processed to be an amplitude image with size of 32 × 32 which is suitable for the pretrained network. e format of the processed images is 32 × 32 × 3 by duplication of the origin grey-scale images. After that, preprocessed image samples are separated into two parts for training and testing, respectively. e size of the training samples is 2000 for each condition while the size of the testing samples is 500. Adam optimizer is adopted for parameters updating with learning rate 0.001 and the batch size was set to 16.
For model comparison, the following methods are also conducted using the same dataset:

RUL Prediction for Bearing Dataset.
To verify the validity of the proposed architecture, we took two different fault datasets under three conditions as specific experimental datasets, and RMSE loss is adopted to evaluate the RUL prediction performance. Figure 8 shows the comparison of training efficiency between the proposed model and the model training from scratch, and the time needed for the two models to reach the specific loss value is compared. In these experiments, we set RMSE threshold as 65 to record the time required for model training.
From the results shown in Figure 8, the training time of the proposed ResSN-TL model is much lower than that of the network trained from scratch, which verifies the advantages of transfer learning strategy in training efficiency. Table 3 shows the final RMSE loss of each model. Compared with the current mainstream prediction model RNN and its variants, the proposed model converts onedimensional sensor signals into two-dimensional images, effectively leveraging the advantages of convolutional neural network in feature learning, and is superior to most sequential networks in terms of RUL prediction. By means of transfer learning, the feature extraction module of the network is pretrained, which greatly improves the training efficiency of the model. Compared with the model training from scratch, the effectiveness of the transfer learning method is verified in terms of both training time and prediction accuracy.    8 Shock and Vibration To provide convincing experimental verification and discuss the model learning ability, we conduct 100 times hand-out cross validation using the proposed framework and several comparison methods and generate RMSE confidence interval with 95% confidence level to evaluate the model learning ability for RUL probability distribution, shown in Table 4 and Figure 9.
In Figure 9, the color bars represent the average RMSE across the 100-time cross-validation experiments and the black lines denote their confidence interval. Based on the estimation results of confidence interval for RUL prediction, the proposed framework ResSN-TL has relatively superior performance. For the proposed model, there is a 95% likelihood that the RMSE of the predicted RUL ranges between 39.64 and 55.12. Compared with other models, the proposed framework has more accurate prediction results and the results have verified the superiority in learning ability of the proposed method. Figure 10 shows the RUL prediction results of bearing dataset 2-4 by the proposed model. It can be seen from the results that the RNN model has a slightly weaker performance in analysis and prediction of complex signals with long sequence than its variants. Although LSTM and GRU model are able to solve the problems existing in RNN such as gradient explosion and gradient disappearance to some degree, the prediction performance of them is still unsatisfactory. It can be explained that the signal beyond a certain length may result in information loss during the long-distance transmission process. Compared with LSTM and GRU networks, the prediction performance of the Bi-LSTM model is significantly improved, which verifies that it is effective to analyze the long period of time-series signals through bidirectional propagation. Convolutional neural network has its advantages in the field of feature extraction compared with time-series model. Combined with the powerful feature extraction capability of convolutional neural network and the analytical capability of Bidirectional LSTM to sequence signals, the model proposed in this paper is superior to most time-series models in prediction effect, which proves the effectiveness and efficiency of the proposed method.

Model Generalization Ability.
In order to further investigate the model performance for a more general task, several additional experiments have been conducted. Datasets under different working conditions for training and testing are utilized to investigate the model generalizability. Table 5 shows various datasets defined for training procedure and testing prediction verification.
Experiments have been carried out where model parameters and hyperparameters are initialized the same as the original model initialization. Training datasets are utilized to train the proposed model and the trained model is saved. After sufficient training, the proposed model has obtained the key feature learning ability and RUL prediction ability. Testing datasets under different working conditions are applied to test the model performances in RUL prediction. Specifically, the mixture of sensor data from bearings 1-1, 1-2, 1-3, 1-4, and 1-5 is used as training datasets to well train the proposed model, and the sensor data from bearings 2-1, 2-2, 2-3, 2-4, and 2-5 are used as the testing data separately to test the trained model. e RUL prediction results under different bearing datasets are shown in Figure 11 and Table 6. Several comparisons are also carried out and the experimental results are shown in Figure 12.
From the results, our model is able to achieve approximate RUL prediction among various working conditions and outperforms other frameworks. However, compared with the experimental results based on the same unit, model prediction accuracy using testing data from different working conditions is lower, which owes to the various key        In this case study, the gearbox sensor signal dataset is used to verify the performance of the proposed model. As shown in Figure 13, vibration signals are collected by sensors placed on a gearbox and the sampling frequency was set to 50 kHz. e experiment is set to stop when the amplitude of the collected vibration signals exceeds the given threshold. e specific description of the experimental data is shown in Table 7, including a total of four runto-failure datasets under two different working conditions.

Building and Training.
Compared with bearing signal dataset from XJTU, the vibration signals of gearbox have no obvious trend of amplitude change, and the sampling frequency is also different from the bearing dataset. To effectively analyze the dataset and evaluate the prediction performance of each model, a time window containing 10000 data points is chosen to be one sample and every sample is converted to an image with the size of 100 × 100. e other settings are the same as those mentioned in the previous section. Figure 14 shows the comparisons in training time of the proposed pipeline with the ResSN model trained from scratch. Similar to the experimental conclusion in case study 1, the transfer learning-based method is able to accelerate the training procedure and therefore improve model training efficiency. Table 8 shows the RMSE loss of four different datasets. It can be seen that RNN has relatively poor predictive performance due to its insufficient processing capacity for long and complex sequence signals. e prediction performance of LSTM and GRU is better than that of RNN. By stacking LSTM or using bidirectional units, the predictive ability of the network significantly improved.    Table 9 and Figure 15 show the color bars about the confidence interval of the RMSE to evaluate the model prediction ability of RUL. Results have shown that there is a 95% likelihood that the RMSE of the predicted RUL ranges between 64.32 and 78.64 using the proposed framework which is smaller than other methods, ranging between 65.17 and 87.13, which has verified the model learning ability from historical sensor data.     Figure 16 shows the RUL prediction results of Data A. e conclusion is consistent with the one driven from bearing dataset, where the prediction performance of the proposed model combined pretrained feature extractor with the Bidirectional LSTM is the best, which further verifies the effectiveness of the method proposed in this paper.

Conclusion
In conclusion, an RUL prediction model framework based on transfer learning is proposed in this paper. By means of transfer learning, certain model parameters are initialized reasonably, which solves the problem of training instability existing in random initialization and greatly reduces the training burden of the deep architecture. Combining the pretrained feature extractor using residual blocks with the sequential model using Bi-LSTM architecture, an efficient and accurate RUL prediction model for rotatory machine is established, which is advanced in training efficiency and prediction accuracy. e advantages of the proposed model are proved among the datasets of bearing and gear run-to-fail sensor signals. In the future, transfer learning can be expected to play a useful role in fault detection and RUL prediction across various mechanical systems. Besides, transfer guidelines will be further investigated.
Data Availability e data used in this paper are downloaded from XJTU-SY Bearing Datasets at https://biaowang.tech/xjtu-sy-bearingdatasets/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  14 Shock and Vibration