A Hybrid Deep Learning Prediction Method of Remaining Useful Life for Rolling Bearings Using Multiscale Stacking Deep Residual Shrinkage Network

,


Introduction
In the feld of prognostics and health management (PHM), the accurate prediction of remaining useful life has always been a key and extremely challenging problem [1,2].Rolling bearing is one of the most common and the crucial parts of modern machinery, and almost all kinds of mechanical equipment needs it [3].However, long-running and repetitive loads can cause wear and damage to bearings, which can lead to noise.Rolling bearings in normal operating conditions typically produce low levels of noise, but once the bearing is damaged or worn, the noise level increases signifcantly.Noise, which is a common indicator of rolling bearing failure, can provide engineers with critical information about the extent of bearing wear and damage [4].Once a rolling bearing fails, it can afect the operation of the entire mechanical system and can cause serious consequences for the equipment and the enterprise [5].
Terefore, in order to avoid the premature failure of rolling bearings, the equipment will cause accidents and even cause a huge economic loss to the company.We need to make an accurate prediction of the remaining useful life of the rolling bearing so that we can fnd the failure in advance and replace it and solve the problem in time.It can not only accurately refect its health condition in operation but also provide an accurate theoretical basis for the development of the health management plan for the equipment [6].
In general, RUL prediction methods can be divided into three categories: methods based on physical model, methods based on data-driven models, and hybrid methods [7][8][9][10][11][12][13][14][15][16][17][18][19].Te method based on model-driven uses physical and mathematical models for modeling and then estimates the parameters of the model from the monitoring data to predict the degradation trend of rolling bearings over their full life cycle.For example, Ding et al. [20] proposed a method to extract time-domain features such as RMS and kurtosis from the vibration signals of rolling bearings, which were used to evaluate them by the proportional hazard model, and achieved a better result.Wang et al. [21] proposed a method to get the covariates in the Weibull proportional hazard model by KPCA, which achieved high accuracy in RUL prediction.However, this often requires numerous kinds of prior knowledge, resulting in difculty in accurately establishing a degradation model under the complex system structure and working condition [22].
With the rapid development of sensor technology, computer science, and artifcial intelligence theory, datadriven methods have become more and more attractive in prognostics and health management (PHM).Te datadriven method uses machine learning and deep learning to learn autonomously from the data and then infer the degradation process of rolling bearings; it not only can save time and labor but also can accurately predict RUL [23,24].Zhang et al. [25] proposed an improved CNN to predict RUL by using CNN's ability of autonomous learning.Xin and Weitang [26] proposed a bearing remaining life prediction method based on multiscale convolutional neural network.
Although CNN-based deep learning algorithms have achieved a large number of excellent results in the feld of bearing RUL prediction, most of these methods have accomplished performance validation on laboratory datasets.However, it is difcult to capture bearing vibration signals with high signal-to-noise ratio in industrial production because of various noise sources in the environment, which makes many RUL algorithms in the industrial feld have serious problems of accuracy decline or even failure.Terefore, in order to realize the practical industrial production requirements and enable the methods to complete the life prediction task in noisy environments, many researchers have focused their research on improving the robustness of the model.Tey have studied extensively on fault diagnosis in noisy environments.For example, Zhicheng et al. [27] used empirical wavelet transform to reconstruct the signals, then used minimum entropy deconvolution CNN to reduce the noise of the signals after composition, and achieved better results.Zou et al. [28] automatically extracted features from the background noise with a 1DCNN by performing structure optimization.Su et al. [29] designed a class of hierarchical branching CNN structures and built a basic convolutional block with strong robustness by stacking one-dimensional small convolutional kernels, which improved the accuracy.Although there are a large number of research results on fault diagnosis in noisy environments by researchers, there are still relatively few reports on life prediction in noisy environments.
Inaccurate prediction results are obtained because the CNN-based prediction model usually has the problems of gradient disappearance and gradient explosion.Terefore, He et al. [30] proposed a residual network, which brought the shortcut connection into the network to improve the linear transformation ability, and it avoids the problems of gradient explosion and gradient disappearance, thus realizing the network to stack to a deeper layer.Te proposal of residual network made it possible to have a deeper network, and then appeared the deep residual network, which can reduce the number of model parameters and shorten the time of model training by increasing the setting of residual connections.Yu et al. [31] proposed a ResNet model constructed by extracting features of 1D vibration signals.
When some data containing noise and complex data are used for feature learning, the results are often not very satisfactory.Zhao et al. [32] proposed DRSN, which is an improved network of DRN, brought soft-thresholding into DRSN as a nonlinear transformation layer, and achieved autonomous learning of thresholds by adding an attention mechanism, and thus it can extract degrading information efectively.
Although the hazard methods can make full use of the advantages of both methods, the process is more complicated; therefore, this kind of method has been rarely reported.Aiming at the problem of noisy signal data in rolling bearing remaining useful life prediction, this paper focuses on the ability of deep residual systolic network to handle noise-containing data and improve it.
Te novelty in this paper can be summarized as follows: (1) Te deep residual shrinkage network is improved by bringing the idea of stacking integrated learning, which learns more features of the dataset by two layers of learners, so that the results are less partial and the prediction ability of the model is improved.(2) Traditional DRSN models usually have a single scale of convolution.Tis paper proposes to use multiple scales of convolution kernels as the base-learner of DRSN, which can learn more features.(3) Based on the above, this paper proposes a model based on the MSDRSN prediction method, and the experiment results show that the method can be accurate in RUL prediction under noisy data and complex data.
Te remainder of the paper is organized as follows.In Related Teory, the related techniques are introduced, and the basics of the methods used in this paper are presented, such as the basic deep residual shrinkage network, stacking integrated learning, and SG smoothing algorithm.By combining these methods, the MSDRSN method is constructed.In Materials and Methods, the detailed prediction process of MSDRSN is given, which includes the data preprocessing stage and the training process of the model.In Experimental Verifcation, the dataset used in this paper is introduced, the parameters of the experiments are determined, and the results obtained are discussed.Finally, this paper concludes the research.

Related Theory
2.1.DRSN.DRSN is a modifed model of deep residual network, which consists of three parts: deep residual network, attention mechanism, and soft thresholding function, and is usually used to enhance deep neural networks so that more useful features can be extracted from noise-laden data and redundant information is eliminated better [33].
Compared with traditional convolutional neural networks, there are two major problems: frst, when the network layers are deeper, it will be prone to gradient disappearance and explosion; second, the sample data are less adaptable under the strong noise environment.By using DRSN, it can improve the above two problems [34].Te main structure of DRSN is shown in Figure 1.
Residual building unit (RBU) is the basic component of ResNet.In Figure 1, where rectangle represents the feature map, C is channel, W is width, and 1 is height.A deep residual shrinkage network can contain two Batch Normalizations (BN), two Rectifer Linear Unit activation functions (ReLU), two Convolutional layers (Conv), a Identity shortcut, a Global Average Pooling (GAP) and some Fully Connected Layers (FC).

Stacking.
Stacking is the process of stacking multiple models on the same layer to get the fnal prediction result.In the stacking method, there are two stages of models.Te frst stage model is a model based on the original training set, which is called the base-learner (level-0 model), and can choose multiple base models for training.Te second stage model is the model in which the predictions of the base model on the original training set are used as the training set, and the predictions of the base model on the original test set are used as the test set, called the meta-learner (level-1 model) [35].
Stacking is attracting attention because the base-learner can get the prediction value with diference and more accuracy for the original data and then relearn it by using the new feature number, so the models of multiple base-learner in the integrating learning learn from each other to make the prediction result more accurate [35].
Te stacking framework is shown in Figure 2.

Savitzky-Golay Algorithm. SG algorithm, published
early by Savitzky and Golay in 1964, is widely used for data smoothing and denoising.It is an important method in the feld of signal processing [36].In the feld of rolling bearing remaining useful life prediction, the test set will output a RUL prediction curve by the model.However, because the training of the model is difcult to achieve complete accuracy, some points in the predicted RUL curve may be predicted incorrectly, with sudden increases or decreases.
Here, after smoothing the curve, these points can be modifed with a local average trend, thereby having the efect of enhancing the prediction accuracy.Te signal after smoothing by the SG flter is shown in Figure 3.
Te blue line is the original signal, and the red line is the signal after smoothing by the SG flter.

Materials and Methods
For the RUL prediction problem, this paper uses CUSUM to fnd degradation points.In addition, to improve model performance in learning degenerate features, this paper proposes a multiscale deep residual shrinkage network which combines the idea of stacking integrated learning.Applying integrated learning to DRSN enhances the ability to extract useful information.Using the residual connections of the residual network not only solves the gradient vanishing problem of the deep network but also improves the accuracy of the prediction results.

International Journal of Intelligent Systems
Te complete RUL prediction process is shown in Figure 4.

CUSUM.
CUSUM is a sequential analysis method that was frst proposed by E. S. Page of Cambridge University, UK, in 1954.Its basic idea is to accumulate the small ofset between the sample data and the target data during the process by accumulating the sample data, which plays the role of amplifying and enhancing the sensitivity of the small ofset in the detection process, so as to detect the anomaly of the data, which is called CUSUM variable point detection [37,38].When applied to the bearing vibration signal, it can be used to fnd the variable point, which is the fault point of the rolling bearing.
Te RUL of a rolling bearing is considered as a monotonically decreasing line.Te health starts at 1. From the start of sampling until degradation decreases uniformly, it continues until the rolling bearing does not work, and at that point the health is 0.
In this paper, the marking method is diferent from the previous method.Tis paper divides the life cycle of rolling bearing into stable stage and declining stage.Generally, the rolling bearing works stably during the stable stage and the signal features basically do not change, but when it enters the declining stage, the signal features will change drastically until the rolling bearing breaks down.Te advantage of this method is to keep the feature labels unchanged in the stable stage, and the labels start to decline gradually from the declining stage.It avoids the infuence of the stable stage on the model training and improves the accuracy of the training results.By dividing the labels of the life cycle, the fault point of a rolling bearing can be found quickly, and the prediction result of RUL can be improved efectively.
Te CUSUM algorithm is calculated in three steps, which are shown as follows.
Step 1: calculate the mean value of the series.
x � Step 2: calculate the cumulative sum.
Step 3: take the maximum value in the cumulative sum, and the corresponding horizontal coordinate is the fault point.
Tis layer accepts the raw rolling bearing vibration signal and takes the absolute mean value of the vibration signal and then it is used for life division and labeling by CUSUM.

MSDRSN.
Te network structure of DRSN mainly consists of convolutional layer, residual shrinkage module, and pooling layer.MSDRSN is a new model that combines the DRSN model into stacking integrated learning and by using diferent scale DRSN as the base-learner and fully connected networks as the meta-learner.In RUL prediction, three residual shrinkage layers are stacked together as part of the model structure.Multiple residual shrinkage layers can better investigate the mapping connection between input and output and solve the problem of gradient disappearance.However, the full life data sequence of the bearings will contain some noise, and the residual network can only reduce the noise to a certain extent; its training efect is ordinary.Terefore, this paper proposes an MSDRSN by considering the ability of stacking to learn features.It can not only solve the problem of gradient disappearance due to network in deepening but also enhance the training ability of the model and improve the accuracy of training, which are all contributions to rolling bearing RUL prediction.
Te DRSN model construction process is shown as follows.
Step 1: convolutional layer-feature preextraction.Te calculation principle of the one-dimensional convolution layer is as follows:  International Journal of Intelligent Systems where x i is the input, w k is the weight of the k th convolution kernel, ⊗ is the dot product operation, b k is the bias of the k th convolution kernel, f cov is the activation function, and y k i is the output vector of the k th convolution kernel.At this layer, the vibration signal is frstly processed by using a one-dimensional convolutional kernel to extract the shallow features of the signal, in order to provide a basis for the deep feature extraction in the next step and, at the same time, set the padding in this layer to avoid the loss of boundary features.Tis layer completes feature preextraction.
Step 2: residual shrinkage module-feature extraction and denoising.Given that the input is x i and the output is x j , the soft thresholding formula can be shown: where x t represents the output feature and t is the soft threshold.Finally, the sum of the soft thresholded feature map and the input feature map is the output feature map, and this step is the core of the identity function in the residual structure.In this layer, the network automatically extracts the features of the previous layer's data and automatically learns the threshold for denoising.After that, the data are transmitted to the next layer of the network.
where W is the width of the pooling domain, q l i (t) is the t th neuron of the i th feature vector in layer l, and P l+1 I (j) is the value corresponding to the neuron in the l + 1 layer.Te network at this layer accepts the information output from the previous layer, automatically downsamples the data according to the set output size, and outputs the data to the next layer.
Step 4: splicing and fully connected layers-feature aggregation and relearning.
where W is the weight of the fully connected layer, x is the input, and b is the bias.In this layer, it is similar to the meta-learner in the stacking.After the features are spliced in this layer, they are sent to the fully connected layer for feature training as the sample information, and fnally the results are output.
We choose the loss function mean square error (MSE), which is most commonly used to solve the regression problem; the loss function is as follows: MSDRSN plays the role of feature extraction and initial output of RUL prediction results.Te model accepts the data from the preprocessing of the previous layer, then does International Journal of Intelligent Systems feature self-extraction and output of lifetime, and puts the output results into the next layer for curve smoothing.

SG Filter.
Because of the noise that inevitably appears in the training results, the SG flter was used to reduce the noise and smooth the curve to increase the accuracy and curve ft of the prediction results.Te SG flter is calculated in two steps, which are shown as follows.
Step 1: Let x[i] (i = −m, . .., 0, . .., m) be a set of consecutive integer values in the moving window; then, the width of the moving window is M = 2m + 1.Using n th degree polynomials (n ≤ M) within the flter window, the data are ftted locally using least squares.(9) where y(x) are the output data after ftting; x are the data to be ftted; and a is the parameter to be solved.Te curve is best ftted when the residuals of the n th degree polynomial are zero.
Step 2: the procedure for fnding the residual for the r th point is as follows: where r = 0, 1, . .., n, m is the number of single-sided points to be ftted, and x(i) are the data to be ftted; then, we have  m r�−m x(i) × i r , carrying this coefcient into equations.Te ftted polynomial is used to fnd the center point estimate within that window.Te estimated value of the center point of the window is obtained by constantly moving the window and repeating the operation to obtain the estimated value of the center point of any window.Using SG flter to process the output data of MSDRSN, the noisy data can be efectively fltered out, so that the output prediction value can be more closely ftted to the real value, estimation accuracy can be improved, and model robustness can be enhanced.

Model Algorithms.
Te algorithm for the construction of the stacking-based prediction network process is shown in Algorithm 1.

Experimental Verification
To demonstrate the validity of the proposed model, this paper uses the rolling bearing accelerated degradation dataset published in the IEEE 2012 PHM Data Challenge [39].Te proposed RUL prediction method's feasibility and efectiveness are analyzed in detail by experiments and compared with other RUL prediction methods.Te PRONOSTIA platform consists of three main parts: the rotating part, the degradation part, and the measurement part [40].Te data provided by PRONOSTIA describe the degradation of ball bearings throughout their service life (until complete failure), and each degraded bearing contains almost all types of defects (inner and outer rings, balls, and cage).In the experiment, a radial load force is applied to accelerate the degradation of the bearings.Te vibration signals in the X and Y directions are acquired with sampling frequency of 25600 Hz, 2560 data points are recorded at 0.1 s per sampling, and the recording interval is 10 s.When the vibration level of the measured data exceeds a certain threshold, the test is stopped [41].Te basic characteristics of bearings are listed in Table 1, and Table 2 gives a detailed description of the dataset.
Diferent working conditions will have some infuence on the training of the model.Tis paper selects the datasets under three conditions for experiments.Te reason for the choice is that the three datasets had long decline periods and more information could be learned.In total, three groups of tests were carried out, and the detailed arrangement is shown in Table 3 [41].

Evaluation Metrics.
In this paper, three metrics are used to evaluate performance: mean square error (MSE), mean absolute error (MAE), and coefcient of determination (R 2 ).Te three evaluation metrics are defned as follows: where RULpredict and RULtrue represent the predicted RUL and the actual RUL, respectively, and n is the length of the testing data.Te smaller the MSE and MAE, the better the prediction performance.Te closer R 2 is to 1, the better the prediction performance is.

International Journal of Intelligent Systems
Taking Bearing 1_1 as an example, the steps of sample labeling are as follows: Step 1: Rolling bearing vibration signal actually measures the acceleration of the rolling bearing in a certain direction, it is usually a vibration signal in the horizontal direction, it represents not the size but the direction of bearing vibration, and the direction of vibration of the bearing is not of concern in the signal analysis; therefore, the signal should be frst to take the absolute value.Due to the 2560 vibration signals sampled at a single moment of the bearing, this paper uses the mean value to replace the vibration signal at a certain moment.After processing the data, the signal curve of the absolute mean vibration signal is shown in Figure 6.
Step 2: After the signal processing is completed, the signal curve is smoothed by SG for denoising, and then CUSUM is used to fnd the fault point.Te former curve is regarded as the stable stage, and the RUL is 1.Te latter curve is regarded as the declining stage, and the RUL is marked as decreasing from 1 to 0. After all the data RUL are labeled, because the diferent bearing vibration signal values have diferent intervals, the bearing vibration signal samples also need to be normalized.Te degradation points of each bearing are given in Table 4, which demonstrates the validity of the method.
Step 3: To verify the superiority of MSDRSN model in handling noisy data, according to the SNR for noise addition process, this paper adds 2 db, 4 dB, 6 db, and 8 db of noise in order.SNR formula is shown as follows: SNR(dB) � 10log 10 4.4.Model Building Experiments.Te structural parameters of MSDRSN are shown in Table 5. Te hyperparameters need to be preset, such as the number of residual blocks and the size of the learning rate.Te hyperparameters are shown in Table 6.
Tere are some other parameters in this experiment such as the validity of CUSUM and the validity of SG smoothing.Meanwhile, the number of residual blocks and the learning rate size are also determined by the experiment.
Te SVR machine learning model is used to test the model, and the kernel function is RBF.Te results of the comparison experiments are shown in Figure 7.
From the experiments, it can be shown that the MSE of bearing 1_3, bearing 2_3, and bearing 3_3 in which CUSUM is applied is lower than that of those in which CUSUM is not applied, and thus it can be concluded that CUSUM signifcantly improves the prediction capability of the model.Te number of network layers in this experiment is 2; the pooling layer size is 80, and the comparison item is whether SG smoothing is applied or not.Te results of the comparison experiment are shown in Figure 8.
From the experiments, it can be concluded that the training results are better when the model is SG smoothed.Te training results are worse when SG smoothing is not set, so the training ability of the model is greatly enhanced by SG smoothing.Te setting of the learning rate has a signifcant impact on the MSDRSN network.It is often that too low learning rate will lead to slow convergence of the model, while too high learning rate will lead to failure of the model to converge, so this experiment compares the optimal values of diferent learning rate losses and determines the optimal learning rate.Te results of the comparison experiment are shown in Figure 10:

International Journal of Intelligent Systems
From the experiment, it can be concluded that when the learning rate is set as 0.000001, the loss value is the smallest and the model training result is the best, so the optimal learning rate of MSDRSN network is set as 0.000001.

RUL Prediction Results
Te programming software used for the experiments is Python 3.6, and the central processing unit (CPU) used in the workstation is Intel i7-11800H.
During the experiment, the MSDRSN method was compared with the common methods, selecting two machine learning models, which are Random forest and SVR, and two deep learning models, which are BiLSTM [42] and MSCNN-FC [43].BiLSTM is a stack of two LSTM layers.It can efectively utilize the input forward and backward feature information.MSCNN-FC can be used for prediction problems, which is similar to their algorithm for classifcation problems, but the last layer is usually a fully connected layer with only one neuron.
Table 7 shows the comparison table of evaluation metrics of MSDRSN method, BiLSTM, and MSCNN-FC in diferent noise environments.
Among them, MSDRSN has the smallest MSE and MAE, indicating that it has the smallest prediction error, and R 2 is the largest, indicating that its model has the best ftting prediction results.Besides, the prediction accuracy of tasks 2_3 is low, and the prediction results of both condition 1 and condition 3 are better than those of other methods.Te feasibility and superiority of MSDRSN are demonstrated by comparing with BiLSTM and MSCNN-FC.Te results of the experiments conducted under diferent noises also demonstrate that the MSDRSN network is more suitable for prediction tasks with noisy data.Taking bearings 1_3 as an example, the RUL prediction curves under diferent noises are shown in Figure 11.
It can be shown that the prediction results of the method have a similar trend to the actual RUL values, and the prediction accuracy is signifcantly higher than that of BiLSTM and MSCNN-FC, and thereby it verifes the effectiveness of the model in rolling bearing RUL prediction.
In order to verify the superiority of the model, this paper chooses two traditional machine learning methods to compare with it, Random forest and SVR.Taking bearings 1_3 as an example, the RUL prediction curves of the three models under diferent noises are shown in Figure 12.
It is obvious from the prediction results that the model has a higher prediction accuracy compared with other traditional machine learning methods, and the curves are most compatible with the actual values, and there is no obvious fuctuation trend.
In summary, the method proposed in the paper can more efectively predict the remaining life of rolling bearings.Te method can more accurately capture the early degradation characteristics of the bearing.At the same time, the method makes a signifcant improvement in the prediction of the later bearing life.

Discussion of RUL Prediction
(1) Te MSDRSN method convolves operations with multiple convolution kernels of diferent scales to achieve the efect of extracting detailed features and increasing the prediction accuracy of the model.Stacking integrated learning can observe the features of the data from multiple perspectives and learn them with the base-learner and relearn them after fusing the obtained features so as to improve the accuracy of the prediction results.Ten, this paper uses the powerful noise reduction capability of DRSN, which enables it to get more useful features of the dataset, which improves the feature learning capability of the model in complex datasets and improves the feature extraction capability of the model.International Journal of Intelligent Systems International Journal of Intelligent Systems

4. 1 .
Data Description.Te PHM 2012 challenge dataset was provided by the PRONOSTIA platform of the FEMTO-ST Institute.It is shown in Figure 5.

( 2 )
Validity of SG smoothing: a comparison test based on bearing 1_3 at 2 db, 4 db, 6 db, and 8 db noise.

ALGORITHM 1 :
Stacking-based algorithm for prediction network process construction.
4.3.Data Preprocessing.In this paper, CUSUM is used to detect the change point of the operating state of the rolling bearing vibration signal and to divide the bearing life cycle.Output: Predicted value after integration H Process: Step 1: Data preprocessing for i � 1 to m do abs (mean(x i )) end for p � do CUSUM on x i for i � 1 to m do if i < p set y i � 0 else set y i � m − i/m − p end for normalization(x i ) Step 2: Training base-learner for t � 1 to T do learn h t based on D end for Step 3: Feature aggregation for i � 1 to m do D h � x i ′ , y i  , where x i ′ � h 1 (x i ), . . .,h T (x i )

Table 1 :
Characteristics of the bearings.

Table 4 :
Te degradation points of rolling bearing.

Table 7 :
Comparison of evaluation metric results.Te bold values in the table are the values of the proposed method, which is convenient and clearer to compare with the values of other methods.