Remaining Useful Life Prediction of Rolling Bearings Based on Multiscale Convolutional Neural Network with Integrated Dilated Convolution Blocks

Remaining useful life (RUL) prediction is necessary for guaranteeing machinery’s safe operation. Among deep learning architectures, convolutional neural network (CNN) has shown achievements in RUL prediction because of its strong ability in representation learning. Features from diﬀerent receptive ﬁelds extracted by diﬀerent sizes of convolution kernels can provide complete information for prognosis. The single size convolution kernel in traditional CNN is diﬃcult to learn comprehensive information from complex signals. Besides, the ability to learn local and global features synchronously is limited to conventional CNN. Thus, a multiscale convolutional neural network (MS-CNN) is introduced to overcome these aforementioned problems. Convolution ﬁlters with diﬀerent dilation rates are integrated to form a dilated convolution block, which can learn features in diﬀerent receptive ﬁelds. Then, several stacked integrated dilated convolution blocks in diﬀerent depths are concatenated to extract local and global features. The eﬀectiveness of the proposed method is veriﬁed by a bearing dataset prepared from the PRONOSTIA platform. The results turn out that the proposed MS-CNN has higher prediction accuracy than many other deep learning-based RUL methods.


Introduction
Prognostics and health management (PHM) are crucial for the mechanical system. RUL prediction is one of the important tasks in modern industry PHM. Maintenance costs can be reduced if the remaining useful life of the machinery can be known in advance. Bearings are the critical parts of the mechanical system [1,2]. e failure of bearings may lead to a severe accident. us, the bearings RUL prediction has drawn more and more attention in the study of PHM. e bearing RUL prediction methods can be roughly divided into two types: model-based approaches and datadriven approaches [3]. With the development of modern industrial technology, an enormous amount of condition monitoring data are recorded, data-driven methods such as deep learning have the powerful data-processing ability when faced with massive data [4]. Since DL-based approaches can extract features from the input data without much prior knowledge, they have become more and more popular in RUL prediction and fault diagnosis [5]. A prediction framework constituted by deep autoencoders (AE) is proposed in Reference [6]. AE is used to retain sufficient information when compressing features. Shen et al. [7] proposed a contractive autoencoder-based rotating machinery fault diagnosis method. Robust features can be learned by the contractive autoencoder automatically. A deep long short-term memory (DLSTM) network was proposed in Ref. [8]. Multisensor condition monitoring data are fused to get more useful information for accurate RUL prediction. Long short-term memory (LSTM) is also applied in discovering the potential patterns in Ref. [9]. Xiang et al. [10] proposed a novel LSTM framework. Attention-guided ordered neurons are applied in this framework to achieve the accurate gear remaining useful life prediction. A double CNN model architecture was implemented in Ref. [11]. e fault occurrence time (FOT) is determined by the first CNN; RUL prediction is accomplished on the second CNN. Guo et al. [12] proposed a health indicator construction method to monitor the health state of machinery. Convolutional neural network is designed to learn features and construct a mapping from HI between features. A recurrent convolutional neural network was designed in Ref. [13]. e temporal dependencies of different degradation states can be captured by the recurrent convolutional layers.
Among the variety of DL techniques, CNN has gained more attention because of two outstanding characteristics, i.e., spatially shared weights and local perception. At first, CNN was widely used in image recognition and achieved tremendous success. Nowadays, it is also popular in fault diagnosis [14] because it can accomplish feature extraction and fault classification automatically. However, there are still two shortcomings in traditional CNN. (1) e scales of the convolution kernel are very important for the performance of CNN. Kernels with a bigger size can extract features in a bigger receptive field, whereas kernels with a smaller size extract feature in a smaller receptive field. e performance of the network is directly affected by the scales of the convolution kernel. Single size of the convolution kernels can lead to the extracted information incompletely. (2) e ability of traditional CNN is not strong enough to extract local and global features simultaneously. In the conventional CNN, just the feature maps of the final layer before flatten are treated as final features while features extracted in previous low-level layers are omitted. Although the global features extracted by the high-level layers are more invariant than features extracted by the low-level layers, detailed local features extracted by the low-levels are also contribute to prognosis and classification [15,16].
In order to overcome the aforementioned problems and learn more representative features, various multiscale CNN models have been proposed and applied in machinery fault diagnosis and RUL prediction [17]. e final convolutional layer and the final pooling layer are merged to construct a multiscale layer in Ref. [18]. Global features from the final pooling layer and local features from the final convolutional layer are utilized for classification. e results show that the network with the proposed multiscale layer can improve the recognition accuracy. Chang et al. [19] were inspired by the inception model and proposed a concurrent convolution neural network method to enhance wind turbine fault diagnosis accuracy. Also, some multiscale CNN methods have also been applied in RUL prediction. A multiscale deep convolutional neural network (MS-DCNN) with three MSblocks was applied in Ref. [20]. In order to extract features in different receptive fields, three distinct sizes of convolution kernels are implemented in each layer parallelly. Kernels with a bigger size can extract features in a bigger receptive field, but it also leads to more weights of convolution kernels to train. Too many weights are generally hard to train for the network. Different moving steps in convolution operations are used to obtain features in different scales [21]. A multiscale convolutional neural network that merges the final convolutional layers and the final pooling layer was designed to extract the local and global features for bearing RUL prediction [15]. But the detailed information learned by the low layers is still lost in this architecture.
In this paper, a novel MS-CNN method for bearing RUL prediction is introduced. An integrated dilated convolution block is constructed to extract features in different receptive fields from the complex signal. en, several stacked integrated dilated convolution blocks are concatenated to construct a multiscale feature extractor. e two advantages of the proposed MS-CNN are summarized as follows: (1) An integrated dilated convolution block is constructed to extract features in different receptive fields from complex signals without increasing the weight of convolution kernels. (2) A multiscale feature extractor is constructed to avoid the loss of information at different levels. e multiscale feature extractor can make full use of the global features obtained from the higher layers and the local features obtained from the lower layers.
e rest of this paper is introduced as follows: in Section 2, the relevant theoretical backgrounds are introduced, including convolutional neural networks and dilated convolution. e proposed MS-CNN bearing remaining useful life prediction method is introduced in Section 3. In Section 4, a public dataset about rolling bearings is used to verify the superiority of the proposed method. e dataset is collected from the PRONOSTIA platform. Finally, the conclusions are summarized in Section 5.

Convolutional Neural Network.
CNN is a feed-forward neural network. e traditional CNN is mainly composed of three kinds of layers: the convolutional layers, the pooling layers, and the fully connected layers [22].
In the convolutional layers, features can be learned by several convolution kernels. e convolutional operation is a linear operation. In order to increase the nonlinear of two adjacent layers, the nonlinear activation function is carried out to solve the problem. e output of the convolutional layer can be written as where * denotes the convolution operation, Z r j is thejth output of layerr, Z r−1 i is the ith input from the previous r − 1 layer, w r j is the weight vector of convolution kernel in rth layer, and b r j is the bias of the jth output. φ(·) is the nonlinear activation function.
In the pooling layer, the output of the convolutional layers can be compressed to improve computational efficiency [14], which is a form of down-sampling. e pooling operation includes many pooling functions, such as L 2 norm pooling, max pooling, and average pooling. Pooling operation is adopted as 2 Shock and Vibration where Z n−1 m is the mth input feature map in n − 1 layer, Z n m is the mth output in the layer n, p is the pooling size and s denotes the stride, and pool ( ) denotes pooling function.
After several convolutional and pooling operations, the feature maps of the previous layer are flattened and then sent to the fully connected layer. RUL prediction is a regression task. us, the fully connected layers can be regarded as the regression layer, which builds a connection from the feature maps learned by previous layers to the final result.
Compared with the ordinary convolution operation, a hyperparameter called dilation rate is added to the dilated convolution. As shown in Figure 1, different dilation rates can be seen as inserting different sizes of holes between each convolution kernel parameter. When applied in one-dimension CNN, it can be calculated as where y i represents the output of the ith element in the convolution, x i is the ith element of input,ω are the weights of the filters, and the length of the filter is K. r is the dilation rate, r � 1 in dilated convolution is equal to the ordinary convolution. One zero is inserting in the adjacent convolution weight when the dilation rate � 2.

Proposed Bearing RUL Method
In this section, the framework of the proposed multiscale convolutional neural network is introduced in detail. Machinery vibration signals are sent to the network directly as the input data. e procedure of the proposed MS-CNNbased RUL prediction method is introduced lately. Four steps are implemented to get the bearing remaining useful life prediction: FOT determination, data preprocessing, RUL prediction, and smoothing operation. e proposed MS-CNN could establish a relationship between monitoring signal and remaining useful life without much prior knowledge. It is easy to be generalized in industrial applications.

Proposed MS-CNN Architecture.
e framework of the multiscale convolutional neural network is shown in Figure 2. e MS-CNN consists of two modules, a multiscale feature extractor and the regression layer. e multiscale feature extractor is constructed to improve the network's learning ability. Features in different receptive fields are taken into consideration by an integrated dilated convolution block. en, the integrated dilated convolution blocks in different levels are all concatenated to establish the proposed multiscale feature extractor. e regression layer is designed to construct the mapping relationship between features and corresponding real lifetime.

e Integrated Dilated Convolution Block.
From the structure of the traditional CNN, features are extracted by the convolutional operation and pooling operation. e scale of the convolution kernel can affect the learning ability of the network. A single-sized convolution kernel in each convolutional layer may lead to the information learned by this layer incompletely. Inspired by the inception network, an integrated dilated convolution block is introduced, which is shown in Figure 3.
Different convolution kernels with different dilation rates are concatenated to learning multiscale features in different receptive fields in this paper. Compared with the convolution with different sizes of convolution kernels. e integrated dilated convolution block can learn different scales of information without increasing the parameter of the network. Too large dilated rate may cause the loss of detailed information. Too many dilated rates integrated into one block may lead to redundant information. Taking consideration of the characteristic of the vibration signal and inspired by inception networks, the structure of the integrated dilated convolution block has three kinds of dilated rates.
When the input data are sent to the integrated dilated convolution block, three kinds of convolution operations are performed on the input data synchronously. Features extracted from n filters are then concatenated into a features vector. e features vector can be recorded as follows: g l i (x) denotes the feature map learned by different convolution kernels in the lth layers.

3.1.2.
e Proposed MS-CNN. In the traditional CNN structure, the feature maps of the last pooling layer are treated as final features for classification or regression. Local features extracted by previous layers are usually discarded. Although the global features extracted by high-level layers are much more representative and robust than those extracted by low-level layers, local features contain some detailed information and are useful for prognosis.
In this paper, the outputs of different integrated dilated convolution blocks are concatenated to extract local and global features synchronously. e proposed MS-CNN is shown in Figure 2.
e concatenated feature maps can contain not only the invariant and stable global feature but also the detailed information.
e concatenated feature maps can be expressed as where the ⊕ denotes the concatenation operation and g l indicates the feature maps of the lth integrated dilated convolution block.

Shock and Vibration 3
To keep the RUL prediction ranging from 0 to 1, the sigmoid activation function is applied in the last fully connected layer. e sigmoid activation function can be expressed as where the X represents the input feature map of the final fully connected layer. MSE loss function is utilized to update the parameters of the whole network, which is expressed as where N denotes the number of the training samples and f(x i ) denotes the predicted RUL value of the input x i ,y i is the true label.     network. e superiority of the MS-CNN is confirmed by testing samples. At last, a smoothing operation is implemented to get a continuous RUL result.

3.2.1.
Step 1: FOT Determination. As shown in Figure 5, in the early stage, vibration signals recorded by the sensors usually undergo a stable period, which means the monitored component is in a healthy stage. Data in the healthy stage include unrelated information to RUL prediction. us, the determination of FOT before RUL prediction is necessary. In this paper, kurtosis is applied to find the FOT. e formula of kurtosis is as the following equation: where x is the mean and standard deviation of the signal, σ is the standard deviation of the signal, and N is the number of signal data. Kurtosis is very sensitive to the amplitude value. e degradation can be well reflected by kurtosis.
us, it is usually very helpful for detecting incipient faults [27].
Laida criterion (also known as 3σ rules) is applied to detect the FOT. We assumed that the signal in the early period is in the health stage. e mean μ and the standard deviation σ of kurtosis in the health stage are calculated, and then, 3σ is used as the FOT indicator.
Although it is very effective to regard kurtosis as health index to detect early fault points, this health index is not stable enough because it is affected by noise and outliers.
us, local linear regression is applied after calculating kurtosis from the vibration signal, which can remove the FOT misjudgment caused by outliers. After local linear regression smoothing operation, when the smoothed kurtosis of the time t falls out of the 3σ interval, it can be regarded as the FOT as follows: where k i u denotes the kurtosis of the ith data sequence. e FOT of bearing1_1 determined by the applied method is shown in Figure 5. e amplitude of the vibration signal keeps in a stable stage before FOT and degradation starts after FOT. e method of FOTdetermination is proved to be effective and accurate. FOT determination is only determined in the training processing. In the online testing processing, lacking of the whole lifetime vibration signal leads to the FOT determination impossible.

3.2.2.
Step 2: Data Preprocessing. To speed up the training convergence, normalizing the raw data is a common and effective operation. Firstly, the selected sensor data is divided into time segmentation, and the length of the segmentation is L. Each time segmentation is a sample that can be represented as X � [x 1 , x 2 , x 3 , . . . , x L ]. Max-min normalization is implemented to ensure the data within [0, 1] as the following equation: where the x max is the max value of the sample X. x min is min value of the sample X. x i represents the value of x i after maxmin normalization. e label of the training data is constructed as the reliability in the range of [0, 1]. FOT is regarded as the start of degradation. e label can be described as the linearly degrading process from FOT to complete failure, as shown in Figure 6. en, the samples and the corresponding labels are treated as the input and the output of the proposed MS-CNN.

Step 3: RUL Prediction Based on the Proposed MS-CNN.
e proposed method based on MS-CNN has two processes: offline training and online testing. In the training  Shock and Vibration period, the training dataset and corresponding label are used to train the proposed MS-CNN. After data preparation, training data segmentation undergoes a multiscale feature extractor. More representative and comprehensive features can be learned by the multiscale feature extractor. e output of the multiscale feature extractor is sent to the regression layer to construct the relationship between features and RUL. MSE loss function is applied in the network. Small training samples can lead to overfitting. e network has bad performance on the testing dataset. Dropout is adopted in the fully connected layers to avoid overfitting by setting some hidden neurons to zero and turned off in the testing process. Adam algorithm is applied to update the parameters of the network. Different from the stochastic gradient descent algorithm, the Adam algorithm can adjust the learning rate adaptively without setting the learning rate in advance. When the online testing data at a moment are sent to the network, the prognostic RUL of that testing data can be predicted by the trained network.
In order to assess the predicted result quantitatively, two error indicators are applied in this paper, i.e., the mean absolute error (MAE) and root mean squared error (RMSE) [28].
where N denotes the number of the samples, E i is the corresponding RUL of the ith sample predicted by the proposed MS-CNN, L i is the corresponding actual RUL.

Step 4:
Smoothing. e RUL results predicted by the network are discrete and fluctuant. However, in the real industrial applications, the actual RUL of bearing is always continuous.
e remaining useful life of the bearing decreases as time goes by. us, the smoothing operation is applied to smooth the predicted RUL.
At the time n, the prediction RUL of this moment is recorded as RUL(n), the RUL at five time-point moments before n such as n − 5, n − 4, . . . , n − 1, and n undergo a moving average filter. e smoothing operation can be implemented as equation (12). e regression RUL result at the time n is regarded as the final predicted result. e smoothing operation can make the predicted result according to the actual condition.

Experiment
In  Figure 7) performed the accelerated degradation test of the rolling bearing. e platform is mainly divided into three parts: the rotating part, the degradation part, and the measurement part. Degradation of the bearings can be accelerated by the degradation part. In order to measure the vibration signal of bearings, two acceleration sensors were installed on the vertical and horizontal axis. However, the amplitude of the vertical vibration signals is lower than the horizontal ones. e degradation trend was better captured by the sensors placed on the horizontal axis. erefore, only the horizontal vibration signals are used in the paper. e frequency of sampling in the experiment is 25.6 kHz [29]. A sample contains the data collected in 0.1 s every 10 s. 17 runto-failure bearings were acquired. e first two bearings of each working condition were applied for the training process, and the rest of the bearings were used as the testing data. All the datasets are shown in Table 1.

Proposed Method on Experimental Data.
Row vibration signal of each run-to-failure bearing is divided into timeseries segmentation. Each segmentation is a sample, and it contains 2560 data points. After the determination of the FOT, each sample is normalized by the max-min normalization method as described in Section 3.2.2. Training samples with corresponding labels were utilized for training the MS-CNN. e parameters of the MS-CNN are displayed in Table 2. KS represents the size of the convolution kernel, r represents the dilation rate, n is the number of filters, and s represents the stride. N represents the number of neurons in the fully connected layer. e dropout strategy is adopted to avoid overfitting with a coefficient of 0.2. MSE is employed as the loss function. Adam optimization algorithm is applied to update the parameters of the model in the back-propagation process. e loss function curve for the training is shown in Figure 8. e training loss curve declined rapidly from the beginning to the 20th epoch. en, keeping a slow decline trend among the 20th epoch and 80th epoch, the loss curve is stable from the 80th epoch to the 100th epoch. us, the epoch of the network is determined as 80.
e performance of the network is related to depth. Effects on the number of integrated dilated convolution blocks are discussed in this section. As shown in Figure 9, the performance of the network is improved when the number of fused dilation convolution layer increases at first. When the number of the integrated dilated convolution blocks exceeds three, the MAE of the result increased. Too many integrated dilated convolution blocks may lead to an overfitting problem, the training data will have high performance, but the testing data will have a bad performance. On the other hand, as the depth of the network increases, the training time can be longer. us, in this study, the number of fused layers is determined to 3. e row vibration signal of bearing1_3 is shown in Figure 10. Bearing1_3 shows a failure behavior in a gradual degradation trend. e remaining useful life prediction result of bearing 1_3 is shown in Figure 11. e yellow line in     Shock and Vibration the figure is the row estimation result without smoothing operation. e row estimation result is discontinuous and fluctuates in a larger range. e red line is the estimation result with smoothing operation. Smoothed estimation shows a steady and continuous RUL result of the testing data, which is consistent with the actual RUL. e row vibration signal of bearing1_7 is shown in Figure 12. Bearing1_7 shows a sudden failure behavior. RUL prediction result of bearing 1_7 is shown in Figure 13. Although the estimation result is not completely in line with the actual RUL in the early stage, degradation can be effectively reflected in the near-failure stage. Bearings in different failure behaviors were used to turn out the superiority of the proposed MS-CNN method.

Comparison Results
. Several commonly used deep learning models are used for comparison, including deep neural network (DNN), convolutional neural network (CNN), long short-term memory (LSTM), and multiscale convolutional neural network which merge the final convolutional layers with the final pooling layer [15]. All the comparative methods have tuned the parameters to optimal values relatively.

MS-CNN.
e third convolution layer and the third pooling layer were concatenated to form a fused layer as proposed in Ref. [15]. e kernel size and the number of the kernel are the same as the CNN structure. After the fused layer, three fully connected layers were constructed the same as the proposed MS-CNN. e prognostic result of different methods for testing data bearing1_3 is shown in Figure 14. e degradation can be reflected by the proposed method. e predicted RUL of bearing1_3 by the proposed method is the closest to real-life than those other approaches. e DNN method shows the   Shock and Vibration worst result among the comparing methods. e prognostic result of different methods for testing data bearing1_7 is shown in Figure 15. e RUL of bearing1-7 predicted by the proposed MS-CNN is not consistent with the actual RUL in the early stage. at is because the signal in the early stage shows a stable state. In the late stage, degradation can be reflected by the proposed MS-CNN. Since the accurate RUL prediction in the near-failure stage is more important in real industries, the proposed MS-CNN is promising in real industrial implementation. e numerical comparison result of all the testing data is shown in Table 3. It can be seen that the MAE and RMSE of the proposed MS-CNN are almost the lowest among the comparison methods. Although the proposed MS-CNN gets bad performance in bearing 2_4, the method is still robust in many different tasks. DNN method gets the biggest errors than other methods. MS-CNN in Ref. [15] has smaller errors than CNN.
at is because the combination of the final convolutional layer and the final pooling layer makes use of the local and global features learned by the high-level layer. But MS-CNN in Ref. [15] has bigger errors than the proposed method. e result shows which can be suggested that the detailed information extracted by low-level layers is useful for Prognostic. LSTM method shows worse performance than the CNN method. It is not suitable for extracting features from plenty of original data. What is more, the LSTM method consumes much more time to train than other methods, and it is not suitable for industrial    applications. e results proved that the proposed method could provide reliable remaining useful life estimations in different failure behaviors.

Conclusions
In this paper, an MS-CNN-based method for bearing prognostic is proposed to overcome the shortcomings of traditional CNN. e effectiveness of the proposed method was verified on a public dataset. Some contributions are summarized as follows: (1) the integrated dilation convolution block can extract features in different receptive fields from the raw signal without increasing the parameters of the network; (2) the integrated dilation convolution block in different depths are concatenated, avoiding the loss of detailed information learned by the lower layer. e proposed architecture can show a high accuracy than other deep learning methods mentioned in this paper. However, the structure of the network is designed subjectively. Our future study is supposed to concentrate on optimizing the structure of the network automatically.
Data Availability e data used in this paper are available, which can be downloaded from GitHub -wkzs111/phm-ieee-2012-datachallenge-dataset: dataset that was used during the PHM IEEE 2012 Data Challenge, built by the FEMTO-ST Institute.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.