Developing Deep Survival Model for Remaining Useful Life Estimation Based on Convolutional and Long Short-Term Memory Neural Networks

The application of mechanical equipment in manufacturing is becomingmore andmore complicated with technology development and adoption. In order to keep the high reliability and stability of the production line, reducing the downtime to repair and the frequency of routine maintenance is necessary. Since machine and components’ degradations are inevitable, accurately estimating the remaining useful life of them is crucial. We propose an integrated deep learning approach with convolutional neural networks and long short-term memory networks to learn the latent features and estimate remaining useful life value with deep survival model based on the discrete Weibull distribution. We conduct the turbofan engine degradation simulation dataset from Commercial Modular Aero-Propulsion System Simulation dataset provided by NASA to validate our approach. The improved results have proven that our proposed model can capture the degradation trend of a fault and has superior performance under complex conditions compared with existing state-of-the-art methods. Our study provides an efficient feature extraction scheme and offers a promising prediction approach to make better maintenance strategies.


Introduction
With the advance of Internet of Things (IoT) technology and its applications to the industrial environments, data analytics methods can be applied to the mechanical equipment health and performance. In fact, any machine breakdown may lead to a huge loss on production yield. However, sometimes even a professional and experienced engineer cannot find where the fault is and also cannot figure out the main cause of the malfunction. In this case, the company has no choice but to suspend the production line for the thorough examination, which is certainly one of the disastrous situations for the company's business. In order to keep the high reliability and stability of the production line, reducing the downtime for fixing malfunction and the frequency of routine maintenance is necessary. The earliest maintenance technique takes place only when breakdown happens, which is called breakdown maintenance or run-to-failure maintenance. Later, companies tend to have time-based preventive maintenance. It means that engineers perform a preventive maintenance periodically, in spite of the status of the machine even though it is in healthy status. Preventive maintenance will cause a lot of cost and become a major expense of many companies. In order to save the cost, another maintenance strategy such as condition-based maintenance (CBM) is figured out to solve the situation. CBM proposes to reduce the number of unnecessary regular preventive maintenance and improves the reliability of machine by implementing maintenance only when there is an evidence that an exception occurs [1,2]. Because it is effective in saving the cost to the companies, CBM has been more and more popular. The prognostics and health management (PHM) is one of the major tasks in CBM. The core of PHM is the estimation of the remaining useful life (RUL) of machines based on the collected information of the historical and ongoing degradation trends [2][3][4]. The flowchart of PHM contains five major processes as shown in Figure 1, including data acquisition, signal processing, diagnostic, prognostic, and maintenance decision. Data acquisition is the first process of PHM which is composed of sensors, data transmission, and data storage devices.
Different kinds of sensors are used to collect different types of data, which are related to the health condition and able to reflect the degradation process of the monitored machine. Signal processing's task is to extract useful information from the data acquired from the previous step. Diagnostic is a process to divide the machine's whole lifetime into different health status. Prognostic is aimed at estimating the time length from current time to when it requires repair or replacement, which is also the definition of RUL. Maintenance decision is the final process in PHM and is used for analyzing the outputs from diagnostics and prognostics. If we can predict RUL, we are able to propose a strategy about scheduling maintenance, avoiding unplanned downtime, and optimizing operating efficiency and frequency to save the most cost. However, knowing that the machine and components' degradation are inevitable, the challenge of proper scheduling grows with the complexity of machines. One of the key problems in predictive maintenance is the prediction of the equipment failures should be early enough so that the proper maintenance could be scheduled before it happens. Therefore, predictive maintenance is based on the continual monitoring of the equipment in order to determine the right maintenance actions at the right times. The organization of this paper is as follows. Related works on RUL prediction is introduced in Section 2. Our proposed deep learning approach with a survival model is described in Section 3. In Section 4, the experimental results and evaluations are compared with existing state-of-the-art methods and show the effectiveness of the proposed approach. We conclude the contributions and limitations of our approach in Section 5.

Related Works
Generally, the methods of the RUL estimation problems can be categorized into model-based, data-driven, and hybrid approaches. Model-based prediction applies a physical model of the system for degradation [5]. This approach can be further divided into microlevel models [6] and macrolevel models [7]. Microlevel models need to consider the assumptions and simplifications in uncertain environments. A macrolevel model is constructed under different operational conditions of the physical system which includes the relationships among input variables, state variables, and system outputs. However, model-based methods require a large amount of prior knowledge. The physical models are difficult to build under many components which limit the effectiveness of the methods. The data-driven approaches detect the state of the system via large number of sensor monitoring, which are more suitable for the complex system and do not require a comprehensive understanding of the physical understanding [8]. Currently, high-dimensional data collected in real-life PHM applications makes it difficult to directly discover the trends for the prognostic algorithm. There are various operational conditions and health states in the same type of the system which may cause different degradation processes as well as unit-to-unit variability (UtUV) [9]. This situation brings difficulty to RUL estimation. Javed et al. contributed a data-driven prognostics approach based on extreme learning machine (ELM), which is able to model degrading states without assuming a homogeneous pattern [10]. Liu and Chen combined indirect health indicator (HI) and the smooth monotonic signals from sensory data and multiple Gaussian process regression (GPR) models to achieve the RUL prediction [11]. Previous works constructed a model based on Box-Cox transformation (BCT) and Monte Carlo (MC) simulation to predict the battery RUL [12]. Khelif et al. developed a procedure to estimate the RUL directly from sensor values using the support vector regression method which models the direct relationship among sensor values or health indicators [13]. However, the dramatic changes and variations of the indicators make a well-trained prediction model that may not be suitable for the practical applications. The traditional feature extraction method is hard to get high-level representations from measurements, and the poor prognostic performance may be achieved. Therefore, capturing hidden patterns from high- 2 Wireless Communications and Mobile Computing dimensional data efficiently is a necessary procedure in the feature extraction procedure [14]. Hybrid approaches are to complement the superiority of model-based and datadriven methods [15]. However, it still remains a challenging work to utilize the advantages and to avoid the disadvantages of both approaches.
Recently, data-driven prediction methods focused on the flexible deep learning models to capture useful information from high-dimensional data efficiently. Zhang et al. employed a multiobjective evolutionary algorithm with traditional deep belief networks (DBN) for RUL estimation in prognostics [16]. Sequence learning methods such as the hidden Markov model (HMM) were applied to capture time series information from the sensory data [17][18][19]. In the model, each state can only depend on the immediately previous state and the hidden states must be drawn from the discrete space. However, modeling long time dependencies may lead to high computational complexity while the set of the hidden stages grows larger. Recurrent neural network (RNN) can model time sequence data as well, and some work applied it to estimate RUL [20]. But RNN has its limitation to capture the long-term time dependencies, the gradients propagated over many hidden layers tend to either vanish or explode [21]. Long short-term memory (LSTM) network is a significant branch of RNN, can learn long-term dependencies, and avoid gradient disappearance and explosion in long sequence training. Previous studies have shown that LSTM networks can expose hidden patterns from the sequential sensor data with multiple operating conditions, fault, and degradation models [22,23]. Some new approaches based on LSTM, such as bidirectional long short-term memory (BLSTM) network [24] and vanilla LSTM [25], were proposed. Wang et al. proposed a transfer learning algorithm based on BLSTM networks, which can be first trained on the datasets and fine-tuned the model with a different but related dataset [26]. Recent works also enhanced LSTM networks with attention mechanism and generative adversarial network (GAN) to improve the interpretability and accuracy of the deep networks [27][28][29]. The convolutional neural network (CNN) architecture has been proven to be effective for extracting abstract information on multichannel sequential sensor data [30][31][32]. Although LSTM networks enable us to build and capture long-term time dependencies, its feature extraction capabilities are marginally lower than CNN [33]. CNN can extract the spatial feature while LSTM can learn temporal features. Therefore, it is better to learn temporal features from the slow inherently long-term degradation process by combing those two structures. Recent paper proposed a deep neural network structure using both LSTM and CNN which can be combined in a serial or parallel manner to improve the accuracy of the RUL prediction of the equipment [34,35].
Most of the prior works have focused on the RUL prediction problem which present one numeric RUL value only. However, it is nearly impossible to find an approach that can predict RUL exactly the same as the real one. If the variance is large, it is hard to have confidence on the predicted result. The RUL prediction problem is also similar to the survival analysis which is commonly used to model time-to-death events in the healthcare domain [34]. For example, the model predicts the failure will happen in 8 days with 80% probability is much better than predict 10 days until the failure. Martinsson proposed the Weibull time-to-event recurrent neural network, which is a simple framework for time series prediction of the time to the next event applicable [36]. Aggarwal et al. used the Weibull distribution assumption on the time-to-failure event with a linear hazard rate corresponding to the linear degradation model that most of the literature makes [34].
Due to the complicated environments in real-life PHM applications, monitoring sequential sensor data is subjected to the operating conditions and fault modes for the prognostic algorithm. The existing data-driven methods often rely on the sensor measurements as a whole data for training that may cause less effect and bias. To cope with this issue, there is a great potential to improve the RUL estimation by extracting latent patterns from partial information that is a necessary procedure to capture useful information from highdimensional sequential sensor data. It is also valuable to estimate how much time is left of the equipment and the probability of a failure together. Therefore, we integrate CNN and LSTM with a deep survival model to enhance the ability of feature extraction and capture the degradation trend of a fault with a reasonable prediction horizon.

Materials and Methods
The overall workflow of our approach is shown in Figure 2. Different sensors may have different physical meanings and numerical ranges. In order to eliminate the influence of ranges of value, we first apply a min-max normalization method as feature scaling to adjust the range of sensor values between 0 and 1. Second, the training data and test data are prepared with a sliding time window (TW) to generate the sequential samples. Third, we use 1D temporal convolutions to learn hidden patterns in those sequential samples without any interference from the other sensor values. Forth, the extracted temporal patterns from the 1D convolution would be fed into LSTM networks to learn the long short-term time dependencies. Fifth, we use both regression and survival analyses with the discrete Weibull distribution to estimate the RUL and failure probability in the training phase. Finally, we can predict the RUL and the probability of a failure with test data in the trained model in the testing phase.

Data Preparation with Time Window.
The input sequential sensor data from an engine are assumed to be a matrix XðnÞ = ½x 1 , x 2 , ⋯, x L s with k sensors (measurements), where n denotes engine ID and L s denotes the last observed cycle or the cycle that fault occurs in Figure 3. A sliding time window strategy is adopted to generate the temporal sequence data except sampled at a single time step which may conduct better feature extraction efficiently. Taking XðnÞ as an input and extracting sequential X i ðnÞ as Equation (1) by sliding the fixed time window (TW) with length m, this can be presented in Figure 3:  x 1 x Ls where x t = ½x t 1 , x t 2 , ⋯, x t k represents a k-dimensional array of measurements at time (cycle) t through a TW size of m. Therefore, the size of each array extracted each time by TW is m × k (TW × numbers of sensors), and the number of arrays is L s -m (lifetime cycles-TW).

Temporal Convolutional Layer.
To deal with sequential information more effectively, a CNN layer can be used to extract abstract and high-level features before LSTM layers. The temporal convolutional layer consists of three layers, starting with 1D-convolution, 1D-max-pooling, and followed by activation function. 1D convolution represents a filtering window length and moves towards the depth across the data. We consider that the input data matrix size is m × k array in X i ðnÞ, and there are d kinds of the feature detectors with w kernel size. So, each feature detector has to move ðm − wÞ + 1 times and generates the set of feature detector regions X dw . Then, we add 1D-convolution weight kernel w * and bias b so the convolution operation is shown as Where ⨂ denotes the Hadamard product (element-wise product) and f represents the nonlinear activation function, ReLu. Accordingly, the output feature maps of a 1Dconvolution layer will be the size of ððm − wÞ + 1Þ * d. Since we have multiple temporal convolution layers, we let C ðl−1Þ and C ðlÞ be the input and output of the lth layer, respectively. We denote the jth feature map of layer l as C ðlÞ j which can be computed by 3.3. Long Short-Term Memory. LSTM cell state relies on three control gates: input gate, forget gate, and output gate. Input gate controls the extent to which incoming data flows into the cell. Forget gate judges which data from the foregoing cell state to be taken in consideration or be ignored. Output gate decides whether the value in the cell is used to compute the output. In the LTSM layer, it performs multiple internal equations as described below as where f l denotes the forget gate, and its main function is to neglect the data from the previous LSTM cell state. σð Þ is an activation function sigmoid. W f is the weight matrix of the forgot gate, h l−1 denotes the short-term state of previous layer in the LSTM cell, the feature map of layer C ðlÞ j is the input of LSTM cell, and b f is the bias vector of forget gate. The input gate is composed of two parts, i l is a vector that determines which data in the short-term state h l−1 is used to update the new cell state. After being selected by i l , f Z l will be added to the long-term cell state and tan h is an activation function. W i and W c denote the weight matrixes and both b i and b c denote the bias vectors of the input gate. Then, the forget gate and the input gate will be used to update the long-term state of the previous LSTM cell. The output gate o l is also composed of variables where W o denotes the weight matrixes and b o denote the bias vectors.
Finally, the LSTM layer connects to the fully connected layer for estimating the output target RUL value. Dropout technique is a regularization technique which randomly drops the hidden nodes with a given probability during training. It forms neural networks with different architectures in parallel and then takes an ensemble of them to prevent coadaptation. In order to alleviate the overfitting problems, the dropout is used between the final LSTM layer and the first fully connected layer [37].

Loss Function.
Survival analysis is also called time-toevent analysis that is a subfield of statistics for analyzing the expected time duration until one or more events happen [38]. This approach calculates the probability of the subject to 'survive' the number of days or cycles [39]. One of the most commonly used distributions in the survival analysis is the discrete Weibull distribution which can be presented as Equation (10). The time-to-failure is modeled with a random variable T giving the probability of failure time between t and t + 1. The probability mass function (PMF) of a discrete random variable is characterized by two parameters: alpha (α) is a scale parameter that denotes that the expected value and mode of the distribution are positioned in time, while the parameter beta (β) is an indicator of the shape as well as the variance of our prediction.
We have to utilize a special log-likelihood as loss-function, called the discrete Weibull distribution log-likelihood [34]. The discrete Weibull distribution log-likelihood punishes the model for predicting high probabilities of failures occurring during the lifetime without failures for all samples. In addition, the discrete Weibull distribution log-likelihood will reward distributions that give high probabilities of the event happening at that point in time for samples where the failure time is known. The discrete Weibull distribution log-likelihood can be defined as follows:

Wireless Communications and Mobile Computing
where y denotes the time-to-event value (cycle) and u denotes either a 0 or 1 machinery health event indicator. Since it is an average value, we express it by u. In each training step, we apply two types of loss functions. We apply linear activation function for the output value in regression analysis approach and use mean squared errors (MSE), which measures the average squared difference between the estimated RUL values and the true RUL value as loss function. On the other hand, the activation layer of the discrete Weibull distribution is a costumed function that is set to use an exponential function for alpha and softplus function for beta [34]. We use the discrete Weibull distribution log-likelihood as loss function in failure probability in the survival analysis approach and estimate Weibull parameters be the outputs of the layer giving us a distribution of the training data.

Performance Evaluation.
For the sake of comparability with other existing state-of-the-art methods, the same metrics are used to evaluate the performance. While using the model to predict the RUL with regression and survival analysis approaches, there is an error between the predicted RUL and the actual RUL called root mean square error (RMSE) as Equation (13). The late prediction might delay the schedule of the proper maintenance operations, and too early prediction might not be harmful but still wastes more maintenance resources. Since the key aspect is to avoid the failure, early prediction is generally more desirable than late prediction. The scoring function as Equation (14) penalizes late predictions more than early predictions to evaluate the model. In addition, we also calculate MAE as Equation (15) and R 2 (R squared) coefficient of determination which is a statistical measure of how well the predictions approximate the real data points for more comparison between the two analysis approaches. The higher R 2 value means more information about the fit of a model can be explained.

Performance of Regression and Survival Analyses.
We randomly select 80 percent of the samples from the training set to train the models and the remaining 20 percent of the instances is used as the validation set to select the parameters in the training phase. We predict the RUL from the test data using a trained model and denote the performance comparing with state-of-the-art methods in Table 2. In Table 2, the italicized numbers denote the top 3 ranked results among those methods. The first three are regression-related algorithms including the multilayer perception (MLP) [41], support vector regression (SVR) [42], and relevance vector regression (RVR) [43], and the others are deep neural network related. The deep learning methods show better performance than the traditional machine learning methods. Our proposed approach achieves the lowest RMSE values and scoring function based either on regression analysis or on the discrete Weibull distribution for the FD002 and has superior performance for scoring function in FD004. We perform significantly reduce 5.19 and 0.15 in terms of RMSE and show improvement 1:75 * 10 3 and 1:11 * 10 3 in terms of scoring function, for FD002 and FD004, respectively. The samples working in multimodal switching in FD002 and FD004 datasets are more challenging for obtaining accurate prediction results. Our proposed network structure is able to find hidden patterns, and the prediction capability of the proposed method is better than the existing RUL prediction methods under the complex conditions. According to the defining of RMSE in Equation (13), the extrapolated RUL value as shown in FD001 and FD003 leads to little larger RMSE in the results. Although those subsets did not reach the best performance in RMSE, the score values are close to the best one. The experimental results also show that the integrated deep learning-related methods such as DAG and our method get better performance than previous single CNN or LSTM methods. Based on the comparison of the scoring function evaluation criteria, it can be seen that our proposed method appears in the top 3 ranked results among all four benchmark datasets, but the DAG method get three out of four in the experimental results. It shows that our proposed method can avoid failure with the early prediction. One engine unit may have its own historical sensor data, and we apply the fixed sliding time window to scan the historical data to generate the several sequential sensor data. We estimate the RUL and its probability of failure corresponding to each sequential data and then construct the Kaplan-Meier survival curve. Figure 4 shows the predicted and the actual degradation process of the four randomly selected engines in each subdataset from the test data with Kaplan-Meier curve. We can find that the engines have 20% probability to keep running after 100 cycles, namely, the engines have 80% probability that it will have failure after operating 100 cycles in Figure 4(a). The survival probability of these engines decreases by 10%~20% after every 20 cycles.
From FD002 in Figure 4(b) and FD004 in Figure 4(d), the degradation trends have a fast decrease at the beginning and become steady when left 10% probability of availability, which is very close to failure. As for FD003 in Figure 4(c), there is a plateau in the right extrapolated tail of the Kaplan-Meier curve and it may cause the error between the predicted and true values to become bigger with cycles. This is the reason that RMSE of FD003 cannot surpass other approaches.
We describe the difference of our7 approach based on the discrete Weibull distribution and regression analysis with the linear model to predict the failure cycle. In Figures 5-8, we randomly selected four engines in each subdataset as examples and showed Kaplan-Meier curve with confidence intervals and probabilities based on the discrete Weibull distribution and RUL value based on linear regression analysis under time cycles. The results in Figures 5-8(b) have more variance and more error at the beginning of the estimation in the regression analysis approach, especially when the conditions become more complicated. It is easy to find that the prediction errors are greater in the early stage of degradation than in the late stage of degradation in the regression analysis approach. Due to the late stage with more sequential information comparing to the early stage, the predicted results can get better performance. The traditional RUL methods based on the regression analysis might lead to inconsistent     shows that the discrete Weibull distribution is much more explainable and also denotes that the degradation trends of the turbofan engines are more similar to the Weibull distribution.

Conclusions
With the growth of smart manufacturing in the industry, more and more data will be collected and deep learning models will be extremely applied to estimate the health statement of a machine for the maintenance strategy. The predictive maintenance can bring advantages for proposing strategies to optimize the maintenance schedule with a goal of reducing unplanned downtime, as well as needless preventive maintenance to save the most cost for the company. We propose an integrated deep learning approach with convolutional neural networks and long short-term memory networks to learn the latent features and estimate remaining useful life value with deep survival model based on the   Loss function (regression) Loss function (Weibull) TW MAE RMSE R 2 Score MAE RMSE R 2 Score discrete Weibull distribution. Our works can not only estimate RUL but also learn the failure probability. We can provide reference for making a decision about when and how often the replacement should be implemented. In particular, our approach does well on the harder task under the complex conditions with a subtle drop on the error and scoring function compared with other existing state-of-the-art methods. The improved results have proven that our proposed model can capture the degradation trend of a fault under complex conditions and avoid failure with the early prediction. The limitation of our approach is that our model relies on the specific probability distributions corresponding to a mixture of the two-parameter discrete Weibull distributions that may not be suitable for every degradation process. There are still some available distributions that can be implemented in the survival analysis approach. The data-driven deep learning approach depends on the quality of the data and strongly requires large labeled training datasets in the supervised learning. But getting sufficient run-to-failure data for training process is very difficult, especially for new systems. For further improvements, it may be possible to use a generative adversarial network for data augmentation or generation for the future research. Since the data condition and fault mode are different between subdatasets, further optimization via transfer learning method is still necessary to improve the stability of the method and then efficiently apply to solve other problems.

Abbreviations
XðnÞ: Matrix including cycles and sensor measurements of nth engine k: Types of sensor measurements (features) L s : Last observed cycle or the cycle that fault occurs m: Time window (TW) length x t : k-dimensional array at cycle t of an engine X i ðnÞ: Sequential data extracted from XðnÞ with size m × k x t k : The value at cycle t and sensor measurement k in x t d: Kinds of the sliding windows (feature detector) in CNN X dw : Set of feature detector regions with w kernel size w in CNN C ðlÞ : Convolutional lth layer f l : Forget gate's activation vector of the layer l in LSTM f Z l : Cell input activation vector of the layer l in LSTM Z l : Cell state vector of the layer l in LSTM.

Data Availability
The dataset used in this paper is the Turbofan Engine Degradation Simulation Dataset, provided by NASA, retrieved from https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-datarepository/.