Integrating Feature Engineering with Deep Learning to Conduct Diagnostic and Predictive Analytics for Turbofan Engines

­e prediction of remaining useful life (RUL) is a critical issue in many areas, such as aircrafts, ships, automobile, and facility equipment. Although numerous methods have been presented to address this issue, most of them do not consider the impacts of feature engineering. Typical techniques include the wrapper approach (using metaheuristics), the embedded approach (using machine learning), and the extraction approach (using component analysis). For simplicity, this research considers feature selection and feature extraction. In particular, principal component analysis (PCA) and sliced inverse regression (SIR) are adopted in feature extraction while stepwise regression (SR), multivariate adaptive regression splines (MARS), random forest (RF), and extreme gradient boosting (XGB) are used in feature selection. In feature selection, the original 15 sensors can be reduced to only four sensors that accumulate more than 80% degrees of importance and not seriously decrease the predictive performances. In feature extraction, only the top three principal components can account for more than 80% variances of original 15 sensors. Further, PCA combined with RF is more recommended than PCA and CNN (convolutional neural network) because it can achieve satisfactory performances without incurring tedious computation.


Introduction
Fast developments in IT technologies including wireless sensors, big data, arti cial intelligence, and cloud computing have boosted as well as reshaped numerous areas in smart manufacturing, such as product-service system (PSS), predictive maintenance, cyber-physical system (CPS), and digital twins [1,2]. One of the most famous examples is Rolls-Royce; it has successfully shifted from an engine manufacturer to an engine service provider and now, it claims its core business is engine health management (EHM). In the perspective of product lifecycle, Rolls-Royce provides a total solution and full diagnosis starting from design, manufacturing, and operation, to maintenance. In practice, EHM describes and transfers sensor signals from an engine on an aircraft to an operational center on the ground, which can be used to record and monitor the performance of an engine to ensure its reliability [3]. Although EHM brings about a new era in PSS, it still has thousands of parameters to monitor and needs to respond to requests from an operational center by sending back hundreds of hours of information speci cally tailored to that request. ese procedures inspire the origin of this research: the impacts and bene ts of feature engineering on predictive maintenance.
Clearly, IT-enabled PSS helps EHM build an integrated interface between "product" and "service" to put these two terms together rather than separately. Typical bene ts include high value-added business models, fast customer responses, and operational e ciencies. To achieve successful EHM, both preventive maintenance (PvM) and predictive maintenance (PdM) are designed to increase the reliability of the equipment. In reality, PvM is similar to a form of "regular" as well as "scheduled" maintenance without considering equipment conditions. In contrast, PdM relies on IT to monitor equipment condition and focus on the critical parameters. Clearly, PdM is performed only when "needed" and hence it can significantly reduce labor and material costs. As mentioned, the integration of IT tools, such as the collection of sensor signals, data analytics, diagnosis, computing, prediction, and prescription, determine the eventual performances in PdM [4,5]. e most important technique is the prediction of remaining useful life (RUL) because it assists practitioners in analyzing engine conditions and scheduling maintenance or replacement in advance [6,7].
Recently, the fourth industrial revolution, Industry 4.0, brings about a new trend in smart manufacturing by extending computer automation to a cyber-physical environment composed of Internet of things (IOT), big data analytics, cloud computing, cognitive reasoning, and wireless sensors. Clearly, Industry 4.0 creates the paradigm of "smart factory," in which IT-based cyber systems can collect data in production or quality control and monitor physical processes to make data-driven and decentralized decisions [8,9]. Further, this paradigm shift significantly impacts on the developments of EHM and predictive maintenance. Despite numerous studies that have been presented, most of them did not address the impacts of feature engineering on the prediction of the RUL.
To the best of our knowledge, typical ways in feature engineering include filtering, extraction, wrapper, and embedded techniques [10,11]. Specifically, filtering achieves dimension reduction without considering the interrelationships between input features. However, input features like sensors are usually interdependent.
us, extraction focuses on component analysis of input features without considering their dependences on the response. In contrast, the wrapper and embedded approaches simultaneously conduct dimension reduction as well as predict the response (forecasting).
is research attempts to present an integrated framework to highlight the impacts of feature engineering. For instance, what sensors should be firstly monitored and why? Which algorithms should be adopted to balance the trade-offs between forecasting performances and computational time? Based on the recognized key performance indicators (KPIs) or extracted principal components (PCs), the prediction of the RUL can be thoroughly assessed to help practitioners accomplish the following goals: (i) What sensors are most effective to predict the RUL of turbofan engines (using feature selection)? (ii) What components are most effective to predict the RUL of turbofan engines (using feature extraction)? (iii) What are the best combinations of feature-engineering techniques with machine learning or deep learning algorithms to conduct RUL forecasting? e rest of the study is organized as follows. Section 2 overviews feature engineering and predictive maintenance. Section 3 details the proposed framework. Experimental results are presented in Section 4. Discussions and insights are in Section 5. Conclusions and future work are shown in Section 6.

Overview
Rapid advances in information technologies (IT), such as big data and artificial intelligence, have brought about numerous applications in smart manufacturing and predictive maintenance. In practice, IT needs to be integrated with domain knowledge to conduct data-driven decision-making. For example, the following questions are frequently asked: which sensors (input features) should be selected? What algorithm (clustering, association, anomaly detection, classification, regression, and dimension reduction) is suitable to solve specific problems? What values or benefits can be obtained after implementation? Generally, clustering, association, and anomaly detection are unsupervised schemes (only related to input features), while classification and regression are supervised schemes (related to the response). e most interesting part is dimension reduction or the so-called feature engineering. It can be achieved in an unsupervised manner (component extraction) or a supervised way (filtering, wrapper, and embedded). All the abovementioned schemes have been well integrated to solve complex problems in smart manufacturing (SM) and predictive maintenance (PdM).
SM is a systematic approach, which applies software to bring data together from a firm's manufacturing execution systems and pass data between firm-level systems and shopfloor plants. Most manufacturing data are well structured and can be collected from multiple sensors to conduct datadriven decision-making. In simple words, huge amounts of manufacturing data can be transformed into real knowledge to achieve cost reduction, yield improvement, productivity enhancement, and fast delivery. In contrast, PdM is designed to help practitioners assess health conditions of in-service machinery equipment and prevent them from fatal failures. Today, cheap wireless sensors are ubiquitous to collect process data and signals at anytime and anywhere. Hence, SM and PdM become more feasible, easier, and cheaper than before. In reality, the following issues are critical but rarely addressed together: (1) how to identify the causality between performance indicators and the response (yield, productivity, cycle time, delivery, and RUL)? (2) How to apply dimension reduction to select key performance indicators or extract representative components? (3) How to assess the degree of improvement considering the impacts of feature engineering? To highlight research contributions, Table 1 compares this research to past studies. As indicated, lots of publications did not consider feature selection or feature extraction. Li et al. [16], Zhang et al. [9], Li et al. [10], Xiang et al. [20], and Li et al. [21] slightly outperform our presented approaches in terms of forecasting errors (RMSE or MAE in equations (1) and (2)). However, most of them did not show the details of hyperparameters for various models. Some authors [9,10,21] adopted two-stage deep-learning methods to achieve better performances. We do not consider these hybrid architectures because it takes tedious computation and not appropriate for fast diagnosis. In brief, the presented approach performs effectively but its computation is relatively efficient. In particular, this research aims at the best combination of feature engineering (PCA) with machine  is research particularly addressed the impacts of feature engineering on the prediction of the RUL for turbo engines. As mentioned, feature engineering consists of filtering, wrapper, embedded, and extraction schemes. Among them, principal component analysis (PCA) extracts the most representative components that can sufficiently account for the variances of the original features. In contrast, sliced inverse regression (SIR) integrates the concept of PCA considering the impacts of the predictors on the response. Filtering methods considers the interrelationships between individual features and the outcome, such as correlation coefficient, mutual information, and F score. Mutual information originates from information entropy and uses conditional probabilities to assess the discriminant power of input features. F score originates from analysis of variance (ANOVA) and calculates the between-group variances divided by the withingroup variances. Filtering approach is excluded in this research because it does not consider the interrelationships between the predictors (sensor signals).
In feature engineering, both wrapper and embedded methods consider the classifiers or the regressors to perform feature selection. e wrapper approach, such as metaheuristics algorithms, needs to search for the nearly optimal subsets that can achieve satisfactory performances. Typical schemes include genetic algorithm (GA), particle swamp optimization (PSO), ant colony optimization (ACO), whale optimization algorithm (WHA), etc. In contrast, the embedded approach directly prioritizes the degrees of importance of the features according to an information criterion. Due to limited lengths, this research considers the embedded approach in feature selection. Specifically, stepwise regression (SR), multivariate adaptive regression splines (MARS), random forest (RF), and extreme gradient boosting (XGB) are adopted to prioritize the degrees of importance of input features.

Predictive Maintenance.
e prediction of remaining useful life (RUL) plays a key role in PdM because it can detect anomaly (internal failure) in advance to avoid huge losses (external failure). In reality, it can be viewed as a general regression problem. Based on experienced domain experts, representative sensor signals are collected as input features to conduct anomaly detection, real-time monitoring, and predictive analytics. However, it is difficult to know the "real" RUL because it is infeasible to observe a machinery equipment from the start to the end.
us, distributionbased probability models (Wiener process and Weibull distribution) are simulated to estimate the true RUL [22]. A classical burn-in test requires an equipment to operate in an extremely tough environment (high temperature, high pressure, etc.) to speed up the process of degradation and effectively shorten its lifespan. Based on the shortened life in a burn-in test, the actual life operated in a normal environment can be estimated accordingly. Heimes [23] used the threshold to separate the RUL into the level segment (before the threshold) and the decline segment (after the threshold). Recently, machine learning and deep learning are applied to construct causal models between sensors and the RUL.
Typical machine learning includes multivariate adaptive regression splines (MARS), random forest (RF), extreme gradient boosting (XGB), and support vector machine (SVM). Among them, MARS, RF, and XGB can derive the degrees of importance for input features. Deep-learning regressors include deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), long-short term memory (LSTM), and gated recurrent unit (GRU). DNN is the most classical but it has problems like gradient vanishing or gradient explosion. CNN is very popular in pattern recognition (image processing), while RNN is powerful in speech recognition (temporal signals). LSTM can be viewed as an improved version of RNN while GRU is a simplified version of LSTM. Although deep-learning regressors are powerful, they cannot identify the degrees of importance of input features. Besides, they require lots of samples to optimize network topologies and associated hyperparameters.

Methodologies
To combine feature engineering with the prediction of RUL, Figure 1 details the presented techniques: (1) turbofan engines with degradation are collected from the NASA dataset (FD001), in which sensors are treated as input features to predict the RUL, (2) feature extraction including principal component analysis (PCA) and sliced inverse regression (SIR) and feature selection including stepwise regression (SR), multivariate adaptive regression splines (MARS) random forest (RF), and extreme gradient boosting (XGB) are applied to conduct dimension reduction, (3) various regressors are tested with feature-engineering techniques to find the best combination, and (4) the impacts of feature engineering are thoroughly assessed to generate managerial insights.
Without loss of generality, machine learning includes multivariate regression splines (MARS), random forest (RF), extreme gradient boosting (XGB), support vector machine (SVM), and deep neural network (DNN), while deep learning contains recurrent neural network (RNN), gated recurrent unit (GRU), long-short term memory (LSTM), and convolutional neural network (CNN). In addition, three quantitative metrics including root mean square error (RMSE), mean absolute error (MAE), and mean absolute arctangent percentage error (MAAPE) are used to measure forecasting errors [24]: where n denotes the number of observations and e i � F i − y i is an error between a predicted value (F i ) and the real data (y i ).

Feature Extraction. Principal component analysis (PCA)
is a dimension-reduction technique that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. PCA is a mathematical procedure that transforms a big number of possibly correlated variables into a small number of uncorrelated variables called principal components (PCs). e first principal component accounts for the most information (the largest proportion of variance in the original data) and the succeeding components sequentially account for as much of the remaining variability as possible.
Suppose the original variables, erefore, the following property is found: where λ 1 , λ 2 . . . λ p represents the eigenvalues in a descending order. Intuitively, we can extract the top m principal components ( m i�1 λ i /(λ 1 + λ 2 + · · · + λ p )) that can account for most of the original information to accomplish dimension reduction.
Clearly, PCA performs feature extraction in an unsupervised manner without considering the relationships between the input variables and the response outcome. us, this research also considers a supervised version of PCA, and it is sliced inverse regression (SIR). Following similar definitions in PCA, we consider a univariate response variable Y with the original where β k are unknown p-dimensional vectors assumed to be linearly independent [25]. e semiparametric regression model, Y � g(β T X), is an attractive dimension reduction approach to model the effect of X on Y with an unknown arbitrary link function g and given Y is independent of X. us, we can replace X ∈ R p by using β T X ∈ R k without loss of information on the regression of Y on X.
To estimate the subspace of effective dimension reduction (EDR), the basic concept of SIR is to reverse the role of Y and X, the covariate X is regressed on the response variable Y, and the linearity assumption on the distribution of X, denotes an estimated basis of the EDR subspace and it is used to estimate the mapping function g. Specifically, the centered inverse regression curve is spanned by the columns of the p × K matrix Σβ. e eigenvectors associated with the largest K eigenvalues of the symmetric matrix Σ −1 M, M � Var [E{X |S (Y)}], are EDR directions (S is a slicing transformation to discretize the response Y).

Machine
Learning. Without loss of generality, typical regressors include multivariate adaptive regression splines (MARS), random forest (RF), extreme gradient boosting (XGB), and support vector machine (SVM). Among them, MARS, RF, and XGB are used in feature selection because they can identify key performance indicators (KPIs) as well as conduct regression simultaneously. MARS, a nonparametric and nonlinear regression, uses the socalled "knots" to characterize basis functions (BFs) and approximates the nonlinear dependences between the predictors and the outcome. e general MARS is defined as follows: where Mathematical Problems in Engineering right or the left step function, x k,m are input variables, and t k,m are "knot" locations in each interval [26]. In MARS, general cross validation (GCV) can be used to prioritize the degrees of importance of the predictors. Based on quadratic programming, SVM uses "kernel" functions to convert the regression problem from the original space into the high-dimensional feature space. erefore, the "nonlinear" fitting in the input space has been transformed into "linear" fitting in the high-dimensional space and the tolerance, epsilon, means limited "outliers" are allowed to be outside the cylinder to avoid overfitting (good training performance but poor test performance). e most common kernel is radial basis function (Gaussian kernel).
RF and XGB originating from decision trees like classification and regression tree (CART) considers "Gini index" as information impurity to recursively split tree nodes, grow a tree from the top root to the bottom leaves, and characterize the entire system. is process is called ensemble learning because it has been theoretically justified to significantly outperform a single tree. In RF, the meaning of "random" includes random selection of samples and partial features (predictors) to fit a single tree.
e objective is keeping all the trees sufficiently dissimilar to conduct aggregate decision making. In contrast, "boosting" implies different probability distributions for different samples. Generally, data samples that are hard to predict will increase their probabilities to appear and vice versa. XGB is an enhanced version of gradient boosting. Based on the so-called loss functions, XGB searches for the best moves (gradient decent) to minimize forecasting errors and achieve the goal of curve fitting (nonlinear approximation).

Deep Learning.
Deep neural network (DNN) is one of the most classical artificial neural networks. It has a serious problem, vanishing gradients, when feeding back an error signal to adjust the links between the neurons in the consecutive layers. DNN consists of the input layer, multiple hidden layers, and the output layer. An error signal defined by the actual value minus the predictive value needs to be fed backed to adjust the links (weights) between the neurons. When the mean-squared-error is getting converged, the updating process stops and the network has been constructed to conduct forecasting: where n is the number of training samples, M is the dimensions of input variables, L is the dimension of output variables, w ij means the linking weights from layer i to layer j, and θ i is the intercept. e universal approximation function, f, needs to be fitted to conduct nonlinear regression (curve fitting).
RNN has been presented to accommodate the time-lags between the predictors and the response, but it also has the problem like gradient exploding or vanishing. us, long short-term memory (LSTM) network and gated recurrent unit (GRU) have been developed [27,28]. By replacing the basic cells in the RNN, LSTM has the input gate, the forget gate, and the output gate, while GRU merges them into reset gate and update gate. Generally, GRU is a simplified version and comparable to LSTM. In deep learning, specific parameters need to be optimized, such as the number of hidden layers and associated neurons, drop-out rate (some paths are randomly selected to remove), learning rate, activation function (hyperbolic tangent, sigmoid, and ReLu), and optimizer (stochastic decent, gradient decent, AdaDelta, AdaGrad,). For clarity, Table 2 gives a brief description for associated hyper-parameters in machine learning and deep learning [29]. In practice, there are no standard ways to optimize the above-mentioned parameters [30,31]. Typical ways include random (grid) search and Bayesian search. e whole process is like trial-and-error and no "optimal" combinations can be guaranteed.
us, past publications [18,23,32] considered metaheuristics algorithms to fast search for "relatively good" solutions for these parameters.

Illustrated Examples
e NASA "C-MAPSS (Commercial Modular Aero-Propulsion System Simulation)" dataset, FD001, composed of 33,727 simulated samples with 21 sensors is used. It contains lifetime data of 100 turbofan engines assessed in the same operation condition. All these engines were operated in a normal condition at the beginning with different degrees of deterioration and then gradually degraded with the operation cycles. Following the cited references, the calculation of the RUL is based on the maximal cycle minus the elapsed time since engine running. Intuitively, the longer the usage, the shorter the remaining useful life (RUL). Table 3 illustrate the details of the dataset. Specifically, temperature is measured in Rankine's scale. Note that sensors 1, 5, 10, 16, 18, and 19 are removed because they do not vary with different engines. Possible reasons are due to the fact that these sensors represent the same test condition.
In this research, the impacts of feature engineering are thoroughly explored: feature selection using stepwise regression and machine learning and feature extraction using PCA and SIR. Meanwhile, various regressors consisting of machine learning (RF, XGB, SVM, DNN) and deep learning (RNN, LSTM, GRU, CNN) are combined with PCA and SIR to find the best performed combination. e prediction of the RUL is actually a regression problem: sensors or reduced features are treated as input features while the RUL is the response. Since the original dataset does not define the true RUL, Heimes [23] and the most cited references [9, 14-18, 20, 21, 33] used the threshold, 130, to separate the RUL into two segments: level (before cycle 130) and decline (after cycle 130). Mathematically, the RUL for engine n at time t is defined as, Max(Cycle n ) − Cycle nt , the maximal cycle minus the elapsed time since engine running. e maximal cycle is set by 130 in this research.
For clarity, Table 4 details the hyperparameters needed to optimize for machine learning and deep learning. In reality, it is a trial-and-error process heavily reliant on experienced users. Grid search and Bayesian search are common ways to shorten computational time. Prior to applying feature engineering, all sensors used to predict the RUL and its performances are shown in Table 5. Specifically, 20,631 samples are used for training while the remaining 13,096 samples are for testing. For clarity, computer architecture is reported as follows: CPU is AMD Ryzen 9 3900X 12-Core Processor, GPU is NVIDIA GeForce GTX 1660, and RAM is 32 GB. In terms of MAAPE, the best performed regressors in the training set include RF and XGB while deep learning algorithms (DNN, RNN, LSTM, GRU, and CNN) are comparable in the test set. In terms of running time, deeplearning algorithms generally require much longer computation in the training set because they need to optimize network topologies and associated hyperparameters (batch size, learning rate, drop-out, hidden layers, hidden neurons, optimizer, activation function, etc.). However, the most surprising regressor is XGB because it also requires tedious computation to search for appropriate hyperparameters. us, it is not recommended in this study. e predictive performances for machine learning and deep learning do not differ a lot in the test set. Due to the so-called black-box property and tedious computation, bidirection deep learning or two-stage architectures are not considered in this research.

Feature Selection Using Stepwise Regression and Machine
Learning. To conduct supervised embedded feature selection, four methods are adopted: stepwise regression, MARS, RF, and XGB. ese methods can simultaneously perform feature selection and conduct forecasting. As indicated in Table 6, sensor 11 (ratio of fuel flow to static pressure at HPC outlet), sensor 9 (engine pressure ratio), sensor 12 (corrected fan speed), and sensor 4 (pressure at fan inlet) are commonly identified as key performance indicators (KPIs). Here, SR uses AIC (Akaike information criterion) or BIC (Bayesian information criterion), MARS uses general cross validation [26], RF and XGB use Gini index defined in information theory to derive the degrees of importance of input sensors. Note that, the top four sensors account for almost 85% cumulative degrees of importance and thus these sensors should be monitored in a real-time way. To test the validity of the identified sensors, their performances in the prediction of the RUL are shown in Table 7. Machine learning (RF, XGB, SVM, and DNN) performs better in the training set while deep learning (RNN, LSTM, GRU, and CNN) performs better in the test set. Compared to Table 5, the performances based on the identified four "KPIs" are on average worse than considering all 15 sensors.

Feature Extraction Using Component Analysis and Sliced
Inverse Regression. To justify research validity, principal component analysis (PCA) and sliced inverse regression (SIR) are used in feature extraction. PCA is implemented in an unsupervised manner without considering the RUL, while SIR is a supervised version because it simultaneously conducts feature extraction and regression. To determine the appropriate number of extracted components, cumulative proportions are listed in Table 8. e eigenvalue of each principal component (PC) explains the variances of the original sensors. Since the top three PCs, PC1, PC2, and PC3, can account for more than 80% cumulative variances, they are extracted as the predictors.  Total temperature at high-pressure compressor (HPC) outlet°R Sensor_3 Total temperature at low-pressure turbine (LPT) outlet°R Sensor_4 Pressure at fan inlet psia Sensor_6 Total pressure at HPC outlet psia Sensor_7 Physical fan speed rpm Sensor_8 Physical core speed rpm Sensor_9 Engine pressure ratio (P50/P2) -Sensor_11 Ratio of fuel flow to Ps30 pps/psi Sensor_12 Corrected fan speed rpm Sensor_13 Corrected core speed rpm Sensor_14 Bypass ratio -Sensor_15 Burner fuel-air ratio -Sensor_17 Total temperature at fan inlet°R Sensor_20 High-pressure turbine (HPT) coolant bleed lbm/s Sensor_21 LPT coolant bleed lbm/s Response Remaining useful life (RUL) Cycle Further, the relationships between the extracted components and the original sensors are indicated in Table 9. e dependences of the first component (PC1) that accounts for more than 60% total variances seem to be quite diverse: it is closely related to sensor 4, sensor 11, and sensor 12 that are three of the four KPIs shown in Table 6. In contrast, the second component (PC2) is highly related to sensor 9 (engine pressure ratio) and sensor 11 (ratio of fuel flow to static pressure at HPC outlet) and the third component (PC3) is strongly associated with sensor 6 (total pressure at HPC outlet). Similarly, the top three directions (DIR1, DIR2, and DIR3) in the SIR can account for almost 99% variances and thus they are treated as the predictors.
Finally, based on PCA and SIR, Tables 10 and 11 demonstrate the predictive performances for the RUL. Apparently, PCA outperforms SIR in both training set and test set. For PCA, deep learning performs better than machine learning while machine learning and deep learning are comparable in SIR. Moreover, the performances based on PCA are almost equivalent to using all sensors (Table 5).
us, in predicting the RUL, PCA is recommended in dimension reduction. To visualize the predictive results using PCA, Figure 2 (PCA and machine learning) and Figure 3 (PCA and deep learning) demonstrate all 100 engines that are mixed together. For simplicity, all engines are operated in the last cycle. e horizontal axis represents the IDs of   Table 10. Similar to Table 5, Table 10 demonstrates that RF requires the least computation in the training set while its predictive performance is comparable to other schemes in the test set.
Without loss of generality, representative engines, such as engine 62 (beginning phase), engine 49 (medium phase), and engine 81 (declining phase), are selected and tested at different phases for a single-sample test. Due to limited lengths, only engine 49 is visualized. Figures 4 and 5 respectively demonstrate the predictive results combining machine learning (RF, XGB, SVM, and DNN) and combining deep learning (RNN, LSTM, GRU, and CNN) for a   single sample. Clearly, the true RUL presents a two-stage linear pattern (level + decline) because it is assumed to be a constant (no decline) before cycle 130 and linearly declining after cycle 130. Here, the horizontal axis represents the elapsed time since running while the vertical axis is the RUL. Surprisingly, machine learning seems to perform better than deep learning in the prediction of engine 49. e true RUL begins to decline since cycle 199 but all models predict engine deterioration at earlier cycles. is implies warning alerts may be sent earlier than engine deterioration really starts. In terms of quality control, type 1 error (false alarms) can be tolerated because it does not incur huge losses in production lines. In contrast, type 2 error (miss alarms), in which the models send alert signals later than the engine starts to deteriorate, is not accepted because it can result in fatal damage.

Discussions and Insights
Equipment maintenance is a classical and critical issue because it significantly affects production schedules, product quality, and on-time deliveries [36]. In the past, PvM (occurs in regular cycles) is designed to keep parts in good repair but it does not take the state of a component or equipment into account. Today, PdM (occurs as needed) conducts real-time data collection and analysis to identify machine problems at      the nascent stage before they can interrupt production processes. PdM simultaneously avoids excessive maintenance in PvM and prevents unexpected equipment breakdown. Clearly, the prediction of the RUL is the core to accomplish successful PdM. Following statistical probability distributions [22,[34][35][36][37], conventional burn-in tests derive the actual RUL at a normal condition given an object is operated in a rigorous condition like high temperature. However, it is difficult to justify its validity because we cannot wait for the equipment or devices to really fail. us, machine learning or deep learning algorithms attempt to predict the actual RUL by identifying critical sensors. ese sensors can alert warnings before machine failures happen.
In the prediction of the RUL, practitioners want to know what sensors should be firstly monitored and when to send alerts? In this research, feature selection is achieved using machine learning (SR, MARS, RF, and XGB), while feature extraction is achieved by unsupervised PCA and supervised SIR. To prioritize the degrees of importance for input features (sensors), our presented feature-selection methods follow the embedded approach. As expected, the performances using feature selection are slightly worse than the use of all sensors because only partial information is considered.
Future extension in feature selection can incorporate metaheuristics [18,32] to consider the wrapper approach. Typical metaheuristics algorithms include genetic algorithm, particle swamp optimization, ant colony optimization, bee colony optimization, grey wolf optimization, etc. Basically, they try to find "relatively" good combinations of input features to achieve satisfactory performances. In feature extraction, it is surprising to find supervised SIR does not outperform unsupervised PCA. Possible reasons are due to skewed distributions of the RUL. As indicated by Figure 6, the histogram of the RUL is biased toward 120-130 cycles. SIR groups data samples into distinct categories according to the sequence of the RUL. However, it cannot predict the RUL very well assuming the response is not fairly distributed (small between-group differences).
Not surprisingly, we observe the performances using deep learning are on average better than other methods. In particular, some publications use the CAE (convolutional autoencoder) features to achieve very small errors. However, these features are automatically synthesized in deep learning.
ey are hard to explain and apply to real industries. An architecture composed of two-stage deep-learning algorithms is also popular in recent years [3,21]. is concept is very similar to stacking in ensemble learning. However, a hybrid architecture incurs tedious computation and it cannot achieve fast response. In practice, the trade-offs between predictive performance and computational complexity needs to be carefully considered. Response speed and data privacy also inspires edge computing that can improve the drawbacks of cloud computing.

Conclusions
Industry 4.0 has been recognized as a radical innovation, which creates numerous technology dynamics and market opportunities in smart manufacturing and predictive maintenance. In the past, the "high-volume and low-variety" paradigm emphasizes mass production and preventive maintenance. Today, the "low-volume but high variety" paradigm focuses on firm's agility to respond and capability to convert huge amounts of data into decision analytics. To achieve better predictive maintenance, this research particularly addresses the impacts of feature engineering on the prediction of the RUL for turbofan engines. Experimental results show that feature selection using machine learning can reduce original 15 sensors to only 4 representative sensors but without declining the predictive performance too much (decreasing 4% MAAPE). Besides, feature extraction using unsupervised PCA performs better than supervised SIR. In particular, PCA's performances are comparable with less than 20% MAAPE) for both machine learning and deep learning and better than using all sensors. More importantly, the presented framework can be extended to other areas, such as productivity enhancement in smart manufacturing and yield improvement in predictive maintenance. In summary, research contributions are outlined as follows: (i) In feature extraction, PCA and SIR are used to generate sufficiently representative components or directions that can account for most variances of original sensors. (ii) In feature selection, SR, MARS, RF, and XGB are used to identify significant sensors by prioritizing the degrees of importance of original sensors. (iii) PCA + RF is more recommended than PCA + CNN because it can balance the trade-offs between predictive performances and computational time.
In future work, several directions are specifically indicated: (1) this research only considers feature extraction and feature selection. Feature transformation like discrete wavelet transform or metaheuristics based wrapper approaches can be included. (2) e prediction of the RUL is basically a regression problem (numerical outcome): the dependent outcome is assumed to be initially a level constant and then follow a linear decline (piecewise linear pattern). Statistical simulation-based models may be included to obtain precise estimates and increase generalization of this research. (3) In addition to predicting the RUL, diagnostic analytics for hidden states (healthy, risky, and sick) of turbine engines deserves to be modeled and explored in engine health management.

Data Availability
All data used in this research were extracted from https:// www.kaggle.com/datasets/behrad3d/nasa-cmaps.

Conflicts of Interest
e authors declare that they have no conflicts of interest.