Fault Diagnosis Approach for Rotating Machinery Based on Feature Importance Ranking and Selection

,


Introduction
At present, the components of industrial rotating machinery equipment are becoming increasingly complex and compact. As a key component of the transmission system, rotating machinery is an important part of modern industrial machinery equipment, including motors, engines, bearings, and gearboxes [1][2][3]. When rotating machinery is operating under harsh or complex conditions, its key components are extremely prone to failure, which may cause the shutdown of the entire mechanical equipment, and even endanger the safety of surrounding operators [4,5]. erefore, it is significant to construct a fault diagnosis scheme for rotating machinery under complex conditions to accurately detect and diagnose its health or fault state.
Fault diagnosis based on vibration signal analysis is the main research hotspot at present, among which the most critical step is feature extraction [6,7]. Based on vibration signal analysis, the existing methods mainly extract fault features from the time domain, frequency domain, and timefrequency domain [8,9]. Time-domain features contain root mean square, mean, standard deviation, kurtosis, etc., which may be valid only for certain fault types [10,11]. e frequency-domain analysis is mostly based on the Fourier transform (FT) [12]. However, these methods are limited by prior knowledge and experience in practical applications due to the nonlinearity and nonstationarity of the original vibration signal, which makes it difficult for them to effectively mine the fault information hidden in the vibration signal [11,13].
As a measure of time-domain uncertainty, entropybased analysis methods have attracted extensive attention of scholars, which have been widely used in image processing, biological analysis, and other fields [14].
ese entropies mainly include sample entropy (SE) [15], approximate entropy (AE) [16], permutation entropy (PE) [17], fuzzy entropy (FE) [18], and symbolic dynamic entropy (SDE) [19]. FE uses a Gaussian function to replace the Heaviside function in SE to measure the similarity between two vectors. In recent studies, these entropies are usually combined with some other processing strategies, e.g., multiscale and improved multiscale techniques, and some signal decomposition methods in the time-frequency domain [19][20][21]. However, the multiscale analysis may result in the absence of part of the frequency band components, which may lead to the loss of some fault information. In addition, traditional signal decomposition methods also have some disadvantages that cannot be ignored.
Research studies based on time-frequency domain analysis have been done for a long time. For example, empirical mode decomposition (EMD) and local mean decomposition (LMD) both are self-adaptive time-frequency decomposition methods [22,23]. LMD is improved on the basis of EMD to better maintain the local characteristics of the original signal [24]. Variational mode decomposition (VMD) proposed by Dragomiretskiy et al. is a self-adaptive decomposition method that aims to overcome the shortcomings of undershoot, overshoot, mode mixing in EMD [25][26][27]. However, there are still some defects in VMD, e.g., the massive consumption of computing resources [27,28]. In addition, these methods are all based on "mode," and some frequency information will be lost, which means that they do not apply to frequency analysis.
Compared with the above methods, the decomposition strategy of wavelet packet decomposition (WPD) is to pass the signal through a series of filters with different central frequencies but the same bandwidth. erefore, the signal analysis performed by WPD is more refined, especially for the high-frequency components [29]. In [30], an automatic method combining WPD and EMD was proposed to detect the weak defects of rolling bearings. In [31], WPD was combined with PE to extract fault features of rolling bearings. However, how to select the optimal wavelet basis function (WBF) has not been analysed and discussed in these references. e WBF is a group of functions obtained from the expansion and translation, including db wavelets, sym wavelets, and mexh wavelets [32]. Different WBFs are applicable to different analysis objects, and improper selection may affect the accuracy of fault pattern recognition [32,33]. In this paper, a two-step principle is proposed to select the most suitable WBF for the fault vibration signal. en, the optimized WPD (OWPD) method is proposed and applied to decompose the vibration signal to obtain its frequency component. In view of the advantage that entropy measure can effectively extract dynamic information of time series, FE is used to extract hidden fault features from decomposed subsignals. Meanwhile, it also has the advantage of being insensitive to background noise and good robustness [18,20]. erefore, a novel fault feature exaction method combining OWPD and FE is further proposed in this paper. After feature extraction utilizing OWPD and FE, a classification algorithm with good performance and computational efficiency is needed to give final diagnosis results. In addition, screening redundant features before fault classification can effectively reduce feature dimension and computational burden and further improve classification accuracy [34]. Traditional classification algorithms include support vector machine (SVM) [35], K-nearest neighbor (KNN) [36], artificial neural networks (ANNs) [37], and random forest (RF) [38]. Deep learning (DL) algorithms include convolutional neural network (CNN) [39], autoencoders (AEs) [40], and deep belief network (DBN) [11,41]. However, these classification algorithms still have some inevitable shortcomings. For example, SVM is not effective for large-scale training samples and sensitive to the selection of parameters and kernel function. RF is easy to overfit in noisy classification or regression problems. e structure and parameters of some DL algorithms, e.g., DBN, are basically determined by human experience, which not only affects the accuracy of diagnostic results but also causes a large amount of computing costs [11,42]. CatBoost is a new implementation of the gradient boosting decision tree (GBDT) framework [43]. It has the advantages of high efficiency, few parameters, and strong generalization ability and has excellent performance in many machine learning tasks [43][44][45]. In addition, as an algorithm based on the decision tree, it can obtain the importance of each feature according to the tree model after gradient boosting, and then the valuable features can be effectively selected for model training. erefore, it is introduced for feature selection to form a feature set that contains the main fault information. To the best of authors' knowledge, CatBoost algorithm is rarely studied in the field of fault diagnosis of rotating machinery. In this paper, it is introduced not only for fault pattern recognition but also for selecting the optimal features.
Finally, the optimization of hyperparameters is also an urgent problem to be solved in the use of the CatBoost algorithm, which usually has a great impact on the performance of the model. e optimization of hyperparameters is to find an acceptable solution for the optimization goal as effectively as possible [46]. Due to the large amount of data and large solution space, the application of traditional solution methods, e.g., grid search and greedy algorithm, has been limited, while intelligent algorithms such as differential evolution have been widely used due to their fast computing efficiency and the ability to obtain global optimal solution [46,47]. In this paper, Bayesian optimization (BO) algorithm [48] is considered to solve this problem to find the optimal hyperparameters of the CatBoost classifier. It can obtain the global optimal solution through Gaussian process, which has the advantages of high search efficiency and less iteration times, and can be used for the optimization of any black-box function.
Based on the above analysis, aiming to solve the defect that traditional feature extraction methods cannot fully explore the deep-level fault features and to improve the 2 Shock and Vibration performance of fault pattern recognition, a novel fault diagnosis approach based on feature importance ranking and selection is proposed. In summary, the advantages of WPD in signal decomposition, FE in feature extraction, and CatBoost in fault pattern recognition are fully exploited in the proposed approach. e main contributions can be summarized as follows: (1) A two-step principle is proposed to select the optimal WBF adaptively according to the characteristics of the mechanical vibration signal. (2) A fault feature extraction method combining OWPD and FE is proposed, where OWPD is utilized to decompose the vibration signal, and FE is further adopted to form the fault feature set. (3) CatBoost algorithm is introduced not only for fault pattern recognition but also for feature selection to filter redundant features, which helps to reduce model training time and improve the classification accuracy. (4) BO algorithm is adopted to solve the optimization problem of hyperparameters in CatBoost. On this basis, the BO-CatBoost algorithm is established and applied to the fault diagnosis of rotating machinery. e remainder of this paper is organized as follows. Section 2 introduces the theoretical knowledge and methods of the proposed approach. e diagnosis process and the preliminary validation of the proposed fault diagnosis approach with a mechanical fault simulation (MFS) platform dataset are detailed in Section 3. Further experimental verification using another actual dataset of the one-stage reduction gearbox is shown in Section 4. Section 5 contains the conclusions and future research studies.

Optimized Wavelet Packet Decomposition
2.1.1. Wavelet Packet Decomposition. Generally, FT has been widely used in traditional vibration signal analysis [49]. However, only the frequency-domain information is retained in FT, while the time-domain information is completely lost, which makes it unsuitable for the analysis of nonstationary time-varying signals. Wavelet transform (WT) can provide information in both frequency and time domains to overcome the deficiency of FT [50]. However, only the low-frequency coefficients will be decomposed again in the WT method, which will cause the problem of missing high-frequency information. WPD was proposed to address this deficiency, where the information in both lowfrequency band and high-frequency band is completely preserved [51]. e schematic diagram of a three-layer WPD is shown in Figure 1, and the theory is described as follows.
Let j denote the decomposition layer, n denote the frequency factor (n � 0, 1, 2, . . . , 2 j − 1), and ψ(n) and ϕ(n) represent the wavelet function and scale function, respectively. Given φ 0 (n) � ϕ(n) and φ 1 (n) � ψ(n), the wavelet packet φ i (n) (i � 0, 1, 2, . . .) can be defined as where k is the shift factor, Z is the integer set, h k denotes the low-pass filter, g k denotes the high-pass filter, and h k , g k is a couple of quadruple mirror filters that satisfies g k � (−1) k h k . For a given time series x(n), let x p j (n) (p � 0, 1, 2, . . . , 2 j − 1) denote its subsignal, which can be represented as a linear combination of the corresponding wavelet function of the wavelet packet: where d p j (k) denotes the p-th wavelet packet coefficient of the j-th layer, and it can be obtained by the inner product between x(n) and φ 2 j−1+p (2n − k), namely, An approximation x j (n) of x(n) with layer j equates the sum of all the subsignals:

Optimized Wavelet Packet
Decomposition. e main idea of the OWPD method is to automatically select the most suitable WBF for vibration signal analysis of rotating machinery according to the proposed two-step principle on the basis of WPD. In general, the main characteristics to be considered in selecting the WBF include orthogonality, compactness, symmetry, and vanishing moment. Considering the characteristics of different wavelet families and vibration signals of rotating machinery, the coif wavelets, db wavelets, and sym wavelets are selected as the candidate WBFs. Figure 2 shows the proposed two-step principle for selecting the optimal WBF, detailed as follows.
Step 1: select the candidate WBFs preliminarily from the same wavelet family according to the principle of maximum energy-to-Shannon entropy ratio (METSE) [52].
(1) Calculate the energy value E(n) of the n-th node: where i and m are the serial number and total number of discrete points in the n-th node, respectively, and C n,i is the coefficient corresponding to the discrete point. (2) e Shannon entropy of the n-th node is defined as

Shock and Vibration 3
where p i is the energy probability distribution of the wavelet coefficients, defined as (3) e ratio of the total energy and the total Shannon entropy of the j-layer WPD is represented as ζ, namely, According to equations (5)-(8), the candidate WBF with the largest ζ value can be selected from the same wavelet family.
Step 2: select the optimal WBF further from different wavelet families according to the principle of similarity measure.
Firstly, the candidate WBFs from different wavelet families mentioned above are applied to implement WPD, respectively. en, the signal is reconstructed using the coefficients with the nodes of the last layer. Finally, a standardized Euclidean distance is used to measure the similarity between the original signal x i and reconstructed signal y i (i � 1, 2, . . . , N): where s i is the standard deviation between x i and y i . e smaller the value of d is, the closer the reconstructed signal is to the original signal, and the corresponding WBF is more suitable for signal analysis.

Fuzzy Entropy. Given an N-dimensional time series
[μ(1), μ(2), . . . , μ(N)], the phase space dimension and the similarity tolerance are defined as m (m ≦ N − 2) and r, respectively. en, the phase space can be reconstructed as where

e fuzzy membership function A(x) is introduced as
where d m ij is the maximum absolute distance between window vector X(i) and X(j), that is, e function Φ m is defined as erefore, the FE value of the original time series can be calculated as

CatBoost for Classification.
In the GBDT algorithm, lots of decision trees are combined to produce a model of high accuracy, and the progress can be written as g k : high-pass filter h k : low-pass filter where x is the feature vector, T is the number of trees, θ t (t � 1, 2, . . . , T) is a learned parameter, and f t (x, θ t ) represents the decision trees that are learned. Given training samples D � (x k , y k ) n 1 , where n is the number of samples, x k (k � 1, 2, . . . , n) is the sample data, and y k is the true sample label. To learn the model introduced in equation (17), the following objective function needs to be minimized: where y k is the predicted sample label, L is the loss function that represents the difference between y k and y k , and Ω is the regular function that is used to punish the complexity of f t , defined as where α is the penalty parameter that controls the number of leaf nodes, q is the number of leaf nodes, β is the regularization parameter, and ω is the weight coefficient. Let g denote the negative gradient of the loss function, and the objective function is minimized in the direction of g, namely, Traditional GBDT algorithms generally have the problem of prediction offset, which affects the generalization ability of the model. To overcome this defect, CatBoost was proposed with two notable improvements [43]: (1) the ordered boosting strategy was adopted to obtain the unbiased estimation of the gradient and slow down the prediction offset; (2) the oblivious tree was used as the basic learner to increase the reliability of the model and speed up the prediction. In addition, to better deal with categorical features, the greedy target-based statistics strategy was improved by adding prior terms in Cat-Boost algorithm, which can be summarized as three main steps: (1) all the sample datasets are randomly arranged; (2) samples with the same category are selected, and the average label of similar samples is calculated; and (3) features of each sample are digitized by adding the prior term and its corresponding weight coefficient. e improved greedy targetbased statistics strategy can be expressed as where x i k represents the i-th category feature of the k-th sample, x i k represents the corresponding numerical feature, P represents the increased prior value, and a represents the weight coefficient (a > 0). e addition of prior values can effectively reduce the noise caused by low-frequency features and avoid the overfitting phenomenon.

CatBoost for Feature Selection.
e growth strategies of decision trees are different in different GBDT algorithms. XGBoost uses level-wise strategy, which has the disadvantage of inefficiency [53]. CatBoost uses the symmetric tree strategy to optimize the computation of the leaf value to prevent the model from overfitting. In the case of the basic learner of CatBoost is the tree model, the feature coefficient or importance can be obtained according to a certain evaluation index after training the model, e.g., the change of loss function or prediction values.  Figure 2: e proposed two-step principle of selecting the optimal WBF.

Shock and Vibration 5
For a given feature set and where S denotes the different paths to the leaf nodes in the decision tree, c 1 and c 2 denote the total weight coefficient in the left and right leaves, respectively, and υ 1 and υ 2 denote the formula value in the left and right leaves, respectively.

BO Algorithm for Optimizing Hyperparameters.
In machine learning algorithms, the hyperparameters usually have a strong influence on the performance of the model. ere are some inevitable shortcomings in traditional methods of parameter tuning. For example, greedy algorithm can only obtain the local optimal solution, and the uncertainty and nonconvexity of grid search tend to miss global optimality. Different with these methods, BO algorithm can obtain the global optimal solution through Gaussian process, which is considered to find the optimal hyperparameters for the CatBoost model. e basic thought of BO algorithm is that, for the given data and optimal termination condition (usually, the number of iterations or the expected value of the objective function), Bayesian theory is used to estimate the posterior distribution. e distribution and the information from the previous sampling point are used to select the hyperparameters of the later sampling until that the value of the objective function reaches the maximum globally. Here, the objective function is defined as the maximum of the classification accuracy: where CB denotes the CatBoost classifier, μ � μ 1 , μ 2 , . . . , μ n is the hyperparameters, D t and D v represent the training and validation set divided by the Kfold cross-validation, respectively, and f (CB, μ, D t , D v ) is the classification accuracy.

e Proposed Fault Diagnosis
Approach. An overview of the proposed fault diagnosis approach for rotating machinery based on feature importance ranking and selection is shown in Figure 3. e specific steps are as follows: Step 1: the original vibration signals are acquired by accelerometers and the data acquisition system.
Step 2: the optimal WBF is selected according to the proposed two-step principle, and then the original vibration signals are decomposed by OWPD.
Step 3: the FE values of the decomposed subsignals are calculated to form the fault feature set F.
Step 4: CatBoost algorithm is utilized to obtain the importance of each feature in F by a certain strategy. According to the ranking result of feature importance, the candidate features are selected in sequence and combined with the corresponding labels to form dataset S.
Step 5: dataset S is divided into two parts according to a certain proportion: training set and test set.
Step 6: the training set is used to train the CatBoost classifier, and BO algorithm is adopted to optimize the main hyperparameters.
Step 7: the test set is fed into the trained CatBoost classifier to output the diagnostic results.

Experimental Setup and Data Description.
To prove the effectiveness of the proposed fault diagnosis approach, a hybrid dataset of bearing and rotor faults collected by the machinery fault simulator (MFS) platform was used for experimental verification [54]. e experiment setup is shown in Figure 4, the MFS is driven by the AC motor with the speed of 2100 rpm, and the power of it is transmitted to the rotating plate and the drive shaft and through the coupling. e sampling frequency is 6 kHz. rough replacing different components, ten different types of datasets are collected with the data acquisition box, including nine fault types and one normal type, with detailed information shown in Table 1. ere are 160 samples for each type, and each sample contains 1000 nonoverlapping data points.
To observe the difference between different types of vibration signals, a sample of each fault type is randomly selected to draw the waveform in time and frequency domains, respectively, and the results are shown in Figure 5. It can be seen that all fault types are time-varying and frequency-varying signals, which indicates that the original vibration signals are nonstationary. In addition, in view of the frequency domain, most types of fault information are concentrated in the low-frequency band, while the highfrequency band contains less.
Each sample data is standardized by the z-score method. In addition, missing values are detected in advance. If there is a missing value, it is filled with the Lagrange interpolation formula. e design and improvement of the experimental algorithms are implemented by Python 3.7.3 with a computer configured with Intel Core i5-6000hq CPU and 12G RAM.

Parameter Settings of OWPD.
Different decomposition layers in OWPD will result in different frequency resolutions of subsignals, which affect the accuracy and time consumption of fault diagnosis. If the decomposition layer is l, the frequency resolution of the signal is where f s is the sampling frequency, and here, it is 6 kHz. To make sure that d f is greater than 1 Hz, the value of l needs to be less than 12. In addition, the number of features and calculation time will increase with the increase in the number of subbands. erefore, l is preliminarily selected as 5, and each sample is averagely separated into 32 parts in the frequency domain. e influence of the value of l on the diagnostic results will be analysed in detail in the following experiments. e optimal WBF will be selected according to the twostep principle described in Section 2.1.2. To reduce the calculation time, 10 samples are randomly selected under each fault type to form a new dataset, and 5-level WPD is carried out for each sample with different WBFs. In the first step, the total energy-to-Shannon entropy ratio ζ of each sample is calculated, respectively, according to equations (5)- (8), and then the average value of these 100 samples is taken. e results are detailed in Table 2. As can be seen, the WBFs with the largest average ζ value are db7, sym7, and coif3 in the same wavelet family, respectively.
In the second step, the above candidate WBF is used to reconstruct the signal, and the average value of its similarity coefficient d with the original signal is calculated according to equation (9). e results are detailed in Table 3. As can be seen, the original signal is most similar to the reconstructed signal when coif3 WBF is selected. coif WBF has orthogonality and compact support. In addition, compared with db WBF, it has better symmetry. erefore, it is more effective to extract fault vibration signals of impulsive and nonstationary characteristics.   Shock and Vibration

Parameter Settings of FE.
After determining the decomposition layer and optimal WBF, the original signals are decomposed by OWPD. en, the FE values of these subsignals are calculated separately. at is to say, the dimension of the feature set is 32. e parameters of FE are set according to [14,18], as shown in Table 4, where STD represents the standard deviation of each sample. Time consumption of feature extraction is 1.29 s/sample. To visualize the calculation results, a sample from each fault type is randomly selected to carry out 5-layer OWPD using coif3 WBF. Figure 6 shows the FE values of the subsignals of different fault types. It can be seen that the difference of FE values among different fault types is quite obvious, which indicates that the proposed OWPD-FE method can effectively extract fault features. In addition, the FE values of some subsignals for certain fault types are almost 0, which also indicates that there is some feature redundancy.

Feature Visualization. t-SNE
is the most commonly used algorithm for data visualization and dimensionality reduction [55]. Here, it is adopted to project the extracted 32-dimensional features into two-dimensional (2D) space for visualization. For each fault type, 20 groups of data after dimensionality reduction by t-SNE are selected to draw a scatter diagram, as shown in Figure 7. It is clear that, in the 2D plane, the features of ten fault types overlap little, and the boundaries among different types can be clearly distinguished, which shows that the OWPD-FE method is effective in extracting fault information of bearings and rotors.

Feature Selection Using
CatBoost. An appropriate feature selection method can reduce the feature redundancy and improve the diagnostic performance of the model. In this paper, CatBoost algorithm is considered for this step. According to Section 2.3.2, equations (22) and (23)   Shock and Vibration the classification accuracy, the feature set used to reach the highest classification accuracy can be obtained. Figure 8 shows the normalized calculation results of the importance of 32 features, whose sum is 100.

Diagnosis Results and Analysis.
At first, the dataset is divided into training set and test set, and the ratio of them is set as 3 : 2. at is to say, 96 samples of each fault type are used for model training and the rest of 64 samples for test. In addition, ten-fold cross-validation is conducted on the training set. e main hyperparameters of CatBoost optimized by BO algorithm are shown in Table 5. en, to study the effect of the number of features on the classification results, the feature selection process is carried out according to the analysis of Section 3.3. e experimental results are detailed in Figure 9. It can be seen that the model training time is positively correlated with the number of features, which is consistent with the actual experience. e average accuracy of ten-fold cross-validation on the training set has reached 100% when 7 features are selected. When the number of features is 22, the test set accuracy is the highest, reaching 99.17%. However, when all 32 features are used, it decreased by 0.21 percent (only 98.96%). erefore, the classification accuracy of using 22 features is considered as the final diagnosis result. e time consumption of model training in this case is 10.13 s, which is 1.68 s less than that without feature selection. Experimental results show the reliability of the proposed feature selection method and the effectiveness of the classification algorithm.
As indicated in Figure 10, the confusion matrix of the diagnosis result using 22 features is presented in detail. It is not hard to see that the diagnostic accuracy of all fault types is above 98%, and it reaches 100% for 6 fault types (corresponding category labels are 2, 3, 6, 7, 9, and 10). e experimental results show that the proposed approach can effectively identify the hybrid fault states of the rotor and bearing.

Comparison of Different Decomposition Layers.
In this section, in order to further demonstrate the effectiveness and reasonability of setting decomposition layer l to 5, the influence of the value of l on the diagnostic performance is investigated. Firstly, l needs to be less than 12 according to equation (25). erefore, l is set to 1 to 8, respectively, to perform OWPD. en, the parameters of FE are set according to Table 4. e partition of the sample dataset is described in Section 3.4. CatBoost is still applied to feature selection and fault pattern recognition of the 8 datasets, and BO algorithm is used to optimize hyperparameters. e experimental results are shown in Figure 11.
As can be seen from Figure 11(a), the classification accuracy generally presents an upward trend with the increase of l. In detail, when l is equal to 1, the average accuracy of the training set and validation set is very low, only 82.02% and 53.75%, respectively, while the test set accuracy only reaches 57.29%. When l is greater than 2, all the training sets' accuracy reaches 100%. e validation set accuracy reaches the maximum when l is equal to 6 (99.29%). When l is equal to or greater than 5, all the test sets' accuracy reaches 99.17%. As can be seen from Figure 11(b), the time of feature extraction and model training increases with the increase of l. e feature extraction time increases exponentially with the increase of l. Experimental results show that high-quality features containing effective fault information can be obtained by selecting appropriate l to perform WPD. An inappropriate value of l will result in too low classification accuracy (l less than 4) or too high computational cost (l greater than 6). erefore, it is reasonable to set l to 5 considering the classification accuracy and time consumption comprehensively.
In addition, to illustrate the necessity of feature selection, the diagnostic performance with and without feature selection is compared, and the experimental results are shown in Figure 12. As can be seen from Figure 12(a), there is no feature redundancy when l is less than 4. However, the number of redundant features gradually increases when l is   greater than 4, which is caused by the increase of frequency bands containing no or less fault information. Accordingly, the implementation of feature selection greatly reduces model training time from the perspective of computational cost. Meanwhile, it can be seen from Figure 12(b) that feature selection can also effectively improve the classification accuracy, especially when l is 5, 7, and 8.

Comparison of Different Classifiers.
In this section, to justify the superiority of the proposed BO-CatBoost algorithm and the applicability of the OWPD-FE method combined with other classifiers, SVM, RF, GBDT, and XGBoost are adopted for comparison. e dataset consisting of 22 high-quality fault features described in Section 3.4 is input into the above classifiers, respectively, for model training and testing. To obtain the optimal diagnostic performance, the main hyperparameters of these classifiers are all optimized by BO algorithm, as detailed in Table 6. e diagnostic results are shown in Figure 13.

Experimental Setup and Data Description.
Since the working condition of the MFS dataset is relatively simple, this section further verifies the effectiveness of the proposed approach in practical applications through an actual gearbox dataset with more complex working conditions. e experimental platform is composed of a one-stage reduction gearbox, a torque sensor, a servo motor, etc., as shown in Figure 14. Four fault types are formed by processing gears with different crack lengths (0, 5, 10, and 15 mm). e sample frequency is 5 kHz. e details of the dataset can be found in [2]. Here, data collected under 20 different working conditions are used, as shown in Table 7. ese data constitute 10 different datasets, as shown in Table 8     algorithm is still used to identify and diagnose fault types. Figure 15 shows the diagnostic results of ten datasets after selecting the optimal feature subset. For all ten datasets, the training set accuracy reaches 100%. For D1 to D9, the test set accuracy is higher than 96.56% and even higher than 99% on D2, D6, D7, and D9. e test set accuracy of D10 under the most complicated conditions is 98.65%, which is 0.22 percent higher than using the default CatBoost

Conclusions
In this paper, aiming at the fault diagnosis of rotating machinery under complex working conditions, a novel approach based on feature importance ranking and selection is proposed. Firstly, the OWPD method is proposed to decompose the vibration signal, where a two-step principle of selecting the optimal WBF is introduced. On this basis, it is combined with FE to extract hidden and high-quality fault features from the decomposed subsignals. en, in order to filter out redundant fault features that are not conducive to the diagnosis result, the CatBoost model is constructed and preliminarily applied to calculate the importance of each feature for further feature selection. Moreover, the classification model based on BO-CatBoost algorithm can effectively solve the optimization problem of hyperparameters, which can greatly reduce model training time and improve the diagnosis accuracy. Finally, experimental results on the MFS dataset and the one-stage gearbox dataset under complex working conditions demonstrate the practicability and the generalization performance of the proposed approach, and the classification accuracy reaches 99.17% and above 96.56%, respectively. In addition, the robustness of the proposed approach under different working conditions is also verified by the one-stage gearbox dataset.
In the future work, the effect of combining OWPD with other information entropies or some dimensionless timedomain indexes still needs to be discussed. In addition, the vibration data collected by only one acceleration sensor are utilized in this paper, while the fusion of multisensor data may provide more real fault information, which is also worth investigating in the next step.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.  Shock and Vibration 15