Optimal Statistical Feature Subset Selection for Bearing Fault Detection and Severity Estimation

,


Introduction
Vibration-based diagnosis and prognosis of bearing faults has been an active area of research for many years [1][2][3]. Machine learning algorithms have been effectively used in bearing fault detection systems. e performance of such algorithms, in terms of accuracy and efficiency, largely depends on the number of features used and their discriminating capability.
Several types of statistical features have been used so far. Zeng et al. [4] used 12 conventional time-domain and 10 frequency-domain features for their classification approach. e conventional statistical parameters, namely, mean value, maximum value, root mean square value, peak-to-peak value, root, standard deviation, variance, kurtosis, and skewness are the most commonly used dimensional parameters [5][6][7][8]. A wide set of dimensionless statistical parameters such as crest factor, shape factor, impulse factor, clearance factor, skewness factor, and kurtosis factor has also been used to enhance the performance of fault diagnosis systems [9][10][11][12]. ese parameters are extracted from either raw or preprocessed vibration signals. For the latter, empirical mode decomposition (EMD) has been popularly used as a preprocessing technique [13]. It is important to note that not all the constituent signals, called intrinsic mode functions (IMFs) generated by EMD, bear fault information. Different criteria, such as correlation coefficient [14] and kurtosis [4], have been used to select significant IMFs. is has been shown to be important for designing an efficient fault diagnosis system [15,16].
A major boost in bearing fault diagnosis research was seen when novel statistical features began to be applied in artificial intelligence techniques. Sreejith et al. [17] proposed a method for pattern recognition by extracting normal negative log likelihood and kurtosis values of time-domain signals and using them as input to an artificial neural network. Abbasion et al. [18] used Weibull negative loglikelihood function of denoised signals and support vector machine (SVM) classifier for fault classification of rolling bearings, while in [19][20][21] zero crossings, mean absolute deviation, and histogram features were applied in their respective works. e application of a new group of statistical parameters, called as "Hjorth parameters," was explored for diagnosing 4 types of faults in rolling element ball bearings [22]. Lastly, different kinds of entropies have also been used for intelligent fault diagnosis in various research studies [14,[23][24][25]. Some recent papers have developed symmetric cross entropy measures of single-valued neutrosophic sets for fault diagnosis in bearings [26,27].
Researchers were thus able to utilise the power of statistical features for capturing important fault information by applying popular supervised machine learning techniques. e potential of one such machine learning technique-rulebased classifiers-in the domain of bearing fault diagnosis has also been shown to be promising. An ensemble of rulebased classifiers for fault diagnosis of rotating machinery was proposed by Dou et al. [28] to predict potential faults and subsequent breakdown of rotating machinery by using 6 time-domain and 5 frequency-domain features. Application of rule-based classifiers has also been successfully extended for severity detection. Li et al. [29] proposed an association rule mining method based on equal probability discretization to perform a 10-class prediction on select time-domain and frequency-domain features for fault classification and severity prediction.
Although all of the studies cited above have applied various time-domain parameters, whether conventional or novel and dimensional or dimensionless, for bearing fault diagnosis, either exclusively or in addition to other frequency-domain and time-frequency features, none of the studies discusses and provides the reason behind the use of a set of particular statistical features. Majority of the previous studies, till now, have focussed more on optimising fault diagnosis classifiers. Moreover, few have compared established features with novel ones. Since fault diagnostic accuracy depends on the choice of both features and classifier, we believe it is important to emphasise on features too by proposing novel statistical features and/or developing a strategy to conduct an extensive analysis of new and old statistical features, so as to derive the most optimal set of fault-sensitive features.
In view of this, this paper applies statistical time-domain features, namely, Hjorth parameters (activity, mobility, and complexity) and normal negative log likelihood for Gaussian mixture model (GMM) for the first time for rolling element bearing fault type and severity diagnosis. ese statistical features were added to 26 already established statistical features to investigate a comprehensive set of 30 time-domain statistical features. e features were extracted from intrinsic mode functions of bearing vibration signals obtained using EMD. An in-depth two-step model-agnostic approach, based on filter-based ranking with 3 metrics and feature subset selection with 11 search techniques including both conventional search techniques and the newly introduced swarm search techniques, has been developed to select an optimal feature subset for bearing fault type detection and severity estimation. For validation of the performance of the selected features, rule-based classifiers-PARTand JRIP-and their bagging and boosting ensembles were applied. e paper also examines the selected parameters and further analyzes the replacement of the high ranked parameter normal negative log likelihood for single Gaussian with normal negative log likelihood for GMM and its effect on the diagnostic performance. Case Western Reserve University (CWRU) Bearing Dataset [30] has been used for the experiments. e performances of the selected subset of features have also been tested on a dataset obtained from an operating power plant for fault type detection. e main contribution of this paper is twofold: (1) proposing Hjorth parameters and normal negative log likelihood for Gaussian mixture model (GMM) for rolling element ball bearing fault detection and severity estimation and (2) devising a strategy to probe a comprehensive set of statistical time-domain features with the objective of identifying optimal feature subset for rolling element ball bearing fault detection and severity estimation. e paper has been organised as follows: Section 2 describes the statistical features and filter-based feature ranking and subset selection methods. Section 3 explains the steps of our proposed methodology. Section 4 discusses in detail data preparation steps, implementation and results of feature selection, fault type detection, and fault severity detection.
is section also discusses the replacement of normal negative log likelihood for single Gaussian with its Gaussian mixture model variant as well as the implementation of our methodology on a dataset obtained from an operating power plant. Finally, Section 5 discusses the drawn conclusions and future scope of our research. Table 1 shows the list of  statistical features extracted from the vibration signals. In the  table, T1, T2, . . ., T30 are feature notations, x i are individual data points of sampled vibrational signal, and N is the length of the vibration signal. e dimensional statistical parameters T1 to T11, T18, and T19 and dimensionless statistical parameters T12 to T17 are self-explanatory from their formulae. T20 and T21 are the arithmetic means of the absolute deviations of each x i from mean and median, respectively. T22 is the rate of zero crossing (ZCR) and can be defined as the ratio of the number of zero crossings to the total number of points in the signal. In other words, ZCR is the rate at which a signal changes sign from negative to positive or vice versa, thus providing a good understanding of the frequency of the signal and hence very useful for separating signals with different frequencies. T23 or entropy is a measure of randomness of the values of a vibration signal. e two histogram features T24 and T25 are histogram upper and lower bounds, respectively. e next three statistical parameters are the Hjorth parameters. T26 or activity of vibration signal is the same as variance of the signal. T27 or mobility is defined as the ratio of the standard deviation of the first derivative of the vibration signal to the standard deviation of the vibration signal itself. T28 or complexity is defined as the ratio of the mobility of the first derivative of the vibration signal to the mobility of the vibration signal. e next statistic, T29 or Weibull negative log likelihood, is computed as follows:

Extraction of Statistical Features.
Finally, T30 or normal negative log likelihood for single Gaussian can be computed as follows:

Filter-Based Feature Selection Methods.
In this paper, we have used filter-based methods for identifying important features for bearing fault diagnosis. Filter-based techniques work well with any classification model producing accurate results with shorter run times [31]. Moreover, these techniques perform well for the identification of dominant features for both fault type and severity detection [32].

Feature Ranking.
ree filter-based ranking metrics have been used to rank the features: chi-squared, information gain, and gain ratio. In chi-squared feature selection, the value of a statistic called chi-square (χ 2 c ) is calculated for the feature and response variables. ose features that have  2 Median T7 � (50(N + 1)/100)th observation 25th percentiles T8 � (25(N + 1)/100)th observation 75th percentiles T9 � (75(N + 1)/100)th observation Rate of zero crossing T22 � (number of zero crossings/total number of points) Shock and Vibration 3 high chi-squared value are then selected. e chi-squared value is calculated as follows: where c � degrees of freedom, O i � observed value, and E i � expected value. Information gain test uses the value of a statistic called information gain that is computed for each feature (L) in the context of a response variable (R), ultimately selecting features with higher information gain values. e higher the information gain score for an attribute, the more is the information that the attribute can contribute to the response variable. Information gain can be understood as the reduction in entropy (H) and is calculated as follows: where and m � the total number of instances andm k � the number of instances belonging to class k, in which k � 1, . . ., k. e third metric of gain ratio feature is a ratio of information gain to the intrinsic information: where split entropy(S, A) � − where A � candidate attribute, V � possible values of A, S � set of examples X, S V � subset where X A � V. In this method, the attribute with the highest gain ratio is selected as the splitting attribute.

Feature Subset Selection.
CfsSubsetEval evaluator gives a subset of features which show high correlation with ground truth labels but are very less correlated to each other. In this research, we have performed both conventional search techniques and the newly introduced swarm search techniques [33]. Best first search is a conventional greedy search technique that finds subsets by backtracking and provides the flexibility of searching in forward, backward, and bidirections. Greedy stepwise search is similar to best first search as it also searches for subsets in a greedy manner, but instead of backtracking, it searches in a stepwise fashion and stops when a decrease of evaluation occurs after an addition or deletion of an attribute. Swarm search methods are metaheuristics-based search methods that are nature-inspired optimization algorithms. Ant search [34] is a stochastic combinatorial optimization technique that is modelled on how the ant colonies work. It finds solutions in the early stages of the search process. Bat search [35] is based on how bats determine object location by using sound waves. Bee search [36] is inspired by the foraging behaviour of honeybees. It solves optimization problems by doing exploitative neighbourhood search and random explorative search together. Cuckoo search [37] algorithm models the parasitic behaviour of cuckoos and the Levy flight behaviour of fruit flies while elephant search [38] models the behavioural characteristics of elephant herds. Firefly search [39] is inspired by the flashing patterns of fireflies and offers the advantage of automatic subdivision and the ability to deal with multimodality. Flower search [40] follows the survival of the fittest rule based on the process of pollination of flowers. Wolf search [41] models how wolves hunt for prey and survive by avoiding their enemies. It is known to give good results when used in large search spaces. Rhinoceros search [42] models the herding behaviour of rhinos and can be used for solving global continuous optimization problems.

Rule-Based Classifiers.
Rule-based classifiers classify using "if... then... else" rules. If the set of feature values in a test sample conforms to a set of rules, then the sample is classified as belonging to the class associated with that set of rules.
e classification rules can be built by directly extracting them either from the data (direct method) or from other classification models (indirect method).
ese classifiers are very expressive in their results as the rules learned by them can help understand the underlying mechanics of the classification. Some of the rules generated by one of our classifiers are discussed in Section 4.3.
We have applied the rule-based classifiers: PART and JRIP for bearing fault diagnosis. PART [43] is an efficient and accurate rule-based algorithm based on partial decision trees. It builds rules in a recursive fashion by using the "separate-and-conquer" strategy. It forms a rule and then removes the instances that rule covers and again creates a rule for remaining instances, and these steps repeat until the number of remaining samples becomes zero. JRIP [44], an implementation of the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) algorithm, builds association rules using reduced error pruning (REP). First, a rule is added with conjuncts to improve the information gain. en, the rules in the set are pruned using incremental REP. is is repeated until the criteria for discretion length are achieved. e rule set is then optimized.

Methodology
e flowchart illustrating the steps used in our proposed methodology is shown in Figure 1. First, bearing vibration signals are bifurcated into nonoverlapping signals of length 1024 each. Empirical mode decomposition is applied on these signals to obtain their intrinsic mode functions. Second, each of the parent signals is then replaced by its IMF having the highest Pearson correlation with the parent signal, for further analysis 4 Shock and Vibration   Table 2). In this paper, we will refer to them as representative IMFs. ird, 30 statistical time-domain features ( Table 1) are extracted from these IMFs. Fourth, an average ranking of all the extracted features is obtained using filter-based ranking methods. Fifth, filter-based feature subset selection methods are used to find subsets of features having low intracorrelation and high correlation with the truth labels. e results obtained from the fourth and fifth steps are analysed together to obtain a set of significant features. Finally, the selected feature subset is used to train rule-based classifiers and their ensembles to perform classification for both fault type detection (Table 3) and fault severity detection (Table 4). e two classification models are also trained on a modified feature subset, formed by replacing one of the features, namely, normal negative log likelihood of single Gaussian (T30), by its Gaussian mixture model counterpart (T30 * ). e T30 * feature values have been computed for each vibration signal. e results obtained show the performance of classifiers to be slightly improved (Section 4.7, Exp. 6). e trained models from our pipeline are also tested on a bearing dataset obtained from the water pump at an operational power plant (Section 4.8, Exp. 7).
In this paper, EMD, representative IMF selection, and feature extraction have been performed using scripts written in MATLAB R2017a. e ranking and selecting of subset of features have been conducted through the opensource tool, Waikato Environment for Knowledge Analysis (Weka) [45]. Moreover, all the rule-based classifiers and their ensembles have been trained and tested in Weka. e Matlab scripts and models have been executed on an 8 GB RAM machine with Intel Core i5-7200U CPU having 2.50 GHz speed.

Dataset Preparation.
In this paper, we have used the Case Western Reserve University Bearing Dataset. e setup ( Figure 2) consists of a 2 hp electric motor, transducer/ encoder, dynamometer, and accelerometers. Data for normal bearings were recorded using the deep groove ball bearing SKF6205-2RS JEM with dimensions 25 mm × 52 mm × 15 mm at drive end for motor loads of 0, 1, 2, and 3 horsepower (motor speeds of 1797, 1772, 1750, and 1730 rpm, respectively). Electrodischarge machining was used to induce faults of diameters: 0.007 inches, 0.014 inches, and 0.021 inches in the SKF6205 bearing and 0.028 inches in its NTN equivalent. After reinstalling these faulted bearings into the test motor, vibration data were recorded for the same motor loads. Vibration data were collected using a 16channel DAT recorder for single-point drive-end defects at 12,000 samples/second. e data consist of signals corresponding to four cases: outer raceway defect, inner raceway defect, rolling element, i.e., ball defect, and the normal. e defect of the outer raceway fault was located at 6 o'clock (orthogonal to the load zone). Data samples of size 1024 points were extracted from these vibration signals in a linear nonoverlapping fashion, i.e., for each signal, the first 1024 points were extracted as its first data sample, the next 1024 points as its second data sample, and so on. A total of 17060 samples consisting of 3314 normal, 4735 inner race, 4272 outer race, and 4739 ball data samples were extracted. Each of the resulting 17060 signals, of 1024 data points, was decomposed by EMD resulting in 12-17 IMFs along with residue. e IMF having the highest Pearson correlation with the parent signal was taken as its representative signal since it retains most of the information of the original signal ( Figure 3). e correlation coefficient ρ X,Y was computed as follows: where E is the expectation, σ X and σ Y are standard deviations of X and Y, and μ X and μ Y are means of X and Y, respectively. e correlation coefficient values obtained for two sample vibration signals and their IMFs are shown in Table 2. e representative signals of all the parent signals were then used to extract 30 statistical time-domain features listed in Table 1. Min-max normalization was applied to scale the features. e values of every feature were transformed into a decimal between 0 and 1 (0,1 both inclusive) by applying the following equation: Two kinds of datasets were prepared: dataset A for fault type detection and dataset B for fault severity detection. Dataset A has the data labelled into 4 classes-outer raceway defect, inner raceway defect, ball bearing defect, and normal whereas dataset B has the data labelled into 12 classes defined by fault types and fault sizes (in inches). Both the datasets were divided into train and test sets using 80 : 20 split. e number of samples in train and test sets for dataset A and dataset B is shown in Tables 3 and 4 respectively.  IMF1  IMF2  IMF3  IMF4  IMF5  IMF6  IMF7  IMF8  IMF9  IMF10  IMF11       ree feature ranking techniques, based on the statistical metrics of chi-squared, info gain, and gain ratio, were applied to the data. ese techniques ranked the features according to their importance in classifying the samples into 4 classes for dataset A and likewise into 12 classes for dataset B. e threshold for the search method used by the feature ranking techniques was set in a way to ensure none of the features gets discarded. e ranking of the features is shown in Tables 5 and 6. e 3rd, 4th, and 5th columns list the rank obtained by ranker methods chi-squared, info gain, and gain ratio, respectively, for the feature in the corresponding row. e last column shows the average rank score calculated by averaging out the ranks of features over all the three ranking metrics.
e evaluator named CfsSubsetEval was used for further validation. is evaluator gives a subset of features which show high correlation with ground truth labels but low correlation to each other. 11 search techniques, including both conventional search techniques (best first search and greedy stepwise search) and the newly introduced swarm search techniques, were performed. Tables 7 and 8 show the results obtained with these techniques for the two datasets.
For implementing best first search, bidirectional search was used, while in greedy stepwise, forward search was used. e number of particles in the swarm, chaotic coefficient, and mutation probability were set as 20, 4.0, and 0.01, respectively, for all searches. In bee search, the discount factor for crossover was set as 0.8, and in cuckoo search, the sigma rate was kept as 0.69. For firefly search, zero distance attractiveness for the firefly population was set as 0.33. e same value was taken for wolf search. In flower search, a pollination rate of 0.33 was used.
From Table 7, it is observed that the subset of root mean square (T2), geometric mean (T18), rate of zero crossing (T22), Hjorth parameter-mobility (T27), and normal negative log likelihood for single Gaussian (T30) is selected by majority of search techniques. It is also observed that these 5 features are among the top 10 average rank scorers as highlighted in Table 5, hence validating the importance of these 5 features for fault type diagnosis. It can be noted that the features standard deviation (T6), RSSQ (T19), and Hjorth parameter-activity (T26) are ranked well by the ranking metrics but get selected by only a few subset search techniques. is could be due to the way subset selection works. Since one of the criteria for a good subset is that the subset elements should have very less correlation among each other, features like T6, T19, and T26 got rejected by most of the search techniques as they may be showing some correlation with an already chosen member of the set. Since our aim is to select the best performing subset for fault diagnosis, we selected the set of only the 5 features stated above.
For fault severity detection, from Table 8, we observe that the subset of features-root mean square (T2), RSSQ (T19), kurtosis factor (T17), geometric mean (T18), rate of zero crossing (T22), Hjorth parameter-mobility (T27), normal negative log likelihood for single Gaussian (T30), and 75th percentile (T9)-is selected by majority of search techniques, where these 8 features are also among the top 10 average rank scorers in Table 6.

Exp. 2: Fault Type Detection with Feature Selection.
Rule-based classifiers, PART and JRIP, were used with ensemble techniques boosting (multiboost [46] in our case) and bagging for performing fault type detection of the 4 classes-normal, inner race, outer race, and ball bearing. e ensemble techniques make use of a single learning algorithm to train multiple models on the dataset and then average out these models to get a final prediction. While bagging helps in getting an ensemble model with less variance by assigning equal weights to individual models for averaging, boosting works by reducing both variance and bias. Both of them help in increasing the stability of the models.

(10)
From the above rules, it can be observed that a total of 1635 out of 2649 dataset instances were correctly classified as normal using just one rule R4 generated by the features T2 and T22. Each of the three remaining rules also classifies significant portions of samples in the dataset, thus showing the good performance of the features forming these rules. e rules are also very simple in structure, for instance, R2 constructed with the help of just 2 atomic formulae correctly classifies 375 ball bearing instances. e ability to easily distribute the samples into the 4 classes using simple rules shows the efficiency of the selected time-domain features.
Since the number of samples belonging to normal class is 2649, which is less than the number of samples in each of the faulty classes (around 3500 for each), our dataset is slightly class unbalanced. Hence, evaluating the performance of the rulebased classifiers on accuracy alone may not be comprehensive, and therefore, the performance of the classifiers on the test dataset has been further studied by calculating 2 other metrics: F-measure and geometric mean (G-mean). F-measure per class is calculated as the harmonic mean of precision and sensitivity. G-mean is the geometric mean of specificity and sensitivity. Table 10 reports the performance of the classifiers on dataset A using F-measure and G-mean scores. e F-measure and Gmean scores for all the classifiers are significantly high; in particular, PART + multiboost classifier outperforms the other classifiers. Hence, our rule-based classifiers are able to classify our slightly unbalanced dataset very well, in turn showing the good performance of the selected features.
To further validate the good performance of the selected features in fault type detection, confusion matrices were built. Figure 4(a) shows the confusion matrix obtained on the test dataset with T30 parameter for PART with multiboost. e normal instances were accurately classified (with only 2 misclassifications), further confirming that they are well classified by our 5 important features even though normal class is slightly underrepresented. Another important observation is that most of the misclassifications involve ball bearing class. 22 instances each of inner race and outer race got misclassified as ball bearing. Also, a total of 56 ball bearing instances were incorrectly classified. To validate these observations, 3 features-T22, T27, and T30-were plotted pairwise using scatter plots as shown in Figure 5. It can be seen that for normal class, the plots show mostly nonoverlapping clusters, hereby confirming that the selected features are playing a very good role in classifying fault vs. no fault. e plots also show a lot of overlapping between ball bearing, inner race, and outer race clusters, confirming our observation about ball bearing class. is indicates a lot of diversity in the ball bearing instances that are unhandled by our features. Hence, features which can accurately distinguish this class from other classes are needed. A modification in one of the features resulting in a better classification for ball bearing instances has been discussed in Section 4.7. Table 11 reports the performance of the classifiers in fault type detection in terms of accuracy obtained on dataset A without feature selection. ese classifiers perform better than their feature selected counterparts, in terms of accuracies obtained. Again, in terms of overall performance, PART + multiboost outperforms all the 5 classifiers, giving the highest train and test accuracies of 100% and 99.13%, respectively. On comparing these with the train and test accuracies of 100% and 96.51% obtained with feature selection, we observe that, on the train data, the selected features were alone sufficient for attaining a perfect accuracy, while in the case of test data too, a major contribution in achieving the high accuracy can be credited to the selected features.

Exp. 4: Fault Severity Detection with Feature Selection.
is section detects the severity of fault types by performing a 12-class classification using rule-based classifiers on the selected eight features obtained for dataset B. e 12 labels used for severity detection are shown in Table 4 where the severity of a fault has been characterised by its size in inches. Table 12 shows the results obtained wherein, once again PART, when used with multiboost, outperforms all the 5 classifiers and gives the best performance accuracy of 100% on the train dataset and 97.92% on the test dataset. e results obtained with the parameter T30 * are discussed later in Section 4.7.
Having few IR 028 and BB 028 samples and almost thrice the count for other severity labels, the dataset is unbalanced classwise, and hence, we obtained the F-measure and Gmean values to further study the performance of the selected features. Table 13 reports the F-measure and G-mean values obtained with PART + multiboost for fault severity detection. We observe that both the underrepresented classes and the remaining classes are accurately classified. Hence, the selected features for fault severity detection are performing well despite the data being class unbalanced.

Exp. 5: Fault Severity Detection without Feature Selection.
In this experiment, we fed dataset B with all the 30 statistical features to the rule-based classifiers to again study fault severity detection, results for which are shown in Table 14. In  comparison with Table 12 data, the highest train accuracy of 100% remains unaltered while the test accuracy sees a gain of 0.76%. e perfect train accuracy and the small gain in test accuracy show the superior performance of our selected eight features over the other features, confirming their large      Here T22, T27, and T30 are rate of zero crossing, Hjorth parameter-mobility, and normal negative log likelihood for single Gaussian, respectively. sample data being the representative IMF of bearing vibration signal. However, on selecting some random samples from the set of representative IMF signals and plotting their frequency distribution histograms with fitting curves, shown in Figure 6, we found that the idea of fitting a single Gaussian curve will not work well with some of the signals. Some signals, like samples 1 and 2 in Figures 6(a) and 6(b), respectively, have more than one distinct maxima, and a single Gaussian curve will not be able to accurately model these signals. is trend was mostly seen for samples belonging to faulty classes. Hence, we replaced the T30 feature with its Gaussian mixture model variant, calling it T30 * . e normal negative log likelihood for Gaussian mixture model (T30 * ) gives the negative of the likelihood that a GMM of k means and k standard deviations fits the sample data. Mathematical formula for normal negative log likelihood of GMM can be defined as follows: where π k is the mixing coefficient of the kth Gaussian. Here, we instantiated k by the number of local maxima in the probability distribution of the sample data. Table 15 reports the normal negative log likelihood for GMM values obtained for the 3 samples shown in Figure 6. e normal negative log likelihood for GMM for Sample 1 is − 466, which is far better than its normal negative log likelihood for single Gaussian equal to − 143.347, in terms of fitting of the model. e same is the case with Sample 2. Sample 3 shows a very slight change in value because it can be fitted well with a single Gaussian. Tables 9 and 11 show the results obtained for fault type detection when T30 was taken as one of the 5 selected features and when it was replaced by T30 * . According to Table 9, PART with multiboost gives an accuracy of 100% on the train dataset for both T30 and T30 * and shows a slight jump in test accuracy from 96.51% with T30 to 96.75% with T30 * . For the rest of the classifiers, we see a slight improvement in train and test accuracy with T30 * over T30. From the two confusion matrices in Figure 4, the number of true positives for the ball bearing class has increased from 931 with T30 to 939 with T30 * , accompanied by a decrease in the misclassified instances of ball bearing into inner race and outer race classes from 24 to 22 and 28 to 22, respectively.
is further confirms the better performance of T30 * as compared to T30. Similarly, the accuracy values obtained in Table 11 show better accuracy values with parameter T30 * than the values obtained with T30 for all the classifiers. is again shows the effectiveness of T30 * over T30.
A similar trend is observed for fault severity detection. From Table 12, it is seen that on comparing T30 with T30 * , the train accuracy remains at its peak of 100% while test accuracy increases by 0.64% to become 98.56%. T30 * performs better than T30 in the case of without feature selection as well, as can be seen from Table 14. A more clear picture is provided by the confusion matrices in Figure 7 where we see that for T30, the misclassifications are more than that for T30 * . We believe that for bearing datasets having a majority of vibration signals with multiple peaks, this feature can prove to be a key performer because of its ability to represent such signals well. Hence, normal negative log likelihood of GMMs has the potential of being a more accurate and reasonable parameter for feature selection and bearing fault diagnosis than the normal negative log likelihood of single Gaussian. e setup, shown in Figure 8, consists of a 7 hp vertical electric motor of service water pump. Data for normal baseline bearings were recorded using the deep groove ball bearing SKF6303-2Z/C3 with dimensions 17 mm × 47 mm × 14 mm at drive end with a motor speed of 3000 rpm. Vibration data were collected at 12,000 samples/second. Vibration data are taken for 2 different timestamps, and for every timestamp, data have been taken for both vertical and horizontal positions, leading to a total of 4 vibration signals. We performed the same data preprocessing, including bifurcating the signals to subsignals of length 1024, EMD, and feature extraction, on these samples as done for the CWRU dataset.
Since the bearing configuration matches with that used in CWRU dataset, we mixed these test samples with the 20% test set separated from the CWRU dataset A. e trained model of PART + multiboost for fault type detection with feature selection, obtained in Exp. 2, was chosen for labelling instances as outer raceway defect, inner raceway defect, ball defect, and the normal. e overall accuracy and average Fmeasure achieved were 96.75% and 0.972, respectively. Furthermore, per instance classification results were obtained and it was observed that all the normal samples taken      from the power plant were correctly classified. Hence, the results were reasonably good. e mixing of power plant samples with CWRU test samples ensured that the good accuracy was not due to any bias shown by the classifier towards the normal class, thereby also leading to a good Fmeasure value. Furthermore, when T30 was replaced by T30 * for this dataset, we did not see any improvements in the accuracy and F-measure values. is might be due to the mostly single Gaussian nature of the newly added normal vibration signals taken from the power plant. e following points discuss and summarize the obtained results:  single Gaussian (T30) is replaced by its GMM variant (T30 * ). is is because some signals have more than one distinct maxima and a single Gaussian curve cannot model these signals accurately. us, instead of a single Gaussian curve, it is better to fit these signals with a Gaussian mixture model. (4) e F-measure and G-Mean scores for all the rulebased classifiers are significantly high, confirming the potential of the chosen features to accurately classify the unbalanced dataset. (5) On comparing the results with feature selection vs.
without feature selection, the latter performs better for both fault diagnosis and fault severity estimation giving 99.63% and 99.12% test accuracies, respectively. Despite this observation, the good accuracy values obtained with the selected features show that a major contribution in achieving the high accuracy in the case of without feature selection can be credited to the selected features. Hence, these selected features play a major role in the diagnostic ability of the classifiers. (6) e efficacy of the selected time-domain features is further confirmed by their good performance in classifying samples taken from an operating power plant.

Conclusions
So far, research in bearing fault diagnosis and severity estimation has majorly focussed on improving the performance of classification algorithms. However, the applicability of these approaches mainly depends on the quality of features extracted from the bearing signals. In light of this, this paper focussed instead on optimising the input features fed to these classification systems. We believe that if new input features are proposed and/or the input features are selected comprehensively using an in-depth analysis, they can contribute more in bearing fault diagnosis, covering all types of fault samples in an exhaustive manner. is can in turn lead to the existing classification methods to perform much better. In an attempt to achieve this, this paper proposed the usefulness of Hjorth parameters and normal negative log likelihood for Gaussian mixture model (GMM) for rolling element ball bearing fault detection and severity estimation. An in-depth two-step approach, based on filterbased ranking with 3 metrics and feature subset selection with 11 search techniques, was developed to select an optimal feature subset for bearing fault type detection and severity estimation. Filter-based selection techniques being model-agnostic make sure that the selected features work efficiently with any machine learning classifier. In this paper, we used rule-based classifiers and their ensembles for validation of our results in terms of diagnostic accuracy. e obtained results show that features root mean square, geometric mean, rate of zero crossing, Hjorth parameter-mobility, and normal negative log likelihood are important statistical features for bearing fault diagnosis. For fault severity detection, these five parameters and three additional parameters-75th percentile, kurtosis factor, and RSSQ, collectively form an important and useful feature subset. We also proposed a novel improvement in the performance of one of the selected high ranked features, normal negative log likelihood for single Gaussian, by suggesting its replacement with its Gaussian mixture model (GMM) variant. is is because some vibration signals have more than one distinct maxima and a single Gaussian curve cannot model these signals accurately. is also leads us to conclude that existing features could be improvised to achieve maximum possible coverage of the underlying data. e performance of the feature set was validated by running experiments for different conditions-with and without feature selection-and comparing their results. e dataset being class unbalanced proved that the good performance of the feature set is independent of the class distributions of the dataset. Since in real-world scenarios one can never exactly replicate the experimental conditions of a class-balanced dataset, our results guarantee that the selected features are independent of class-balanced scenarios. In addition to CWRU test rig dataset, the performance was further validated on a bearing dataset obtained from an operational power plant. e results show that the methodology is robust, aligns well with the real-world situation of class unbalanced datasets, and achieves good performance for both fault diagnosis and fault severity estimation. We also believe that the results obtained from this research can pave the way for researchers to change their current method of developing diagnostic systems by first analysing the feature set comprehensively and then shifting their focus to the next step of classification. For future studies, the selected high-performing statistical features obtained in this paper can be combined with features extracted from different domains to build a more robust multidomain feature set.
is work can also be extended by considering different types of bearing specifications. is will prevent the diagnostic system from becoming dependent on the vibration dataset of a specific bearing.

Data Availability
e Case Western Reserve University (CWRU) Bearing Dataset is available in [30]. e bearing data taken from the operational power plant were taken for academic research and are not publicly available.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.