The Fault Diagnosis of Rolling Bearing Based on Improved Deep Forest

Rolling bearing fault diagnosis is a meaningful and challenging task. Most methods first extract statistical features and then carry out fault diagnosis. At present, the technology of intelligent identification of bearing mostly relies on deep neural network, which has high requirements for computer equipment and great effort in hyperparameter tuning. To address these issues, a rolling bearing fault diagnosis method based on the improved deep forest algorithm is proposed. Firstly, the fault feature information of rolling bearing is extracted through multigrained scanning, and then the fault diagnosis is carried out by cascade forest. Considering the fitting quality and diversity of the classifier, the classifier and the cascade strategy are updated. In order to verify the effectiveness of the proposed method, a comparison is made with the traditional machine learning method.,e results suggest that the proposed method can identify different types of faults more accurately and robustly. At the same time, it has very few hyperparameters and very low requirements on computer hardware.


Introduction
Rolling bearing is an important basic device in mechanical equipment which has been widely used in wind power generation group, high-speed electric multiple unit (EMU), computerized numerical control (CNC) machine tools, and other equipment [1]. Rolling bearing is the core component of rotating machinery, whose failure will result in huge economic losses and threaten personal and property safety [2]. erefore, it is necessary to accurately grasp the running status of rolling bearing, timely maintain the damaged parts, and prevent them from evolving into a greater threat. Accurately and effectively identifying the types of bearing fault and ensuring the normal operation of mechanical equipment are essential to improve the reliability of the system.
Vibration signals of rolling bearing fault are usually nonstationary and nonlinear [3,4]. e early bearing fault identification technologies are mainly based on time domain, frequency domain, and time-frequency signal analysis methods [5][6][7]. In general, the fault features such as kurtosis, coefficient of variation, energy entropy, information entropy, and power spectrum entropy are extracted from the original signal, and then the fault identification is carried out by combining the classification algorithm. In terms of traditional fault diagnosis methods, the more commonly used time-frequency analysis methods include wavelet transform [8], empirical mode decomposition (EMD) [9], and variational mode decomposition (VMD) [10]. Zhao proposed a rolling bearing fault diagnosis method that combines wavelet packet decomposition (WPD) and multiscale permutation entropy (MPE). e vibration signals of rolling bearing in different states are decomposed into a group of subfrequency signals by using WPD, and then the average MPE value of each subfrequency signal is calculated as the input feature vector, and the fault modes of rolling bearings were identified by the hidden Markov model (HMM) [11]. Zhang proposed an automatic fault diagnosis method for rolling bearings based on lifting wavelet packet transform (LWPT), sample entropy, and classifier integration. And the construction of wavelet function is not based on Fourier transform but is obtained in the time domain. At the same time, considering the unstable accuracy of single classifier, an ensemble system which integrates back propagation neural network (BPNN), radial basis function neural network (RBFNN), and Elman neural network (ElmanNN) is proposed to reduce the impact of initial parameters on the performance of the classifier [12]. Compared with the traditional wavelet transform, the lifting wavelet packet transform has the advantages of the flexibility of constructing the wavelet function, less computation, and less memory. Ensemble empirical mode decomposition (EEMD) [13] is an improved version of EMD. As a typical representative of the adaptive method for dealing with nonlinear and nonstationary data, EEMD has been widely applied in the field of fault diagnosis [14,15]. VMD is developed on the basis of EMD, which has a solid mathematical theoretical basis and has been proved to be superior to other adaptive data decomposition methods. It is widely used in fault diagnosis [16]. Chen et al. proposed the fault diagnosis of rolling bearing based on VMD and support vector machine (SVM), which significantly improved fault identification accuracy through multiscale fractal dimension and multiscale energy calculation features [17].
In addition, some shallow learning methods are closely combined with intelligent optimization algorithms for fault diagnosis. Dai proposed a fault identification method based on KICA-RBF [18]. In this model, it is important to use kernel independent component analysis (KICA) to fuse multiple signals to eliminate noise, and the genetic algorithm is used to optimize the parameters of radial basis function (RBF), thus the accuracy is improved. Zhao and Deng et al. improved a variety of optimization algorithms and proposed a data-driven feature extraction method-fitting curve derivative method of maximum power spectrum density (FDMPD)-and combined with the kernel extreme learning machine (KELM) and weight application to failure times (WAFT), which can effectively realize the prediction of the remaining service life of rolling bearings [19][20][21][22]. LV et al. proposed an improved particle swarm optimization (PSO) algorithm to optimize parameters of support vector machine for fault diagnosis of rolling bearings [23]. e PSO is improved by introducing dynamic inertia weight, global neighborhood search, population shrinkage factor, and particle mutation probability. e method solves the problem of blind selection of kernel function parameter and penalty factor parameter of SVM. Experimental results showed that the classification effect is more stable. Shallow learning algorithms require manual participation in the construction of feature engineering and have poor ability of learning representation, while deep learning can effectively model high-level abstraction of data due to its powerful nonlinear representation capability [24]. In recent years, with the development of artificial intelligence, deep learning has made breakthrough progress. e cross-domain application of deep learning in fault diagnosis has aroused great interest and achieved remarkable results [25][26][27][28]. Zhong et al. used EEMD to decompose intermittent fault signals into multiple intrinsic modal functions (IMFs), combined with Pearson correlation coefficient for feature optimization, and deep belief network (DBN) is used for fault diagnosis [29]. Guo et al. proposed a bearing fault diagnosis method based on hierarchical learning rate adaptive deep convolutional neural network and achieved satisfactory results [30]. Xu et al. proposed a bearing fault diagnosis method based on convolutional neural network (CNN) and random forest, which took two-dimensional images of continuous wavelet transform as input [31]. e multilevel features containing local and global information are used to diagnose bearing faults. e research indicated that this method is superior to the base deep learning method. Although the fault diagnosis method based on deep neural network (DNN) is powerful, due to the complexity of its model, a large amount of training data and its learning performance are excessively dependent on parameter optimization, which limits its applicability. In 2017, Zhou et al. proposed a method different from deep learning called multigrained cascade forest (gcForest), which generates a deep forest with cascade structure for representation learning, and it is regarded as a decision tree ensemble approach [32]. gcForest is easier to analyze theoretically than DNN. Some scholars have explored its application in the field of fault diagnosis. Hu et al. proposed a collaborative method combining deep Boltzmann machine with multigrained scanning forest integration, which effectively solved the problem of industrial fault diagnosis under big data [33]. Liu et al. applied deep forest for the first time in the end-to-end intelligent diagnosis of hydraulic turbine faults [34]. Considering the diversity of the cascade forest classifiers and the classification performance of each classifier, this paper proposes an improved deep forest algorithm. e mechanism changes the cascading mode based on the output results of the multigrained scanning stage and replaces the classifier through the cascade stage to increase the diversity and improve the performance of the classifier. e main contributions of this paper are summarized as follows: (1) e classification is based on the original vibration signal data, which is different from most existing literatures in which they first extract the features and then classify them. e interference of human factors to feature window and feature type is avoided. (2) An improved deep forest algorithm is proposed, the idea of heterogeneous integration is introduced, and the cascade mode is changed to reduce the loss of sample information, which further improved the accuracy of fault diagnosis of gcForest. e rest of this paper is organized as follows. Section 2 is the basic principle of deep forest and its improved algorithm; in Section 3, we give the results of the empirical analysis. Section 4 is the conclusion.

gcForest
GcForest, also known as deep forest (DF), is a supervised ensemble learning algorithm based on decision tree, which mainly consists of two parts: multigrained scanning (MGS) and cascade forest [32]. MGS solves the problem of highdimensional input and enhances the difference of input feature. Cascade forest can improve the classification ability 2 Shock and Vibration of input features by simulating the structure of DNN for representation learning.

Multigrained Scanning.
e multigrained scanning is mainly inspired by the CNN. e key point of the CNN is that different enhancement features can be obtained through the convolutional kernel of different sizes. e MGS just draws on this idea to enhance the cascade forest [35]. Multigrained scanning is mainly used to locally sample the original data through the sliding window so as to obtain multiple feature instances of different dimensions. e process is described as follows: the input sample size be S dimension.
e sliding window size is Kdimension. e sliding step size is L, and M represents the number of generated feature vectors, and then we have the following equation: After the multigrained scanning, each sample subset is input to a random forest and a completely random forest (CRF) for training so that each forest can obtain the feature vector of M * C, where C is the number of categories. Finally, M * C * 2-dimensional features can be obtained as the output of the multigrained scanning structure. As shown in Figure 1, it is a multigrained scanning process. Assuming that the raw data have 400-dimensional feature, the size of the sliding window is 100-dimensional; after sliding scanning, it will produce 301 feature vectors. If this is a threeclass problem, each forest will produce 301 three-dimensional class vectors. Finally, 1806-dimensional transformed feature vector is obtained. Similarly, for 400-dimensional input data, sliding windows of 200-dimensional and 300dimensional will generate 1206-dimensional and 606-dimensional feature transformed vectors, respectively.

Cascade Forest.
e cascade forest stage embodies the process of deep learning through the hierarchical representation learning of features. Each level in the cascade forest corresponds to a different scanning granularity. e latter level receives the feature information from the previous level. e feature information is transmitted to the next level after processing at this level. Each level takes as its input an eigenvector that connects the original input to the output from the previous level [32,35]. As shown in Figure 2, each layer of the cascade contains two random forests and two completely random forests, each forest is composed of multiple decision trees, and each tree randomly selects features from d input features as candidate features. e dividing standard of node splitting in decision tree is to select the feature with the best Gini value as the root node. Completely random forest in the choice of decision tree node split is random; each leaf node until only the similar samples has stopped growing. e classification results are obtained by the class vectors distribution of the leaf nodes of the decision tree. And then you take the average of all the trees to get an estimate of the class distribution. In order to reduce the risk of overfitting, the cross-validation method is used during training.

Overall Process of Deep Forest.
Combining the multigrained scanning and the cascade forest, the overall flow of the deep forest is obtained, as shown in Figure 3. Suppose that there is original input of 400 features, three sliding windows with length of 100-, 200-, 300-dimensional are used for multigrained scanning, respectively [36]. e sliding step length is 1. As stated in Section 2.1, the feature vectors after MGS are input to a RF and a CRF for training, and then the eigenvectors of 1806-dimensional, 1206-dimensional, and 606-dimensional are obtained, respectively, and used for training the first level of the cascade forest. Taking the 100dimensional sliding window as an example, suppose there are three classes. After the 1806-dimensional features are trained by four forest classifiers. en, it is connected to the 1806-dimensional feature vector obtained after scanning transformation, so the first level obtains 1818-dimensional feature vectors, which is the input to the second level. Similarly, the second level obtains 1218-dimensional feature vector after concatenating, which is used as the input of the third-level training.
e third level generates 618-dimensional vectors, which is used as the input of the next level. e above process is repeated until there is no significant performance gain, and the training process is stopped.

Improved Deep Forest
Model. DF is an ensemble learning algorithm. If you want to build an ensemble with strong generalization ability, it should be "good but different" for individual learners. Layer-by-layer training of cascade forest can enhance the representation ability of feature information, and it is very important for DF ensemble learning to adopt different classifiers for each layer.
In this study, cascade forest based on multiple heterogeneous classifiers is proposed, and the classifier of each hierarchy is set to RF, ET, XGBoost, and LightGBM. e change is illustrated in Figure 4. e combination of multiple different types of forest classifiers could fully learn the feature information of the input feature vectors and improve the overall performance of the model. e most important characteristics of RF are sample randomness and feature randomness. By extracting different training sets and randomly extracting features for training, the difference between classification models will be increased, which can effectively avoid the overfitting problem. At the same time, the parallel computing mechanism of the RF algorithm greatly reduces the training time of data [37].
ET is the abbreviated form of the extremely randomized trees model. Each time, all the samples are used for training, and features are randomly selected. Since splitting is random, the results obtained by ET are better than those obtained by RF to some extent [38].
XGBoost (eXtreme gradient boosting) is a supervised learning algorithm based on decision tree, which is proposed on the basis of gradient boosting decision tree (GBDT) [39].
e XGBoost generates a new tree by iteration. Compared with GBDT, XGBoost adds regularization term to the loss Shock and Vibration 3 function to control the complexity of the model. XGBoost supports linear classification problems and draws on the practice of RF by supporting column sampling, which not only reduces overfitting but also reduces computation, which is a feature of XGBoost that is different from GBDT.
XGBoost has been widely used because of its advantages of parallel processing, simple model structure, small computation amount, and high accuracy [40].
LightGBM is an improvement of GBDT, mainly to solve the problem of the decline of training efficiency when the   GBDT algorithm is dealing with large amount of data. LightGBM improves the GBDT algorithm from two technical aspects as follows. (1) To solve the problem of large amount of data, the improved method is gradient-based one-side sample (GOSS). e gradient is calculated by sampling the samples, the large gradient data are retained, and the small gradient data are randomly sampled to reduce the amount of data used by the samples; (2) e Leaf-wise splitting method with depth limitation is used to replace the traditional Leaf-wise splitting method. Each time, the Leaf splitting with the largest splitting gain is found to generate a more complex decision tree, which can reduce the error and improve the accuracy of the algorithm. ese two technical methods of LightGBM greatly reduce the time cost and accelerate the training process, and a large number of experimental studies have shown that LightGBM has even better performance in terms of accuracy. erefore, XGBoost and LightGBM are selected in this paper to replace the two original classifiers in the cascade forest [41].
On the other hand, in the gcForest, the probability feature and the original input feature can be serially integrated into the input vector to effectively prevent overfitting. However, as the model depth increases, this sparse connection structure may result in a large amount of information being discarded, which may hinder the diversity of the integration. is study is inspired by the model of dense cascade forest [42]; we improve the gcForest. As shown in Figure 4, for each level of the cascade, a sublevel is added. In the first level of the original cascade, there are three sublevels called Level1 A , Level1 B , and Level1 C . e additional sublevels are created by connecting all the features together. Taking Section 2.3 as an example, this feature will be 3618dimensional; then the characteristics are input into the classifier to obtain the probability class vector, which is concatenated with 3618-dimensional features as Level1 A level. e original 1806-dimensional, 1206-dimensional, and 606-dimensional features are concatenated in Level1 B , Level1 C , and Level1 D , respectively. e process at the second level is similar to that at the first level, in which cascading all the features retains more information of the original sample. In the experiment, we find that this structure makes the training process more stable.

Data Sources.
e data in this study come from the public bearing data set of Case Western Reserve University [43] in the United States. e data were collected from a motor, a torque sensor, and a power tester. e experimental platform is shown in Figure 5. e fault of the bearing is manufactured by the electric spark technology. e motor load is from 0 to 3 horsepower, and the corresponding motor speed is 1797 to 1730 rpm. e vibration signal includes the normal data, the drive end acceleration data, the fan end acceleration data, and the base data. is paper only collects the fault data of the drive end for analysis, and the sampling frequency is set to 12 kHz. Four different fault diameters were introduced for inner raceway, outer raceway, and rolling element. e fault diameter ranges from 0.007 inches to 0.028 inches. Due to the lack of data with fault diameter of 0.028 inches in some types of data, only 0.007 inches, 0.014 inches, and 0.021 inches of data are retained. Rolling bearing can be divided into four kinds of condition: normal state (NOR), the inner race fault (IRF), ball fault (BF), and outer race fault (ORF). Figure 6 shows three types of bearing fault. Partial vibration signals of the four conditions are shown in Figure 7.
In this study, we design two experiments for fault diagnosis of rolling bearing. In Experiment 1, the collected data are divided into four categories, including normal state, inner race fault, rolling fault, and outer race fault. e data under each condition are 1460,000 data points. In Experiment 2, the fault data are further divided into three different fault diameters.
ere are three fault states in each fault diameter, plus the normal state. erefore, the data are divided into 10 categories, and the sampling point of each category is 480000. Each type of data set is randomly divided into training set and test set in an 8 : 2 ratio.

Performance Evaluation and Parameters Setting.
In order to evaluate the generalization ability of the model, the 8fold cross-validation method is adopted. Accuracy and macroaverage are used to evaluate the performance of the algorithm. Accuracy is the proportion of correctly predicted   (1) RF: n_estimators � 10, max_depth � 10, criterion � "gini," min_samples_split � 2, and min_samples_leaf � 10 (2) ET: n_estimators � 10, min_samples_split � 2, max_depth � 10, and min_samples_leaf � 10 (3) XGB: max_depth � 10, learning_rate � 0.1, and n_estimators � 10 (4) LGBM: learning_rate � 0.1, n_estimators � 10, and max_depth � 10 (5) SVM: C:1, kernel � "rbf," type � "Classification," gamma � "scale," and tol: 1e^−3 (6) LSTM: activation � softmax, loss � categorical_crossentropy, optimizer � Adam, epochs � 500, and batch_size � 30 (7)  is fully shows that the recognition accuracy and robustness of the proposed method are greatly improved. e model based on deep forest is much more accurate than all other learning methods, which shows the effectiveness of the ensemble learning model based on tree ensemble. In this experiment, in addition to the SVM, LSTM model accuracy is higher than other accuracy of the shallow learning algorithm due to its strong ability of nonlinear said, but with the SVM, model accuracy is close; it may be that the depth study of the performance of the model parameters of dependence is strong, and parameters setting for each model are described in section 3.2 Figure 8 shows the confusion matrix of the four classification test sets of gcForest and the proposed method. e diagonal elements of the matrix represent the recall rate for each fault mode. It can be seen from Figure 8 that the proposed method in this paper can fully identify the bearing normal condition and outer race fault, and the number of misdiagnoses of inner race fault and ball fault is less than that of gcForest.
F1-score index is the harmonic average of precision and recall, and it is a good comprehensive evaluation index. Figure 9 shows the comparison results of F1 values on the Shock and Vibration 7 test set between the proposed method and the basic models in four categories. e F1-score of the method in this paper is consistent with the gcForest under normal conditions but higher than other basic models under other three types of faults, and the F1-score of most basic models is less than 80%. e above results further confirm the performance of the proposed method. Macroaverage is the average value of each label evaluation calculated independently. e results of the three macroaverage indicators of different methods are shown in Figure 10, which shows that our method achieves the highest value in macroaverage precision, macroaverage recall, and macroaverage F1-score. It is shown that the proposed method has the best effect on four types of fault diagnosis.
In Experiment 2, rolling bearing faults are divided into three different diameters. Together with bearings in normal state, the faults are divided into 10 categories for experiments. e experimental results compared with the basic model are shown in Table 2. Table 2 shows that the accuracy of the deep forest-based learning algorithm is higher than that of other algorithms. Compared with the other 7 models, the proposed method still achieves the highest accuracy. Compared with the gcForest with the highest accuracy in the base models, the training accuracy and testing accuracy are improved by 2.74% and 2%, respectively. Figure 11 shows the confusion matrix comparison between the gcForest model and the proposed method under 10 classification test sets. As can be seen from the figure, the proposed method can fully identify six categories, which are NOR, IRF, and ORF with a diameter of 0.007 inches, IRF and ORF with a diameter of 0.021 inches, and IRF with a diameter of 0.014 inches; in addition to the spherical fault misjudgment rate of 0.007 inches in diameter than gcForest slightly higher, the rest of the three conditions of recognition accuracy is higher than that of gcForest, especially the accuracy of ball fault identification increases 0.021 diameter larger extent. Figure 12 shows the comparison results of F1-score on the test set between the proposed method and the base models in the case of 10 classifications. e F1 value of the method in this paper is high in most types. e best diagnostic results are achieved for all inner race faults, normal condition, and for outer race faults of 0.007 inches and 0.021 inches diameters. For ball faults with diameter of 0.014 and 0.021 and outer race faults with diameter of 0.014 inches, the F1-score is higher than other base models, and only for ball   Shock and Vibration faults with diameter of 0.007 inches, it is slightly lower than gcForest, and the F1 value of most base models is around 40%. e above results further demonstrate the superior performance of the proposed method. Figure 13 shows the three macro average index values of different methods. According to Figure 13, our method achieves the highest value in macroaverage precision, macroaverage recall rate, and macroaverage F1-score. e macro average precision of the proposed method is 57% higher than that of the lowest ETmethod and 2% higher than that of the highest gcForest. Similarly, the macroaverage recall rate and macroaverage F1-score of the proposed method are 56% and 57% higher than those of the lowest method and 1% higher than those of the highest method. It is shown that the proposed method has the best diagnostic performance on 10 types of faults.
In addition, in order to further illustrate the performance of the proposed method, we designed Experiment 3. In Experiment 3, two innovations in the proposed method are analyzed separately for the two classification situations and compared with existing methods in literature [36]. e experimental results are shown in Tables 3 and 4, respectively. Table 3 shows the comparison results of training accuracy and testing accuracy in 4 types of situations, and Table 4 shows the comparison results of training accuracy and testing accuracy in 10 types of situations.
"Only improve the cascade" as mentioned in Section 2.4, a sublevel is added to each cascade layer, which is composed of the full connection of class vectors and placed at the first sublevel of each level. e learner keeps consistent with the original deep forest. "Only replace the learner" means that learners are replaced, and the cascade mode after multigrained scanning is still consistent with the original deep forest. Tables 3 and 4 show the accuracy of the proposed method is the highest. In Table 4, the accuracy is 98.05% and 96.99%, respectively. It is 4.86% and 1.25% better than the method of only improve the cascade. It is 0.76% and 0.75% better than the method of only replace the learner. e training accuracy and the testing are also higher than     existing methods in the literature, which fully proves that the proposed method in our study on the classification result in the rolling bearing fault diagnosis efficiency.

Conclusion
is paper proposed an improved method of rolling bearing fault diagnosis based on deep forest. CWRU bearing vibration signals are used to verify the effectiveness of this method through two different groups of experiments. Connecting multiple scan granularities to add a sublevel at each level of the cascade forest reduces the loss of information flow. In addition, the adoption of more efficient tree model learners not only increases the diversity of classifiers but also helps to improve the recognition rate of fault diagnosis. e analysis results suggest that the detection rates of test faults under the two classifications are 98.54% and 96.99%, respectively, which is higher than those of other base models. e results indicate that the improved deep forest model has high recognition ability and robustness for bearing faults.
is study has a few limitations. Although the proposed method is relatively time-consuming because multiple granularity connections are added for cascading, it can obtain better fault detection performance. In order to solve this problem, feature optimization in the cascade layer is considered in the future research to detect faults efficiently and accurately.
Data Availability e data of rolling bearing are from the website of Case Western Reverse Lab, and they are all available at http:// csegroups.case.edu/bearingdatacenter/pages/downloaddata-file.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.