Fault Diagnosis for Rotating Machinery Based on Convolutional Neural Network and Empirical Mode Decomposition

. The analysis of vibration signals has been a very important technique for fault diagnosis and health management of rotating machinery. Classic fault diagnosis methods are mainly based on traditional signal features such as mean value, standard derivation, and kurtosis. Signals still contain abundant information which we did not fully take advantage of. In this paper, a new approach is proposed for rotating machinery fault diagnosis with feature extraction algorithm based on empirical mode decomposition (EMD) and convolutional neural network (CNN) techniques. The fundamental purpose of our newly proposed approach is to extract distinguishing features. Frequency spectrum of the signal obtained through fast Fourier transform process is trained in a designed CNN structure to extract compressed features with spatial information. To solve the nonstationary characteristic, we also apply EMD technique to the original vibration signals. EMD energy entropy is calculated using the first few intrinsic mode functions (IMFs) which contain more energy. With features extracted from both methods combined, classification models are trained for diagnosis. We carried out experiments with vibration data of 52 different categories under different machine conditions to test the validity of the approach, and the results indicate it is more accurate and reliable than previous approaches.


Introduction
Rolling-element bearings (REBs) are the most fundamental and important components of rotating machines in industrial manufacture and agricultural production.Therefore, the analysis of REB vibration signals is always considered an important approach in fault diagnosis and condition monitoring.A minor defection of rolling bearings may lead to breakdown of the entire system and cause severe financial losses.
Vibration signals are usually generated from rollingelement bearings, which contain rich information that may assist in the procedure of condition monitoring, fault diagnosis, and machine health management.The research of bearing fault diagnosis has long been receiving extensive attention over years and is becoming more important in modern industry for the need of higher reliability and lower loss possibility.
Essentially, fault diagnosis is a pattern recognition problem, which includes two major steps that are feature extraction and classification.Traditional features of vibration signals are generated from three main kinds of methods as listed below.Time domain analysis and frequency domain analysis are mostly commonly used in feature extraction; also the combination known as time-frequency domain analysis is another significant method.Time domain features have long been used in the aspect of fault diagnosis for rotating machinery [1].Most time domain features are statistical features such as mean value, root mean squares, standard deviation, kurtosis, and skewness.They are generally easy to calculate and acquire and therefore are trained in different classifier models for fault diagnosis.Hu et al. [2] and Sreejith et al. [3] combined time domain features with artificial intelligence, namely, artificial neural network (ANN), in bearing fault diagnosis.Another machine learning technique such as support vector machine (SVM) is also applied in [4].Chang et al. in [5] summarized other time domain features used in fault diagnosis.
The analysis of the vibration signals' frequency spectrum is the basis of frequency domain analysis.Fundamental frequencies of the signals are calculated through fast Fourier transform.Usually the significant frequencies and the corresponding amplitudes are chosen manually as fault diagnosis features.Frequency domain features are applied with different methods in [6][7][8].Time domain features and frequency domain features reflect different characters of the vibration signals, so generally fault diagnosis methods consider them both as classification features.In [9], time domain features and frequency domain features were combined using information fusion and an ANN model was trained for fault diagnosis.Cao et al. in [10] trained a SVM model with feature extraction using PCA method.Other experiments were done trying to take advantage of both domain analyses in [11,12].
Time-frequency methods are usually effective in extracting the features of the original rotating machinery signals.However, most of the vibration signals may have nonstationary characteristic; other analysis methods are introduced.Wavelet transform is one of the most useful signal analysis methods.Efficient results of applying wavelet transform are shown in [18,19].
Though traditional analysis methods are mostly effective, however some fundamental mathematic models usually need to be established before applying to the original signals.For instance, the fundamental frequencies need to be selected manually and the bandwidth of filters to preprocess signals is chosen with expert experiences.In real rolling-element bearing systems, signals are more complex and parameters may be hard to extract or determine.
Being a time-frequency analysis technique, empirical mode decomposition (EMD) shows its powerful ability for signal analysis.The analysis process of EMD is not based on predetermined parameters but takes the local time scales of the signals into consideration [20].In an EMD procedure, the vibration signal of a rotating machine is decomposed into a set of intrinsic mode functions (IMFs).Each IMF may be considered as a basic function of the signal.When the vibration signals are nonlinear and nonstationary, EMD technique may have better performance than traditional techniques.Also, EMD is a self-adaptive processing method, which means less manual work.
Most feature extraction methods mentioned focused on utilizing signal characteristics instead of modeling the signal itself.However, vibration signals still contain rich information.Recently machine learning techniques, especially neural networks, have been widely used in feature engineering.Deep learning technique is a machine learning method proposed in 2006 [21].The special structure of deep neural network (DNN) makes it possible to extract features for original signals representation [22].The performance of DNN has been state of the art in many applications, such as computer vision and natural language process [23,24].Researchers have applied DNN in fault diagnosis as well [25][26][27][28].Verma et al. [27] purposed a condition monitoring method using sparse autoencoder.In [25], Tagawa et al. built a model based on denoising autoencoder for car fault diagnosis.
Convolutional neural network (CNN) is an important machine learning technique.CNN is a deep neural network structure that mainly focuses on image processing.Like other neural network structures, CNN is formed by a number of neurons, which are organized as the reflection of different overlapping part in the whole field.CNN has been used for image classification and segmentation, and it already has achieved effective results [29,30].
In this paper, EMD and CNN are both applied as feature extraction method, and a complete structure for fault diagnosis of rolling-element bearing is designed and trained.The following parts of the paper are organized as below.In Section 2, a literature review is given about CNN and EMD applications.Details of CNN and EMD methods and a complete structure of our approach are also described and discussed.In Section 3, the validity of our newly proposed approach for REB fault diagnosis is testified by different experiments which we carried out.In addition, the experiment results are compared with other analysis methods.In the end, the conclusion of this paper is drawn in Section 4.

Methodology
First of this section, details of CNN and EMD methods are introduced, after which a complete structure of our approach is described and discussed.A number of CNN structures are developed in recent years such as LeNet, GoogleNet, and AlexNet.Figure 1 is a typical structure of LeNet model.Applications in image recognition, video analysis, and nature language process also show the effectiveness of CNN model [31][32][33][34][35].
A CNN structure is made up of three types of layers, which are convolutional layer, subsampling layer, and fully connected layer with a loss function such as SVM or softmax classifier [36].Typical CNN structure can therefore be divided into two parts.Convolutional layers and subsampling layers work as the feature extractor, while the last layer works as a classifier.
A convolutional layer is the most important and fundamental component of a CNN structure.Each neuron in a convolutional layer receives some inputs of a restricted region in the whole signal.The convolutional layer's weights and biases are considered as a group of convolution kernels (or filter).A kernel only takes a relatively small region of the signal into consideration and projects the whole signal to a brand new feature map, which means dot product is calculated between the signal and each kernel repeatedly.Since the replicated kernel shares the same parameter setup, the number of the network parameters is relatively small.
A  ×  ×  signal vector is input to a convolutional layer as the extractor part of CNN. is the height and width of the input signal, and in general cases the height and width are the same. is the number of channels of the input.A convolutional layer has  filters (kernels) in the size of  × , where  is usually less than half the size of the input vector's height .Each of the filters takes a relatively small local region of the input signal into consideration and projects the whole signal to a brand new feature map, which means dot product is calculated between the signal and each kernel repeatedly. feature maps are generated with the size of  −  + 1.Each feature map is then generally subsampled in contiguous  ×  areas.Types of subsampling techniques include average pooling and maximum pooling depending on the calculation of a restricted area.Also in the pooling process, the pooling areas may be overlapped.
As we know, the convolution layer is used for extracting signal features, and the pooling layer may reduce computation cost.After feature extraction, the extracted features are usually put into a classifier.In this paper, CNN is only used as a feature extractor for fault diagnosis, and the classification part is done after combining other time-frequency domain features.
Figure 2 presents the structure of the CNN structure used in this paper.Consider vibration signals  as the input signals and  as labels of the signal.In the convolutional layer, a set of feature maps can be acquired by using different filters.Subfeature maps are the result of convoluting multiple input feature maps.The process is calculated as follows: where   represents the selection of input feature maps,  is the th layer of a network,  is a convolutional filter connecting the  − 1th layer to the th layer,  is a nonlinearity active function, and    represents the feature map generated from the −1th layer. is the additive bias given to each output feature map.
Traditional nonlinearity active function  used in neural network is sigmoid function (() = 1/(1 +  − )), but due to its problem in gradient vanishing, a new active function called Relu (Rectified Linear Units) function is generally used in deep learning methods.The expression of Relu function is () = max(0, ).Besides solving the gradient vanishing problem in back propagation steps of the neural network training, the amount of calculation would be much less using Relu function.The outputs of some neurons would be zero using Relu function, which leads to the sparsity of the network and avoids the problem of overfitting.
A subsampling layer is calculated as follows: where  and  are multiplicative bias and additive bias. represents a subsampling function; common subsampling functions are max pooling and average pooling functions.In a max pooling process, the max of the restrict region is chosen as the new feature, while, in an average pooling, a mean value of the same region is calculated as the new feature.Generally speaking, max pooling reflects the most significant characteristic while average pooling smoothens the region and selects the smoothed feature for further use in the following layers.
CNN method has the advantage of extracting feature automatically due to the back propagation (BP) steps.The gradient of the loss function for all the weights in all the layers is calculated by BP algorithm.The mean-squared error (MSE) of the output layer is expressed as follows: The objective is to minimize the error by reducing the contributions of the network parameters.We calculate the derivative of the MSE to perform gradient descent method on weight    and bias    of the neuron.The sensitivities of the error are as follows: where  = ∑ ∈   −1  *    +    .The sensitivities of higher layer are calculated using chain-rule as The updating of the weights is then calculated as follows: where  is the learning rate.The calculations of sensitivities for convolutional layers and subsample layers are different, of which we will not discuss the details in this paper.
In our purposed approach, the CNN structure consists of 4 convolutional layers and 2 subsample layers; detailed parameters are shown in Section 4.

Empirical Mode Decomposition.
The empirical mode decomposition method was first developed by Huang et al. in 1998 [37].Unlike other signal analysis methods which transform a signal into a certain mode, EMD method focuses on the natural scale and character of the original signal.
In the EMD process, original vibration signal is always decomposed into a certain number of different components which reflect different intrinsic character of the signal.Entropy energy of IMFs contains information of the signal and can be extracted as measurement for fault diagnosis.EMD is superior to traditional signal analysis approach when the signal to be analyzed has nonlinear or nonstationary characters.In addition, EMD technique is self-adaptive analysis processing method which means little manual operation is needed.
After EMD was developed, it has been widely studied in various domains, such as process control [38], voice recognition [39], and system identification [40].The decomposition result of a simple sample signal is shown in Figure 3.
The fundamental assumption of EMD method is that a sequence of signal is the combination of several different components.In EMD methods, these components are known as intrinsic mode functions.In each of the IMFs, the number  of extrema and the number of zero-crossings are the same.Another premise of EMD is that between two contiguous zero-crossings, there is only one extremum [41].
As shown in Figure 2 and mentioned above, the following conditions should be satisfied for IMFs: (1) In each complete IMF, the difference between the number of extrema and the number of zero-crossings should be less than or equal to one.
(2) In the process of EMD, two envelopes are defined in which the upper envelope is defined by local maxima and the lower envelope by local minima.For each point of an IMF, the mean value of both envelopes should be zero all the time.
The decomposition process of a vibration signal () is described as below: (1) For a sequence of vibration signal (), local extrema are first selected.An envelope is created by connecting the local maxima with cubic spline technique.This envelope is called upper envelope.
(2) Another envelope is created as in (1).All the local minima are connected using the same technique, and the new envelope is called lower envelope.All the points in the signal must be in the range of two envelopes.
(3) The mean value of both envelopes' values is defined as  1 , and we could get ℎ 1 by subtracting the mean value  1 from the original signal () as follows: We validate ℎ 1 to see if both conditions as an IMF are satisfied.If both conditions are satisfied, ℎ 1 is defined as the first composition of ().
(4) If either of the conditions is not satisfied, we treat ℎ 1 as the former signal () and then repeat the process from step (1) to step (3), which means a new mean value  11 is calculated and then we have The process is repeated for  times, until we have ℎ 1 which satisfy both premises.We have and ℎ 1 is chosen as the first IMF composition of the signal (). 1 is defined as the first IMF as Normally,  1 ought to have the most significant feature of the original signal.(5) Then the IMF is subtracted from signal (), and the residue is acquired as After that, we consider  1 as the original signal and repeat the process from step (1) to step (4) until we obtain a new IMF  2 of ().(6) The whole procedure described above is repeated for  times until we stop the decomposition process.We have A set of IMFs from  1 to   are acquired.If the residue   becomes monotonic, it can reflect the main trend of the original signal.Also no more IMFs could be obtained.In summary, the original signal can be presented as Through the EMD process, a combination of  empirical modes is got from decomposing the signals, plus a residue term   .Intrinsic mode functions each contain unique frequency bands.
The energy entropy of EMD is calculated and measured as features for fault diagnosis.After decomposing rolling bearing signals into IMFs, energies of the  IMFs are  1 ,  2 , . . .,   .The energy for one IMF is calculated as where  is the number of sample data points.And the total energy of all IMFs is calculated as EMD energy entropy of the signal is calculated as where   =   / is the percentage of the energy entropy of the th IMF.
In our approach, the energies of the first five IMFs  1 ,  2 , . . .,  5 and the energy entropy   are chosen as fault features.

Fault Diagnosis Structure.
In this section, the implementation of our proposed fault diagnosis approach is introduced.Figure 4 represents the flowchart of the fault diagnosis process.
In the feature extraction process, five statistical time domain features are selected as fault features, including mean value, standard deviation, skewness, kurtosis, and root mean square (RMS).The formulas of the five features are listed in Table 1.
Fourier transform is applied to vibration signals of rolling-element bearing to obtain the frequency spectrum.A CNN model is designed to extract the spatial information of the frequency spectrum.Eighty features are gained based on CNN methods for classification phase.
Empirical mode decomposition is also applied to vibration signals.Vibration signals in real rolling-element bearing system may be divided into more than 10 IMFs; however the energy of IMF decreases swiftly.In this paper, we only select the first five IMFs.Their energies  1 ,  2 , . . .,  5 , as well as the energy entropy   , are chosen as fault features.
In summary, the vibration signals of rotating machinery are analyzed and a total of 91 features are extracted based on two different methods.In the following classification phase, two effective models, support vector machine (SVM) and softmax classifier, are trained for fault diagnosis of rollingelement bearings.

Experiment Results and Analysis
To testify the effectiveness of our approach, experiments were performed on the bearing vibration signal database of Case Western Reserve University (CWRU).CWRU database contains a large amount of data acquired from the experimental setup introduced below.The data set of the bearings used in this paper is arranged in Table 2.

Experimental Setup.
As shown in Table 2, 52 categories of vibration signals are chosen from CWRU database.1000 samples containing 5000 points each are selected for every category, and 800 samples are randomly selected as training data while 200 samples are left as test data.Two vibration signals and their frequency spectrum are shown in Figures 6 and 7  Figure 7: Vibration signal and its frequency spectrum under outer race fault at 6 : 00 with fault diameter of 0.014 inches and motor load of 3. presented.On the other hand, the frequency spectrum may have more notable features, which illustrates that the analysis process using CNN is promising on the The original vibration signal contains 5000 points while the frequency spectrum of a signal is a data set of 2500 points.In our approach, the spectrum is reshaped into a 50 × 50 vector as the input of the CNN model designed above for feature extraction.
In this experiment, mini-batch stochastic gradient descent algorithm was used as approximation method.The batch size was fixed on 100, and the CNN learning rate varied from 0.01 to 0.001.In the training process, we can see the significant ability of CNN in extracting features from the original vibration signals of rotating machinery.
As shown in Figure 8, the training error reduced to almost zero in three epochs, while the test error remained 1.10% after 15 epochs.3 and 4.
The training accuracy of both methods is rather high as shown in Tables 3 and 4 which represented that both classifiers trained on 91 combined features achieved an outstanding test accuracy.10374 of 10400 samples are classified correctly using SVM while 10346 samples are correct using softmax classifier.Two classification methods are both competitive and effective, and SVM method shows a slight superiority.
The results also demonstrate the powerful feature extraction ability of CNN.As we can see, features from CNN model alone can reach a relatively high performance; however, features from CNN model have limitation in fault classification.Efforts have been done trying to alter the parameters or even structures of the CNN model, but features extracted can only get a classification accuracy around 99%.Time domain features and EMD features are easier to obtain compared with CNN, and they are also useful in many situations.By combining features from both methods, we can achieve a superior result compared to using them separately.The results of our proposed approach are also compared with works in some other papers.Table 5 below shows classification accuracy of some other works.
As shown in Table 5, traditional ANN combined with EMD method already has a high accuracy in [13].CNN has been applied in fault diagnosis in [14][15][16][17].CNN structures in [15,16] show great performance in classification.However, with a small number of categories, CNN would not always have better results than traditional methods as shown in [17].Most works only dealt with a small number of categories, which is not adequate in practical situations, while our approach deals with 52 fault categories.Our proposed approach with 91 features has the best performance in the table.

Parameter Selection for CNN.
In our purposed approach, the CNN structure consists of 4 convolutional layers and 2 subsample layers; detailed parameters are shown in Table 6.In a CNN structure, usually bigger number of filters shows better ability of representation.As there are 52 fault categories, filter numbers should be bigger than 52.Convolutional layers show different kinds of characteristics, and the later convolutional layer represents more delicate details than former layers.Therefore, in layer C3, we select 300 filters for better representation.
The number of features extracted from CNN model is very important.Experiments are implemented with different number of features.The results are shown in Figure 11.As we can see, different numbers of features have different accuracies.80 features show the best representation ability while more features may lead to the problem of overfitting.
The optimization of parameters of CNN is always important to obtain an effective CNN model.In general, learning  [13] 96.24% 3 Wavelet-ANN [13] 88.54% 3 CNN with 2 pipelines [14] 93.61% 8 CNN with statistical feature [15] 98.02% 12 CNN with statistical feature [15] 98.35% 8 Hierarchical ADCNN [16] 98.13% 3 SVRM [16] 94.17% 3 1D-CNN [17] 97.40% 2 WP-SVM [17] 99.20% 2 FFT-SVM [17] 84.20% 2 The selecting of learning rate of the mini-batch SGD algorithm is also considered.An appropriate learning rate is important to the final results.Higher learning rate leads to faster descent, while lower rate may cause the optimization to be local but not global.A series of experiments were done trying out different learning rate, and some of the results are shown in Figure 13.As shown in the figure, training error collapses to nearly zero in no more than four epochs, except the one with learning rate of 0.001.Due to the small learning rate, the CNN model cannot get a satisfied result.The results in this paper and other CNN parameter-adjusting algorithms indicate that the variational learning rate is the best choice here.

2. 1 .
Convolutional Neural Network.Deep learning methods have outstanding performances in image classification, computer vision, and nature language process.CNN structure is a type of deep neural network.Neurons forming the CNN structure have weights and biases which are changeable and learnable through training.

Figure 3 :
Figure 3: Empirical mode decomposition of a sample signal.

Figure 4 :
Figure 4: Representation of proposed fault diagnosis structure.

3. 2 .
Data Selection and Preprocess.Three bearing components, the inner race (IR), the outer race (OR), and the ball of rolling bearing (BA), are under study in the database of CWRU.In order to verify this performance of our approach, a set of experiments were conducted.Fault categories of the experiment apparatus include IR faults, BA faults, and OR faults located at three o' clock, six o' clock, and twelve o' clock.In addition, vibration signals under different motor loads and fault diameters are collected for analysis.The sampling frequency of the platform is twelve kHz. .

3. 3 .Figure 6 :
Figure 6: Vibration signal and its frequency spectrum under inner race fault with fault diameter of 0.007 inches and motor load of 0.

5 Figure 8 :
Figure 8: Training and test error of CNN feature extraction model.

Figure 9 :
Figure 9: Vibration signal and its first 9 IMFs under inner race fault with fault diameter of 0.007 inches and motor load of 0.

Figure 10 :
Figure 10: Vibration signal and its first 9 IMFs under outer race fault at 6 : 00 with fault diameter of 0.014 inches and motor load of 3.

Figure 12 :
Figure 12: Training time of CNN model.

Table 1 :
Time domain features.

Table 2 :
Bearing fault data arrangement.

Table 3 :
Training accuracy of both classifiers on different features.

Table 4 :
Test accuracy of both classifiers on different features.

Table 5 :
Classification accuracy of different methods.

Table 6 :
Parameters of the purposed CNN structure.Figure 11: Error rate with different numbers of CNN features.rate,number of kernels, number of weights in each layer, and batch size are all parameters to be optimized.In our purposed CNN model, as shown in Table6, a total number of 654360 weights and bias parameters need to be calculated in each step, which results in a relatively long training time.Training time of the CNN model in this paper is shown in Figure12, and the average training time is about 240 seconds.