Early Fault Detection Method of Rolling Bearing Based on MCNN and GRU Network with an Attention Mechanism

Aiming at the problem of early fault diagnosis of rolling bearing, an early fault detection method of rolling bearing based on a multiscale convolutional neural network and gated recurrent unit network with attention mechanism (MCNN-AGRU) is proposed. This method ﬁrst inputs multiple time scales rolling bearing vibration signals into the convolutional neural network to train the model through multiscale data processing and then adds the gated recurrent unit network with an attention mechanism to make the model predictive. Finally, the reconstruction error between the actual value and the predicted value is used to detect the early fault. The training data of this method is only normal data. The early fault detection in the operating condition monitoring and performance degradation assessment of the rolling bearing is eﬀectively solved. It uses a multiscale data processing method to make the features extracted by CNN more robust and uses a GRU network with an attention mechanism to make the predictive ability of this method not aﬀected by the length of the data. Experimental results show that the MCNN-AGRU rolling bearing early fault diagnosis method proposed in this paper can eﬀectively detect the early fault of the rolling bearing and can eﬀectively identify the type of rolling bearing fault.


Introduction
As one of the key parts in rotating machinery, rolling bearing mainly plays a role in undertaking stress and transferring load in the system.Because of its long-term operation under high-speed, high-load working conditions, the rolling bearing has become the most easily damaged part of mechanical equipment [1].Once the rolling bearing is damaged, it will have a very serious impact on the mechanical equipment, so it is of great significance to study rolling element bearings failure mechanisms.
e typical life curve of the rolling bearings is shown in Figure 1.
ere are four stages: (1) running-in stage, (2) normal operation stage, (3) early weak fault occurrence and healing stage, and (4) severe fault stage.e early faults are too weak to detect and once an early failure occurs, and the rolling bearings will deteriorate rapidly after a short period of the "healing stage"; it will lead to serious consequences.If the fault can be detected and remedied at an early stage, that would avoid bigger safety problems and reduce losses.
erefore, the early fault detection of rolling bearings is very important [2][3][4][5].And there are two problems to be faced in the detection of early faults: (1) e early faults are too weak to detect, and it is more difficult to extract the features.(2) ere are less early fault data, which is not enough to train the model.e current methods for diagnosing rolling bearing faults can be roughly divided into two categories [6].e first category is model-based fault diagnosis methods, which mainly uses expert knowledge to analyse the fault frequency [7][8][9] or establish a degradation model [10,11] to isolate early faults.However, this method relies on the subjective choice of people and the accuracy of the model; it requires high experience.
e model is only designed for specific fields, which limits the scope of application to a certain extent.e second category is fault diagnosis methods based on data [12][13][14].As the field of fault diagnosis enters the era of "big data," a series of data-based methods have emerged.
e data-based method relies on the neural network to extract the features of the data on its own, eliminating the artificial subjectivity and dependence on the human experience, which is more in line with the monitoring of today's large-scale industry.
In recent years, with the rapid development and wide application of deep learning; it has become the focus of fault diagnosis.
ere are some typical networks such as deep neural network (DNN) [23], deep belief network (DBN) [24], autoencoder (AE) [25], convolutional neural network (CNN) [26], and recurrent neural network (RNN) [27].Although the accuracy of DBM and DNN is improved compared with shallow artificial neural networks, there is still the problem of artificial extraction of time series features, ignoring the characteristics of data timing.AE belongs to unsupervised learning, which is mainly used for data dimensionality reduction or feature extraction.It usually needs to be applied to the field of load forecasting after combining with other models.CNN is a neural network with convolution calculation and depth structure.Convolution and pooling are used to extract data features, which reduces the error caused by artificial feature extraction.It is widely used in image, voice, and other fields.However, it is difficult for a single CNN network to extract the weak features of early faults, so a multiscale convolution neural network is introduced to extract more comprehensive features.But it is difficult for an only MCNN model to learn the timing dynamics of retained data.RNN introduces the cyclic structure into the network so that it can model the dynamic time series data better than other neural networks [28].Gated recurrent unit (GRU) is a special RNN.GRU and long short-term memory network (LSTM) [29][30][31] are solving the problem of gradient disappearance in RNN.ey can consider the longterm and short-term dependence in time series more completely.Compared with LSTM, GRU has a faster convergence speed and no difference in accuracy.However, when the input time series is long, RNN series networks such as LSTM and GRU are prone to lose sequence information and it is difficult to model the structure information between data, which affects the accuracy of the model [32].e attention mechanism is a resource allocation mechanism, which can assign different weights to input features so that the features containing important information will not disappear with the increase of step size, highlight the influence of more important information, and make the model easier to learn the long-distance interdependence in the sequence [33].
However, although there are data-based early fault detection methods, early fault detection still faces the following challenges: (1) how to extract comprehensive and robust features from early fault signals; (2) consider the timing characteristics of bearing vibration signals to detect anomaly; (3) and when the data input length is too long, there is a problem of missing information.
Because of the above problems, this paper proposes an early fault diagnosis method of MCNN-AGRU.is method uses MCNN to extract the features of the rolling bearing at different time scales and filter out certain noise in the multiscale calculation to obtain more robust and comprehensive features of the bearing.e GRU network with an attention mechanism can learn the long-term dependence characteristics of the data, and the features containing important information will not disappear with the increase of the step size, thereby highlighting the influence of more important information, making the model easier to learn the long-term sequence.e interdependence of distance [34] solves the problem of information loss caused by too long data.Finally, a large amount of normal operating data of rolling bearings is used to construct a predictive model of the normal operating state of rolling bearings.e model can learn the distribution of normal data through training and use the learned prediction value and the reconstruction error of the true value to measure the operating state of the rolling bearing and perform early alarm.
e main contributions of this paper are as follows: (1) proposing an early fault diagnosis method that only needs to use the normal bearing data to train the model, which solves the problem of less early bearing fault data; (2) using multiscale data processing methods to make the features extracted by CNN more robust; (3) and using GRU network with an attention mechanism to make model predictive ability independent of the length of the data.
e main structure of the paper is as follows: the second part introduces the basic theoretical knowledge.e third part proposes the MCNN-AGRU method.e fourth part verifies the performance of the scheme through simulation.Finally, the conclusion is in the fifth section., . . ., x N } is given, N is the length of the original input data, s is the number of multiscale processing scales, and x n is the nth vibration value of the original signal.If the multiscale output signal is assumed to be y s,j , the calculation process of the multiscale processed data is shown in

Fundamental Theory
e data length after multiscale data processing is N/s.e range of s selected in this paper is 1∼4.

Convolutional Neural Network.
e convolutional neural network is a multilevel neural network, including filtering level and classification level.Among them, the filtering stage is used to extract the features of the input signal, the classification stage classifies the learned features, and the two-stage network parameters are obtained through joint training [36].e filter stage includes a convolutional layer and a pooling layer, and uses an activation function to perform nonlinear operations on it.e convolution layer uses the convolution kernel to perform convolution operations on the local area of the input signal and generate corresponding features.e most important feature of the convolutional layer is weight sharing; that is, the same convolution kernel will traverse the input once with a fixed step.Weight sharing reduces the network parameters of the convolutional layer and avoids overfitting caused by too many parameters.e main purpose of the pooling layer is to reduce the parameters of the neural network and extract the features obtained by the convolutional layer twice.e one-dimensional convolution process is shown in Figure 3.
e convolution kernel moves the input signal according to the step length to extract the features, and then the obtained features are pooled to obtain more advanced features.

Gated Recurrent Unit Network. Gated recurrent unit network (GRU) is a variant of the recurrent neural network (RNN)
. RNN is a type of recurrent neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all recurrent units are connected in a chain [37].As shown in Figure 4, the GRU network consists of an update gate and a reset gate.e main function of the update gate is to control the extent to which the state information from the previous moment is brought into the current state.e larger the value of the update gate, the more state information from the previous moment is brought in [38].e main function of the reset gate is to determine the degree of discarding previous information.e smaller the value, the more information is ignored.e GRU expression is as follows: In the above formula, σ represents the Sigmoid activation function.e parameters in the formula are W, W z , and W r .

Attention Mechanism.
e attention mechanism is a resource allocation mechanism that simulates the attention of the human brain.At a certain moment, the human brain will focus its attention on the areas that need to be focused, reducing or even ignoring the attention to other areas to get more attention.Needing to pay attention to the details of information and suppressing other useless information, its core idea is to change the attention to information ingeniously and reasonably, ignore irrelevant information and amplify the required information.e attention mechanism allocates sufficient attention to key information through probability allocation, highlights the impact of important information, and improves the accuracy of the model.e structure of the attention mechanism is shown in Figure 5.Among them, x t (t∈ [1, n]) represents the input of the GRU network, h t (t∈ [1, n]) corresponds to the hidden layer output of each input through GRU, α t (t∈ [1, n]) is the attention probability distribution value of the attention mechanism to the GRU hidden layer output, and y is the GRU output value of the attention mechanism introduced.
2.5.Support Vector Data Description.Support vector data description (SVDD) is a single-valued classification algorithm, which can distinguish target samples from nontarget samples.At present, the SVDD algorithm is mainly used for abnormal state detection and fault identification that only define the normal working state space to judge whether the working state is normal or not.Given a training sample , the goal of SVDD is to determine a hyperspherical body that can surround all training samples with a minimum volume.Assuming that a and R are the center and radius of the hypersphere, respectively, the SVDD optimization problem can be expressed as follows: C is the constant used to control the degree of punishment for misdivided samples.ξ i is the relaxation factor.φ(x i ) is the mapping from sample space to feature space.

Shock and Vibration
e Lagrange operator is used to solve the above optimization problem, and the following dual form can be obtained: where α i is the Lagrange multiplier, 0 In order to improve the adaptability of the algorithm, the Gaussian kernel function K(x i , x j ) is introduced to replace the inner product operation on φ(x i ) to improve the generalization ability of SVDD.e Gaussian kernel function is as follows: where σ is the Gaussian kernel parameter, which has a great impact on the detection performance of SVDD.To solve the above maximum optimization problem, the solution set {α i } can be obtained; then, the center and minimum radius of the sphere can be obtained by the following formula: where x k is an arbitrary support vector.For test sample Z, its thresholding algorithm is When f(z) ≥ R 2 , the sample is the normal sample; otherwise, it is the abnormal sample.

MCNN-AGRU Early Fault Detection Method
Most fault diagnosis methods based on deep learning are learning and classifying the serious faults, but there are a few methods for the early fault of bearings.e MCNN-AGRU method proposed in this paper solves this problem.MCNN can extract data features of different scales to increase the number of data sets and filter out part of the noise in the process to extract more robust features.e GRU network with the attention mechanism can solve the problem of information loss and the difficulty of taking into account the relationship between data and information when a single GRU network inputs data with too long sequence.erefore, the MCNN-AGRU early fault detection method proposed in this paper is improved compared with the previous methods in feature extraction and timing processing.e experiment proves that the early fault of the bearing can be detected accurately and quickly.

MCNN-AGRU Fundamental.
e structure of the MCNN-AGRU model proposed in this paper is shown in Figure 6, which is mainly divided into three parts: the multiscale input layer, multiscale feature extraction layer, and prediction layer.First, the original vibration data is transformed into data of four time scales after multiscale preprocessing, as shown in Figure 7. en, input the data of these four scales into the CNN network to extract the features, finally concatenate the features extracted from the data of the four scales to obtain the comprehensive feature, and Figure 5: Attention mechanism structure.Each layer in the model is described as follows: (1) Multiscale input layer: e input layer processes the original data through multiple scales to obtain four different scale inputs and inputs them into four different convolutional neural networks.e original data is X � x 1 , . . ., x n , . . ., x N  .
(2) Multiscale feature extraction layer: In this layer, two pairs of convolutional pooling layers are used for feature extraction for data of each scale, and the extracted features are connected in series to form a comprehensive feature.e input of the first convolutional layer is a signal of length L � N/s, and a convolution kernel of length m is selected to move on the data to extract features.erefore, the output z i of the i node in the feature graph is (i) where w Τ represents the weight matrix, b is the bias, y i: i+m− 1 represents the subsignal of length m starting from the i-th period in the original data y, and σ represents the activation function.ReLU activation is used here.e function can prevent the gradient from disappearing and speed up the function convergence.Sliding the convolution kernel from the beginning to the end, the j-th feature can be seen as (ii) After that, the pooling layer is used to further extract the features obtained by the convolutional layer, the max-pooling with a pooling length of p is adopted for calculating the local max value over the input feature map, and the k features are combined to obtain (iii) e features after the pooling layer are expressed as h k ′ , and then the features are connected in series to get (iv) Finally, the features obtained from the four scales are connected in series to obtain comprehensive features: q � q (1) , q (2) , q (3) , q (4)  .
(3) Prediction layer: e prediction layer is composed of the GRU layer, attention mechanism layer, and output layer.e GRU layer learns the feature vectors extracted by the multiscale feature extraction layer.By building a single-layer GRU structure, the proposed features are fully learned to capture its internal changing laws.e output of the GRU layer is denoted as H, and the output at step t is expressed as e input of the attention mechanism layer is the output vector H that has been activated by the GRU network layer.
e probability corresponding to different feature vectors is calculated according to the weight distribution principle, and the better weight parameter matrix is continuously updated and iterated.e calculation formula of the weight coefficient of the attention mechanism layer can be expressed as  Shock and Vibration where e t represents the attention probability distribution value determined by the output vector h t of the GRU network layer at time t; u and w are weight coefficients; b is the bias coefficient; and the output of the attention layer at time t is represented by s t .Finally, the input of the output layer is the output of the attention mechanism layer.e output layer calculates the output with a prediction step length of m through the fully connected layer.e prediction formula can be expressed as Among them, y t represents the predicted output value at time t; w o is the weight matrix; and b o is the deviation vector.
e activation function σ is Sigmoid.e reconstruction error is calculated as follows: Only normal data is used to train MCNN-AGRU to make the model have the ability to predict the normal behaviour of the system along the time axis model in the process of detecting the early fault of the bearing.When the online data is input into the model, the model can predict the value of the next time of the data and calculate the reconstruction error with the actual value of the next time.e bearing has different reconstruction error when different faults occur.For example, when the system is normal, the reconstruction error is very small.When the system is abnormal, the reconstruction error will increase obviously.More importantly, the reconstruction errors of different types of early faults are also different.erefore, we have reason to believe that the running state of the system can be judged by the reconstruction error.Various types of vibration signals and normal vibration signals are input into MCNN-AGRU to get the reconstruction error, and abnormal reconstruction error and normal reconstruction error are used to train SVDD to indicate the running state of the system.

Experimental Results and Analysis
is section verifies the accuracy and feasibility of the proposed MCNN-AGRU method through two sets of experiments on the self-built mechanical failure comprehensive simulation experiment platform and a full life cycle data set from the intelligent maintenance system (IMS) of the University of Cincinnati [39,40].

MCNN-AGRU Fault Classification Experiment.
is part mainly uses experiments to verify the accuracy of the model's classification.e data set was acquired from the self-built mechanical failure comprehensive simulation experiment platform.is test stand consists of a motor, a rotor, a principle axis, a vibration sensor, and different kinds of rolling bearings (shown in Figure 9).e fault data set consists of four categories: normal state (N), the inner ring failure (IRF), the outer ring failure (ORF), and the rolling elements failure (REF).For the same fault, the degree is 0.2 mm, and the motor speeds is 1800 RPM.Digital data was collected at 12,000 samples per second.
is data set is used to evaluate the fault diagnosis performance of the algorithm.It contains four operating states common to the rolling bearing (N, IRF, ORF, REF).Each state has 120000 points, of which 80000 is selected for training data, 20000 for validation set, and 20000 for test set, and then test data of four operating states are entered into the trained model and judge the state of the system by the reconstructed error.e detailed information is listed in Table 1.For the proposed method, all structural hyperparameters are shown in Table 2.
Figure 10 shows the test results of the model.e black dots indicate the normal state, the green triangle indicates the rolling element failure, the blue square indicates the inner ring failure, and the pink cross indicates the outer ring failure.(1) e reconstruction error of the normal state fluctuates less than 2.5, and the reconstruction error of the abnormal state (rolling element failure, outer ring failure, and inner ring failure) is 2.5 to 20.It can be seen that there is a clear difference between the reconstruction error of the normal state and the abnormal state, which means that the model can distinguish the normal state from the abnormal state very well and has a good abnormality detection ability.
(2) e reconstruction error range of rolling element failure is 2.5 to 5, the reconstruction error of inner ring failure fluctuates about 7.5, and the reconstruction error of outer ring failure ranges from 12 to 20.It can be seen that the model can distinguish three different types of faults well, indicating that the model has good fault classification capabilities.

MCNN-AGRU Fault Prediction Experiment.
is experiment is mainly used to verify the fault prediction ability of the model.To verify the performance of the model extended to the early state recognition of the rolling bearing, it is first necessary to analyse the operating characteristics of the rolling bearing throughout its life cycle.
is article uses the full life cycle data of bearings from the Intelligent Maintenance Center of the University of Cincinnati for analysis.As shown in Figure 11, the bearing test bench carries four bearings on a shaft, which is driven by an AC motor.
e speed is maintained at 2000 r/min.A radial load of 6000 lbs is applied to the shaft and bearing through a spring mechanism to accelerate bearing aging.e oil circulation system can measure the flow and temperature of lubricating oil.Besides, the electromagnet installed in the Shock and Vibration oil return pipe will collect debris in the oil to prove the performance degradation of the bearing system.When the accumulated debris attached to the electromagnet exceeds a certain level, the system will stop running.A vibration acceleration sensor is installed on each bearing box.
e data sampling rate is 20 kHz, sampling once every ten minutes.And there are 20480 points in each sample.
is paper chooses the data of experiment C in the IMS full life cycle experiment as the training set of this model to train the model.is experiment started on April 8 th and ended on April 18 th .After the accelerated aging test with applied load, the outer ring failure occurred on the 3# bearing.e data contains the vibration acceleration signals of the 3# bearing from normal operation to the occurrence of outer ring failure and contains 1399 samples in total.e data sampling rate is 20 kHz and each vibration signal snapshot length contains 20480 points.e first 800 samples are the healthy running data of 3# bearing.Select the first 500 samples of the sample file as the training data of the model, and the last 300 samples are validation data.e last 599 samples are used to test the performance of the model, and the last 599 samples contain the degradation process data of the 3# bearing.To ensure that the data input model has a certain physical meaning, 600 sampling points for roughly one revolution by calculating the sampling frequency and motor speed are obtained.erefore, the data is rearranged and the data is input into the model for training and testing according to the cycle.
Figure 12 shows the performance of 3# bearing data in Experiment C. It uses normal data to train the model so that the model can learn the data changes of the rolling bearing in the health condition and use the reconstruction error between the actual value and the predicted value to measure the running state of the bearing.It can be seen from the partial enlargement of Figure 13 that the model first showed abnormal condition in the 8250 th cycle.en, the vibration signal returned to normal. is consists of the failure process of rolling bearings.When an early fault occurs to the outer ring and the rolling bearing is running, the weak defects in the outer ring will be smoothed by the continuous moving of the rolling elements.is abnormality will gradually diminish, so there will be short-term data similar to normal conditions.e rolling bearing with early fault will continue to run, and these two states will alternate.But the duration of the two states is getting shorter, and the amplitude of each abnormal signal will gradually increase.
To verify the stability and advancement of the MCNN-AGRU method proposed in this paper, this method is compared with several other fault detection methods.
As shown in Figures 14 and 15, it is clear that MCNN-AGRU can describe the development of the rolling bearing's damage.It is very sensitive to initial anomalies than other methods through Figures 14 and 15.For Kurtosis, it is not sensitive enough to abnormal changes in the signal and about 6650 revolutions slower than the method proposed in this paper.When Kurtosis adds the MCNN, its detection ability is enhanced, but its ability to predict the next running state of the bearing is reduced.For the RMS, it has a certain response to the early fault, but it is not obvious, and it cannot accurately predict the next running state of the bearing.When the RMS adds the MCNN, both detection ability and the ability which predicts the next running state are reduced.Compared with RMS, the MCNN-AGRU proposed in this paper is obviously larger than that in amplitude.It means that when both methods detect early faults, the MCNN-AGRU's response to early faults is more sensitive and obvious, while RMS is easily masked by noise.In conclusion, the MCNN-AGRU extracted data features are more stable and more sensitive to early faults.

Figure 10 :
Figure 10: Diagram of model diagnosis results in four states.

Table 1 :
Rolling bearing failure test data.