Bearing Faulty Prognostic Approach Based on Multiscale Feature Extraction and Attention Learning Mechanism

Recently, researches on data-driven faulty identification have been achieving increasing attention due to the fast development of the modern conditional monitoring technology and the availability of the massive historical storage data. However, most industrial equipment is working under variable industrial operating conditions which can be a great challenge to the generalization ability of the normal data-driven model trained by the historical storage operating data whose distribution might be different from the current operating datasets. Moreover, the traditional data-driven faulty prognostic model trained on massive historical data can hardly meet the real-time requirement of the practical industry. Since the hierarchical feature extraction can enhance the model generalization ability and the attention learning mechanism can promote the prediction efficiency, this paper proposes a novel bearing faulty prognostic approach combining the U-net-based multiscale feature extraction network and the CBAM(convolutional block attention module-) based attention learning network. First, time domain conditional monitoring signals are converted into the two-dimensional gray-scale image which can be applicable for the input of the CNN. Second, a CNN model based on the U-net structure is adopted as the feature extractor to hierarchically extract the multilevel features which can be very sensitive to the faulty information contained in the converted image. Finally, the extracted multilevel features containing different representations of the raw signals are sent to the designed CBAM-based attention learning network for high efficiency faulty classification with its unique emphasize discrimination characteristic. The effectiveness of the proposed approach is validated by two case studies offered by the CWRU (Case Western Reserved University) and the Paderborn University. The experimental result indicates that the proposed faulty prognostic approach outperforms other comparison models in terms of the generalization ability and the speed-up properties.


Introduction
With the advent of the large-scale manufacturing of the modern industry, the prognostic and health management (PHM) of the manufacturing equipment has been becoming increasingly important. Bearings, regarded as the key component of the industrial machine, play a significant role in the health status of the whole equipment whose failure might directly result in total collapse. Therefore, the accurate and effective prediction of the bearing fault can not only save the periodical maintenance cost but also improve the reliability of the whole equipment. Traditional faulty prognostic approach can be mainly categorized into three schemes: signal-based approach, physical analyzing-based approach, and pattern recognition-based approach. The signal-based approach, especially the vibration signal-based approach, can be the most commonly used one in the faulty prediction of the industrial mechanical components. By using the time domain, frequency domain, and the time-frequency analysis, the vibration-based faulty prognostic approach can be very sensitive to the machine faulty symptom. Hong and Dhupia [1] proposed a vibration-based faulty prognostic model by analyzing the kurtosis of strong impact circle of the vibration spectrum. Borghesani et al. [2] established a vibration-based faulty prognostic model by analyzing the relationship between the Kurtosis, square envelop spectrum and cepstrum prewhitening. A novel band demodulation approach is proposed for the faulty prognostic of the rolling bearings. Apart from the vibration-based approach, the temperature-based approach and oil analysis-based faulty prognostic approach can also be very effective [3][4][5]. The signal-based faulty prognostic approach is totally based on the understanding of the target monitoring signal whose prediction accuracy can be limited to the priori domain expertise knowledge. Moreover, the manual feature extraction and alarming threshold setting of different target signal can be labour cost [6].
In addition to the signal-based approach, the physical analyzing-based approach has also been studied in recent literature. The physical analyzing approach aims at establishing the physical equation based on the material characterization. Xu et al. [7] analyzed the degradation situation of the aluminium-steel joint by analyzing the profile effects of the underwater friction stir welding tool pin on the on the properties of aluminium steel joint. Xu et al. [8] established a composite material fatigue analyzing evaluation based on the analysis of the dispersion wave characteristics of laminated composite nanoplate.
To overcome the above issue existed in the signal-and physical analyzing-based approach, the pattern recognitionbased approach, usually realized by the deep learning model, is proposed for the faulty prognostic tasks. The deep learning models can replace the manual feature extraction with its power automatic learning ability of representative features and the nonlinear input-output mapping relationship in complex system with its deep nonlinear network structure [9][10][11]. As one of the most effective deep learning models, the convolution neural network model has shown its promising ability in hierarchical feature learning and intelligent faulty prognostic [12][13][14][15][16][17][18]. The CNN-based faulty prognostic approaches have achieved comparatively higher accuracy than the signal-based approaches; however, there still exists some points needed to be considered.
(1) It is assumed that the training datasets and the testing datasets are collected under the same operating situation; however, in the real industrial environment, the operation condition such as the bearing rotating speed and the load of the equipment can be variable in different time segments. The performance of traditional CNN-based faulty prediction approach can be vulnerable when the load condition vary. How to boost the model generalization ability remains a challenge (2) In the traditional CNN-based faulty prediction approach, only the last feature layer, which is highly related to the specific task or datasets, is used for the faulty prognostic task. However, some generalized characteristics are contained in the low-level hidden layers which are not well preserved in the high-level feature. How to jointly use these multilevel features remain a problem Since the low-level features reserved in the hidden layers are universal and similar for different but related distributed datasets or tasks, the multiscale hierarchical feature learning has been studied in recent literature [19][20][21][22][23]. Ding and He [20] combined the second max pooling layer with the last convolution layer as the categorical feature image for spindle bearing fault diagnosis. Sun et al. [21] connected both the third and the fourth convolution layer into the last hidden layer of the CNN network so that the model generalization ability can be enhanced. Lee and Nam [22] incorporated several low-level features with the extracted high-level feature. The concatenated feature vector is fed into a SVM detector for the prediction. In order to fully utilize the hierarchical features learned by the CNN model, Xu et al. [23] extracted the feature image of two pooling layers and one fully connected layer from the CNN model. These features are fed to the ensemble learning model of three random forests for final prediction.
Since these literature directly extract multiple feature layers from the traditional CNN and send them to the classifier for faulty prognostic, it is questionable whether the traditional CNN network has enough hierarchical feature learning ability and whether it is appropriate to directly use the multilevel features for practical faulty classification problem. The following two points need to be further considered.
(1) In current literature, the multilevel and multiscale features are extracted from the traditional CNN network such as the most commonly used LeNet-5, but the network itself has limited hierarchical feature learning ability which hinders the model generalization ability somewhat (2) In current literature, the extracted multilevel feature images are directly used for the faulty classification tasks. Nevertheless, there exists some abundant features contained in these extracted feature images which has less relationship to the prognostic task. These abundant features greatly increase the computation cost, and the highly related features might be concealed by them, thus causing reduction of the prognostic efficiency and the prognostic accuracy Dealing with the above two issues, this paper takes full advantage of the powerful hierarchical feature learning ability of the U-net CNN and the discriminative feature selection ability of the attention learning network. The major contributions of this research are as follows: considering the first issue listed above, an improved CNN based on Unet structure is designed as the hierarchical feature extractor network which has already been proved about its powerful hierarchical feature learning ability in the medical image area; considering the second issue listed above, a designed attention learning network based on several CBAM-(convolutional block attention model-) based attention learning blocks is used for the faulty classification with its unique discriminative feature selection mechanism for eliminating the redundant features; the rest structure of this paper is organized as follows: Section 2 briefly reviews the related theory and the methodology used in this paper; Section 3 presents the overall flowchart and the technical detail of the proposed faulty prognostic method; Section 4 presents the experimental result including the ablation study and the comparison 2 Journal of Sensors experiment with other traditional prognostic approaches; finally, the conclusion and future work of this paper are presented in Section 5.

Related Theory and Methodology
2.1. Multiscale Feature Extraction and U-Net. As a typical representation of deep learning model, the convolution neural network can automatically learn the structured and representative features from the raw datasets through layer-tolayer propagation scheme. Since the convolution neural network can learn multiscale hierarchical features of raw data, researches on making full use of features in the multilayers of the CNN have achieved considerable attention which has been proved to have better generalization ability [21]. There are some famous CNN models such as LeNet-5 [24], Alex-Net [25], VGG-Net [26], Google-Net [27] and U-Net [28], among which the CNN model based on U-Net structure has shown its great advantage in hierarchical feature learning. U-Net, as a new structure of CNN, has already been frequently applied into the task of image classification, segmentation, detection, and tracking in the medical imaging and biochemical area due to its powerful hierarchical feature learning ability [29]. Gao et al. [30] proposed an improved U-net-based image segmentation method for the blood vessel segmentation. In order to combine complementary magnetic resonance image protocols to reconstruct the highquality image, Lei et al. [31] proposed a Dense-UNet to reconstruct T2-weighted image (T2WI) using both T1weighted image (T1WI) and undersampled T2WI. Nazem et al. [32] proposed an improved 3D version of the U-net model based on the dice loss function to predict the binding sites of new proteins accurately. Dogan et al. [33] proposed a two-phase hybrid approach combining the Mask R-CNN and the 3D U-net for high-accuracy automatic segmentation of pancreas in CT imaging. Chae et al. [34] proposed a resid-ual U-Net combined with an attention learning module for the image segmentation of the pressure ulcer (PU) region.
To the best of our knowledge, it is the first time that the "U-net" is used as a feature extractor in the area of equipment faulty prognostic. Normally, the U-net-based CNN network consists of two parts, the max-pooling period in the left and the upconvolution period in the right which jointly construct the "U" structure as shown in Figure 1. It usually consist of four kinds of operations, namely, convolution, max-pooling, transpose-convolution, and skip connection.
2.1.1. Convolution Operation. The convolution layer consists of a series of feature maps which is obtained through the convolution operation between the convolution kernel and the input as shown in X j α denotes the α th output feature map of the j th layer; X j−1 β denotes the β th input feature map of the ðj − 1Þ th layer; W j α,β denotes the convolution kernel between the feature map X j α and the feature map X j−1 β . The f ð * Þ denotes the activation function. In order to increase the nonlinearity of CNN, the rectifier linear units (Relu) is adopted in this paper due to its excellent performance. The ReLu function can be expressed as shown in 2.1.2. Max-Pooling Operation. In order to release the model parameter size as well as the overfitting problem, the pooling operation is executed along with the convolution operation.
Since the convolution kernels for the same feature map share the same weight and bias, a max-pooling layer is added to  3 Journal of Sensors each convolution layer, producing lower resolution feature maps through subsampling operations. The max-pooling function can be defined as illustrated in where the X S 1 * S 2 a and X S 1 ′ * S 2 ′ a denote the S 1 * S 2 pixel in the a th feature map before and after max-pooling operation. The parameter λ denotes the stride size of the pooling window whose value should be larger than 1. The max-pooling operation decreases the size of the feature maps and subsamples the highest resolution proportion of the input feature image which greatly reduce the parameter number of the CNN model.

Transpose Convolution.
In order to obtain the feature image which has the same size as the input image, the transpose convolution operation is applied along with the maxpooling process. During the transpose convolution process, the domain interpolation is the most commonly used technology as shown in where the X i * i a denotes the i * i pixel value in the a th feature map before the transpose convolution operation and X I * I a denotes the I * I pixel value in the a th feature image after the transpose convolution operation; the parameter S denotes the stride step of the transpose convolution, and the parameter p denotes the zero padding. The kernel size of the transpose convolution kernel is k * k.
2.1.4. Skip Connection. The U-net is a typical encodingdecoding structure. The encoding process is realized by the max-pooling operation while the decoding process is realized by the transpose convolution operation. In order to compensate the information loss during the max-pooling process, the U-net utilizes the concatenation layer to realize the feature fusion of the two symmetrical feature images located in the max-pooling and transpose processes, respectively, which is called skip connection. The "skip connection" enhances the hierarchical feature learning ability of the U-net without resolution loss.

Attention
Learning and CBAM. The attention learning is first inspired by the cognitive neuroscience. When dealing with a certain task, people will pay more attention to the important issue while paying less attention to the unimportant ones. Based on this notion, the attention mechanism is first proposed by Treisman and Gelade in 1980s [35]. The attention mechanism is aimed at assigning different weights to different proportions of the input based on the contribution of the different input proportions to the output. It has already been successfully applied into the area of natural language processing, machine translation, pattern recognition, and large equipment maintenance due to its powerful ability of extracting discriminative features [36].
Chen et al. [37] proposed an attention-based deep learning framework for machine's RUL prediction. In his paper, the proposed approach first exploits the LSTM network to learn representative sequential features from raw sensory data, then the attention learning network is utilized to learn the importance of the sequential features and assign larger weights to more important ones. Chen et al. [38] applied a spatial-temporal convolution neural network with convolution block attention module for microexpression recognition. First image sequences were input to a medium-sized convolution neural network (CNN) to extract visual features. Afterwards, it learned to allocate the feature weights in an adaptive manner with the help of a convolutional block attention module. Since microexpressions only occur in parts of the human face, the attention mechanism helps to focus on specific facial regions, learning and acquiring the important features. Xiong et al. [39] proposed an attention augmented multiscale network (AAMN) for single-image superresolution (SISR), employing an attention driven strategy to guide feature selection and aggregation among multiple branches. Leng et al. [40] proposed a context-aware attention network combining the context learning module and the attention transfer module. The context learning module is first utilized to capture the global contexts. Then, the attention transfer module is proposed to generate attention maps that contain different attention regions, benefiting for extracting discriminative features.
Currently, there are two most commonly used attention learning mechanism, namely, SENET (sequential and excitation network) and CBAM (convolutional block attention model) [37]. The SENET applies the attention module to channel dimension while the CBAM applies the attention module not only on the channel dimension but also the spatial dimension of the image.
The idea of the CBAM attention mechanism was first proposed by Woo et al. [36]. The CBAM consists of channel  Journal of Sensors attention process and spatial attention process as shown in Figure 2. The overview of the channel-spatial process of the CBAM is illustrated in where M represents the input image of the CBAM module with the channel number of C, the height of H, and the width of W. The mark ⊗ represents the element-wise multiplication, M′ represents the feature image multiplying the channel attention map, and M ″ represents the result of the spatial attention map multiplying M ′ which is regarded as the output of the CBAM module.

Channel Attention Process.
Usually, the input image can be transferred to a feature matrix through the convolutional layer. The channel number of the obtained feature matrix is the same as the kernel number of the convolutional layer with the common value of 256 or 512. Since some channels are not so useful to the information transference, it is necessary to apply channel attention on these channels. The attention weighting process is illustrated in where the F C avg and F C max denote the average pooling operation and the max pooling operation applied on the channel dimension of the feature matrix. W 0 ∈ R C/r * C and W 1 ∈ R C * C/r denote the activation operation of the shared multilayer perceptron with activation function of rectified linear unit (Relu) with the size of R C/r * 1 * 1 ,where r denotes the compression ratio.The parameter σ denotes the sigmoid activation.

Spatial Attention Process.
Similar as the channel attention process, the spatial attention is aimed at applying the importance weighting on spatial dimension of the feature matrix as shown in where the average pooling and the max pooling are also applied for the information evaluation. The parameter f R * R denotes the convolutional layer with the kernel size of * R and the spatial attention weighting is finally normalized by the sigmoid activation.

Proposed Combination Model Based on U-Net and CBAM Mechanism.
Although the hierarchical feature extraction network can provide the multilevel characteristics of the input image, the input image has been largely expanded to some extent. Therefore, it is necessary to use the attention learning network to capture the sensitive proportion of these input feature images and eliminate the abundant proportion. This paper proposes a hybrid model based on the U-net and the CBAM-based attention learning blocks, compromising the hierarchical feature extraction of the U-net, the attention learning of the CBAM blocks, and the effectiveness of the combination. The overall framework is illustrated in Figure 3. Firstly, the one-dimensional time series signal has been converted into the two dimensional gray-scale image, which is then decomposed by the U-net into several multilevel feature images hierarchically, representing the hierarchical characteristics of the input signal.
Secondly, multiple CBAM attention learning blocks are used to optimize the decomposed features, selecting the faulty sensitive features from the redundant ones. The complexity of the hierarchical feature images are greatly reduced, thus promoting the prediction efficiency.

Journal of Sensors
Finally, the outputs of the CBAM attention learning blocks are aggregated, and the second CBAM block is applied on the categorical feature image. The categorical feature optimized by the CBAM attention learning is sent to the Softmax layer for final faulty prognostic as shown in where C ðiÞ denotes the optimized categorical feature image used for faulty prognostic; i = 1, 2, ::n denotes the number of the training data; j = 1, 2, ::k denotes the dimension of the output layer which is equal to the faulty type number. θ denotes the parameters of the Softmax layer.

Proposed Faulty Prognostic Procedure
3.1. Data Preprocessing. Generally speaking, the condition monitoring data collected from the front-end industrial equipment includes one-dimensional time series data and two-dimensional image data. The 2D image data can be used directly for the faulty prognostic task by using the pattern   Journal of Sensors recognition techniques. In this paper, we use the "N * N" signal to image conversion technique proposed in literature [41] to convert the 1D time series vibration signal data into the 2D image data; the converted image is used as the input of the U-net convolution neural network. The specific schematic diagram of the N * N "signal to image" conversion process is illustrated in Figure 4. First, we randomly choose N signal segments from the raw signal containing N sampling points in each segment equally. Since the maximum value of the pixel length of the gray image is less than 255, the selected N 2 sampling points are normalized into the value ranging from 0~255 by using Equation (9) and the N * N signal matrix is constructed. Finally, the normalized pixel value of the signal matrix is fulfilled for the construction of the gray-scale image.
In Equation (9), the round function transforms the sampling signal value to the gray scale pixel value by using the round function "roundð * Þ". The Pixelði, jÞ denotes the converted pixel value of the corresponding signal valueði, jÞ where the min (value) denotes the minimum value of sampling data point among the selected N 2 sampling data point while the max (value) denotes the maximum value among the N 2 data points. The above "signal to image" conversion method used in this paper is simple, and it has been proved to be effective in literature [41] due to its less requirement of the domain expertise and signal processing knowledge. The converted gray-scale image is a 2D representation of the raw signal which can effectively retain the details and characteristics of the raw signals.

Proposed Feature Extraction Network and Attention
Learning Block 3.2.1. Proposed U-net-Based Feature Extraction Network. In this paper, a U-net-based convolution neural network is designed as the hierarchical feature extraction network. The whole feature extraction network consists of 10 layers, namely, X1~X10, among which the feature images of X1~X4 denote the max-pooling process of the U-net while the feature images of X5~X10 denote the upconvolution process of the U-net as shown in Figure 5.
Since the feature layers of the transposed process of the U-net can better represent the hierarchical characteristics of the input data which contains less outside noise, the feature layers of X6, X8, and X10 from the low, middle, and high levels, respectively, are used as the extracted hierarchical features, representing the global and specific characteristics of different health conditions, thus contributing different knowledge to the feature extraction task.

Proposed ResNet-CBAM Attention Learning Block.
In this paper, the designed CBAM attention learning network is compiled with the three-layer ResNet CNN as shown in Figure 6. First, the ResNet-based CNN is used to extract the spatial and channel features of the input feature images. Then, the CBAM attention learning block is used for the attention weighting of the channel dimensions and the spatial dimensions of the input images in an adaptive way. The advantage of the proposed ResNet-CBAM attention learning block is that there will not be feature loss and gradient disappearance before the input images are processed by the CBAM module.

Proposed Prognostic
Procedure. The proposed prognostic procedure is illustrated in Figure 7. First, the onedimensional time series data is converted to the twodimensional gray-scale image by using the "N * N" image conversion approach. Second, the U-net-based hierarchical feature extraction network is applied and the multilevel feature images of X6, X8, and X10 are extracted as the multiinput of the attention learning network. Third, the three designed ResNet-CBAM-based attention learning blocks are applied on the three extracted multilevel feature images which are then fused through shaping into the same size and channel concatenation. Finally, the concatenated categorical feature image is optimized by the second ResNet-CBAM attention learning block, and the final faulty prognostic result can be calculated through Softmax prediction. The novel Pareto-optimal strategy based on spatial game theory which is proposed by Wong [42][43] is utilized as the parameter optimization strategy of the proposed hybrid

Performance Metrics.
In order to evaluate the prediction accuracy as well as the prediction efficiency of the proposed approach, the "accuracy" metric, the "accuracy gain" metric, and the function of the "average accuracy gain" are used in this paper.
Equation (10) denotes the definition of the "accuracy" function which has been widely used in the accuracy  Step 5: Evaluate the proposed methodology Evaluate the performance of the proposed methodology on testing datasets Xt and output the testing accuracy of the proposed approach.
Algorithm 1: The general procedure of the proposed methodology. 8 Journal of Sensors evaluation of the classifying problem including the faulty classification task mentioned in this paper.
where m denotes the number of the training or testing samples per epoch; f ð b x i Þ denotes the prognostic value obtained by model, and y i denotes the true label.
Equation (11) denotes the definition of the accuracy gain (AG) and the average accuracy gain (AAG) which has been frequently used to evaluate the speed-up properties of the prediction model [44].
where the ACC Model1 The loss function is defined as shown in Equation (12), where Ið * Þ denotes the indicator function and N denotes the number of the training samples.

Methodology Evaluation
In order to evaluate the effectiveness of the proposed approach, two case studies are adopted with two bearing datasets from the reliance electric motor and electromechanical drive system, respectively. The experimental environment of this paper is Intel Xeon 5238 CPU@2.  [45]. The vibration signal data is collected from the drive-end of a 2-hp reliance electric motor as shown in Figure 8.
The accelerator sensors are installed on the inner race, ball, and the outer race, respectively. In this case study, only the data collected from the inner race are collected and analyzed. The vibration data is sampled at the frequency of 12 kHz under different rotating speed of 1730 rpm, 1750 rpm, 1772 rpm, and 1797 rpm. There are totally five statuses of the inner race including one normal status and four different faulty severity statuses of the diameters 0.007, 0.014, 0.021, and 0.028, respectively. Therefore, five operating statuses are included in the datasets.
In this experiment, two datasets including the training datasets and the testing datasets in each are generated, respectively. In dataset Ι, for each health condition, 100 samples with 4096 data points in each sample are randomly selected under each load condition in the training datasets. That is to say, there are 400 samples of a single health condition with the load condition of 0, 1, 2, and 3. Therefore, there are totally 2000 samples of five health conditions altogether. Meanwhile, 2000 samples are randomly selected in the same way for the testing datasets. In dataset II, the training and testing samples are selected under different loads where 1500 samples with five operating statuses are randomly selected under the load condition of 0, 1, and 2 as the training datasets, while the testing datasets consist of 500 samples of five operating status under the load condition of 3. More details of the two datasets, namely, dataset I and dataset II, are listed in Table 1.    Figure 9. It can be found that there is naked-eye distinguishable differences among these converted gray-scale images, which is applicable for the input of the U-net.
In order to demonstrate the generalization ability and the faulty sensitivity of the proposed U-net hierarchical feature extractor, the t-distributed stochastic neighbor embedding (t-SNE) technology, regarded as a novel technology which visualizes high-dimensional data by giving each data-point a location in a two-or three-dimensional map [46], is used here for the visualized evaluation of the U-net hierarchical feature extractor. As shown in Figures 10(a)-10(f), the two-dimensional visualizations of the feature images X6, X8, and X10 are illustrated under the test set of dataset I (loads 0~3) and the test set of dataset II (load 3), where different colors represent different health conditions. Firstly, it can be found that the vast majority of the samples belonging to the same conditions are well gathered while separated for different health conditions. Therefore it can be concluded that the extracted multilevel features of the U-net feature extractor can be very sensitive for the faulty information contained in the gray-scale image. By the comparison analysis in Figures 10(a)-10(f), it is worth mentioning that the majority of samples belonging to the same health condition can be well gathered in the test set of both datasets, and there is no obvious difference in terms of the classification result. Since the operation conditions of the training and testing datasets are the same in dataset I while different in dataset II, it can be further proved that the U-net-based CNN has powerful generalized feature extraction ability which can be less influenced by the load condition variation.
In addition, the two-dimensional visualization view of the extracted multilevel features of X6, X8, and X10 are different from each other, indicating that the different feature level can contribute different knowledge to the faulty prognostic tasks. Therefore, it can be concluded the U-netbased CNN has powerful hierarchical feature learning ability which represent the information of the different health conditions from multiple aspects.
The visualization view of the representative feature images of X6, X8, and X10 is illustrated in Figure 11. It can be found that the three extracted multilevel feature images can be well distinguished from each other under the five different health statuses of the testing set of dataset I, indicating the proposed U-net hierarchical feature extractor being sensitive to the faulty information contained in the gray-scale feature image.
The extracted hierarchical features in layer X6, X8, and X10 are sent to the designed ResNet-CBAM attention learning block separately, and the designed ResNet-CBAM attention learning network is applied two times not only on the multilevel feature images but also on the (8 * 8) concatenated categorical feature images. The visualization of the attention learning result of the health condition of faulty diameter 0.007 under load 0 is illustrated as shown in Figures 12(a)-12(d); it should be noted that there is obvious discriminative concentration on these extracted multilevel feature images and the concatenated categorical feature image, thus, assigning larger weights to the important features and promoting the prognostic efficiency as well as the prognostic accuracy. Therefore, it can be concluded that it is necessary to apply the CBAM attention learning block not only on the extracted multilevel features of X6, X8, and X10 but also on the concatenated categorical feature used for faulty prognostic.
The optimized categorical feature image is sent to the Softmax layer for final faulty prognostic. The maximum epoch number is set to 60, and the average accuracy of the last 10 epochs from the 50 th to the 60 th epoch is defined as the final convergence accuracy (FCA) in this paper; the Journal of Sensors optimizer is Adam with the learning rate of 0.005. The prediction accuracy of the training and validation curves of two datasets are illustrated in Figure 13. It can be clearly seen that both the training and testing accuracy can reach almost 100% after the 60 th epoch in dataset I. In dataset II, the final convergence accuracy of the training result can also reach nearly 100%, and the testing accuracy can reach nearly 93%,which can be also comparatively high. Since the training and the testing datasets are collected under the same load in dataset Ι while different in dataset II, it can be proved that the proposed faulty prediction approach can achieve perfect prognostic accuracy as well as generalization ability.

Ablation Experiment.
To evaluate the speed-up property promotion of introducing the attention mechanism to the proposed faulty prognostic framework, an ablation experiment of the different combinations of the U-net and the attention learning mechanism is evaluated on the two datasets of the case study. Specifically, we implement the proposed approach: the U-net+Softmax (US), the U-net+categorical attention+Softmax (UCAS) and the U-net+multi-scale attention+Softmax (UMAS). The "U-net+Softmax", which has no attention learning process, is used as the benchmark model, and the performance metrics of accuracy gain and the average accuracy gain is adopted for the   Journal of Sensors evaluation of the model speed-up properties promoted by the attention learning network. As shown in Figure 14, the proposed model which has two times attention learning process significantly outperform the US model especially in the first 30 epochs in terms of the testing accuracy gain of both datasets, which is very important for the real-time requirement of the practical industry during the infant stage. Moreover, the ablation models of the UCAS and UMAS, which have only one attention learning process on the categorical feature and the multiscale features, respectively, also have certain accuracy gain promotion compared with the US model, indicating the effectiveness of the introduction of the attention learning mechanism in promoting prediction efficiency. The ablation experiment is executed 10 times, and the mean values of the average final convergence accuracy (FCA) and the average accuracy gain (AAG) are illustrated in Table 3, where the proposed approach outperforms the other three ablation models in both metrics.

Comparison Experiment.
To further evaluate the speed-up properties of the attention learning network and the generalization ability of the U-net CNN-based hierarchical feature extractor, the comparison analysis introduces the proposed approach; the three ablation models as well as some hybrid prediction models based on the hierarchical feature extractor of the classical LeNet-5 CNN, namely, LeNet-5+Random forest (L-RF), LeNet-5+SVM(L-SVM), and LeNet-5+Softmax(LS) for comparison. Similar as the ablation experiment, the model of the U-net+Softmax is set as the benchmark model, and the accuracy gain curves of the multiple hybrid prediction approaches are illustrated in Figure 15, where the approaches with the attention learning mechanism has superior accuracy gain over the US model while the models without attention learning mechanism has inferior accuracy gain over US model, indicating the prognostic efficiency promotion of the attention learning.
The comparison experiments are conducted 10 times on both datasets just the same as the ablation experiment. It can be clearly seen from Table 4 that the proposed approach achieves the highest final convergence accuracy and the most superior average accuracy gain on the testing result of both datasets. Moreover, it should be noted that the models with the U-net feature extractor network significantly outperform other traditional LeNet-5 CNN-based model especially on the final convergence accuracy of dataset II when compared with the performance on dataset Ι. Therefore, it can be concluded that the models with the   [47]. The testing rig is illustrated in Figure 16 which consists of an electric motor (1), a torque-measurement shaft (2), a rolling bearing test (3), a flywheel (4), and a load motor (5). The experiment uses the motor current signal of the electromechanical drive system for bearing diagnostics which is collected under four operating conditions with different operating parameters settings as shown in Table 5. There are totally four different statuses of the electromechanical drive system, namely, inner-ring damage, outer-ring damage, combined damage, and the healthy status. All the samples with 4096 data sampling points are randomly selected from the conditional monitoring data. Different from the     Figure 17. It can be concluded that these images corresponding to different health conditions can also differ from each other, and it should be easy to classify them which further proves the effectiveness of the "N * N" signal to image conversion method.
The same as the case study one, the converted gray-scale images are used as the input of the U-net feature extractor and the multilayer features extracted from the U-net are used as the input of the attention learning network for faulty classification within the maximum epoch range of 60. The prediction result is illustrated in Figure 18. It can be seen that the training and the testing accuracy of the 60 th epoch can reach nearly 100% on both datasets which can be comparatively higher than case study one. The reason should be that the classification task is only within the same component of inner-race faulty in case study one while including different components in case study two, which has more distinguishable faulty symptom. Figure 19 show the effectiveness of the attention learning network where the proposed approach, the UCAS, and the UMAS have obvious accuracy advantage over the U-net+-Softmax within the same epoch range during the infant stage, indicating the effectiveness of the attention learning mechanism being also valid in case study two. The mean value of the average accuracy gain and the final convergence accuracy are illustrated in Table 7, where the proposed approach outperforms the other three ablation models in terms of both metrics in case study two. Figure 20 and Table 8 show the accuracy gain curve; the mean final convergence accuracy and mean average accuracy gain of the proposed approach, the three ablation models and the traditional hybrid prediction models based on LeNet-5 hierarchical feature extractor network, where the model with the U-net feature extractor has better generalization ability; and the model with the attention learning mechanism has better speed-up properties especially during the infant stage, showing the great potential of the U-net, the attention learning network, and the proposed combination.

Conclusion and Future Work
5.1. Main Contribution of the Proposed Paper. In this paper, a novel bearing faulty prediction approach based on the Unet-based hierarchical feature extractor network and the ResNet-CBAM-based attention learning network is proposed. The main contributions of this paper can be summarized as follows: (1) Introducing the N * N "signal to image" conversion approach, the N * N data to image approach can be simple but effective which can relax the dependencies on the domain expertise knowledge of signal processing (2) Proposing a U-net CNN-based multilevel feature extractor network which has powerful generalized and hierarchical feature extraction ability. The extracted multilevel features can distinguish the different health conditions under the complex operational conditions and represent the different health conditions from multiple aspects, contributing different knowledge to the prognostic tasks     Figure 19: The accuracy gain of the three ablation models with attention learning mechanism: (a) testing accuracy gain of dataset I; (b) testing accuracy gain of dataset II.  The proposed approach is validated on two case studies, namely, offered by the CWRU (Case Western Reserved Uni-versity) and the Paderborn University. Both case studies prove the effectiveness of the generalization ability of the U-net and the speed-up properties of the attention learning network. Moreover, the proposed approach is validated on the ablation experiment and the comparison experiment which further proves the effectiveness of introducing the proposed combination of the U-net and the attention learning network.

Future
Work of the Proposed Paper. Although the proposed approach has made some achievements, there are still two items needed to be considered. Firstly, the complexity of the U-net-based hierarchical feature learning network as well as the attention learning network should be taken into account. In the future, the parameter scale of the proposed approach should be shortened which can be applicable for the model deployment of the edge-computing devices. Moreover, the proposed bearing faulty classification approach should be expected to be widely used in the faulty classification of other similar prognostic scene such as the gearbox, the milling equipment, and the gas pump system.  Figure 20: The accuracy gain of the three ablation models with attention learning mechanism and the LeNet-5 based traditional hybrid models used for comparison: (a) testing accuracy gain of dataset I; (b) testing accuracy gain of dataset II.

Data Availability
The dataset used to support the findings of this paper have been deposited in the CWRU (Case Western Reserved datasets) with the link of "https://csegroups.case.edu/ bearingdatacenter/pages/12k-drive-end-bearing-fault-data" and the Paderborn University with the link of "http://groups .uni-paderborn.de/kat/BearingDataCenter/."

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.