Bearing Fault Diagnosis Based on Multiscale Convolutional Neural Network Using Data Augmentation

Bearings are one of the most important parts of a rotating machine. Bearing failure can lead to mechanical failure, financial loss, and even personal injury. In recent years, various deep learning techniques have been used to diagnose bearing faults in rotating machines. However, deep learning technology has a data imbalance problem because it requires huge amounts of data. To solve this problem, we used data augmentation techniques. In addition, Convolutional Neural Network, one of the deep learning models, is a method capable of performing feature learning without prior knowledge. However, since conventional fault diagnosis based on CNN can only extract single-scale features, not only useful information may be lost but also domain shift problems may occur. In this paper, we proposed a Multiscale Convolutional Neural Network (MSCNN) to extract more powerful and differentiated features from raw signals. MSCNN can learn more powerful feature expression than conventional CNN through multiscale convolution operation and reduce the number of parameters and training time. The proposed model proved better results and validated the effectiveness of the model compared to 2D-CNN and 1D-CNN.


Introduction
The development of the IoT and industrial applications is rapidly improving the intelligence of equipment in the modern industry. As a result, mechanical equipment is becoming increasingly sophisticated and complex. Machinery failure can cause significant financial loss as well as human casualties. The rotating machine is one of the most widely used machines in the industry [1]. Rolling bearings are very essential components in rotating machines [2,3]. Therefore, diagnosing bearing failure is very important.
Recently, data-based fault diagnosis [4][5][6] is drawing a lot of attention to the researchers due to the development of computers and GPUs. Traditional model-based diagnostic methods [7][8][9][10][11] are not efficient for learning nonlinear data. In addition, in the feature extraction step [12][13][14][15], there is a large difference in the result value depending on the skill of the expert. Machine learning methods such as Support Vector Machine (SVM) [16,17], Principal Component Analysis (PCA) [18,19], and artificial neural network (ANN) [20] have been used frequently. However, traditional machine learning [21][22][23] has also difficulty handling complex data.
On the other hand, data-driven diagnostics can effectively and accurately express the characteristics of big data or complex input data. With the advent of deep learning, it is possible to train neural networks through very deep continuous layers. Deep learning [24,25] is widely applied in various fields such as image processing and image generation. DNN (Deep Neural Network) [26,27] is a structure composed of many layers and can automatically extract deep features. Jia et al. [28] performed rolling bearing failure diagnosis via DNN and said that characteristics could be collected from raw signals. Xu et al. [29] used PCA to reduce the size of these features. Eren et al. [30] built an error detection system using 1D-CNN (One-Dimension Convolutional Neural Network). Deng et al. [31] proposed the MSIQDE algorithm based on making use of the merits of the Mexh wavelet function. Shao et al. [32] proposed the Convolutional Deep Faith Network (CDBN) for rolling bearings. Thus, deep learning technologies have the ability to overcome the shortcomings inherent in traditional machine learning methods.
Among deep learning models, CNNs are one of the great ways to perform feature learning without prior knowledge. CNNs are suitable for feature learning because they can pass signals periodically. However, CNNs have some drawbacks. Unlike methods like SVM or PCA, it requires a lot of training samples. Also, the filter size of each convolutional layer is fixed, so you cannot get various information. In addition, since a CNN with a general structure can only extract single scale features, useful information may be lost and domain shift problems may occur. Therefore, we proposed an improved CNN called MSCNN with different filter sizes at each convolution. This allows us to extract useful information in frequency domains with different resolutions. In addition, more powerful feature expressions can be learned than conventional CNNs, and the number of parameters and training time can be reduced [33,34]. Also, training deep learning requires a lot of data. If there is not enough training data, it is difficult to expect good results. Data on rolling bearings are not always enough under real condition. In this paper, we have increased the amount of data by applying permutation and time-warping techniques.
This paper is configured as follows: Neural networks and the background of the proposed model are described in Section 2. Experiment and results are provided in Section 3. Finally, Section 4 presents a conclusion.

Background
2.1. Artificial Neural Network (ANN). The structure used in artificial neural networks [35] today was proposed by Frank Rosenblatt in 1958. Rosenblatt proposed a linear classifier called perceptron, which was a linear classifier structure that outputs 1 if the value is greater than 0 and -1 if it is less than 0 by adding the product of the inputs and weights and applying the activation function. Neurons have multiple inputs and one output. When each input is multiplied by a weight, the weight is multiplied by the next input, and the larger the weight, the more information is conveyed. A bias is added to the sum of the input value and the weight, and this bias represents the sensitivity of the neuron. Figure 1 describes the architecture of perceptron: where w represents the weight, x represents the input of neuron, and b represents the bias.
It was expected that the perceptron could create artificial intelligence like a real human, but in 1969, Minsky and Papert [36] proved the mathematical limitations of the perceptron, and people's expectations dwindled. According to them, the perceptron is only a simple linear classifier and cannot perform XOR classification. In other words, simple problems can be solved, but complex problems cannot be solved. Figure 2 describes the problem of OR, AND, and XOR.
In 1986, Rumelhart et al. [37] proposed a multilayered perceptron that overcomes the limitation of linear classifiers by adding a hidden layer. It proved that the XOR problem can be solved by using a concept of multilayered perceptron.
MLP has a structure that is similar to a single perceptron, but by making the input/output characteristics of the intermediate layer and each unit nonlinear, it overcame the shortcomings of single perceptron by improving the network capability. In MLP, as the number of layers increases, the characteristics of decision regions formed by perceptron become more advanced. In Figure 3, we visualized the architecture of MLP.

Convolutional Neural Network (CNN)
. CNN was developed by Lecun and Bengio [38] in the 1990s as a neural network structure that classifies handwritten numbers and received great attention. It is one of the most popular deep learning algorithms. CNN is a model that reduces the number of parameters using convolution using spatial relations. The goal of extracting hidden features from the data is to learn several feature filters in the input data and then perform operations between the feature filters and the input data. Since the vibration signal is a time series vibration signal, 1D-CNN [39] was used. CNN mainly consists of input, convolution layer, pooling layer, fully connected layer, and output. The basic structure of 1D-CNN is represented in Figure 4.
The convolutional layer is a layer that learns the feature values of the input data and consists of multiple feature maps. Neurons in each feature map are connected to the local area of the previous functional map through a set of weights. This set of weights is called the convolution kernel. The result after performing convolution on the input feature map and the convolution kernel is passed to the activation function to form the next feature map layer. Functional maps are computed through weight sharing, reducing model complexity, and making network training easier. The forward propagation of the convolutional layer is as follows: where x l j is the output of the layer l, M j is the selected feature map, x l−1 j is the output of the layer l -1, k l ij is the weight of layer l, and b l j is the bias of layer l. The pooling layer is usually placed between successive convolutional layers. It is about reducing the dimensionality ?  Outer0.007 Time-warping-outer0.007 Inner0.021 Time-warping-inner0.021 Inner0.007 Time-warping-inner0.007 Inner0.014 Time-warping-inner0.014  5 Journal of Sensors of the convolutional layer to do the extraction. The pooling layer uses feature vector values in feature maps for subsampling, so the most commonly used pooling methods are maxpooling and average pooling. In this paper, we used a maxpooling method that performs better in onedimensional time series operations. The structure of the maximum pooling layer is shown in Figure 5 and is as follows: Inner0.007 Permutation-inner0.007 Inner0.007 Permutation-inner0.007 Inner0.021 Permutation-inner0.021 Inner0.014 Permutation-inner0.014  Journal of Sensors where q l i is the output of the tth neuron in the ith feature map of the layer l, t ∈ ½ðj − 1ÞW + 1, jW, W is the width of the pooled area, and P l+1 i ðjÞ is the pooled value of the corresponding neuron in the layer l + 1.

Multiscale Convolutional Neural Network (MSCNN).
Since the input value of CNN is usually a raw signal, poor results can be obtained regardless of hyperparameter changes if there is insufficient useful information. A con-volution is the most important method to analyze the signal, and the size of the convolution filter in 1D-CNN has a great influence on the performance. In 1D-CNN, the size of the convolution filter is a hyperparameter. Since the convolution layer uses a convolution filter of a fixed size, laying out the size of the convolution filter is a very difficult problem.
Also, there are also some issues with the classification. First, a large size of convolutional filter has a good resolution because it focuses on a low-frequency region but tends to ignore high-frequency information. Conversely, a small size of convolutional filter focuses on the frequency band but has a lower resolution. Second, if a convolution filter of the same size is used, other discriminatory features cannot be properly extracted.
To address this problem, researchers proposed a multiscale convolution neural network. Multiscale convolution extracts features from a vibration signal using several convolution filters of various scales. Our framework of the proposed model is described in Figure 6.
We have used three convolution filters with different widths in the convolution layer to extract features from the original data through two convolution layers and one maxpooling layer and obtain three different feature maps and then concatenate them. The 1 × 1 convolution reduces the depth and width of the networks without increasing computation resources. This structure made it possible to extract other distinct features from the original signal. After concatenation, the classification part consisted of two fully connected layers and a softmax layer. The softmax function is designed as the number of labels in two different datasets.
The proposed multiscale feature extraction utilizes three convolution filters with different widths in the convolution layer to extract features from the original data through two convolution layers and one maximum    7 Journal of Sensors pooling layer and obtain three different feature maps, then connect them. This structure allows both low-and highfrequency information to be obtained from the original signal. The convolution layer of MSCNN is as follows [40]: where C t d represents the output feature map of the tth convolution layer of MSCNN with depth d = 3 and c t 1 , c t 2 , c t 3 represents the feature maps after convolutions of MSCNN and can be seen as Three convolution filters f t d are convolved with a C ½t−1 feature map. b t d is the bias added to the feature map of the tth convolution layers. Each convolution layer is combined into a concatenation layer after convolution operations.

Data Augmentation.
Deep learning increases the expressiveness of the model by increasing the parameters by stacking a lot of hidden layers. A huge amount of training data is necessary to properly train a lot of parameters. However, it is not easy to extract a lot of data under real working conditions. In addition, the data should keep the quality high and varied enough to reflect reality. If a deep learning model without enough training data is performed to train the parameters, the overfitting problem usually occurs. Therefore, by increasing the absolute amount of data even in a small data set area through a data augmentation technique [41,42], we acquired new data by applying artificial changes to the data. Data augmentation can process unexplored input and improve the generalization effect of deep learning models. The important thing about data augmentation is to meet your domain knowledge to keep your existing labels when generating new data. It also does not change the data label with minor changes. Data augmentation technology is often used for images, but data augmentation technology is applied to time series data.
In this paper, we have used two data augmentation techniques. Both techniques are made by the fact that a slight change in the action point can keep the label. Firstly, a time-warping technique changes the position of time samples by smoothly distorting the time interval between samples. In Figures 7 and 8, we visualized the original signal and the generated signal by time-warping using subplot and scatter plot.
Second, the permutation is a technique that randomly changes the location of an event. To confuse the location of the input data, it is a technique to create a new window  Journal of Sensors by splitting the data into segments with the same length and then randomly changing the next segment. Figures 9  and 10 are diagrams visualizing the original data and the data generated by the permutation technique. Both methods have slight changes to the data, but no significant changes to the labels.

Experimental Configuration.
Keras library was used with Tensorflow backend. We compared our proposed model with 2 CNN models. The system specification for the experiment is shown in Table 1.

Simulation Case 1: CWRU Bearing Dataset.
To evaluate the performance of the proposed MSCNN, a bearing dataset from Case Western Reserve University Bearing Data Center was used, and the fault test bench is given in Figure 11. The CWRU bearing dataset provides vibration signals generated by the simulator in normal and fail conditions. The 2 hp electric motor, torque converter, and dynamometer are the main components, and the vibration signals were collected from the drive end, fan end, and accelerometer mounted in the housing. We used motor drive end bearing data sampled at 0 hp, 1 hp, 2 hp, and 3 hp. Each defect type is divided into 0.007, 0.014, and 0.021 inches, so there are 10 states in the data set. We divided the size  Journal of Sensors of each signal segment by 400. Each state has 1600 samples, total of 10 states, so there are 16000 data. We divided 80% of the total data as training data, 20% as test data, and 20% of the remaining training data as validation data. The detailed description of data is in Table 2.
Two types of CNN models were compared to confirm the effectiveness of the proposed model. Both models are built as commonly used. Table 3 shows the results of an accuracy comparison of the proposed MSCNN with 1D-CNN, 2D-CNN, and the proposed model without increasing data.
Also, Figure 12 is a graph that visualizes the loss curve and accuracy curve for each model. In the loss curve, it can be seen that the proposed model is fine but settles more quickly. In addition, the accuracy curve shows that the proposed model is rapidly increasing in accuracy.
According to the result, the proposed model showed a better accuracy compared with others. The proposed model showed high accuracy of up to 1.4% and at least 0.53% in tests. There seems to be a slight difference, but this is significant because the 1% difference is not a small number as the accuracy increases. In addition, there is a significant difference in accuracy even when data augmentation techniques are not used. This proves that splitting the convolution kernel into multiple scales works. In addition, we created a confusion matrix for each model for the reliability of the experiment.
As a second indicator, we used the confusion matrix. Although accuracy is the most intuitive metric, performance can be skewed on unbalanced datasets. There are four concepts you need to understand for the confusion matrix. The formula utilized for the confusion matrix is as below.
First, precision is the ratio of true to what the model classifies as true. Precision is expressed as Recall is the proportion of true that the predicted model is true. The recall is expressed as Accuracy is the most intuitive metric and is expressed as follows: Bearing 1 Motor Accelerometers Figure 14: Bearing simulator of IMS. 10

Journal of Sensors
The F1 score is the harmonic mean, which allows you to accurately evaluate the performance of your model if your data labels are unbalanced. The F1 score can be expressed as F1 score = 2 Precision * Recall Precision + Recall : ð9Þ Figure 13 shows the results of the confusion matrix for each model. As a result of the experiment, as in the accuracy analysis, the proposed model showed the best performance, followed by 1D-CNN and 2D-CNN. Since the time series data is the input data, the performance of 1D-CNN appears to be better than that of 2D-CNN performance, followed by 1D-CNN and 2D-CNN. Since the time series data is the input data, the performance of 1D-CNN appears to be better than that of 2D-CNN.

Simulation Case 2: IMS Bearing Dataset.
To add reliability to the evaluation of the proposed model, we used the bearing dataset provided by the Center for Intelligent Maintenance System (IMS), and the fault test bench is given in Figure 14.
Four rolling bearings were installed and operated with a rotational speed of 2000 rpm and a radial load of 6000 lbs. The raw signal was received by two accelerometers arranged vertically and horizontally. Therefore, there are a total of 4 types, including 3 fault states and normal states. In order to see the difference in the results, we conducted an experiment with the CWRU data set presented above and the number of samples made the same. The detailed description of data is in Table 4.
We compared the two types of CNN models and the proposed model as in the previous experiment. Table 5 shows the results of comparing the accuracy of the proposed model and a typical CNN model using IMS data. According to the results, the proposed model showed better accuracy compared to other models. The proposed model showed maximum 1.37%, minimum 0.26%, and higher accuracy in the test.
Also, Figure 15 is a graph that visualizes the loss curve and accuracy curve for each model. From the loss curve, it is shown that the proposed model is fine, similar to the CWRU experiment results, but settles more quickly. In addition, the accuracy curve shows that the proposed model is rapidly increasing in accuracy. The results of the confusion matrix for each model are shown in Figure 16.

Conclusion
Diagnosis of bearing failures is very important in the industry. Knowing bearing failure in advance can reduce downtime, prevent financial losses, and prevent failures in advance. Raw vibration signals were collected from CWRU and IMS bearing data sets. However, in the real world, there is not always enough data, and without enough data, the deep learning model performs very poorly. Therefore, in this paper, data was generated using data augmentation techniques that are good for application to two types of time series data. Experimentation has shown that data augmentation techniques significantly improve accuracy. In addition, in order to overcome the shortcomings of CNN, we proposed a model that minimizes the information lost through each different convolution filter by configuring the con-volution layer in multiscale and reduces parameters and training time. The proposed model not only extracts useful information better from the frequency domain with different resolutions than the conventional CNN but also enables more powerful feature expression learning. The proposed model showed better performance than the existing 1D-CNN and 2D-CNN. Future research will apply to not only bearing data but also data from other fields widely used in the industry. In addition, we plan to improve the structure of the model more efficient.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request (jpjeong@ skku.edu).  Journal of Sensors