Remaining Useful Life Prediction of Milling Tool Based on Pyramid CNN

. Remaining useful life prediction of a milling tool is one of the determinants in making scientifc maintenance decision for the CNC machine tool. Predicting the RUL accurately can improve machining efciency and the quality of product. Deep learning methods have strong learning capability in RUL prediction and are extensively used. Multiscale CNN, a typical deep learning model in RUL prediction, has a large number of parameters because of its parallel convolutional pathways, resulting in high computing cost. Besides, the MSCNN ignores various infuences of diferent scales of degradation features on RUL prediction accuracy. To address the issue, a pyramid CNN (PCNN) is proposed for RUL prediction of the milling tool in this paper. Group convolution is used to replace parallel convolutional pathways to extract multiscale features without additional large number of parameters. And the channel attention with soft assignment is used to select the key degradation features, considering diferent sensors and scales. Te milling tool wear experiments show that the score value of the proposed method achieved 51.248 ± 1.712 and the RMSE achieved 19.051 ± 0.804, confrming better performance of the proposed method compared with the traditional MSCNN and other deep learning methods. Besides, the number of parameters of the proposed method is reduced by 62.6% and 54.8% compared with the MSCNN with self-attention and the MSCNN methods, confrming its lower computing cost.


Introduction
As a basic tool of industry, computer numerical control (CNC) machine tool plays an important role in industrial manufacture.With the increasing demand for product quality, stability of machining process becomes more and more important.Tool wear is a common negative efect on machining quality during the high-speed machining process [1].And it not only afects the quality of machined surface and the machining precision but also results in increasing machining cost.Moreover, unnecessary tool replacement that aims at preventing the decrease in surface quality will increase the downtime and machining cost in high-speed milling [2].Te efects for tool degradation mainly include cutting parameters, work material, and cutting tool.However, the internal law of these efects on tool degradation is hard to determine for their various combinations.Since it could not be directly detected during the process, it is hard to make scientifc maintenance decisions without interrupting the machining process.Terefore, a signifcative work is to accurately predict the remaining useful life (RUL) of the milling tool.
With the widely usage of industrial internet of thing in condition monitoring of machinery, a mass of monitoring data of the CNC machine tool are acquired by various sensors.Te explosive growth of monitoring data brings new opportunities to RUL prediction of the milling tool.Compared with model-driven RUL prediction methods, data-driven RUL prediction methods are able to learn degradation characteristics of a tool from massive monitoring data.And it could also build the corresponding RUL prediction models automatically, which means neither deep understanding of system-failure physics nor complete knowledge of the dynamics is required.Terefore, data-driven RUL prediction methods are gaining more and more attention in the feld of RUL prediction recently [3].
Traditional data-driven prognostic approaches usually contain three steps: hand-crafted feature extraction, degradation behavior learning, and RUL prediction [4,5].Hand-crafted feature extraction is to use signal process methods to extract sensitive degradation features from the monitoring data.Ten, these features are fed into machine learning models, such as ridge regression, support vector machine (SVM), and so on, to learn the degradation features and predict the RUL.For example, Park et al. [6] extract time, frequency, and time-frequency domain features, and these features are input into the ridge regression model after dimension reduction using PCA.Zhao et al. [7] extract highdimensional feature using time-frequency representation (TFR), which are fed into the simple multiple linear regression model to predict the RUL after supervised dimensionality reduction using PCA and LDA.Liu et al. [8] used the integration of empirical mode decomposition (EMD) and Wigner-Ville distribution (WVD) to extract degradation feature from gearbox vibration signal, and then particle flter (PF) with the state space model based on the Wiener process is used to predict the RUL of gearbox considering degradation feature.Even though these methods have a good performance on the RUL prediction, they still need to take much efort on hand-crafted feature design [9,10].To avoid this situation, it is desirable to fnd a new method to automatically extract degradation feature from monitoring data.Terefore, deep learning-based RUL prediction methods have gained more and more attention in the feld of data-driven RUL prediction [11][12][13][14][15][16][17][18][19][20].
Deep learning, structured by a stack of multiple layers of nonlinear processing units [21], can extract high-level feature without human intervention.Tus, deep learning shows a more powerful feature extraction ability, and achieves state-ofthe-art accuracy in many tasks, such as image classifcation, natural language processing (NLP), target detection, and so on.Deep belief network (DBN), auto-encoder network (AEN), recurrent neural network (RNN), and convolution neural network (CNN) are mainstream architectures in deep learning [22].Wang et al. [23] proposed a deep separable convolution network (DSCN) for RUL prediction of bearing, which extracted the degradation feature from monitoring data using deep separable convolution and predicted the RUL using fully connected layers.Hinchi and Tkiouat [24] used CNNS to extract features from vibration signal, and then employed LSTM to predict the RUL of rolling element bearings.Zhang et al. [25] proposed a multiobjective DBN ensemble method for RUL prediction of turbofan engines.Wang et al. [26] use DCAE and SOM to gain the health index of rolling bear, and then use this health index as a label to train a CNN-based RUL prediction model to predict the RUL.Ding et al. [27][28][29] proposed three meta deep learning methods to predict the RUL of the machine under diferent conditions and limited and variable-length data.Zhang et al. [30] proposed a deep representation regularization-based transfer learning method for remaining useful life predictions under diferent machinery operating conditions and no target-domain run-to-failure training data.
Because of the remarkable ability of extracting degradation features from monitoring data, CNN-based RUL prediction methods become a research hotspot, especially the multiscale CNN (MSCNN) [31][32][33][34][35][36][37][38][39].Te architecture of traditional MSCNN with self-attention is shown in Figure 1.Parallel convolutional pathways are used to extract diferent scales of degradation features, which is developed by diferent size of convolution kernel for diferent convolutional pathways.And the self-attention is embedded to avoid the interference caused by the redundant and uncorrelated information of partial sensors, improving the performance of the networks.Te usage of parallel learning strategy, however, greatly increases the parameters of the model, leading to higher cost of computing during model training.Te self-attention, in addition, can only consider the contribution of diferent sensors to RUL prediction.In other words, the contribution of diferent scale of degradation features is not taken into account.
To deal with the mentioned problems, a pyramid CNN (PCNN) is proposed in this paper.Te architecture of the proposed PCNN is shown in Figure 2. Te monitoring data acquired from diferent sensors can be directly fed into the proposed network without any preprocessing, which means complex signal processing techniques do not require.Tis network contains two parts, multiscale feature learning subnetwork and RUL predicting subnetwork.Te multiscale feature learning subnetwork is built by stacking one-dimensional (1D) convolution layers and pyramid convolution layers.Low-level features are extracted by the one-dimensional (1D) convolution layers and fed into the pyramid convolution layers.In the pyramid convolution layers, group convolution is used to extract multiscale high-level degradation features.Ten, the channel attention model is used to generate attention weight for each channel.A soft assignment is used to recalibrate the attention weight of diferent scales so that the key degradation features can be selected from not only diferent sensors but also from diferent scales.Te RUL predicting subnetwork contains global pooling and fully connected layers (FCLs).Te mapping relationship between degradation features and the RUL is established in these parts.Te tool wear experiment is used to verify the proposed method.Compared with the traditional MSCNN, the proposed method has higher accuracy of RUL prediction and smaller number of parameters.
Te rest of this article is structured as follows.Te basic theory of the proposed method is expounded in detail in Section 2. Experiment and comparison analyses are illustrated in Section 3. Conclusions are composed in Section 4.

One-Dimensional (1D) Convolution Layer and Shortcut
Connection.On-dimensional convolution is used to extract degradation feature from raw data in this paper.Te 1-D convolutional operation can be described as 2 Shock and Vibration where X 0 is the raw data, y 0 is the output of the process, k 0,i is the learnable convolutional kernel, b 0,i is the bias tern, * represent the convolutional operation, and f(•) is the nonlinear activation function.In this paper, the rectifed linear unit (ReLU) is used as the nonlinear activation function of the 1-D convolution operation.By repeating this process twice, low-level degradation features, denoted as F 0 , can be obtained.Gradient vanishing/exploding and weight matrix degradation is a considerable problem of deep learning.To address this issue, shortcut connection is introduced in this network.
Te raw data acquired from the sensor is fed into the shortcut connection pathway, which contains a convolution layer and a max pooling layer.Te size of the convolutional kernel in the shortcut connection is 1 × 1, which aims to

Shock and Vibration
increase the dimension of X 0 .Te max pooling layer is used to downsample the output of the convolution layer.Te output of the shortcut connection model, denoted as S out , is given by where Y is the output of the pyramid convolution layer, pool(•) is the pooling function, k c is the convolution kernel with the size of 1 × 1, and * is the convolution operation.In this paper, group convolution is used to replace parallel convolutional pathways so that multiscale features can be extracted without additional large number of parameters.Te architecture of this model is shown in Figure 3.

Pyramid Convolution
Te input low-level feature F 0 ∈ R L×C is splitted into s groups along with the channel direction, denoted as X i ∈ R L×c/s , with i � 1, 2, ..., s, where c is the number of channel and L is the length of F 0 .A set of learnable kernels is used to convolve X i .Te output of the convolution, denoted as F i , can be obtained by where C/s is the number of learnable kernels and the number of input channels, * denotes the convolution operator, k i,c ∈ R F×1×(C/s)×(C/s) is the c − th convolution kernel of the i − th group, and b i,c is the bias term.Diferent convolution kernels k i,c have diferent sizes, which can extract diferent scales of degradation features.Finally, the whole multiscale feature can be obtained by the concatenation of all the F i .

Channel Attention Model and Soft Assignment.
Te data from diferent sensors contain diferent degrees of degradation information.In other words, some important degradation information only exists in partial sensors.Furthermore, diferent scales of features also contain different degrees of degradation information.Terefore, it is important to select key degradation information from the multiscale feature F. In this paper, a channel attention model is used to obtain the attention weight from the input feature F. Ten, the soft assignment is used to recalibrate the attention weight of the corresponding scale.Te structure of this model is shown in Figure 4. Attention weights of the features of diferent scales can be obtained by using parallel processing pathways.Each processing pathway includes global information encoding and channel-wise relationship information recalibrating.Te global information encoding is done by global average pooling and global max pooling, and the channel-wise relationship information recalibrating is done by fully connected networks with one hidden layer.
Te global average pooling (GAP) and the global max pooling (GMP) can aggregate the global information of each channel, generating two vectors: V a and V m .Both V a and V m contain J � C/s channel-wise statistics.Te channel-wise statistics of the j-th channel V a,j and V m,j is obtained by Ten, V a and V m are fed into the fully connected network (FCN) with one hidden layer.Te neuron number of the hidden layer in the FCN is J/r, where r is the ratio of dimensionality reduction.After that, the attention weight of F i , denoted as Z i , can be calculated by where , and W v2 ∈ R J×J/R are the weight matrices in the FCNs, ⊕ denotes the element-wise summation, and σ(•) is the sigmoid activation function.
By doing this, the network can fuse degradation information from diferent sensors and produce a better attention for high-level degradation feature.Furthermore, in order to enhance the key degradation features of some scales and suppress the irrelevant ones without destroying the original channel attention vector, a soft assignment is used to adaptively recalibrate the attention weight of the corresponding scale.After doing this, the key degradation features are selected not only from diferent sensors but also from diferent scales.Te soft assignment is given by 4

Shock and Vibration
Ten, the multiscale high-level degradation feature with multiscale channel-wise attention weight, denoted as Y i , can be obtained by where ⊙ is the channel-wise multiplication.

Shock and Vibration
Finally, the output of the pyramid convolution layer, denoted as Y, can be obtained by the concatenation of all the Y i .

Experimental Verification
3.1.Data Description.As shown in Figure 5, the life testing of the milling tool is conducted in a computer numerical control (CNC) milling machine.
Te material of the workpiece is 316L stainless steel, and the milling tool is cemented carbide insert deposited by TiAlN coating.During the milling process, the table feeds the workpiece from front to back along the Y-axis.As tabulated in Table 1, a total of 4 milling tool are tested and all tests are carried out without the application of a cutting fuid.As shown in Figure 6, two types of sensors are installed in the milling machine, including accelerometer (Kistler Z292A600) and rotary dynamometer (Pro-Micro).For the accelerometer, the sampling frequency is set as 10 kHz.For the rotary dynamometer, the sampling frequency is set as 2.5 kHz.
As shown in Figure 6, a metallographic microscope is used to measure the width of the fank wear.When the width of the fank wear is greater than 0.2 mm, the tested tool wear achieves the limit [1].Te acquired monitoring data of the C1 during the whole operating life is shown in Figure 7.
As shown in Figure 7, some of these monitoring data have obvious degradation trends with the increasing of cutting time, while others do not have these trends.

Experimental Study.
In this case, all of the monitoring data are used as the input of the network to verify the effectiveness of the proposed method.Te size of an input sample is 10000 × 1 × 5.
One of the main hyperparameters that may afect the prediction performance of the proposed model is the number of groups, which directly afects the dimension of feature extract in the pyramid convolution layer.For investigating this infuence, diferent number of groups in the proposed PCNN are applied to estimate the RUL prediction.Te number of groups is set to be 2, 4, and 8. Figure 8 shows the score values and RMSE of C4, and the corresponding training time and model parameters are given in Table 2.
It can be observed that the score value is the lowest and the RMSE is the highest when the number of group is set to be 2, which indicates that the prediction performance is relatively poor.Te accuracy of the RUL prediction results is closer for others.As the number of groups increased, the model becomes more computationally intensive.Terefore, it can be observed in Table 2 that the model training time and the number of parameters increased with the increase in the number of groups.Tough a bigger number of groups can extract more features of diferent scales, resulting in better prediction performance, the calculation burden is aggravated and the performance improvement is limited when the number of group increases to a certain extent.By the tradeof between accuracy and efciency, the number of groups is fnally selected as 4.
Te fnal architecture of the network is shown in Figure 9.And the hyperparameters of the pyramid convolution layer of the PCNN are listed in Table 3.
Mean square error is used as the loss function of the network and Adam optimizer with a mini-batch size of 128 is used to update its weights and biases.Te trained network is used to predict the RUL values of the testing dataset after training 150 epochs.If the prediction value was bigger than the actual value, it may cause low process quality or even a scrapped products due to a overwear in the tool.Taking this situation into account, except for root mean square error (RMSE), a score function is used to evaluate the performance of the network.Te score value is given by Score where S is the number of samples in the testing dataset, y is the actual value, and  y is the predicted value.Te higher the score values, the more accurate the performance of the RUL prediction is.
Figure 10 shows the RUL prediction result of C4 using the proposed method.As shown in Figure 10, the predicted RUL value fuctuates slightly with the actual RUL, and the fuctuation becomes smaller and smaller with the increase of the cutting time.Furthermore, cross validation is used to prove the stability of the proposed method.Each test is repeated ten times, and the mean and standard deviation of these four testing dataset are listed in Table 4.
As shown in Table 4, on the one hand, both score and RMSE of each testing dataset has small standard deviation, which proves that the proposed model has good stability for the same task.On the other hand, the mean value of both score and RMSE of these four testing dataset has small fuctuation, which proves that the proposed network has good stability for diferent tasks.In conclusion, the proposed network has a good prediction result and good stability in both the same task and the diferent task, which means the predicted result of the proposed method is credible.

Ablation Experiments.
In order to illustrate the advantage of the proposed PCNN, ablation experiments are done in this part.Te other three prognostic networks are employed to predict the RUL and they are denoted as Network-1, Network-2, and Network-3.Te architectures of these three networks are similar to that of the PCNN, and the diferences are that (1) Network-1 does not use group convolution and channel attention with soft assignment, (2) Network-2 only use group convolution, and (3) Network-3 only use channel attention with soft assignment.In addition, the hyperparameters settings of these three networks are the same as those of the PCNN, and the cross validation used in 6 Shock and Vibration Section 3.2 is used in this part too.Te performance estimation results of these four diferent networks are listed in Table 5 and drawn in Figure 11.It can be observed that compared with the classic multiscale convolutional network without attention mechanisms (i.e., Network-1 [37]), the use of group convolution or channel attention with soft assignment efectively improves the prediction performance and stability of the network, resulting in higher score value and lower RMSE.For Network-2, the performance improvement is attributed to the use of group convolution, which reduces the risk of overftting by reducing the number of learning parameters.For Network-3, the employment of channel attention with soft assignment make the network enhance key degradation features of some sensors and scales.Besides, it is to be noted that through systematically integrating group convolution and soft attention with soft assignment, the proposed PCNN obtains the highest score value and the lowest RMSE value for each testing dataset among four diferent prognostic networks, which verifes again the performance of the proposed method.

Comparison with the State-of-the-Art Models.
In this part, eight state-of-the-art models, including two machine learning models, random forests (RF), and support vector   Shock and Vibration 7 regression (SVR) [34] and six deep learning model, deep convolution neural network (DCNN) [35], residual dense network (RDN) [36], multiscale convolutional neural network (MSCNN) [37], convolutional long-short-term memory network (CLSTM) [24], deep belief networks (DBN) [38], and multiscale convolutional attention network (MSCAN) [39] are utilized to estimate the RUL for the comparison analysis.For the RF and SVR, features listed in [34] are extracted from all the monitoring data.Ten, these features are fed into the corresponding model to predict the RUL.Te score value and RMSE of these methods are listed in Table 6.Both score value and RMSE are calculated form the half of the life too.
From Table 6, it can be found that the proposed method has the highest score value and the lowest RMSE, which confrms the proposed method can predict the RUL        Besides, in order to illustrate the efciency of the PCNN, the number of parameters and the training and testing time of three multiscale learning models are listed in Table 7.All experiments in this paper are performed on a server confgured with two Intel (R) Xeon (R) Gold 6242R CPU@ 3.10 GHz processors, eight NVIDIA GeForce RTX 3090 graphics cards, and a total of 512 GB memory (RAM).

Shock and Vibration
As shown in Table 7, the total model parameters of the proposed method are respectively reduced by 62.6% 54.8% compared to the MSCNN with self-attention and the MSCNN methods.Both training time and testing time of the proposed method are greatly reduced, which means  12 Shock and Vibration the computing cost is reduced and the efciency is improved.

Conclusion
Because of the strong learning capability, the CNN is widely used in degradation feature extraction, especially the multiscale CNN which has a stronger representing learning ability.Because of the parallel convolutional pathways, the traditional MSCNN, however, has a large number of parameters, which means a higher computing cost.In addition, a lack of consideration of contribution of diferent scale of degradation feature makes poor performance of RUL prediction.To address the issue, a pyramid CNN (PCNN) is proposed for RUL prediction of the milling tool is proposed in this paper.In this network, group convolution is used to replace parallel convolutional pathways to extract multiscale features without additional large number of parameters.Te channel attention with soft assignment selects the key degradation features not only from diferent sensors but also from diferent scales.Te proposed method was experimentally validated by the milling tool wear experiment.Some related methods and state-of-the-art models, including machine learning methods and deep learning methods, are analyzed for comparison with the proposed method.Te result of it indicates that the proposed method is able to predict the RUL accurately.Although the proposed method achieves a good RUL prediction result, there are still a few shortcomings in its application.Te premise of the application of the proposed method is that the working condition of the testing data is the same as training data, which limits the application in practical engineering because the working condition of the machining process is dynamic.And limited labeled training samples prevents us from training a model for every working condition.To address the issue, a promising work is to introduce transfer learning or meta learning into the model, which can make the model achieve good performance under small samples.Furthermore, this can be combined with some adaptive optimization algorithms to automatically determine the hyperparameters of the model, which can achieve better performance of it.

Figure 4 :
Figure 4: Architecture of the channel attention model.

Figure 3 :
Figure 3: Architecture of the group convolution.

Figure 7 :
Figure 7: Monitoring data of the C1 during the whole operating life.(a) Force data in the Z-axis.(b) Bending moment data in the X-axis.(c) Torque data.(d) Bending moment data in the Y-axis.(e) Vibration data in the Y-axis.

Figure 8 :
Figure 8: RUL prediction result of C4 based on diferent number of groups.(a) Score values.(b) RMSE.

Figure 10 :
Figure 10: RUL prediction result of C4 using the proposed method.
Layer.In this layer, multiscale high-level degradation information from diferent sensors is extracted and fused.First, a group convolution operation is used to extract diferent scale of high-level degradation features.After doing this, the channel attention model is used to generate the attention weights of the multiscale features.Finally, the soft assignment is used to recalibrate the attention weight of the corresponding scale.
2.2.1.Group Convolution.Te monitoring data acquired from the sensors are nonlinear signals containing a lot of noise.While the degradation features can be extracted by convolution operation, the receptive feld range of the convolution kernels have great infuence on the degradation features.Large-scale degradation features can be extracted by a larger receptive feld, while detailed degradation features can be extracted by a smaller receptive feld.Terefore, it is necessary to use diferent size of convolution kernels to extract multiscale degradation features.Te traditional multiscale convolution uses parallel pathways to extract multiscale features.Te size of convolution kernel in various convolution pathways is diferent.Although the performance of the network is proved, a large number of parameters increases the computing cost.Terefore, it is desirable to fnd an efcient multiscale feature extraction method.

Table 1 :
Cutting condition of milling tool.

Table 3 :
Hyperparameters of the pyramid convolution layer.

Table 4 :
Performance estimation result of four testing dataset.

Table 5 :
Performance estimation result of four diferent networks.
Te bold values express that the PCNN has the best performance.

Table 2 :
Comparison of model parameters and training time with diferent numbers of groups.

Table 6 :
Performance estimation result of the testing dataset for eight state-of-the-art models.Te bold values express that the proposed method has the best performance.

Table 7 :
Te number of parameters of diferent models.