Multiworking Conditions Anomaly Detection of Mechanical System Based on Conditional Variational Auto-Encoder

,


Introduction
Anomaly detection is a method for identifying abnormalities and illogical data mining, and it is an essential branch of machine learning.Especially in the era of big data, the speed of manual data processing has been far behind that of computers, and therefore, faster detection of abnormal data is a valuable task nowadays [1,2].In industries, the anomaly detection of mechanical devices is crucial.More immediate and more accurate anomaly detection help prevent accidents and improves reliability and production efciency [3,4].
Anomaly detection algorithm-based machine learning mainly includes One-Class Support Vector Machines (OC-SVMs), Principal Component Analysis (PCA), and Local Outlier Factor (LOF).For OC-SVM, the normal data is used to train the model to obtain a hyperplane, which is utilized to circle the positive data.OC-SVM takes the hyperplane as a criterion and considers the samples inside the circle are positive.Since the computation of kernel function is timeconsuming, OC-SVM is not widely used under massive data [5][6][7].PCA is a statistical algorithm to convert a set of potentially correlated variables into a set of linearly uncorrelated variables, and the transformed set of variables is called principal components [8].LOF measures the density deviation of a given sample concerning its neighbours and determines whether a point is an outlier by comparing the density of the sample with its neighbour.Tis algorithm is suitable for data with noticeable density diferences, and its complexity is high; therefore, it does not apply to big data [9].
In recent years, the anomaly detection algorithm based on deep learning has been an efective method and academic focus gradually [10][11][12][13][14][15][16][17][18][19][20][21][22].Auto-Encoder (AE) is widely used for anomaly detection due to its excellent deep representation.Te AE-based anomaly detection algorithm minimizes the reconstruction error to establish a representation model of normal samples and identifes the samples whose reconstruction error exceeds the threshold as anomalies [19].Chen et al. proposed a novel quadratic function-based deep convolutional auto-encoder (DCAE) in predicting the remaining useful life (RUL) of bearing [23].Te bearing vibration signals are frst preprocessed by low-pass fltering and then fed into the quadratic functionbased DCAE neural networks.It can generate a bearing Health Indicator (HI) from raw vibration signals and can be better applied to RUL prediction than other existing HI.However, AE has a disadvantage in that it only gives a lowdimensional hidden space representation and cannot learn the characteristics of a sample's probability distribution.However, by employing a variational auto-encoder (VAE), a specifc sample can be expressed as the distribution of possible samples, and the latent space becomes a continuous distribution space.VAE, therefore, is competitive in the feld of anomaly detection as a generation model.In 2015, An and Cho [17] proved the feasibility of VAE in unsupervised anomaly detection and applied VAE in network intrusion detection.Literature [11] proposed an unsupervised anomaly detection model donut based on VAE.Te encoder extracts representative features from Key Performance Indicator (KPI) sequences.Te decoder reconstructs the sequence according to the features and calculates the anomaly in the deviation sequence, which is between the reconstructive sequence and the origin sequence.Tis model fully uses the deep representation ability to model KPI series and takes advantage of the representation abilities of KPI sequence in the deep generation models.References [12,13] discuss how the condition K afects VAE.According to the diferent conditions, the potential distribution can be generated and be used for detection.Due to the diferent thresholds of anomalies, the model can detect local and global anomalies.References [24,25] discussed the unsupervised construction of HI.In [24], it presented a new unsupervised HI construction approach.Te method innovatively constructs the HI of a distribution contact ratio metric health indicator (DCRHI) to represent the degradation process well and obtain a uniform failure threshold.Qin et al. proposed a novel degradation-trend-constrained VAE (DTC-VAE) to construct the HI vector with the distinct degradation trend [22].Compared with other typical unsupervised HI construction methods, this method can more easily determine the uniform failure threshold.
Mechanical devices work under diferent conditions, such as the change of load, rotating speed, and input power, leading to two challenges in the anomaly detection of a mechanical system.(i) Te model learned from a single working condition is not appropriate for a new working condition and even identifes the normal samples under another condition as anomalies.(ii) Te concentration learning of samples under multiconditions may lead to low detection accuracy.Te research on anomaly detection under multiple working conditions has been relatively inadequate [4,26].
Te Multiworking Conditions Variational Auto-Encoder (MW-CAVE) is presented in the paper to solve the low detection accuracy under multiworking conditions.MW-CAVE takes the working conditions as the conditional input of VAE, obtains the distribution of samples by concentration learning on normal data, and determines the threshold of anomaly detection, increasing the detection accuracy under multiworking conditions.

Anomaly Detection of Multiworking Conditions
2.1.Problem Analysis.Figure 1 demonstrates the vibration responses of normal and abnormal cases, of which data are quoted from the JNU dataset (more details on the JNU dataset will be introduced in Section 5.1).Figures 1(a)-1(c) show the vibration of normal cases, and others show the abnormal.
Figure 2 illustrates the results using VAE anomaly detection (introduced in Section 3.1).A typical VAE anomaly detection utilizes the reconstruction error (L MSE ) as the anomaly score to evaluate the abnormal condition.Treshold 1 in Figure 2 represents the best threshold according to the maximum principle of the F1-score (introduced in Section 5.2.3) and the normal and abnormal are expressed as 0 (False) and 1 (True), respectively.Anomaly scores that exceed the threshold are considered abnormal.Table 1 compares the detection performance of the JNU dataset and PU dataset by VAE anomaly detection and shows both low Area Under Curve (AUC) and F1-score and poor performance of anomaly detection.Scores of the three normal cases divide into three layers in Figure 2, representing three typical working conditions, respectively.If threshold 2 is used as the standard, some abnormal samples (in region 1) may be misclassifed into normal.If threshold 3 is used as the standard, all the samples in region 1 and region 2 are misclassifed as normal.Te phenomenon of the misclassifcation is called Misclassifcation Caused by Working Condition Interference, and this is the main reason leading to a low anomaly detection performance.

Idea of Paper.
Figure 3 shows the idea of the paper.Te ellipses 1 and 2 represent the distributions of the normal samples of two working conditions obtained by generating distributed model learning, and the ellipse edges represent the classifcation boundary.Points 1 and 2 stand for abnormal and normal samples of working condition 1, respectively.Point 3 stands for an abnormal sample of working condition 2. If the two working conditions were not distinguished, two kinds of errors would have happened: (1) point 1 is an abnormal sample of working condition 1 but would be considered normal according to the distribution of working condition 2. (2) Point 2 is a normal sample of working condition 1 but would be abnormal according to the distribution of working condition 2.An anomaly detection method is established to avoid these two kinds of errors, as shown in Figure 3(b).Two working conditions will be separated.In low-dimensional space, the normal samples of each independent working condition are learned to form a Gaussian distribution.After two working conditions are isolated, points 1 and 2 will be compared with the distribution of working condition 1.According to the probability density, point 1 is judged as an abnormal sample, and point 2 is deemed as a normal sample.Similarly, point 3 can also be correctly identifed.
By utilizing the ability of a variational auto-encoder (VAE) to lean the data distribution, and a conditional 2 Shock and Vibration variational auto-encoder (CVAE) to separate the working conditions, a Multiworking Conditions Anomaly Detection Method, MW-CVAE, is proposed in this paper.Based on the centralized learning of various working conditions, the proposed method establishes the respective distribution characteristics (latent space) of diferent working conditions samples and calculates the exclusive anomaly metric of diferent working conditions samples.Te new method can overcome the misclassifcation problem caused by working condition interference under multiple working conditions.

Defnition of Multiworking Conditions Anomaly
Detection.Multiworking conditions anomaly detection is defned as a mechanical system that has C kinds of working conditions, C � 0, . . ., c − 1 { }, of which l data samples of each working condition are collected, respectively; for C kinds of working conditions, the total number of samples is . ., n), the subscript "i" represents the i th sample, and the length of the data sequence is J.A model, described by similar function P(x | c), will be established to determine the anomaly algorithm and corresponding threshold ϵ.If P(x | c) < ϵ, the sample is normal; otherwise, it is abnormal.

Variational Auto-Encoder (VAE). VAE, shown in Fig-
ure 4, is a deep Bayesian network that can establish a relationship between visible variable x and latent variable z, which is usually a multivariate unit Gaussian distribution.VAE simulates the data distribution P θ (x) through a neural network P θ (x|z) with parameters θ and takes samples from hidden layers z.Te data that conforms to the distribution of P θ (x) be generated by  P θ (x|z)p(z)dz.Since the true posterior P θ (z|x) is intractable by analytic methods, similar

Shock and Vibration 3
to an auto-encoder, the parameter estimation approach of a VAE is approximating the distribution P θ (z|x) through a simple distribution q ϕ (z|x).Refer to [10], the process of calculation is as follows: Te training loss of a VAE is as To calculate Kullback-Leibler (KL) divergence, the model considers q ϕ (z | x) and P(z) following normal distribution, P(z) ∼ N(0, I), q ϕ (z | x) ∼ N(μ x , σ 2 x I).μ x , and σ 2 x represents the mean and variance generated by the encoder network q ϕ (z | x), and the reparameterization is utilized to sample z.

Conditional Variational Auto-Encoder (CVAE).
CVAE, shown in Figure 5, is a conditional-directed graphical model where input observations modulate the prior on latent variables that generate the outputs, in order to model the distribution of high-dimensional output space as a generative model conditioned on the input observation [13].
Tere are three types of variables.For random observable x, z (unknown, unobserved), and c (known, observed) are independent random latent variables.Te conditional probability P θ (x | z, c) is formed by a nonlinear transformation, with the parameter θ. ϕ is another nonlinear function that approximates inference posterior q ϕ (z | x, c) � N(μ, σΙ).Te latent variable z allows for modeling multiple modes in conditional distribution of x given c making the model enough for modeling one-to-many mapping.To make an approximation of ϕ and θ, ELBO in (3) is given as Gaussian latent variable z samples hidden variable z.Refer to [7], assume x I), the second part in (3) has an analytical solution.
Te frst item in (4), called reconstruction error, represents the diference between the input x and reconstruction x, then loss function L CVAE is expressed as  where J is the dimension of x and  x.

MW-CAVE Anomaly Detection
4.1.Process of Anomaly Detection.Te MW-CVAE model is based on CVAE, and its process is shown in Figure 6.Te details of detection are described as follows: (a) Data preprocessing.In the data preprocessing of anomaly detection, the most critical part is normalization.References [8,16] adopt the normalization of mean and variance, namely, x scaled � (x − μ)/σ.μ and σ stand for the mean and variance of training data.However, the above-given method is only applicable to the same ranges of normal samples and anomalous samples.Usually, the range of anomalous samples is larger than normal.In the training step, only normal samples, which have a relatively small range will be trained.
When we get to the testing phase, the mean and variance of testing data may change.To avoid the error caused by the normalization of testing data, the article suggests adopting linear global normalization: where min x and max x represent the static minimum and maximum ranges of all samples, respectively, rather than the statistical functions, like min(x) and max(x).In practical application, we choose a proper value of min x and max x to Gradient Variational Bayesian (SGVB) is used [7].
Since the CVAE model is adopted, each sample and its working condition information are as input, and the normal samples containing all working conditions are the input to learn.(e) Anomaly score.In this paper, we adopt the reconstruction-based approach to evaluate the degree of abnormality [27].When the samples are abnormal, where J is the dimension of input samples.(f ) Determination of anomaly threshold.Te paper's determination method of abnormal threshold is signifcantly diferent from a traditional CVAE method.An anomaly detection method based on an anomaly score must determine an anomaly threshold to distinguish whether the sample is normal or abnormal.Te selection of abnormal threshold is the key to the detection performance [27].Referring to the paper [28], we frst obtain a set of test samples with known abnormal labels and obtain all abnormal scores.By drawing the relationship curve between F1-score (including precision rate and recall rate) and the threshold, we fnd that the threshold corresponding to the maximum value of the F1-score is the best threshold ϵ.However, in the multicondition scenario, due to the Misclassifcation Caused by Working Condition Interference', the optimal threshold obtained by the above method still cannot guarantee good anomaly detection performance (problem analyzed in Section 2.1).In order to prevent this problem, the proposed method uses the idea of working condition separation to learn the best threshold.Te method given in Algorithm 1 can greatly improve the accuracy of multiworking condition anomaly detection.Te core idea of the algorithm is to classify the test samples with abnormal labels according to the working conditions, that is, to learn the optimal threshold for each working condition.(g) MW-CAVE anomaly detection.After the detection threshold is determined by step (f ), Algorithm 2 can be used to determine the abnormal state under a certain working condition.

Testing Algorithm. In
Step 1, the testing data set X test , working condition C test , abnormal label A test , and trained CVAE model for testing are collected.Te number of samples is N.In Step 2, the optimal threshold is identifed for each working condition.We obtain the set of test samples with known abnormal labels and the anomaly scores by equation (7).Te index of the data under diferent working conditions is recorded as wi_ind.Corresponding to the index, the sample anomaly score and the true anomaly label are recorded as Si and Ai, respectively.Te normalized threshold t (range: 0, 1) starts at 0 and increases in steps of Input: Trained CVAE model, X test , C test , A test ; Testing data set X test � x (1) , x (2) , . . ., x (N test )   working condition C test � c (1) , c (2) , . . ., c (N test )   abnormal label A test � a (1) , a (2) , . . ., a

Introduction of Dataset. Te CWRU dataset provided by
Case Western Reserve University Bearing Data Center [29] is one of the most famous open-source datasets in fault diagnosis research.Data of CWRU was collected by accelerometers attached to the housing with magnetic bases.Tis paper uses the driver end-bearing fault data whose sampling frequency is 12 kHz, and four working conditions are listed in Table 2.
JNU dataset is a dataset on the rolling bearings provided by Jiangnan University [23] and contains four health statuses: normal, inner ring fault, outer ring fault, and roller fault.Accelerometers collect the vibration signal under three rotating speeds, 600 rpm, 800 rpm, and 1000 rpm, at the sampling frequency of 50 kHz.Tree typical working conditions are listed in Table 3.
PU dataset is a dataset on the rolling bearings provided by Universität Paderborn [24].Te type of rolling bearing is 6203.Te faults are divided into practical faults and artifcial faults.Te latter is discussed in the paper.In the artifcial faults, the crack, spalling, and pitting are machined by electrical discharge machining, drilling, and electrical engraving.Te vibration responses are acceleration signals, of which the sampling frequency is 48 kHz.Te working conditions are divided according to the torque, radial force,  Shock and Vibration 7 and rotating speed.Four typical working conditions PU dataset are listed in Table 4.

Baseline Method.
In the experiment, fve methods are utilized for comparison.Te former two methods are machine learning, and the latter three are deep learning.
(a) Principal Component Analysis (PCA).After the eigenvalues of the covariance matrices of samples are decomposed, the eigenvalues are the variances corresponding to the samples projected onto axes.A smaller eigenvalue indicates that the sample is concentrated.Meanwhile, the anomaly is easier to shift.It can be used as an indicator to distinguish anomalies [8].
(b) Local Outlier Factor (LOF).LOF method compares the density of given data points to that of their neighbors.Since the outliers come from the areas with a lesser density, the ratio of abnormal data points is higher.LOF method detects whether the data are normal or not by comparing the densities of given data points with the data points near them [9].(c) Variational Auto-Encoder (VAE).For a VAE, the reconstruction error is considered as a score.
According to the maximum of the F1-score on the testing set, the best threshold is determined by linear normalization.Te best threshold is regarded as a criterion for anomaly detection [25,27].(d) Conditional Variational Auto-Encoder (CVAE).
References [12,13] introduce anomaly detection by CVAE.For the case of learning of multiple working conditions, best_thr is the threshold corresponding to the maximum of the F1-score.(e) Multiworking Condition Anomaly detection (MW-CVAE).When there are C kinds of working conditions, it is divided into C kinds of independent working conditions, and the threshold determination and anomaly detection are carried out for each independent working condition.
In order to introduce the performance indexes F1-score and Accuracy, the confusion matrix in Table 5 is introduced.
TP is the number of positive (abnormally) samples predicted to be positive.FN is the number of positive samples predicted to be negative.FP is the number of negative (normal) samples predicted as positive and TN is the number of negative samples predicted as negative.Te formulas of Accuracy, Precision, Recall, and F1-score are shown in equations ( 8)- (11), respectively.Table 6 compares the results by MW-CVAE and CVAE.Te method proposed in this paper has a signifcant improvement.Accuracy is improved by 0.16-0.1645,AUC is improved by 0.1139-0.1154,and F1-score is improved by 0.1815-0.1918.Te determination method of abnormal threshold presented in this paper is diferent from the CVAE method.In MW-CVAE, it learns the best threshold according to the idea of working condition separation and avoids the interference of working conditions.

Results of Anomaly Detection
Figure 10 compares the receiver operating characteristic curve (ROC) using VAE and CVAE on the JNU dataset.Te AUC of CVAE is higher than that of VAE.Also, the AUC of MW-CVAE is higher than that of MW-VAE.Te reason is that there is an interference between diferent working conditions using VAE.Shock and Vibration that in addition to the AUC performance of one method on the PU data set, the MW-CAVE method has achieved the best F1-score and AUC values on all data sets compared with other methods.

Efects of Latent Space.
Figure 11 shows the distribution of hidden learning samples under diferent epochs when the hidden layer size is 2. Te red, blue, and green points represent three diferent working conditions of learning samples.As the epoch progresses, the network learning gradually converges.Meanwhile, the fnal distribution of each working condition will form the normal distribution corresponding to the respective working condition.However, the mean and variance are diferent for each working condition, precisely with the assumption of CVAE.12 Shock and Vibration

Conclusion
Aiming at the problem that the accuracy of mechanical system anomaly detection is signifcantly reduced under multiple working conditions, an MW-CVAE is proposed in the Te working condition is encoded as conditional input to establish the anomaly detection model.Compared with the typical CVAE model, each working condition has a corresponding best threshold of anomaly detection in the MW-CVAE model, which improves the detection.For instance, the AUC increases by 11-12%, and the F1-score increases by 18-19%.Te method proposed in the paper can be applied to the anomaly detection of discrete working conditions in the industry.According to the distribution of multiple working conditions in the latent space of the MW-CVAE model, each working condition tends to the normal distribution N(0, diag(x)I) after the convergence of learning, providing a basis for classifcation by working conditions and anomaly detection.Shock and Vibration 13

Figure 2 :
Figure 2: Anomaly detection of multiworking conditions based on VAE (JNU dataset with three working conditions).
ensure all the training and testing samples are in the range (min x, max x).(b) Network structure.Te network structure is shown in Figure 7. Te input layer adopts the normalized vibration signal X, and the number of nodes is 8192.Working condition c adopts a 30-node one-hot coding mode, which is detailed in part (c).Encoders are composed of 400 and 200 full connection layers, respectively, and the activation function is ReLU.Te hidden layer size is 2. Decoders, on the contrary, use 200,400 nodes of the fully connected layer, the number of nodes in the output layer is 8192.(c) Encoding of working conditions.Te encoder and decoder add the same input representing a specifc working condition c.Te working condition c is a known variable in the training and testing process and maybe one or more of the following: (i) the devices are at diferent speeds.(ii) Te devices are under diferent loads.(iii) Te devices produce workpieces with diferent specifcations.Te working condition c is encoded by a one-hot form.Since the speed of devices is continuous, for example, the rotating speed is between 200 to 950 rpm, the number of types is infnite.In practice, for the sake of safety, energy-saving, and high efciency, the working speed is only a few, such as low speed (0), medium speed (1), and high speed (2).Te load and specifcation can be encoded similarly.Te one-hot encoding of working conditions is shown in Figure 8.(d) Training process.In the training process, Stochastic

Figure 3 :
Figure 3: Te idea of the paper.

5. 3 . 1 .
Performance.Te anomaly detections by MW-CVAE and CVAE are compared on the JNU dataset, as shown in Figure 9. Figures 9(a)-9(c) demonstrate the visual display of anomaly detection by MW-CVAE under wc � 0, wc � 1 and wc � 2, and Figure 8(d) shows that by CVAE under wc � [0, 1, 2].Te blue dash line stands for the position of the best threshold.In Figure 9(d), there are stratifcations between the three normal working conditions, indicating that the three working conditions are at diferent abnormal levels (green "+" points in Figure 9(d)).Figures 9(a)-9(c) show the distributions of the best threshold and samples.Due to the separation of working conditions, the interference of working conditions is avoided.

Figure 12
shows the F1-score on JNU and PU datasets under the diferent sizes K of hidden layers.When K increases from 2 to 200, F1-score on JNU dataset fuctuates slightly between 2 and 10.In other cases, F1-score remains almost the same.It indicates that the size of hidden layers has little efect on MW-CAVE accuracy.

Table 1 :
Results of VAE-based anomaly detection.

Table 2 :
Four working conditions.

Table 3 :
Tree working conditions.VAE.For the encoder, x size of the input layer is 8192.Te nodes in the middle layers are 400 and 200, respectively.z size of the hidden layer is 2. For the decoder, the input is 2. Te nodes in the middle layers are 200 and 400, respectively.Te size of the output layer is the same as that of the input layer, and both are 8192.(b) CVAE.For the encoder, x size of the input layer is 8192.Te nodes in the middle layers are 400 and 200, respectively.z size of the hidden layer is 2.
Table 7 compares the F1-score and AUC of the methods on CWRU, JNU, and PU datasets.It can be seen from the comparison in Table 7

Table 4 :
Four working conditions.