Fault Diagnosis Method Based on Gap Metric Data Preprocessing and Principal Component Analysis

Principal component analysis (PCA) is widely used in fault diagnosis. Because the traditional data preprocessing method ignores the correlation between different variables in the system, the feature extraction is not accurate. In order to solve it, this paper proposes a kind of data preprocessing method based on the Gap metric to improve the performance of PCA in fault diagnosis. For different types of faults, the original dataset transformation through Gap metric can reflect the correlation of different variables of the system in high-dimensional space, so as to model more accurately. Finally, the feasibility and effectiveness of the proposed method are verified through simulation.


Introduction
As the complexity of industrial manufacturing systems increases, the correlation between variables in the system becomes more complex, and these variables contain important information about the status of the system.Therefore, it is an important issue that fault detection and diagnosis of the system are through the information of these variables.
However, in industrial manufacturing systems, because of the different dimensions of system variables, it is usually necessary to preprocess the data to standardize the data.In traditional data preprocessing methods, ignoring the influence of dimension on the correlation between system variables leads to the lack of correlation of system variables after data preprocessing, which makes it difficult to extract the representative principal components.Therefore, maintaining the correlation between system variables is the key to data preprocessing.
In order to solve this problem, many studies have been made.Wen et al. proposed a method called Relative Principle Component Analysis (RPCA) [1]; it introduces analyzing and determining the importance of each component to the prior information of the system, giving the corresponding weight of each component of the system, and establishing relative principal component model.Literature [2] proposed a fault diagnosis method based on information incremental matrix.Based on this, Yuan et al. proposed a relative transformation of information incremental matrix fault diagnosis method [3], which can effectively detect variables that play an important role in the system.Because of their smaller absolute value and less absolute changes, small changes of these important variables usually play a very crucial role in the system.Xu and Wen proposed a fault diagnosis method based on information entropy and relative principal component analysis [4]; in high-dimensional system, the high correlation of system variables leads to the model not being able to select the representative principal components.The approach given by them is to use information entropy to measure the uncertainty of variables and calculate the information gain of variables.According to the different degrees of importance of variables, relatively transform the data to get a more accurate data model.Jiao et al. proposed a method for simulation model validation based on Theil's inequality coefficient and principal component analysis [5]; it is based on the TIC model; given a model of the differences in position and trend between the simulated output and the reference output, there is a correlation between the two differences, using PCA to obtain the verification results.Kangling et al. proposed a fault diagnosis study based on adaptive partition PCA [6].In order to solve the inaccurate modeling problem, the diagnosis model can be automatically updated and adjusted, so as to improve the model matching and the accuracy of diagnosis results.
The Gap metric is proved to be more suitable for measuring the distance between two linear systems than the norm-based ones [7,8], and the effect of dimension on each variable can be reflected in the Riemannian space when data preprocessing is performed.The gap metric is widely used in the study of the uncertainty and robustness of the feedback system.Tryphon proposed a method that can easily calculate the gap metric [9], which improves the practicality of the gap metric.Literature [10] proposed the concept of ]-gap metric in the frequency domain and then extended it to nonlinear systems [11].Ebadollahi and Saki proposed that in the multimodel predictive control, in order to track the maximum power point without losing its control performance, the gap metric method is used to divide the entire area of the partial load system into a corresponding linear model.This method ensures the stability of the original closed-loop system [12].Konghuayrob and Kaitwanidvilai used V-gap to measure the distance between two linear systems [13].They used low-order controllers with similar dynamic characteristics to replace traditional high-level and complex controllers, and both controllers have similar dynamic characteristics and robustness.For multilinear model control of nonlinear systems, Du proposed a weighting method based on 1/ Gap metric [14].The Gap metric method was used to calculate the weighting function of the local controller combination.The validity of the method was verified by the CSTR system.
In the principal component analysis method based on the traditional data preprocessing, when removing the original data dimension, all are based on the European metric data preprocessing method, and some important information of the data will be ignored, and often these data are important variables that contain slowly changing fault information.In the method proposed in this article, Gap metric can project data on Riemann spheres in Riemann space and can highlight the information easily ignored in the European space.After the Gap metric data preprocessing method is adopted, the eigenvalue feature vector decomposition is performed on the processed data matrix, and the principal component is constructed to construct the principal component space according to the cumulative percent variance criterion, since Gap metric can highlight the variable data with small absolute change and relatively large variation in the system variables, so we can extract the main component and the main component vector when constructing the principal component space.By calculating the  2 statistical limit and the  statistical limit of the normal system dataset, we detect whether the  2 statistics and  statistics of the test dataset exceed the limit to judge whether the system is faulty, and then the fault variables are separated by the contribution of system variables to fault samples.
The rest of this article is organized as follows.In the second section, we briefly review the PCA approach.In the third section, we propose a kind of improved PCA data preprocessing method, which is data preprocessing method based on gap metric.In the fourth section, we set up a system model and test the feasibility and effectiveness of the proposed method by different types of faults.In the fifth section, we give a summary and future research direction.

PCA Based on Traditional Data Preprocessing
The basic idea of PCA is to decompose multivariable sample space into lower dimensional principal component subspaces composed of principal components variables and a residual subspace according to the historical data of process variables.
And the statistics which can reflect the change of space are constructed in these two subspaces.Then, the sample vectors are, respectively, projected into two subspaces and compute the distance from the sample to the subspace.The process monitoring and fault detection are performed by comparing the distance with the corresponding statistics.First, model by PCA to variable space.Select a set of variables under normal conditions as the original data. ∈   is a test sample that contains  variables and each variable has  independent samples.Construct the original measurement data matrix: Here, each column of X n represents a variable, and each row represents a sample.Because the dimensions of the measured variables are different, each column of the data matrix is normalized.Assuming that the normalized measurement data matrix is X Here, The processing of the matrix S is generally eigenvalue decomposition, according to the size of the eigenvalues arranged in descending order.The PCA model decomposes X * as follows: where X * is the projection in the principal component space, E is the projection in the residual space, and P ∈ R × is the load matrix, which consists of the first  eigenvectors of S. T ∈ R × is the scoring matrix, the elements of T are called the primary variables, and  is the number of the primary.
The principal component space is the part of the modeling, and the residual space is not the part of the modeling, which represents the noise and fault information in the data.
The selection of the principal component number  is based on the Cumulative Percent Variance (CPV).This criterion determines the number of principal elements based on the cumulative sum of percentages of principal components.The CPV represents the ratio of the data changes explained by the first  principal component to the total data changes.Therefore, the cumulative contribution rate CPV of the first  principal can be expressed as where   is the eigenvalue of the covariance matrix S. In general, when the cumulative contribution rate reaches 85% or more, it is considered that the number of elements  contains enough information of the original data.

Improved Data Preprocessing Method
In this section, we propose a kind of data preprocessing method based on Gap metric.

Gap Metric.
In the Riemann space, ( 1 ) and ( 2 ) are used to represent the spherical projection of complex numbers  1 and  2 on a three-dimensional Riemann ball with diameter 1 and the chord between  1 and  2 is denoted by ( 1 ,  2 ) is used to express the spherical distance between  1 and  2 , that is, the arc length connecting ( 1 ) and ( 2 ) on the Riemann ball; then As can be seen from Figure 1, the shortest arc length on the circle is obtained from a plane-cut Riemann sphere determined by 3 points of the center of the ball, ( 1 ) and ( 2 ).
The character of the Gap metric in the control system has similar properties in the data space.The nature of the Gap metric in the data space is as follows: (1) Gap metric can be regarded as the distance characterization of data in Riemann space, which is an extension to the traditional method based on infinite norm metrics.
(2) The value of the Gap metric is in the range of 0 to 1.The smaller the value, the closer the characteristics of the two datasets.The larger the value, the greater the difference in the characteristics of the two datasets.If the Gap metric of two datasets is 0, then they contain exactly the same characteristics.
Step 1. Project the original data onto the Riemann sphere and calculate the gap metric for each variable, and the matrix X * is calculated: Here Step 2. Decompose eigenvalues and eigenvectors of X * , select pivotal elements based on Cumulative Percent Variance (CPV), and construct principal component space and residual space.
Step 3. Calculate the statistical limit of  2 based on the principal component space of normal data and calculate the  statistical limit through the residual space.
Step 4. Set the test data matrix Y n ∈  × as Preprocess the test dataset Y n with Gap metric method to get Y * : Here, b  is the mean vector of X n .
Step 5. Calculate the  2 statistic and  statistic separately and detect whether the statistic exceeds the statistical limit of normal data.If the statistic exceeds the statistical limit, it will be faulty.Otherwise, it will be normal.
Step 6. Fault variables can be separated by the contribution of system variables to the fault in the residual space.
The physical meaning of Gap metric is the chord pitch of the data projected on the Riemann sphere through a spherical surface, which can highlight the impact of the  relative changes of the data on its own.The transformed data does not ignore variables that have small absolute changes but relatively large transformations.Frequently, these variables also contain important information.

Simulation
In order to verify the effectiveness of the proposed method, considering random variables and their linear combinations, 6 system variables are constructed as follows: Among them, randn(1, ) is a random sequence of 1 row and  columns generated by MATLAB.First, choose 1000 normal samples to establish the PCA model, then select 1000 samples as the test data, and introduce the constant deviation error of magnitude 3 to the last 200 samples of the variable x 6 in the test data, and detect them, respectively, with the model. statistics and Hotelling's  2 statistics are used as indicators to measure the number of misinformation and the missing number of PCA.
As shown in Table 1, the average misinformation and the number of misstatements of the PCA, InEnPCA, and GAP-PCA methods are obtained through 10 simulation statistics.
In the test, 1∼800 times exceeds the control limit for misdiagnosis, 801∼1000 times below the control limit for omissive judgement.It can be seen from Figures 2-4 that the misdiagnosis rate of PCA, InEnPCA, and GAPPCA is not higher than 2%.However, in the fault omission detection, T2 statistics of traditional PCA and InEnPCA show a large number of fault omissions.The omissive judgement rate of T2 in PCA and InEnPCA is 27% and 33%, respectively, while that of GAPPCA is only 4%; this shows that after the  data is preprocessed by gap metric, the principal component model established contains most of the key information of the system, thus improving the accuracy of fault detection.In the fault diagnosis, it can be seen from the contribution graph of the fault that the contribution rate of the fault variable  in GAPPCA is more easily distinguished from the contribution of other variables than PCA and InEnPCA; because system variables are preprocessed for gap metric projection on a  Riemann sphere, the correlation between system variables can be better reflected.
The application of traditional PCA method in fault diagnosis is based on the absolute distance between samples as the criterion for fault detection and fault diagnosis.However, in real systems, the occurrence of microfaults often changes very little.Therefore, the detection of minor faults is extremely necessary.Literature [15] combines the PCA technique with the univariate exponential weighted-sliding averaging for the correlation of variables in the chemical process.In view of the small deviation between normal and microfaults, [16] combines probability distribution metric and Kullback-Leibler measure to quantify residuals between potential scores and reference scores and proposed PCA algorithm control limits for small faults.Literature [17] is an analysis model based on literature [16].The authors [17] give an approach where Kullback-Leibler divergence (KLD) was applied to the principal component variables obtained after process dimensionality reduction using PCA.This method was used to diagnose microfaults.
In order to verify the detection accuracy in the case of a microfault, a slight fault with a slowly increasing rate of 0.1% is introduced into the last 200 sample points of the variable x 6 in the test data.Use this model to test the fault diagnosis performance of the three methods.
As shown in Table 2, the average misinformation and the number of misstatements of the PCA, InEnPCA, and GAP-PCA methods are obtained through 10 simulation statistics.
It can be seen from Figures 5-7 that in the detection of microfaults, due to the small changes in system variables, the traditional PCA method cannot extract the representative of the main element; as can be seen from Figure 6, PCA cannot reflect the system fault status during fault detection and diagnosis.Because the relative change of microfaults is bigger than absolute change, GAPPCA can extract representative principal component variables for small changes of system variables, and the detection results are better than the traditional PCA.In fault diagnosis, InEnPCA reflects the information gain of system variables and thus has good performance in fault diagnosis.Besides, since the gap metric   reflects the correlation of system variables in the Riemannian space better than those in the European space, when using the contribution graph to separate the fault variables, we can see that the contribution rate of fault variable x 6 in the contribution graph of GAPPCA is remarkable.
To sum up, compared with the traditional PCA preprocessing method in European space, the preprocessing method based on the gap metric can better reflect the relevant information between the variables, and the faults that occur to some small but important variables can be diagnosed accurately.

Conclusions
In this paper, we propose a fault diagnosis method based on Gap metric data preprocessing and PCA.When some variables in the system play an important role, and the absolute value of the variables themselves is small, they cannot detect the smaller faults.The proposed method can detect the source of the fault more accurately and reduce the rate of misdiagnosis and rate of omissive judgement.
However, there are still some problems to be studied in this paper.In the PCA fault diagnosis method based on the Gap metric data preprocessing, there will be a problem that the projection overlaps.The fault information reflected by the fault samples may be projected after the Gap metric processing becomes normal.Within the region, how to separate fault information and normal information in highdimensional space is the focus of further research.

Figure 1 :
Figure 1: The geometric meaning of the gap metric.