A New Support Vector Regression Model for Equipment Health Diagnosis with Small Sample Data Missing and Its Application

Actually, it is diﬃcult to obtain a large number of sample data due to equipment failure, and small sample data may also be missing. This paper proposes a novel small sample data missing ﬁlling method based on support vector regression (SVR) and genetic algorithm (GA) to improve equipment health diagnosis eﬀect. First, the genetic algorithm is used to optimize support vector regression, and a new method GA-SVR can be proposed. The GA-SVR model is trained by using other data of the variable to which the missing data belongs, and the single-variable prediction method can be obtained. The correlation analysis is used to reconstruct the training set, and the GA-SVR is trained by using the data of the variables related to the missing data to obtain the multivariate prediction method. Then, the dynamic weight is presented to combine the single-variable prediction method with the multiple-variable prediction method based on certain principles, and the missing data are ﬁlled with the combined prediction methods. The ﬁlled data are used as input of GA-SVM to diagnose equipment failure. Finally, a case study is given to verify the applicability and eﬀectiveness of the proposed method.


Introduction
For equipment health diagnosis, complete monitoring data is the premise and foundation for an accurate diagnosis. However, in the actual engineering application, many monitoring sample data are incomplete, including small sample, unbalanced sample, and sample data missing. In the collection of sample data, equipment may not be able to operate normally due to fault, or it can be affected by the environment, and the effective monitoring data collected is less, resulting in less failure sample data. e sample data may also be missing due to abnormal data transmission, sensor repair and replacement, or human factors. is paper importantly considers the condition of small sample data missing.
Recently, with the rapid development of technology, equipment health diagnosis has been widely concerned by a large number of experts and scholars. e intelligent diagnosis methods applied to equipment health diagnosis mainly include expert system (ES), neural networks (NNs), and support vector machine (SVM). For the expert system, Husain [1] expanded the fault diagnosis of the power transformer, proposed a fuzzy logic expert system for early fault diagnosis of the transformer, and improved the shortcomings of traditional transformer fault diagnosis methods. Berredjem and Benidir [2] proposed a fuzzy expert system based on an improved range overlap method and similarity division method to solve the problem of high noise in bearing fault data. e system was used to realize accurate bearing fault diagnosis, and the feasibility of the model was verified by an example analysis. Cheriet et al. [3] proposed an expert system based on fuzzy logic, which used stator current signal pair for fault diagnosis, and verified the feasibility of the expert system for fault diagnosis of doubly fed wind turbines through simulation experiments. Xu et al. [4] carried out a series of researches on the fault diagnosis of marine diesel engines, proposed a diagnosis expert system based on belief rules, and applied the proposed method to the abnormal wear detection of marine diesel engines, indicating that the method had good accuracy and stability. Equipment health diagnosis method based on the expert system can acquire knowledge from diagnosis examples, but this method does not have the ability to automatically acquire new knowledge, and the fault tolerance is relatively poor. us, the fault diagnosis method based on the expert system has great limitations in practical application.
For neural networks, Xing et al. [5] constructed an automatic fault diagnosis method for reciprocating compressors based on information entropy and radial basis function neural networks. e test results showed that the fault diagnosis method can effectively improve the accuracy of automatic fault diagnosis and the practicability of the condition monitoring system. Yang et al. [6] analyzed the fault diagnosis of rotating machinery, proposed an intelligent diagnosis method based on long-term and short-term memory recurrent neural network, and detected and classified the fault with the help of the correlation between time and space. Gunerkar et al. [7] established a rolling bearing fault diagnosis model based on an artificial neural network (ANN) and applied wavelet transform to preprocess the original signal to extract fault features. ANN and the k-nearest neighbor were used for fault classification of rolling bearing, and the validity of the model was verified by test. In order to solve the problem of end-to-end fault diagnosis of rotating machinery, Wu et al. [8] constructed a one-dimensional CNN model which can directly learn features from the original signal, applied it to the fault diagnosis of the fixed gearbox and planetary gearbox, and showed that the model had high diagnostic accuracy. Han et al. [9] proposed a method for fault diagnosis of the planetary gearbox by using an expanded neural network, which expanded the receiving domain by two times, so as to enhance the learning ability of fault features and improve the diagnosis accuracy. e fault diagnosis method based on an artificial neural network often needs a large number of fault samples to train the neural network, but it is difficult to obtain enough fault data in practical engineering applications. In addition, the neural network has the disadvantages of slow convergence, overfitting, and ease to fall into the local optimal value, which will have a negative impact on the diagnostic accuracy of the equipment.
For the support vector machine, Huang and Fei et al. [10,11] used the SVM model for equipment fault diagnosis and verified that the model has high accuracy and good generalization ability. Yang et al. [12] established an SVM fault classification model using an ant colony algorithm and verified the effectiveness of the model. Zhang et al. [13] combined SVM with an improved imperialist competitive algorithm and applied it to fault diagnosis of the oil-immersed transformer. e results showed that the method was feasible and effective. Yan and Jia [14] proposed a fault recognition algorithm based on optimized multidomain feature SVM. e feature vectors of fault samples were extracted from the time domain, frequency domain, and time-frequency domain. And Laplace fractional algorithm was introduced to filter fault features. Zhong et al. [15] established a diagnosis model based on convolutional neural network transmission learning and SVM and verified the effectiveness of the model through an example. For the accuracy of transformer fault diagnosis, Huang et al. [16] proposed a diagnosis method based on an improved gray wolf algorithm and SVM. e differential evolution mechanism was introduced into the gray wolf optimization algorithm to improve its performance, and then the SVM optimized by the improved gray wolf algorithm was used for fault diagnosis of the transformer.
Equipment fault diagnosis under the condition of incomplete data also has certain research and development. Zhang and Dong [17] proposed an online nonimputation reasoning method based on mixed Gaussian output for fault detection and identification and proved that the method can accurately identify the fault. Mao et al. [18] studied the bearing fault diagnosis with unbalanced data and constructed an online fault prediction method based on an extreme learning machine.
e simulation experiment showed that the method can obtain high fault diagnosis accuracy. Liu et al. [19] proposed a Bayesian network parameter learning method based on BPNN and maximum likelihood estimation to solve the problem of solar-assisted heat pump fault diagnosis under the condition of lack of small sample data and lack of expert knowledge. BP neural network was used to predict and fill in the missing sample data, and the effectiveness of the method was verified by simulation. Chen et al. [20] constructed a fault diagnosis model of missing data based on transfer learning for the fault diagnosis problem with too small complete sample size, an appropriate migration learning mechanism was established to improve the accuracy of fault diagnosis, and the effectiveness of this method was verified by data. Zhao et al. [21] constructed a rolling bearing fault diagnosis model based on normalized CNN under unbalanced data and eliminated the difference of feature distribution by batch normalization.
e experimental results showed that the model has a good diagnosis effect and robustness for rolling bearing fault diagnosis under unbalanced data. Qian and Li [22] established a kind of unbalance robust network for bearing fault diagnosis, which was used to solve the class imbalance problem in the feature extraction stage and classification stage, and the method was verified by simulation analysis. Zhang et al. [23] proposed to use the deep learning method to solve the problem of fault diagnosis when the data was unbalanced and established a deep generated countermeasure network to generate false samples to balance the sample data. Simulation experiments showed that the proposed method has a better effect on fault diagnosis under unbalanced data.
Collecting sample data in the field of fault diagnosis, a large number of fault sample data cannot be obtained because equipment may not operate normally due to the existence of faults. Presently, most of the research on equipment fault diagnosis is based on complete data set, the research on equipment fault diagnosis under incomplete data is less, and there are some problems such as complex 2 Shock and Vibration diagnosis process, long diagnosis time, and unsatisfactory accuracy. Small sample data missing can not only increase the difficulty of data analysis but also greatly affect the accuracy of the equipment failure diagnosis. For most of equipment failure diagnosis under data missing, it needs a large number of failure sample data to obtain more accurate diagnosis results. Actually, due to equipment aging or human error, a large number of sample data cannot be collected, and there is sample data missing. us, the objective of this paper is to propose a novel small sample data missing filling method based on GA-SVR to improve the equipment failure diagnosis effect.
For equipment fault diagnosis, ANN needs a large number of failure samples to train the neural network, but it is difficult to obtain enough failure data in practical application. Additionally, the neural network has the disadvantages of slow convergence, overfitting, and ease to fall into the local optimal value. ese will have an adverse impact on the diagnostic accuracy of equipment. Actually, equipment may not operate normally due to failure. And it is unable to obtain a large number of failure sample data. SVR needs less training samples and has high model accuracy. us, it is suitable for equipment fault diagnosis in the case of small samples. e advantages of GA lie in its fast optimization speed, good effect, and strong global search ability, and it is not easy to fall into the local optimal solution. us, it is used to optimize the key parameters of SVR. In this paper, first, the GA-SVR model is trained by using other data of the variable to which the missing data belongs, and the single-variable prediction method can be obtained. e correlation analysis is used to reconstruct the training set, and the GA-SVR is trained by using the data of the variables related to the missing data to obtain the multivariate prediction method. en, the dynamic weight is presented to combine the single-variable prediction method with the multiple-variable prediction method based on certain principles, and the missing data are filled with the combined prediction methods. e filled data are used as input of GA-SVM to diagnose equipment failure. Finally, a case study is given to verify the applicability and effectiveness of the proposed method.
is paper aims to develop a new method for equipment health diagnosis. e paper is organized as follows. In section 2, the basic theories of SVR and GA are introduced. Section 3 develops a novel GA-SVR. In Section 4, a case study for equipment health diagnosis with small sample data missing is analyzed and discussed. Finally, conclusions are drawn in Section 5.

Support Vector Regression.
For the support vector regression (SVR), it is to use the given sample data to fit a continuous function which can reflect the relationship between input and output. In the case that the sample is linear and inseparable, SVR uses a nonlinear transformation to map the data set to a high-dimensional space and carries out regression fitting in this space to establish the continuous function with the minimum loss function.
e key parameters of SVR include insensitive loss function ε, radial basis function parameter σ, and penalty factor C. ε represents the insensitive region width and plays a decisive role in the number of support vectors and the generalization ability of the model. σ determines the complexity of sample mapping space. e larger σ means that it is difficult to obtain high regression accuracy. e smaller σ means that the regression accuracy is high and the generalization ability is poor. C represents the penalty degree for samples with an error greater than ε. e larger C indicates that the penalty for samples is large. Although the training accuracy can be improved, the generalization ability of the model is poor. e smaller C shows that the penalty for samples is very small, and it will cause a large training error. ese three key parameters determine the performance of SVR; thus, it is necessary to optimize these parameters to improve the prediction effect of SVR.

Genetic Algorithm.
Genetic algorithm (GA) is a kind of heuristic optimization technology. GA searches from the initial population generated randomly, and the individuals in the population evolve through selection, crossover, and mutation based on the fitness function until the iteration termination condition is met, and the optimal solution is output.
e advantages of GA include fast optimization speed and strong global searchability, and it is not easy to fall into the local optimal solution. It is widely used in various optimization problems such as parameter optimization and path optimization. e basic procedure of GA is as follows: Step 1. e chromosome needs to be coded to determine the initial population Step 2. e fitness function is described to evaluate the fitness value of individuals Step 3. e new species group is generated by selection, crossover, and mutation Step 4. e individuals satisfied the termination iteration condition that can be retained Step 5. e decoding outputs the global optimal solution In this paper, for the problem of equipment health diagnosis, SVR is used to predict and fill the missing data. But the values of kernel function parameter σ, penalty factor C, and insensitive loss function ε in SVR are particularly important. us, the set of key parameters (C, σ, ε) of SVR can be regarded as a population, and the key parameters of SVR can be optimized by GA to improve the prediction performance of SVR.

Support Vector Regression Optimized by Genetic
Algorithm. SVR is obtained by introducing insensitive loss function into SVM. It is usually used to solve regression Shock and Vibration 3 fitting problems and seek a regression function representing the relationship between input and output. For the given data set x i , y i , i � 1, 2, . . . , N, where x i ∈ R n is the input sample, and y i ∈ R is the output expected value. Assume that SVR maps samples to a high-dimensional space by nonlinear transformation ϕ( * ) to establish the regression function, and it is as follows: where w and b are regression function coefficients. And insensitive loss function ε is introduced and defined as us, the objective function can be defined as min(1/2)‖w‖ 2 , and the constraints are e relaxation factors ξ i and ξ * i are introduced under the condition of allowing the fitting error; then, the objective function is where C > 0 is the penalty factor, and it is used to control the punishment for errors exceeding ε. By introducing the Lagrange multiplier α i and α * i , then the above problem is transformed into its dual problem.
is the kernel function. By solving equation (5), the regression fitting function can be obtained as follows: For the selection of the SVR kernel function, the RBF kernel function is used in this paper, and its parameter σ > 0 is the kernel function width factor. It has an important influence on the regression prediction effect of SVR. e small sample data missing has a great influence on the equipment diagnosis results; thus, this paper uses SVR to execute regression fitting for the missing data. However, the key parameters C, σ, and ε have a great influence on the regression prediction accuracy of SVR. GA is used to optimize C, σ, and ε to improve the prediction performance of SVR for missing data. e optimization process of C, σ, and ε by GA can be shown in Figure 1, and the specific operation steps are as follows: Step 1. Parameter initialization: initialize GA parameters and C, σ, and ε; any group (C, σ, ε) represents an individual in GA.
Step 2. Fitness value calculation: in order to evaluate the advantages and disadvantages of GA in selecting SVR parameters, the K-fold cross-validation method is used to take the mean value of K-th root mean square error as the fitness value of an individual, and the calculation of fitness value is as follows: Step 3. Terminating iteration: if the condition of terminating iteration has not been reached, the selection, crossover, and mutation will be carried out to generate a new group; then, go back to Step 2 to continue iteration.
Step 4. Output optimal values: the optimal values of C, σ, and ε are output after completing iteration and obtain the GA-SVR model.  1, 2, . . . , n) represents t-th time point, q(q � 1, 2, . . . , m) denotes the q-th sensor, and X q t means the monitoring data value corresponding to the q-th sensor at the t-th time point.

Combination Prediction
Using GA-SVR to predict the single variable of missing data is to train GA-SVR by using other data of variables with missing data as input to predict the value of missing data.
First, let the length of the missing data segment be l, and determine the variable q of the missing data. e n-l-1 data values in the q-th variable dimension are selected as the input of GA-SVR, and the remaining data value is used as the output to train GA-SVR. en, the trained GA-SVR model is used to predict missing data, and the single-variable prediction results can be obtained.

Multiple-Variable Prediction Filling Based on GA-SVR.
is paper uses GA-SVR to predict the missing data. e data related to the variable dimension containing missing data is used as input to train the GA-SVR model and predict the value of missing data.
First, the correlation analysis is used to find the other variables related to the variable q to form the training set X 1 ,· · ·, X i , · · ·, X k . X t represents the monitoring value at t-th time point. e correlation coefficient R is used to evaluate the correlation among the variables. If the correlation coefficient R ≥ 0.8, it indicates that the two variables are strongly correlated. e correlation coefficient R is calculated as follows: e monitoring data from 1-st to k-th time point can be used to execute correlation analysis. And the GA-SVR is trained with the monitoring data values at remaining n-k time points as the input and the data values at a time point where the missing data belongs to as the output. en, the trained GA-SVR model is used to predict the missing data and obtain the multivariable prediction results.

Dynamic Weight Combination Prediction Filling Based on GA-SVR.
In order to improve the accuracy of missing data prediction, reduce the deviation between the predicted value and the actual value, and improve the effectiveness of equipment fault diagnosis, a dynamic weight combination prediction method based on GA-SVR is established to fill the missing data. GA-SVR is used to make a single-variable prediction and multiple-variable predictions, respectively, and then the dynamic weight combination of single-variable prediction and multiple-variable prediction results is obtained. e combined prediction results are used to fill in the missing data to obtain complete data set.
Root mean square error (RMSE) can describe the deviation between the predicted value and the actual value.
us, RMSE is used to evaluate the quality of the prediction results. e smaller RMSE represents the better prediction effect of missing data. e root mean square error is expressed as follows: where y i is the actual value, y i is the predicted value, and n is the prediction times. e weight value of single-variable prediction results and multiple-variable prediction results in combination forecasting depends on their root mean square error difference. e root mean square error is smaller, and the weight is greater. Based on equation (10), the prediction result of missing data can be obtained and it is followed as equation (11).
where y 1i and y 2i denote single-variable prediction results and multiple-variable prediction results, respectively. R 1 and R 2 are the RMSE values corresponding to single-variable prediction and multiple-variable prediction, respectively. y * i is the final missing data filling values. e chart of combination prediction based on GA-SVR can be seen in Figure 2 Figure 3, and the specific fault diagnosis scheme can be shown as follows: Step 1.
e other data of the variable to which the missing data belongs is used to train the GA-SVR model to obtain the single-variable prediction filling result that can be obtained.
Step 2. Find out the variables related to the variables of missing data by correlation analysis, and the data of these variables can be used to train the GA-SVR model. e multiple-variable prediction filling results can be obtained.
Step 3. Based on equation (11), the single-variable prediction results and the multivariate prediction results are combined to obtain the combined prediction results, and the missing data are filled to obtain the complete data.
Step 4. e complete data is divided into training sample data set and test sample data set, and SVM is trained and tested, respectively, to obtain the fault diagnosis results of equipment.

Experimental Setup and Data Acquisition.
To validate the proposed methods, a real-world case is studied. In this case study, the long-term wear test experiments were conducted at a research laboratory facility. In the test experiments, three pumps (A, B, and C) were worn by running them using oil containing dust. Each pump experienced four states: Baseline state, Degradation state, Degradation state, and Failure state. e degradation stages in this hydraulic pump wear test case study correspond to different stages of flow loss in the pumps. As the flow rate of a pump clearly indicates the pump's health state, the degradation stages corresponding to different degrees of flow loss in a pump were defined as the health states of the pump in the test [24,25]. e vibration signals were collected from pump accelerometers that were positioned parallel to the axis of the   Shock and Vibration swash plate swivel axis and data was continuously sampled. Figure 4 shows the schematic diagram of the experimental setup. e pump used for testing in the experiments was a Back Hoe Loader: a 74 cc/rev variable displacement pump. e data was collected at a sample rate of 60 kHz with antialiasing filters from accelerometers designed to have a usable range of 10 kHz. In many cases, the most distinguished information is hidden in the frequency content of signals. So, the time-frequency representation of signals is needed. In this case study, the signals were processed using a wavelet packet with Daubechies wavelet 10 (db10), and five decomposition levels as the db10 wavelet provide the most effective way to capture the fault information in the pump vibration data. e coefficients obtained by the wavelet packet decomposition were used as the inputs.
ere are 80 groups of experimental data for Pumps A, B, and C, respectively. Each group of data contains 32 variables (32 sensors). In this paper, the monitoring data of the 3-th sensor is taken as the experimental object, and the monitoring data from the 75-th to 80-th time point is deleted to simulate the missing situation of small sample data. e single-variable prediction, multiple-variable prediction, and dynamic weight combination prediction based on GA-SVR are used to fill the missing data, and the filling effect and the diagnosis effect after filling are compared.

Reconstruction Training Set.
e multiple-variable prediction model selects monitoring data from sensors having a strong correlation with Sensor 3 as the training set to predict the missing data value. Based on equation (8), the correlation coefficients between Sensor 3 and other sensors are calculated in Pumps A, B, and C, respectively. If the correlation coefficient R ≥0.8, then the sensor and Sensor 3 have a strong correlation; thus, the training set can be reconstructed, as shown in Tables 1-3. e reconstructed training sample is only 6-dimensional. It can reflect the characteristics of the original data, reduce the amount of calculation, and shorten the prediction time.

Result Analysis of Missing Data Filling.
In order to evaluate the filling effect of the proposed dynamic weight combination prediction method based on GA-SVR, the missing values in Pumps A, B, and C are predicted by singlevariable prediction, multiple-variable prediction, and dynamic weight combination prediction by using GA-SVR, respectively. And the filling effects are compared. e parameters of GA are set as follows: the population size is 20, and the maximum iteration number is 100. e key parameters of SVR are 0.1 ≤ C ≤ 1000, 0.01 ≤ σ ≤ 100, and 0.01 ≤ ε ≤ 1. e root mean square error (RMSE) and mean absolute percentage error (MAPE) are used as the evaluation indexes for the filling effect of missing data. MAPE is as follows: where y i is the actual value and y i is the predicted value. Tables 4-6 show the predicted filling values of missing data of Pumps A, B, and C based on GA-SVR, respectively. Figures 5-7 show the missing data fitting curves of three prediction methods based on GA-SVR for Pumps A, B, and C, respectively.
From Figures 5-7, it can be intuitively seen that the simulation results of the three data sets are basically consistent.
e fitting curve of dynamic weight combination prediction is more consistent with the actual value curve than that of single-variable prediction and multiple-variable prediction. It indicates that the effect of the dynamic weight combination prediction method is better than that of singlevariable prediction and multiple-variable prediction.
In order to evaluate the effect of equipment fault diagnosis under the small sample data missing based on the proposed GA-SVR, the proposed GA-SVR prediction model is compared with the standard SVR prediction model and BP neural network prediction model (BPNN).
e key parameters of SVR are selected by grid search cross-validation method, 0.1 ≤ C ≤ 1000, 0.01 ≤ σ ≤ 100, and 0.01 ≤ ε ≤ 0 1. For the single-variable prediction of missing data, the input layer of BPNN is 1, the output layer is 1, and the number of hidden layers is 3. For the multiple-variable prediction of missing data, the input layer of BPNN is 6, the output layer is 1, and the number of hidden layers is 5. e maximum iteration times are set to 100, the error accuracy is 0.002, the learning rate is 0.1, and the activation function is a sigmoid type function. Tables 7-9 show the filling effect of missing data of Pumps A, B, and C for three different prediction models, respectively. It can be seen from Tables 7-9 that the RMSE and MAPE values of dynamic weight combination prediction are the smallest compared with single-variable prediction and multiple-variable prediction for different prediction modes of the same prediction model. For the same prediction mode of different prediction models, the RMSE and MAPE values of the proposed GA-SVR model are the minimum. us, the proposed dynamic weight combination prediction of missing data based on GA-SVR has the best filling effect on missing data.

Result Analysis of Equipment Failure Diagnosis.
In order to compare the effects of different missing data prediction models and prediction modes on equipment fault diagnosis, the complete data filled with missing data is used for equipment fault diagnosis. 50 groups of Pumps A, B, and C data sets are randomly selected as training samples, and the remaining 30 groups are used as test samples.
Tables 10-12 show the influence of three different missing data filling models of GA-SVR, SVR, and BPNN and three prediction filling modes on the fault diagnosis effect of Pumps A, B, and C, respectively. It can be seen from Tables 10-12 that the dynamic weight combination prediction filling mode has the highest diagnosis accuracy rate and shorter time compared with single-variable prediction filling mode and multiple-variable prediction filling mode under the same prediction model. For the same prediction mode, the fault diagnosis rate based on GA-SVR is the highest compared with SVR and BPNN, and the diagnosis time is shorter than that of BPNN. And the diagnosis time is longer than SVR, but the difference is not significant.
Generally, the missing data filling method of dynamic weight combination prediction based on GA-SVR can obtain the best fault diagnosis effect. It can be concluded that the proposed failure diagnosis method based on GA-SVR under the condition of small sample missing data is effective for Pumps A, B, and C and has certain universality.

Conclusion
In this paper, for the problem that small sample data missing will affect the effect of equipment failure diagnosis, a novel missing data filling method based on GA-SVR is proposed to improve the effect of the equipment failure diagnosis. First, the single-variable prediction is carried out for the missing data. And the training set is reconstructed by correlation analysis. Meanwhile, the multiple-variable prediction is carried out based on GA-SVR. en, the dynamic weight is presented to combine the single-variable prediction results and the multiple-variable prediction results to fill in the missing data. Finally, the complete data obtained by filling missing data is used as input, and GA-SVM is used to diagnose the equipment failure. By the case study, the proposed GA-SVR model is compared with SVR and BPNN to predict the filling effect of missing data of Pumps A, B, and C, respectively. And the failure diagnosis effect based on the complete data after the filling is compared. It can be shown that the proposed dynamic weight combination prediction method based on GA-SVR has the best missing data filling effect and failure diagnosis effect. And the effectiveness and universality of this proposed method under the condition of small sample data missing can be verified.

Data Availability
e underlying data supporting the results of our study can be obtained, including, where applicable, hyperlinks to publicly archived datasets analyzed or generated during the study, upon request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.