Fault Prediction of Centrifugal Pump Based on Improved KNN

To effectively predict the faults of centrifugal pumps, the idea of machine learning k-nearest neighbor algorithm (KNN) was introduced into the traditional Mahalanobis distance fault discrimination, and an improved centrifugal pump fault prediction model of KNN based on the Mahalanobis distance is proposed. In this method, the Mahalanobis distance is used to replace the distance function in the conventional KNN algorithm. Grid search and cross-validation are used to determine the optimal K value of the predictionmodel. A centrifugal pump test rig was established to solve three common faults of centrifugal pumps: cavitation, impeller damage, and machine seal damage, and the method was verified. +e results show that this method can effectively distinguish the specific fault types of centrifugal pumps based on vibration signals, and the fault prediction accuracy of the offbalance condition is up to 82%. +is study provides a novel idea and method for centrifugal pump fault prediction and diagnosis and avoids the interaction between parameters when monitoring multiple parameters.


Introduction
Centrifugal pump is an important fluid conveying equipment; it is widely used in various fields and plays a significant role in the development of the national economy. erefore, it is particularly necessary to ensure the normal and stable operation of the centrifugal pump.
In the process of centrifugal pump operation, the earliest indication of failure is usually the abnormal vibration signal. Some minor equipment damage or defects will cause abnormal vibration of centrifugal pumps, such as seal damage, cavitation, and impeller damage. Although the centrifugal pump can transmit information to the outside world through acoustic emission [1], temperature, oil pressure, pressure pulsation, and other signals, all parameter monitoring methods have their limitations. While vibration signals can timely and accurately transmit the operating state information of the centrifugal pump and can be monitored in a long-term and stable manner. erefore, vibration signal monitoring has become the most commonly used method in centrifugal pump fault diagnosis. Approaches to extract features from mechanical systems based on timedomain data were proposed in [2,3]. e vibration signal of the centrifugal pump can be analyzed and the fault of the centrifugal pump can be performed. In the centrifugal pump monitoring, in addition to time-domain parameters, there are frequency-domain parameters. Xue et al. [4] proposed a fault prediction system. e vibration signal of a centrifugal pump is analyzed in the amplitude domain and time domain, and the characteristic structure of the signal in the frequency domain is analyzed by the fast Fourier transform (FFT) signal analysis method. e failure of the centrifugal pump was analyzed based on the database of typical vibration failure cases.
With the continuous development of computer algorithms and artificial intelligence, clustering algorithms, neural network, and other technologies have a lot of positive impact on the traditional fault monitoring methods, making fault monitoring more accurate and intelligent [5][6][7][8][9][10]. Jun Du et al. [11] proposed a clustering diagnosis method based on statistical average relative power difference (ARPD) for the faults of the aircraft hydraulic pump, such as swashplate eccentricity and the increase of the gap between piston and slider. By effectively enhancing the fault characteristics of these two kinds of faults, the ARPD calculated from vibration signals is used to complete the hypothesis testing. To extract the weak signal fault characteristics of aeroengine intermediate shaft bearing effectively, Jing et al. [12] introduced a tolerance idea into the traditional adaptive genetic algorithm and proposed a variational mode decomposition (VMD) method based on TAGA-VMD. Machine learning and neural network methods have also become a research hotspot recently [13][14][15][16][17][18]. Tai-Ming Tsai and Wei-Hui Wang [19] addressed dealing with these signals to establish the database of input-output relations by using several neural network models through learning algorithms. Meanwhile, to set up an online diagnosis network, the learning speed, and accuracy of three kinds of networks, the backpropagation (BNP), radial basis function (RBF), and adaptive linear (ADALINE) neural networks have been compared and assessed. For online diagnosis, the BPN method is recommended. Yang et al. [20] proposed a fault diagnosis scheme for rotating machinery using hierarchical symbolic analysis and a convolutional neural network. e method performs superior diagnosis capacity with simple network architecture. Kong and Chen [21] proposed a new combined method based on wavelet transformation, fuzzy logic, and neuronetworks for fault diagnosis of a triplex. e failure characteristics of the fluid and dynamics can be divided by the wavelet transform in different scales at the same time.
erefore, the characteristic variables can be constructed making use of the coefficients of the Edgeworth asymptotic spectrum expansion formula and fuzzified to train the neuronetwork to identify the faults of the fluid and dynamic of the triplex pump in the fuzzy domain.
In conclusion, there are many parameters to monitor the operation of the centrifugal pump, which makes it difficult to have a unified judgment standard for fault diagnosis of the centrifugal pump. is paper proposes a centrifugal pump fault monitoring system based on an improved KNN algorithm [22] based on the Mahalanobis distance. Firstly, the ReliefF algorithm [23] was utilized to carry out weight analysis on the timedomain and frequency-domain features [24] and parameters commonly used in centrifugal pump monitoring. en, the parameters with greater influence on centrifugal pump fault monitoring were selected. e characteristic space was developed by using the characteristic parameters under different fault conditions and normal conditions. e KNN algorithm was improved and the centrifugal pump working conditions were predicted by using Mahaobanobis distance [25,26] instead of Euclidean distance. Finally, the proper K value is selected by grid search [27][28][29] to establish the centrifugal pump fault prediction system, so that the operation of the centrifugal pump can be evaluated.

Materials and Methods
e improved KNN prediction model based on the Mahalanobis distance is given in Figure 1, which is mainly divided into two parts: feature engineering and prediction model.

Feature Engineering Based on ReliefF Algorithm
ere are several parameter indexes to evaluate the operation condition of the centrifugal pump; if all parameters are selected at the same time, it will be difficult to judge the operating condition of the centrifugal pump because of the mutual influence between the parameters, and the operation fault cannot be accurately predicted. e ReliefF algorithm can choose the top four parameters indexes with the highest contribution. e commonly used time-domain and frequency-domain indexes are divided into absolute indexes and relative indexes. Absolute indicators are dimensionless eigenvalues and have units, such as peak value, mean value, and root mean square. Relative indexes are dimensionless eigenvalues without units, such as kurtosis and margin.

Relief Algorithm.
e Relief algorithm was first proposed by Kira and was initially restricted to the classification of two types of data. Later, with the progressing of the Relief algorithm, it became a feature weight algorithm. To address its limitations, Kononeill expanded it in 1994 and obtained the ReliefF algorithm, which can give different weights to features according to the correlation of each feature and category. ReliefF algorithm randomly selects a sample R from the training set D and then looks for k-nearest neighbor samples from the samples comparable to R, which is called Near Hits. Looking for the nearest neighbor samples that are different from R is called Near Misses and then update the weight of each feature according to the following rules, as shown in In the above formula, M j is the J th sample of the same kind, P j is the J th adjacent sample in the different class C, m is the sampling number, diff(A, RR 2 ) represents the difference between samples R 1 and R 2 in the feature space A, and its calculation formula are shown as follows: ReliefF function is called repeatedly for n times. Finally, a better weight of each feature index can be obtained by averaging the results of n pairs of calculations.

Centrifugal Pump Fault Working Characteristic Index.
e flow rate of the test centrifugal pump under standard working conditions is Q d � 50m 3 /h. Axial vibration signals under standard working conditions are collected. Weight analysis of the following 10 characteristic parameters [9,10] in the time domain and frequency domain is carried out according to the ReliefF algorithm, as shown in Table 1.
After the operation, the ReliefF algorithm was used to filter 10 characteristic parameters. e weight of each feature parameter is calculated by Python. e larger the weight is, the greater the contribution of the feature parameter to the correct classification of the data is. e characteristic parameters are more reliable for centrifugal pump monitoring. Weights of the 10 characteristic indicators are obtained as shown in Table 2, and the histogram of descending order is drawn as shown in Figure 2.
According to the analysis results of the ReliefF algorithm, the four parameters that contribute the most to the centrifugal pump fault diagnosis classification are root mean square, peak factor, skewness coefficient, and kurtosis. To reduce the dimension and data redundancy of features and facilitate the training of KNN, in the next step, this paper takes the four feature indexes with the largest weight as the eigenvalues of the dataset. e weight of the other six parameters is very small; not dealing with them will not affect the result of fault prediction.

Improved Centrifugal Pump Fault Prediction Model of KNN Based on the Mahalanobis Distance
Due to its simplicity and effectiveness, the KNN algorithm is usually the first choice to solve any classification problem. In essence, fault monitoring can be regarded as classifying the collected operation status signals. However, two factors can degrade KNN performance. e key of the KNN algorithm is the selection of K values and the calculation of the distance between samples. First, KNN determines the similarity between two samples using a distance function. Second, the accuracy is sensitive to the neighborhood size K value.

Mahalanobis Distance.
In general, the KNN algorithm uses the Euclidean distance to calculate the sample distance, but the Euclidean distance has an obvious disadvantage, which equates to the differences between different attributes of the sample. Because of the correlation between discrete monitoring parameters of centrifugal pumps, Euclidean distance cannot meet the practical requirements of centrifugal pump fault prediction.  Shock and Vibration e Mahalanobis distance was suggested by P.C. Mahalanobis, an Indian statistician, which represents the distance between a point and a distribution. It is an effective method to calculate the similarity between two unidentified sample sets. Unlike Euclidean distance, it takes into consideration the relationship between various properties and is dimensionless. For an average of μ � (μ 1 , μ 2 , μ 3 , . . . , μ p ) T , the covariance matrix S for several variables x � (x 1 , x 2 , x 3 , . . . , x p ) T , the Mahalanobis distance formula is shown in To sum up, the Mahalanobis distance can easily measure the distance between the observed samples and the known sample set, so it is very suitable for the field of fault diagnosis. e improved KNN algorithm can be obtained by replacing the original Euclidean distance function of the KNN algorithm with the Mahalanobis distance.

Data Analysis.
According to the weight analysis results of the ReliefF algorithm, four parameters, that is, root mean square, peak factor, skewness coefficient, and kurtosis, were statistically analyzed. We collect the vibration signal data of centrifugal pump impeller damage, centrifugal pump cavitation, centrifugal pump seal damage, and centrifugal pump under normal operation. en, we calculate the root mean square, peak factor, skewness coefficient, and kurtosis under each operating condition. Due to the dimensionality and value range of these four characteristic parameters, the consequences are often not good if the algorithm is directly applied to these data, so the data should be scaled in the same proportion to make it fall within a specific interval. Standard deviation standardization, otherwise known as Represents the average amount of signal amplitude based on a unit of time It can express the difference between the signal and its average value at this time Peak factor C � X peak /X rms Dynamic expression of the damage to the surface of the machine seal Pulse factor I � X peak /x e ratio of the peak value to the mean value of the signal, the criterion for judging the presence of an impact in the signal Waveform factor S � X rms /x Based on the upper root mean square value and the mean value ratio of the time-domain Kurtosis e amplitude of the seal acoustic emission signal is described. If the kurtosis is too large, it means that there is a dry friction Margin coefficient e ratio of the signal root means a square value to its mean value, used to evaluate the degree of component damage Skew coefficient 3 /σ 3 e asymmetry degree of acoustic emission signal is described Denotes the change in the position of the power spectrum barycenter and describes the change of the energy ratio of each frequency component Represents the discrete state of power spectrum energy distribution and describes the composition of signal frequency components  zero-mean standardization or score standardization, is the most extensive data standardization method at present. After standard deviation normalization, the standard deviation of the data is 1, the mean value is 0, and the tangent is under the normal distribution. e conversion formula is presented in where X is the mean of the original data and σ is the standard deviation of the original data. Standard deviation normalization is used for root mean square, peak factor, skewness coefficient, and kurtosis before calculating the Mahalanobis distance.
After data preparation, the Mahalanobis distances between the samples under normal operation of the centrifugal pump and the samples in the case of cavitation, the sample with damaged impeller, and the sample with damaged seal were calculated, which were demonized as MD 1 , MD 2 , and MD 3 , respectively. Figure 3 shows the Mahalanobis distances under different centrifugal pump fault conditions after the addition of the four characteristic parameters according to the ReliefF weights and Table 3 shows the Mahalanobis distances under different centrifugal pump fault conditions after the addition of the four characteristic parameters according to ReliefF weights.
By Mahalanobis distance diagnosis, it was found that, in the whole test, the maximum value of MD 2 was found, and MD 2 , MD 1 , and MD 3 were decreased in turn. Centrifugal pumps operate at a low flow rate or operate over the design point; with the increase of flow, the Mahalanobis distance between the centrifugal pump failure data and normal data at runtime is growing. e Mahalanobis distance reached 2.244143, indicating that the impeller damage is very serious; in this condition, the centrifugal pump operating condition is very poor. According to the analysis of the Mahalanobis distance, it can be seen that using the Mahalanobis distance can effectively judge the operation fault of the centrifugal pump, and it is appropriate and effective in improving the KNN algorithm.

Improved KNN Algorithm
. KNN (K-nearest neighbor) method, originally proposed by Cover and Hart in 1968, is a machine learning algorithm. e idea of this method is very intuitive: if most of the K-nearest likeness (that is, the nearest neighbor of the feature space) samples of a sample in the feature space belong to a certain category, then the sample also belongs to this category. In the classification decision, the method only determines the category of the samples to be distributed according to the category of the nearest one or several samples.   Shock and Vibration e selection of distance function and the determination of K value are the main factors that affect the performance and result of the KNN algorithm. e Mahalanobis distance has proved through data analysis that it can effectively replace Euclidean distance, and the effect is good. For KNN algorithm parameter K, if you choose smaller values of K, it is equivalent to use smaller training instances in the field of the forecast, reducing the approximation learning error. Only training instances that are close to the input instances are useful for the predicted results, but the downside is the fact that the study of estimation error will increase; the result is sensitive to the neighbor instance of the point. If the adjacent instance points happen to be noisy, the prediction will be wrong. is means that as K values decrease, the model becomes more complex and easier to overfit. If a larger value of K is selected, it will be tantamount to using training examples in a larger field to make predictions. e advantage is the case that the estimation error of learning can be reduced, but the approximate error will increase. In other words, with the increase of K value, the prediction accuracy of input instances will be reduced, and the overall model will become simple. e approximate error can be understood as the training error of the existing training set. A small approximate error can make a better prediction of the existing training set, but its performance on the unknown test sample set is poor. e estimation error can be seen as the test error of the test set. A small estimation error indicates a satisfactory ability to predict the unknown data, and the model at this time is closer to the optimal model. As the observation data, based on the data under normal working conditions, with marker distance as the radius, cavitation, and impeller damaged, we draw machine seal damage case sample points. As there are too many sample points, only 100 of them are drawn, as shown in Figure 4. Because the sample points are too concentrated, the normal working condition points can be seen when the central area is enlarged, as shown in Figure 5. Taking the yellow asterisk points in Figure 4 as an example, the schematic diagram of K � 1, K � 10, and K � 100 is drawn. It can be seen that if the K value is too small, the sample points near the prediction point are too small, and it is easy to misjudge the running status of the centrifugal pump. When the K value is too large, there are too many sample points in the range, which will also increase the difficulty of centrifugal pump operating condition prediction.
To determine the appropriate K value, the automatic parameter tuning method is approved in this paper. Grid Research is the most widely used automated parameter optimization method at present. A grid search evaluates all possible combinations of parameter values to calculate the best combination, using cross-validation to evaluate the merits of K values. Starting with a small value of K, increasing the value of K, and then calculating the variance of the result, we can finally determine a suitable value of K. In this paper, K values were selected from 400, 800, 1600, 1800, 2000, 2200, 3000, and 5000. To make a better evaluation of the model performance, we measure the accuracy of the evaluation and the standard deviation from the actual results at the same time and avoid the occurrence of overfitting; the K-fold cross-validation method can be used. In this paper, 10-fold cross-validation was used. As shown in Table 4, the dataset is divided randomly into 10 different subsets, each subset is called a fold, and then, the model is trained and evaluated 10 times, onefold is selected for evaluation each time, and the other nine folds are used for training. e output result is an array containing 10 evaluation scores, and the average value of this array is the evaluation of the model performance. Meanwhile, the robustness of the model can be verified by observing the evaluation score of each training.   It can be observed that when the value of K is increased, the accuracy of fault prediction will increase first, because there are more samples around for reference, and the prediction effect will become better. With the increase of K value, the accuracy will have a maximum value of 0.8884. At this time, the K value of 2000 is the optimal K value, because when there is a further increase of K value, the error rate of prediction will gradually rise, and it will be meaningless to continue to increase the K value.

Experimental Verification
In a check to see the validity of the prediction model in fault prediction, a centrifugal pump testbed was built. Conditions of 0.8 Q d , 1.0 Q d , and 1.2 Q d were selected as the test conditions, which included the centrifugal pump external characteristic test under normal working conditions, cavitation test, machine seal failure test, and impeller damage test. e vibration signals of the centrifugal pump under various fault conditions were collected and processed by the signal acquisition system, and the weight analysis results of the ReliefF algorithm were combined with four characteristic parameters, namely, root mean square, peak factor, skewness coefficient, and kurtosis. e fault prediction model of the KNN centrifugal pump improved with the Mahalanobis distance was used to predict the centrifugal pump working conditions, and the results were compared with the test results.

Construction and Design of Centrifugal Pump Testbed.
e centrifugal pump test rig built is shown in the schematic diagram and the physical drawing as shown in Figure 6. IS-65-50-160 centrifugal pump, export solenoid valve, driving motor, import manual ball valve, import and export stainless steel water pipe, bellows, storage tank, and so on together constitute the water circulation system. e collective design parameters of the IS-65-50-160 centrifugal pump are given in Table 5.

Design of Signal Acquisition System.
e signal acquisition system is composed of the equipment responsible for collecting the various parameters of the centrifugal pump. e signal acquisition system is mainly composed of the inlet and outlet pressure sensor, an electronic flow meter, NI signal acquisition card, transient speed, torque tester, and resistance. e precise parameters of the instrument and sensor are shown in Table 6.

Centrifugal Pump Characteristic Test Procedure.
Offset conditions and cavitation are the most common unstable conditions in the operation of the pump. Tests were carried out to diagnose cavitation and deviation conditions. e experimental steps of centrifugal pump external characteristics are as follows: (1) Open the inlet pipeline, ball valve to maximum; open the storage tank vent valve; let the internal pressure of the storage tank be close to the atmosphere. (2) Open the centrifugal pump unit and check the water circulation system to avoid the leakage of the system affecting the subsequent test; check the stability of the data acquisition card, and check each sensor one by one. (3) e flow of the centrifugal pump under standard working conditions was set at Q d , and the actual pump flow was regarded as an independent variable. e abscissa of external characteristics was divided into 14 working conditions, and the working conditions from 0 to 1.3 Q d were measured, respectively. (4) Set the specific parameters of the centrifugal pump, by controlling the solenoid valve to adjust the specified flow rate of the centrifugal pump, so that the centrifugal pump runs under the specified flow parameters. When the centrifugal pump runs in this working condition and is stable, there is synchronous collection and recording of several parameters such as flow rate, input current, inlet and outlet pressure, speed, and torque of the centrifugal pump.  (1) Open the valve connecting the water storage tank with the atmosphere, so that the internal pressure of the water storage tank is on the brink of the atmospheric pressure, and then close the water storage tank vent valve.
(2) Check the running state of the water circulation system, sensor equipment, and data acquisition card, start the centrifugal pump unit, and simulate the test steps of full flow bias condition for operation. (3) Adjust the analog voltage of the data acquisition card so that the centrifugal pump can run at the specified flow rate. (4) Open the vacuum pump to reduce the inlet pressure. (5) When the head drops by 0.05 m, close the vacuum pump. When the measured data tend to be stable, turn on the vacuum pump and wait for the next round of data collection, waiting for a 10% drop in head to complete the collection. (6) Adjust the flow rate so that the centrifugal pump works in each flow rate, respectively. Repeat the above steps to obtain the cavitation characteristic data of the centrifugal pump in each flow rate.    Use normal impeller, damaged impeller, and damaged machine seal to carry out the above tests, respectively. According to the experimental results, the external characteristic curves of the centrifugal pump under cavitation and normal conditions are drawn, as shown in Figure 7.
In Figure 7(a), by comparing the external characteristics of the pump under the two conditions, it is found that the cavitation phenomenon of the centrifugal pump is serious, and the damage of the inlet blade will produce a large number of cavities, and the flow disorder will occur when the fluid flows through, resulting in the decrease of head and efficiency. In Figure 7(b), it can be found that the head and efficiency of the centrifugal pump have significantly decreased after the impeller is damaged, which is caused by the unstable flow in the pump. In Figure 7(c), machine seal damage leads to serious system leakage, which will also make the flow in the pump more unstable. Centrifugal pump head and efficiency decline.

Predictive Model Validation Steps
Step 1.
e data of the above tests were collated to compute the root mean square, peak factor, skewness coefficient, and kurtosis required for the prediction.
Step 2. Carry out data cleaning and standard deviation standardization for the collated data, and combine the four characteristic parameters according to the weight analysis results of the ReliefF algorithm. en, input them into the fault prediction model for faulty judgment.
Step 3. Compare the output of the prediction model with the actual test centrifugal pump state. Under the flow of 0.8 Q d , 1.0 Q d , and 1.2 Q d , 100,000 data were collected, respectively, and divided into 10 groups for fault prediction accuracy as shown in Figure 8.
e test results demonstrate that the improved KNN centrifugal pump fault prediction model can realize the monitoring of four kinds of centrifugal pump operation faults. e prediction model has superior prediction speed and accuracy for unknown samples. Under the condition of 0.8 Q d , the average prediction accuracy is 0.859, and the highest prediction accuracy is 0.871. Under the condition of 1.0 Q d , the average prediction accuracy is 0.821, and the highest prediction accuracy is 0.844. Under the condition of 1.2 Q d , the average prediction accuracy is 0.855, and the highest prediction accuracy is 0.881. e results show that the centrifugal pump runs stably under the design condition. e vibration amplitude is smaller than that of off-working conditions, and the prediction accuracy is relatively low. However, the vibration of the centrifugal pump is exacerbated under the off-working condition, and the prediction accuracy of the prediction model is higher. e fault diagnosis method has good accuracy and practicability because the water pump mostly operates in offworking conditions in actual use. Some verification results are presented in Table 7.

Results and Discussion
(1) Using the ReliefF weight analysis algorithm not only can be the intuitive observation of vibration monitoring parameters of centrifugal pump failure prediction weight proportion but also can remove the redundant sieve parameters. At the same time, there are too many parameters that can be solved and influence each other because the failure prediction model is too complex and leads to the fact that the failure prediction results are not accurate.  (2) e distance function in the KNN algorithm is replaced by the Mahalanobis distance, which avoids the disadvantage that the Euclidean distance cannot describe the correlation between variables and eliminates the dimensionless influence among various vibration parameters. At the same time, it is proved that the Mahalanobis distance is an excellent performance and application in fault diagnosis.

Data Availability
ere are no publicly archived datasets.

Conflicts of Interest
e authors declare that they have no conflicts of interest.