Processing Method of Missing Data in Dam Safety Monitoring

A large amount of data obtained by dam safety monitoring provides the basis to evaluate the dam operation state. Due to the interference caused by equipment failure and human error, it is common or even inevitable to suﬀer the loss of measurement data. Most of the traditional data processing methods for dam monitoring ignore the actual correlation between diﬀerent measurement points, which brings diﬃculties to the objective diagnosis of dam safety and even leads to misdiagnosis. Therefore, it is necessary to conduct further study on how to process the missing data in dam safety monitoring. In this study, a data processing method based on partial distance combining fuzzy C-means with long short-term memory (PDS-FCM-LSTM) was proposed to deal with the data missing from dam monitoring. Based on the fuzzy clustering performed for the measurement points of the same category deployed on the dam, the membership degree of each measurement point to cluster center was described by using the fuzzy C-means clustering algorithm based on partial distance (PDS-FCM), so as to determine the clustering results and preprocess the missing data of corresponding measurement points. Then, the bidirectional long short-term memory (LSTM) network was applied to explore the pattern of changes of measurement values under identical clustering conditions, thus processing the data missing from monitoring eﬀectively.


Introduction
As an indicator of the safety and work performance of concrete dam, the original measurement data always attract much attention for its integrity and accuracy [1][2][3]. However, due to the impact of such influencing factors, such as complicated and ever-changing operating conditions and human error, it is often difficult to prevent the actual measurement data from suffering a loss to varying extents [4]. e missing data will have an influence on the thorough assessment of dam for its serviceability, which can even lead to misjudgment in some cases. As for the traditional means to process the missing data, they include mean value processing, regression processing, expectation maximization processing, and so on [5,6]. Nevertheless, these methods commonly suffer such problems due to the inability to reflect the correlation between samples or attributes, low processing accuracy, and limited applicability [7].
In recent years, as the study on machine learning and other emerging disciplines deepens [8][9][10][11], some new approaches for data processing have been proposed [12,13]. By improving the k-nearest neighbors imputation method, Shichao Zhang proposed a gray processing method (GKNN) to replace the traditional Euclidean distance with gray distance [14]. Similarly, Tutz G put forward a distance-based weighted nearest neighbors imputation method [15]. However, the process for finding the adjacent samples for each incomplete sample as part of the method requires global traversal, which makes computation overly complex at the time of dealing with a huge amount of data. Chen applied LS-SVM [16] to fill the data, based on which the global optimal solution was worked out and small-scale data were made applicable. However, most of the SVM-based methods involve artificial judgment, which makes them heavily reliant on the personal experience gained by the operator. Proposed by Takagi and Sugeno, the TS nonlinear regression model [17][18][19] is capable of describing nonlinear problems linearly by constructing local linear models and associating these linear models with membership function. However, this model ignores the correlation between input and output, which leads to the low accuracy in processing sequence information. rough the combination between autocoding and the genetic algorithm, Abdella et al. put forward a processing method which takes the missing data as the independent variable for cost function, and the genetic algorithm was applied to optimize the cost function of the missing data to be solved [20,21]. Based on the aforementioned method, Nelwamondo et al. introduced the dynamic programming theory to build multiple self-encoders and then selected the optimal model for each incomplete sample to process the missing data [22]. However, such methods ignore the correlation between different sample attributes, which causes the accuracy of missing data processing to be low. e newly proposed autoregressive integrated moving average model (ARIMA) hybrid method shows the advantages of both the autoregressive integrated moving average and artificial neural network, which improves its generality. Due to the need for a large data sample size, the practicability of it is not as high as that required [23].
To address the problem mentioned above, this study proposes a missing data processing method based on the partial distance fuzzy C-means (PDS-FCM) model and long short-term memory (LSTM) network. Considering the correlation between the measurement points, it relies on the membership degree of each measurement point to cluster center to carry out cluster analysis and preprocess missing data. On this basis, the bidirectional LSTM network is introduced to further process the long-sequence missing data, thus ensuring the accurate processing of missing data. e basic principle is detailed as follows.

Processing Method of Dam Missing Data
First, the core of the missing data processing method proposed in this study is to use PDS [24] instead of Euclidean distance strategy to represent the correlation between the measurement points of a certain type of monitoring quantity. en, the FCM clustering algorithm is adopted to construct the membership relationship between each measurement point and the cluster center, thus achieving fuzzy clustering for the measurement points and the preprocessing of missing data. Finally, the bidirectional LSTM network is introduced to construct the training model of missing data and to process the long series missing measurement values for the measurement points under identical clustering conditions, thus achieving the effective processing of missing data.

Characterization of Correlation between Measurement
Points. When the measurement points of a certain type of monitoring quantity encounter the missing measurement value, it is difficult to measure the correlation between measurement points using those traditional distance measurement indicators. To solve this problem, this study applies PDS to characterize the correlation between different measurement points.
Suppose X � x i , x i ∈ R n , i � 1, 2, . . . , n is an incomplete dataset with the number of measurement points n and the sequence of measurement value s.
. , x is ] T , i � 1, 2, . . . , n, indicates the measurement value of the i th measurement point. I � [I il ] ∈ R n×s represents a function describing the missing data of the measurement point, which is expressed as in the following equation: where I il suggests that the measurement value of the i th measurement point is complete at time t and missing at time t * . Based on equation (1), the correlation between x i and x k in the incomplete dataset X can be expressed as follows: where d Part refers to the partial distance.

Fuzzy Clustering of Measurement Points and Preprocessing of Measurement
Values. Based on equation (2) and FCM clustering algorithm, fuzzy clustering is performed in this section for the measurement points of a certain type of monitoring quantity, and the measurement value of the measurement point is preprocessed. Similar to most clustering algorithms, the FCM algorithm splits the dataset into a number of subsets according to the similarity between samples. Each subset represents a cluster, and the center of sample distribution in the cluster is taken as the cluster center. e difference is that this method achieves the transformation of membership from 0 or 1 to [0, 1] by means of fuzzy processing. Equation (2) is applied to construct the PDS-FCM clustering model of the following equation: where u ik represents the membership degree, which indicates the degree of the measurement point x i falling into the k th cluster, denoted as U � [u ik ] ∈ R n×K . v k denotes the cluster center of the k th cluster, and the corresponding center is a fuzzy parameter, which indicates the degree to which these measurement points belong to each cluster. Considering the equality constraints on membership degree in equation (3), the Lagrange multiplier method is used in this section to solve equation (3). e augmented Lagrange function applied to this process is as follows: where λ � [λ 1 , λ 2 , . . . , λ n ] T represents the Lagrange multiplier. Based on equation (2) and the equality constraint of u ik in equation (3), the necessary condition for the clustering objective function J λ in equation (4) to reach the minimal is expressed as follows: With the application of equations (5) and (6), the membership matrix U and center matrix V that meet the requirements of the accuracy of dam safety monitoring are obtained through iterative solution. e maximum membership of each measurement point to each cluster center is treated as the final clustering result.
According to the membership relationship between the missing data measurement points and the central points, the missing positions of the measurement values are preprocessed as follows: where X * il denotes the measurement value of the i th measurement point at the time l after preprocessing, and V kl indicates the measurement value of the i th center at the time l.
e ultimate clustering result is determined by equation (6), and the missing data are preprocessed by equation (7) based on which the data missing from dam safety monitoring can be effectively processed.

Implementation Technology of the PDS-FCM-LSTM Processing Method for the Data Missing from Dam Safety
Monitoring. On the basis of the abovementioned research, the bidirectional LSTM network is introduced in this section to establish the mapping relationship between the measurement values of measurement points under identical clustering conditions, based on which the PDS-FCM-LSTM processing method is proposed.

Construction of Mapping Relationship between Measurement Values of the Measurement Points under Identical Clustering Conditions.
e advantage of the LSTM network is that it is capable of controlling the information transfer of time series data by introducing input gate, forget gate, and output gate to the unit [9,[25][26][27][28]. It is assumed that the data of dam safety measurement point i are missing from time j to time j + p, which require reprocessing. For the unit at time l, the data x * il (l < j) of measurement point i are defined as output variables, while the data x * ml (m ≠ i, l < j) of other measurement points in the same cluster are treated as input variables. On this basis, the mapping relationship between the measurement values of the forward LSTM network is established, as shown in Figure 1.
(1) Forget Gate. e output variable of the hidden layer h l−1 at time l − 1 and the input variable x * ml at time l are taken. Equation (8) is applied to construct the output variable of the forget gate f l at time l, which determines whether to maintain the state of the hidden unit c l−1 at the previous layer.
where f l represents the output variable after the processing by the forget gate at time l, h l−1 denotes the output variable of the hidden layer at time l − 1, x * ml indicates the input variable of each measurement point at time l after preprocessing, b f refers to the bias term of the forget gate, W f stands for the weight matrix of the forget gate, and σ means the activation function of sigmod, which is expressed as follows: (2) Input Gate. e information collected by the input gate is the same as that collected by the forget gate. Based on equations (9) and (10), the output variable of the input gate R l at time l and the unit state c l at time l are determined.

Mathematical Problems in Engineering
where R l represents the output value after processing by the input gate at time l, W R indicates the input gate weight matrix, W c denotes the unit state weight matrix, c l means the unit state at time l, b R refers to the bias term of input gate, and b c stands for the bias term of unit state.
(3) Output Gate. e output gate is jointly determined by the updated unit state c l , the output variable of the hidden layer h l−1 at time l − 1, and the input variable x * ml at time l. e output variable of the output gate o l at time l and the output variable of the hidden layer h l at time l can be determined as follows: where o l represents the output variable processed by the output gate at time l, W o denotes the output gate weight matrix, and b o indicates the bias term of the output gate.
To process the missing data of measurement value from time j to time j + p more reasonably, a new bidirectional LSTM network method is proposed to process the data from time j to time j + p. As shown in Figure 2, model training and validation are carried out on the measurement points before and after time j to time j + p, respectively, according to the aforementioned principles based on which the forward and reverse LSTM model parameters are obtained. en, the remaining measurement point data of the  Figure 1: Schematic diagram of the LSTM unit. complete information from time j to time j + p are input into the model. Finally, the calculation value of the measurement point with missing data corresponding to the period from time j to time j + p is obtained, so as to process the missing data bidirectionally.

Implementation Method of the PDS-FCM-BILSTM Processing Method for Dam Missing Data.
According to the PDS-FCM-LSTM method for dam missing data processing as proposed above, the implementation process is shown in Figure 3.
Step 1: input the missing dataset of the dam measurement point Step 2: enter the PDS-FCM layer to preprocess the missing data, set fuzzy parameter z, cluster number K, and threshold ε(ε > 0), and then randomly initialize the partition matrix U (0) Step 3: when the number of iterations is n(n ≥ 1), equation (5) is used to update the prototype matrix V (n) based on the partition matrix U (n− 1) Step 4: update the partition matrix U (n) based on equation (6) and the prototype matrix V (n) Step 5: if the conditions ∀k, i: max|u (n) ik − u n−1 ik | < ε are met, where k � 1, 2, . . . , K, i � 1, 2, . . . , n, the algorithm stops to output partitioning matrix U and prototype matrix V. Otherwise, if n←n + 1, return to Step 3.
Step 6: based on the membership relationship between the measurement points with missing data and each cluster center (U, V), the existing values in the cluster center are weighted and averaged to obtain the preprocessed data X * Step 7: establish the bidirectional LSTM processing model for the missing data of the dam. e forward LSTM network model and reverse LSTM network model are constructed according to the time required to process the measurement values.
Step 8: input the measurement values of each measurement point in the same cluster before and after the time slot T into the forward and reverse LSTM models, so as to obtain the weight W and bias b of each parameter after training and learning Step 9: input the preprocessed measurement point data as sample into the forward and backward LSTM models for obtaining the forward and backward output values at time slot T, with their mean values taken as the ultimate result of missing data processing According to the above implementation steps, the implementation concept of PDS-FCM-LSTM is obtained.

Evaluation of the Missing Data Processing Effect.
In this study, the average absolute error MAE and the average absolute percentage error MAPE are taken as the evaluation indicators of the data processing effect for the measurement points, which are defined as follows: where X m represents the set of all processing values, x ij denotes the processing value, r ij indicates the measurement value of the processing value, and x ij corresponds to the measurement value of the measurement point with complete information.
Using equations (13) and (14) and the error allowable values listed in the technical specifications for dam safety monitoring, the effect of missing data processing is thoroughly evaluated, and the processed data meeting the .
x n x i x 2j x 1j x n(j+p) x 2m x 1m x 2(j+p) Figure 2: Bidirectional LSTM processing method and idea (the gray box represents the preprocessed value of discrete data and the red box represents the long-sequence missing data).
accuracy requirements are taken as the effective data for dam safety monitoring.

Project Summary.
e dam is a concrete parabolic double curvature arch dam [29][30][31][32]   measurement data (31 data in total) from October 27, 2016, to March 29, 2017, are purposely deleted to construct the incomplete dataset (the data of other measurement points are complete). e missing data are shown in Figure 5. e PDS-FCM-LSTM method proposed in this study is adopted for data processing. First, the PDS-FCM data processing model is constructed for the dam deformation measurement points. According to the measurement values obtained from the dam deformation measurement points between January 1, 2014, and December 31, 2019, the partial distance between each measurement point and the membership matrix U is calculated when the minimum value is taken for the objective function J λ . e results can be seen from Figure 6.
en, according to the clustering principle proposed in the second section, all measurement points are clustered. e clustering results are given in Tables 1 and 2. It can be seen that the measurement points are divided into multiple clusters; TCN11 belongs to the second cluster, and TCN04, 05, and 16 are also one of them. According to equation (7), the preprocessing result of the measurement point TCN11 is shown in Figure 7.
Based on the above processing results, the data of measurement points TCN04, 05, 11, and 16 can be taken as the basic data to be used for secondary processing.
Finally, based on the integrity of the measured values before October 27, 2016, and after March 29, 2017, after preprocessing, the bidirectional LSTM network is used to process the long-sequence missing data of the measurement point TCN11. To validate the missing data processing method proposed in this study, the data missing from measurement point TCN11 between October 27, 2016, andMarch 29, 2017, are exemplified. It is processed by using the PDS-FCM-LSTM processing method, the LSTM network of single measurement point, and PDS-FCM processing method, respectively. e results are shown in Figure 8 and Tables 3 and 4.
It can be seen from Figure 8 and Tables 3 and 4 that the PDS-FCM-LSTM processing method proposed in this study achieves the highest processing accuracy because it gives full consideration to the correlation between measurement points and the pattern of changes in measurement values. PDS-FCM takes into account only the correlation between the measurement points, which makes the outcome of  TCN1   TCN2   TCN3   TCN4   TCN11   TCN10   TCN9   TCN8  TCN15  TCN19   TCN16   TCN5   TCN6   TCN7   TCN12  TCN13   TCN14   TCN18 TCN17 TCN20 Figure 4: Layout of measurement points of a concrete arch dam. Mathematical Problems in Engineering treatment less satisfactory. e single-point LSTM performs the worst in terms of the processing effect because it considers only the pattern of changes in measurement values while ignoring the correlation between the measurement points. In addition, it can be seen from Table 3 that the absolute errors of the outcome of PDS-FCM-LSTM treatment proposed in this study are all less significant than that specified for the water project (29), which demonstrates the effectiveness of the missing data processing method proposed in this study.  TCN1  TCN3  TCN5  TCN7  TCN9  TCN11  TCN13  TCN15  TCN17  TCN19  Membership degree to TCN1  TCN3  TCN5  TCN7  TCN9  TCN11  TCN13  TCN15  TCN17 TCN1  TCN3  TCN5  TCN7  TCN9  TCN11  TCN13  TCN15  TCN17  TCN19  Membership degree TCN1  TCN3  TCN5  TCN7  TCN9  TCN11  TCN13  TCN15  TCN17  TCN19  Membership degree TCN1  TCN3  TCN5  TCN7  TCN9  TCN11  TCN13  TCN15  TCN17 TCN19 Membership degree to cluster 5 (e) Figure 6: Measurement points of each center cluster.   Measurement points  Cluster 1  TCN01, TCN03, TCN15, TCN18, TCN19, TCN20  Cluster 2  TCN04, TCN05, TCN11, TCN16  Cluster 3  TCN08, TCN12, TCN13, TCN14, TCN17  Cluster 4  TCN09, TCN10  Cluster 5 TCN02, TCN06, TCN07

Conclusion
(1) To solve the problem that the traditional distance measurement indicators are ineffective in measuring the correlation between the measurement points when the dam suffers incomplete monitoring information, the method for characterizing the correlation between survey points is studied and proposed on the basis of partial distance. en, the construction technology of the PDS-FCM clustering model is studied, which achieves the preprocessing of data missing from dam safety monitoring and lays a foundation for the effective processing of missing data. (2) Considering the correlation and trend of the changes in effect size for monitoring, a PDS-FCM-LSTM processing method intended for the data missing from dam safety monitoring is proposed. It combines the advantages of PDS-FCM and LSTM. e effectiveness of the proposed method is verified in practical engineering application. As the proposed method is universally applicable, it is suited for missing data processing in similar projects.

Data Availability
e data used to support the findings of this study are from a large water conservancy project and not suitable to upload to the network. e data are included within the Supplementary Information files.

Conflicts of Interest
e authors declare that there are no conflicts of interest.