A Novel Outlier Detection Method of Long-Term Dam Monitoring Data Based on SSA-NAR

Outlier generally exists in dam monitoring data which may seriously a ﬀ ect the accuracy of dam safety evaluation results. Aiming at the outlier detection of dam monitoring data, a novel dynamic detection method of dam outlier data based on SSA-NAR is proposed. This combined method does not depend on the e ﬀ ect quantity and in ﬂ uence quantity relationship of traditional dam safety theory and only uses the time series of e ﬀ ect quantity to mine the variation, which can avoid the impact of missing or abnormal of the in ﬂ uence quantity. The Nonlinear Autoregression (NAR) is a classical time series neural network widely used in engineering ﬁ eld. However, the prediction accuracy of NAR is greatly a ﬀ ected by the selection of model parameters, the Sparrow Search Algorithm (SSA) which is a novel model parameter solution method and can be combined with NAR to derive the optimal parameters of NAR prediction model. The outlier is identi ﬁ ed through the analysis of the residual distribution between the predicted data and the measured data. The case study shows that when the original data does not contain outliers, the prediction accuracy of the model is high. When the outlier is included, the proposed model has good robustness which the outlier has little in ﬂ uence on the prediction e ﬀ ect. It can e ﬀ ectively detect the outlier in the original dam monitoring data and provide a reliable data basis for dam safety evaluation.


Introduction
The safety of dam projects is of great importance to society and people's lives [1,2]. Dam monitoring data can objectively and comprehensively reflect the safety status of the dam, which is obtained by the monitoring instruments [3][4][5]. Among the dam monitoring data, there have outliers inevitably due to the instrumentation and manual monitoring problem [6]. The detection of outliers is the prerequisite for dam monitoring data analysis.
The outlier detection method of dam monitoring data generally includes manual judgment and statistical probability detection [7]. Manual judgment is based on comparison of the adjacent monitoring data, which is less efficient and mostly depends on the level of expert experience. Statistical probability method is based on statistical hypothesis test; when the data samples are insufficient or the probability distribution assumption deviates from the reality, the outlier detection accuracy of this method is greatly affected [8].
With the development of artificial intelligence technology [9], deep learning models have been successfully applied to the diagnosis of outliers in dam deformation monitoring data, scholars constructed prediction models through the dynamic relationship between deformation data and impact factors, and detected outliers through the residual distribution of predicted values and measured values [10][11][12]. The intelligence methods mostly need to determine the model input, such as water pressure, temperature and aging factor. When the input data is partly missing or abnormal, the method may not work normally.
Therefore, the artificial intelligence algorithm that does not depend on the input-output relationship has good applicability in the detection of dam data outliers. The nonlinear autoregressive (NAR) neural network which is widely used in data prediction field that only uses deformation data as input to complete prediction function [13,14]. The prediction accuracy of the NAR model largely depends on the parameters of model network, such as the delay parameter and number of hidden layer elements. Sparrow Search Algorithm (SSA) is a novel advanced intelligence optimization algorithm on the basis of the behavior of sparrows foraging [15][16][17]. It has the advantages of high robustness and fast convergence which can effectively solve the parameter optimization problem of the NAR model.
In this study, the NAR model and SSA optimization algorithm are integrated to construct a detection method of outliers in concrete dam monitoring data. SSA is introduced to obtain the optimal NAR neural network parameters, and the optimal parameters are used to derive an optimal NAR dynamic model to predict the dam monitoring data. Then, the dynamic detection steps of outliers by SSA-NAR are constructed. Finally, an actual dam project is given to prove the effectiveness of the outlier detection method.

SSA-NAR Detection Model
2.1. SSA Algorithm. The SSA algorithm is proposed in 2020 and mainly on the basis of the foraging behavior of sparrows [18,19]. The sparrows are divided into discoverers and followers during foraging. The discoverers are in charge of finding food and providing foraging locations, while the followers use the information of discoverers to get food. Because the discoverers have priority to obtain food information, the discoverers could acquire a larger foraging search information than the followers. During each foraging, the location of the discoverer is updated as below: where t indicates the number of iterations, X i,j is the location of thei-th sparrow at thej-th dimension. iter max is a constant with the largest number of iteration. α is a random number, α ∈ ð0, 1. R 2 and ST express the warning value and the safety threshold, respectively, ðR 2 ∈ ½0, 1Þ and ðST ∈ ½0:5, 1:0Þ. Q is a random variable which satisfies normal distribution. L shows a matrix that all element inside is 1.
The location update description of the follower can be expressed as where X p is the optimal location of the discoverers. X worst shows the current worst location. A is a matrix that all element inside is randomly numbered 1 or −1, and When the sparrow spots the danger, it will lead to antipredation behavior which shows as follows: where X best is the current best location. β is the random parameter which obeys the normal distribution with the mean value of 0 and the variance of 1. K is a random number, K ∈ ð−1, 1Þ. f i is the fitness value of the present sparrow. f g is the current best fitness values, and f w is the current worst fitness values. ε is the smallest constant so as to avoid the denominator to be zero.

NAR Dynamic Neural
Network. Neural networks are divided into two categories: static neural networks and dynamic neural networks [20,21]. Static neural networks have no feedback and memory capabilities. The output of the static network only depends on the current input and has no relationship with the previous input and output. Dynamic neural networks are divided into two types: feedback networks and nonfeedback networks. The output of the network without feedback depends not only on the current input, but also on the previous input. The output of the network with feedback depends not only on the current and previous inputs, but also on the previous output. Due to its memory function, dynamic neural network is more suitable for prediction of time series which has the advantages of short training time and high prediction accuracy.
NAR neural network is a widely used dynamic neural network; the algorithm model can be expressed as where yðtÞ is monitoring value at time t. yðt − 1Þ, yðt − 2Þ, ⋯, yðt − dÞ are the monitoring values from t − 1 to t − d, respectively. d is the delay parameter. f ½· is a nonlinear function obtained through learning and training. NAR dynamic neural network is composed of input layer, output layer, hidden layer, and delay parameter. It has two network modes; one is close-loop network mode; the output of the neural network will be feedback to the input layer and continue to learn again with other inputs.
The other is open-loop network mode; the expected output of the neural network will be feedback to the input layer in this mode. In order to improve the prediction accuracy, it selects the commonly used open-loop network mode; the specific structure is shown in Figure 1. The yðtÞ on the left represents the network input. d is the delay parameter. p is number of hidden layer elements. ω is weight. b is the threshold. The yðtÞ on the right represents the network output. The delay parameter and the number of hidden layer elements should be determined, and these parameters directly affect the training and prediction capabilities of the NAR dynamic neural network. 2 Wireless Communications and Mobile Computing 2.3. Detection Method of Outliers by SSA-NAR. Outliers in dam monitoring are generally caused by monitoring system failures and manual observation errors. The basic characteristic of outliers is that there is an isolated measurement value that is significantly larger or smaller than the previous time t i−1 and the subsequent time t i+1 at time t i . Outlier has the characteristics of contingency and discreteness. Figure 2 is a schematic diagram of a typical characterization mode of outliers. This paper proposes a method for dynamic detection of outliers in dam monitoring data based on SSA-NAR. This method uses the SSA to optimize the delay parameter and the number of hidden layer elements and introduces the optimal parameter in the NAR dynamic neural network for prediction. The residual distribution of predicted and measured values is used to identify outliers, which can carry out outlier inspection on the latest monitoring data in time, so as to provide technique basis for the project management department to check and correct the information in time. The model flow chart is shown in Figure 3, and the specific steps are as follows: (1) Data set acquisition: Obtain dam monitoring data through safety monitoring system (2) Dam data prediction: Use SSA optimization to determine the delay parameter and the number of hidden layer elements, and establish NAR dynamic neural network for prediction (3) Outlier detection: According to the definition of outlier, when the residual between the expected value and the measured value exceeds a certain threshold, the measured value is called outlier. Hence, there are two key problems in the detection of outlier: one is the determination of expected value, and the other is the determination of threshold. The expected value can be determined by the prediction of SSA-NAR model. The "3σ criteria" is commonly used in outlier detection to determine the threshold. Therefore, the formula of outlier detection is as follows: where y t is the measured value at time t,ŷ t is the predicted value of SSA-NAR model at time t, and σ is the standard deviation of the sample.  Figure 5.
In order to test the effect of the proposed method, five monitoring data (the data number is 9, 15, 71, 156, and 166, respectively) are randomly selected to construct the outlier by adding or subtracting a constant ε. According to the definition of outlier, ε − 3σ is generally selected. Based on the SSA-NAR model principle, when the outlier is larger, the influence of outlier on the accuracy of the model is more remarkable. In order to illustrate that the accuracy of SSA-NAR model has little influence on the outlier, ε − 5σ and ε − 6σ are selected for comparative analysis. The test samples and three groups of test samples with outlier are shown in Figures 5-8. 3.2. Parameter Optimization by SSA. Before obtaining the optimal NAR model parameters, it is necessary to set the parameters of SSA model. The parameters are selected on basis of a lot of references [22][23][24].
(1) Fitness function f i . Select the root mean squared error of the training data as the fitness function; the formula is as follows: where N is the number of training data. y i andŷ i are the measured value and predicted value of the training data, respectively

Wireless Communications and Mobile Computing
(2) Population size. In the SSA algorithm, the population size generally takes 10 to 30. When the population size is large, the prediction effect is not obviously improved, but the convergence speed is reduced. Considering the convergence accuracy and speed, 10 is selected as the population sizes fitting is prone to occur. Therefore, the maximum number of iterations in this study selects 100 After optimization of the SSA, the optimal delay parameter is 6, and the optimal number of hidden layer elements is 10. The fitness curve is shown in Figure 9.

Result Analysis.
The result analysis is divided into two parts. The first part verifies the prediction accuracy of the model based on the training data without outlier, and the other part verifies the outlier detection ability of the model based on the training data with outlier.
In order to verify the prediction performance of the proposed model based on the training data without outlier, BP model and LSTM model are used for comparison and verification. BP neural network is a classical artificial intelligence model and is widely used in dam deformation

Wireless Communications and Mobile Computing
where n is number of test data. y i and y ′ i are the measured value and predicted value of the test data, respectively. The sequence of measured values and model predicted values is shown in Figure 10. The predictive performance evaluation indexes are shown in Table 1. It can be found from the chart that the BP model has a good prediction effect, but the accuracy of BP model is the lowest among the three models. Compared with the LSTM model, the MAE, RMSE, and MAPE values of SSA-NAR model reduce by 6.42%, 5.78%, and 10.05%, respectively. The residual distribution diagram of the prediction results of the SSA-NAR model is shown in Figure 11, which indicates the SSA-NAR model has high prediction accuracy.
This part verifies the outlier test ability of the proposed model. After SSA-NAR model training, the prediction results of three groups of test sets with outliers are shown in Figures 12-14. It can be indicated from the figure that    to 6σ), the accuracy of the model does not decrease significantly, indicating that SSA-NAR prediction model has a strong ability to resist outlier. After the predicted value is obtained, the outlier of dam deformation data can be identified according to Equation (5). The first step is to calculate the residuals between the test data and the predicted data; the second step is to use the "criteria" to detect outliers on the residuals. The residual calculation results of three groups are shown in Figures 15-17. The detection accuracy (number of detected outliers/number of actual outliers) is shown in Table 2. A total of 5 outliers were added artificially; when the outliers are 3σ and 5σ in the test data, all the outliers were detected. When the outliers were 6σ, a total of 4 were detected, and the detection accuracy was 80%.
According to the detection results, the detection accuracy of deformation outlier is 100% in the test data with 3 σ and 5σ outlier. All outliers are detected by the SSA-NAR model. For the test data with 6σ outlier, four of the five outliers were detected. The reason is that when the outlier is large, the outlier has an impact on the prediction performance of the model. The predicted value of the second outlier is close to the outlier, resulting in a small difference between the predicted value and the measured value. Therefore, this outlier point is not detected. In practical dam engineering, the outliers of dam deformation monitoring data are mostly near 3σ, and rarely more than 5σ. Even in the case of large outlier, the detection accuracy of SSA-NAR method is relatively high. Therefore, the proposed model

Conclusions
The outlier of monitoring data may have a great impact on the results of dam safety monitoring. In order to improve the accuracy of outlier detection, a new technique which comprehensively combines the SSA optimization algorithm, and the NAR dynamic neural network is applied in the outlier diagnosis of dam monitoring data. Due to the combination of the SSA algorithm, the problem that the prediction accuracy of NAR model is greatly affected by parameter selection is solved. Based on the definition of outlier and the prediction model, the outlier detection method is constructed, and the following conclusions are obtained through a dam engineering example: (1) At present, most dam deformation prediction methods rely on the input-output relationship between effect quantity and influence quantity. When the effect quantity data is abnormal or missing, the prediction function of dam deformation cannot be realized. This method does not depend on the relationship between effect quantity and influence quantity; the effect quantity is predicted by deeply mining the internal relationship of effect quantity time series. Compared with BP and LSTM methods, it is verified that the SSA-NAR prediction model has high accuracy (2) SSA is introduced to optimize the parameters of NAR neural network, which reduces the influence of the parameter selection of artificial random input. When there is outlier in monitoring data, it can still effectively predict the data without being significantly affected by outlier (3) When the outlier is less than 5σ, the model can effectively identify the outlier in the monitoring data, and the accuracy is 100%. When the outlier is large, the prediction performance of the model may be disturbed by the outlier, mistakenly inferred that the outlier is the real value, resulting in the deviation of the predicted value from the real value. Therefore, it is necessary to conduct further study to reduce the interference of large outlier to the model in the future (4) The proposed method can only identify the location of outliers, but cannot identify the reason of outliers. It needs to build various outlier identification methods according to the reason and characteristics of outlier and establish analysis methods for other abnormal data except outliers

Data Availability
The data used to support the findings of this study are included within the article.