Water Quality Prediction Based on SSA-MIC-SMBO-ESN

Water pollution threatens the safety of human production and life. To quickly respond to water pollution, it is important for water management staff to predict water quality changes in advance. Drawing on the temporality of water quality data, the leaky integrator echo state network (ESN) was introduced to construct the water quality prediction models for dissolved oxygen (DO), permanganate index (CODMn), and total phosphorus (TP), respectively. First, the missing values were filled by the linear trend method of adjacent points, and the outliers were detected and corrected by the Z-score method and the linear trend method. Second, the singular spectrum analysis (SSA) was performed to denoise the original monitoring data, such that the predicted data catch up with the real data, and the model accuracy is not affected by the hidden noise in the data. Third, the correlation between water quality indices was measured by the maximum information coefficient (MIC), and the strongly correlated indices were imported to the prediction model. Finally, according to these strong correlation indicators, the water quality prediction models based on multiple features were constructed, respectively, using the offline and online learning algorithms of the ESN. The hyperparameters of the models were optimized through the sequential model-based optimization (SMBO). Experimental results show that the proposed water quality prediction models, namely, SSA-MIC-SMBO-Offline ESN and SSA-MIC-SMBO-Online ESN, predicted DO, CODMn, and TP accurately, providing suitable tools for practical applications.


Introduction
Water is an important resource for human survival and social development. Unfortunately, water shortage becomes an increasingly severe global problem. One of the major causes of water shortage is water pollution. With limited per-capita water resources, China is deeply troubled by water pollution. According to the 2020 Report on the State of the Ecology and Environment in China (https://www.mee.gov.cn/hjzl/sthjzk/ zghjzkgb), 16.6% of the monitored surface water areas in China belong to Classes IV and V in terms of water quality. As the country stepped up the protection of water resources, the focus of water pollution control has shifted from posttreatment to preprevention. To effectively reduce water pollution, it is important to predict the future trend of water quality accurately. e early warning would promote the scientific management of water resources, maintain the sustainability of the ecosystem, and protect human health.
In recent years, many water quality prediction models have been developed based on a dazingly array of technologies, namely, multiple linear regression (MLR) [1], regression tree and support vector regression (SVR) [2], nonlinear least squares (NLS) neural network [3,4], radial basis function (RBF) neural network [5][6][7], long short-term memory (LSTM) neural network [8][9][10][11], and graph neural network (GNN) [12,13]. ese models greatly advance the prediction of water quality. However, the model performance is affected by various factors in the complex and dynamic system of the water environment, including data quality, input features, prediction algorithm, and selected parameters.
erefore, the existing models should be further improved to forecast water quality more accurately.
is paper constructs the prediction models of water quality indices through comprehensive use of multiple techniques: singular spectrum analysis (SSA), maximum information coefficient (MIC), offline and online learning algorithms of the leaky integrator echo state network (ESN) [14], and sequential model-based optimization (SMBO). First, the original data were smoothed and denoised through the SSA. Next, the correlation between water quality indices was measured by the MIC. Finally, the data on relevant indices were collected from river monitoring stations, and imported to the ESN, the hyperparameters of the network were optimized through the SMBO, and the prediction models were established for dissolved oxygen (DO), permanganate index (CODMn), and total phosphorus (TP), respectively. Experimental results show that the proposed water quality prediction models, namely, SSA-MIC-SMBO-Offline ESN and SSA-MIC-SMBO-Online ESN, forecasted each water quality index accurately and practically.

Leaky Integrator ESN
e ESN, a novel recurrent neural network (RNN) [14], centers on a large, randomly generated, and sparsely connected reservoir. e only thing that needs to be trained in the network is the output connection weight. erefore, the ESN boasts the advantages of simple structure and fast training. During the training, the network is not prone to falling to the local optimal trap, a common defect of traditional RNN. In this way, the network weights are always optimized globally. Figure 1, the leaky integrator ESN consists of an input layer; a hidden layer, i.e., the reservoir; and an output layer. ere are K nodes in the input layer, L nodes in the output layer, and N nodes in the reservoir. At time t, the input, the reservoir state, and the output are denoted as u(t), x(t), and y(t), respectively. e N × K connection weight matrix from the input layer to the reservoir can be expressed as Win. e N × N connection weight matrix of the reservoir from the current moment to the next moment can be expressed as W. e N × L weight matrix of the feedback connection from the output layer to the reservoir can be expressed as W fb , and the connection is unnecessary. e L × (N + K) connection weight matrix from the reservoir to the output layer can be expressed as W out . Reservoir, and an output layer. ere are K nodes in the input layer, L nodes in the output layer, and N nodes in the reservoir. At time t, the input, the reservoir state, and the output are denoted as u(t), x(t), and y(t), respectively. e N × K connection weight matrix from the input layer to the reservoir can be expressed as Win. e N × N connection weight matrix of the reservoir from the cu. e reservoir state x(t) can be updated based on the current input u(t) and the reservoir state x(t − 1) at the previous time, using leaky integrator neurons:

Network Structure. As shown in
where W in , W, and W fb are randomly initialized and fixed before network training; f is the activation function; α is the leakage rate of the neuron; and y(t − 1) is the output at the previous time. In this study, the activation function (tanh) and output state equation of the network can be, respectively, expressed as:

Learning Algorithms.
In the ESN, the input connection weight matrix W in and reservoir connection weight matrix W remain unchanged after initialization. us, the output connection weight matrix W out is the only model parameter that needs to be trained. In essence, the learning process of the ESN is the solving process of W out . e ESN has two kinds of learning algorithms.

Offline Learning.
In the training process of offline learning, to prevent overfitting, W out was solved through ridge regression based on the regularization coefficient [15,16]. is approach adds a regularization term to the objective function. e output weight matrix W out can be updated by: where X is the state matrix of the reservoir; λ is the ridge parameter; I N is an N-dimensional identity matrix; and Y is the target output matrix.

Online Learning.
Online learning updates the current model with continuously generated new data to make better predictions of future data. e common online learning algorithm is recursive least squares (RLS) [17], which updates W out for each time step to minimize the prediction error. By the RLS, the output weight matrix W out can be updated by: Figure 1: Structure of the leaky integrator ESN.
where x(t) is the reservoir state; x T (t) is the transpose of x(t); d(t) is the expected output; y(t) is the actual output; and P(t) is the error covariance matrix at time t.
In this paper, the ESN using offline learning and that using online learning are denoted as Offline ESN and Online ESN, respectively.

Network Training.
According to the principle of the ESN and the input data u, the training steps of Offline ESN and Online ESN can be summarized as follows:

Training Steps of Offline ESN
Step 1: Initialize the reservoir. Configure the relevant parameters; initialize W in , W, W fb , and W out ; and set the reservoir state to the zero vector.
Step 2: Update and collect the internal state of the reservoir. Drive the reservoir with the input data u and update the internal state of the reservoir continuously by formula (1). Note that, to overcome the influence of the initial transient, the state of the previous period is generally abandoned, and the internal state is collected from time T.
Step 3: Calculate W out of the ESN by formula (4).
Step 4: Calculate the output y of the network by formula (3).

Training Steps of Online ESN
Step 1: initialize the reservoir. Configure the relevant parameters, e.g., reservoir size; initialize W in , W, W fb , and W out ; and set the initial reservoir state to the zero vector.
Step 3: input the sample data u into the network, calculate the internal state matrix x(t) of the ESN by formula (1), update W out by formulas (5) and (6), and solve the corresponding output y(t) by formula (3).
Step 4: repeat Step 3 until the input of u is completed.
Each connection weight matrix of the ESN can be initialized as follows: each element of the input connection weight matrix W in is initialized to random number uniformly distributed in [−0.5, 0.5] and adjusted by the input scaling factor η. e reservoir connection weight matrix W is initialized to random numbers uniformly distributed in [−0.5, 0.5], discretized according to the sparsity s, and readjusted by the target spectral radius ρ. e output connection weight matrix W out is initialized as a zero matrix.
Considering the principle of the ESN and the need to initialize the connection weight matrices, several parameters need to be set in advance: reservoir size N, spectral radius ρ, leakage rate α, ridge parameter λ, input scaling factor η, and sparsity s. By optimizing the values of these ESN parameters, the accuracy of the prediction model can be significantly improved.

Data Preprocessing
e original data were collected from the records of the automatic monitoring station of Dongzhen Reservoir in the middle reaches of Mulan river. is reservoir is the drinking water source of over 1.5 million people in Chengxiang District, Licheng District, Xiuyu District, Bei'an Development Zone, and Meizhou Island. erefore, it is of great significance to precisely predict the water quality at this station.
e water quality at the station is monitored in days. A total of 1,095 pieces of data from 2018 to 2020 were collected, including water temperature (WT), pH, DO, conductivity (Con), turbidity (Tur), ammonia nitrogen (NH 3 -N), CODMn, TP, total nitrogen (TN), and chlorophyll (aChl). e blue-green algae metrics were discarded for the many missing values.
According to the 2020 Report on the State of the Ecology and Environment in China, DO, TP, and CODMn are the main pollution-monitoring indices of surface water in China. us, these three water quality indices were selected as prediction objects. Based on the monitoring data of the station, several prediction models were established for the selected indices. e models can apply to other monitoring stations in the Mulan river basin, facilitating the water quality prediction of the entire basin. e original data contain missing values, outliers, and noise, which may affect the prediction accuracy. To eliminate the effect, some preprocessing operations were implemented on the original data. Specifically, the missing values were filled by the linear trend method of adjacent points, and the outliers were checked and corrected by the Z-score method and the linear trend method.
In addition, the SSA [18] was employed to extract signals representing different components of the time series and thus reduced to noise in the time series [19,20]. e noise reduction solves two common problems: the predicted data fall behind the real data, if the time series data have frequent random fluctuations; the prediction accuracy is suppressed by the implicit noise in the data.
During the SSA, a quarter, i.e., 90 days, was taken as the window length. On this basis, a principal component analysis (PCA) was conducted on the decomposed time series data of water quality. e series of the components, whose cumulative explained variance ratio (EVR) reaches 90%, was taken as the real data, and the remaining components were discarded as noise.
For time series reconstruction, the first 35 components were selected for Con; the first 36 components were selected for NH 3 -N and aChl; the first 37 components were selected for DO, Tur, and pH; the first 38 components were selected for TP and CODMn; and the first 39 components were selected for WT and TN. Figures 2-4 display the time series of DO, CODMn, and TP before and after reconstruction.

Computational Intelligence and Neuroscience
Compared with the original time series, the reconstructed time series contain no extreme values and have obvious cycle and trend. erefore, the reconstructed data were adopted as the real data for water quality indices.
e denoised water quality data were further normalized through max-min normalization, which maps sample values to the interval [0, 1]: where X min and X max are the minimum and maximum values in the sample, respectively; X is the original value of the sample.

MIC-Based Correlation Analysis of Water Quality Indices
e water environment is a complex dynamic system affected by many factors.
e system faces both material changes and energy exchanges. Besides, there may be some correlations between water quality indices. e indices related to the prediction index could be used together as input features. Drawing on the mutual information theory, this section analyzes the correlation between water quality indices. e mutual information of a random variable pair (X, Y) was adopted to describe the amount of information about variable Y contained in variable X: where p(x, y) is the joint distribution of X and Y; p(x) and p(y) are the marginal distributions of X and Y, respectively. e MIC [21] measures the correlation between variables. e coefficient falls between 0 and 1. e greater the value, the stronger the correlation [22]. MIC can be calculated by: where ij < B is the constraint of the total number of grids; B is set to the power of θ of the size of dataset D. Relevant studies show that the optimal θ is 0.6, when the dataset size is between 1,000 and 2,500 [23]. Unlike mutual information, MIC can solve continuous variables at a high accuracy. With sufficient samples, MIC can detect a wide range of linear and nonlinear relationships between variables, making it possible to mirror the degree of correlation between water quality indices [24]. e specific value of MIC can be determined as follows: Step 1: For the two given random variables (X, Y), rearrange the data elements of each variable in a certain order, producing an ordered pair set D(X, Y).
Step 2: Mesh the scatterplot of the ordered pair set D(X, Y) into i columns and j rows (the values of i and j are given) and then calculate I(X; Y).
Step 3: Normalize the obtained mutual information.
Step 4: Select different combinations of i and j to divide the random variables, and then find out the maximum mutual information (i.e., MIC) for each division method.
Here, the MIC analysis algorithm is adopted to analyze the correlation between DO, CODMn, and TP of Dongzhen Reservoir; the three target indices; and three other water quality indices (Table 1). e results in Table 2 show that the three other indices are strongly correlated with the three target indices. us, these three indices were taken as the common input features of the target indices.

SSA-MIC-SMBO-ESN Water Quality
Prediction Model e transmission and diffusion of pollutants in the upstream affect the concentration of pollutants in the downstream.     Considering this effect, the data on water quality indices were collected from four inflow river-monitoring stations (Changtaixi Tukeng Reservoir, Dongtaixi Dongtai Village, Dulixi Xiashan Village, and Juxi Guoxi Village) and added to the ESN as input features. For each target index, a total of eight features were imported to the ESN.
To ensure the timeliness of the forecast, a 3-day forecast model was constructed for each index. e input features are denoted as t − w + 1 to t, and the outputs (predicted values) are denoted as t + 1, t + 2, and t + 3. en, the ESN has K � 8w input nodes and L � 3 output nodes. Note that w is a dynamic optimization parameter.
To fully mine the information of each time series, the monitoring data of the first 1,045 days were organized into the training set, and those of the last 50 days into the test set. In addition, the first 45 states were discarded to eliminate the influence of the initial state.

SMBO-Based Hyperparameter
Optimization. According to the network training process, the following hyperparameters of the ESN need to be optimized, namely, reservoir size N, spectral radius ρ, leakage rate α, ridge parameter λ, input scaling factor η, sparsity s, and window length w. ese hyperparameters were optimized to build our water quality prediction model. e optimization of hyperparameters essentially aims to improve the loss function in the configuration space. Suppose the optimization problem pursues minimization, then, hyperparameter optimization can be expressed as: where f(x) is the objective function to be minimized; x is the set of hyperparameters (N, ρ, α, λ, η, s, and w); X is the hyperparameter domain; and x * is the set of the smallest hyperparameters.
x can take any value in the domain X. e evaluation of the objective function is generally costly, and the manual or grid search parameter tuning method is time-consuming. By contrast, Bayesian optimization greatly improves the search efficiency, as it approximates the objective function with a low-cost adaptive surrogate model. Bayesian optimization aims to model the objective function, according to the existing N groups of experimental results H � {x n , y n } (n ∈ [1, N], where y n is the observed value of f(x n )), and to calculate p(y|x, H) (i.e., the surrogate model) of y. e surrogate model uses the treestructured Parzen estimator (TPE). e value of p(y|x) can be calculated by: Sufficient sampling is critical to make the surrogate model approach the objective function. To reduce the cost of sample generation, it is favorable to control the sample size being used. Hence, formula (9) was adopted to evaluate the profit brought by a sample to the surrogate model. e larger the profit, the closer the updated surrogate model will be to the objective function. Generally, the profit can be measured by expected improvement (EI): where y * � min{y n , 1 ≤ n ≤ N} is the optimal value in the current existing samples. e EI refers to the expectation that the value y of a sample x under the current surrogate model p(y | x, H) exceeds the best result y * . e SMBO is a specific algorithm to implement Bayesian optimization [25]. e algorithm 1can be described as follows: For hyperparameter optimization by the SMBO, the search space and step size of each parameter were configured as shown in Table 3, and the number of iterations LT was set to 5,000.

Performance Metrics.
e predictive power of our water quality prediction models was evaluated by root mean square error (RMSE), mean absolute percentage error (MAPE), and Nash-Sutcliffe efficiency (NSE) [26,27]. Among them, the NSE specifically verifies the simulation results of the hydrological model. e value of NSE falls between 0 and 1. e larger the value, the stronger the prediction ability of the model. e RMSE, MAPE, and NSE can be, respectively, calculated by: where y i is the target output;ŷ i is the network output; and is the mean of the target output.

Construction of the SSA-MIC-SMBO-Offline ESN Model.
Offline learning, also known as batch learning, inputs all sample data at once to train the model. During training, the data on the eight input features for each target index were imported to the Offline ESN. e hyperparameters of the  Table 5 shows the index values obtained by verifying each prediction model on the test set. Figures 5-7 compare the predicted value with the actual value of each prediction model. As shown in Table 5 and Figures 5-7, the SSA-MIC-SMBO-Offline ESN prediction models of DO, CODMn, and TP generally performed well. In particular, the predicted value and actual value basically overlapped in the first 2 days. us, the SSA-MIC-SMBO-Offline ESN 3-day prediction model constructed for each water quality index can meet the requirements of practical application. However, the predicted values of the three indices on the second and third days gradually deviated from the actual values, with the deterioration of each evaluation index. is means the prediction effect weakens over the time. Specifically, the prediction indices of all prediction models were good on the first day, where the NSE values were all above 0.98, indicating that the prediction models have a high prediction accuracy for the next day. e NSE values of the prediction indices were all around 0.9 on the second day (only the NSE of the CODMn was slightly low), suggesting that the accuracy on the second day also maintains a high level. e NSE values of the prediction indices of each prediction model reached about 0.8 on the third day, and the values of other evaluation indices were also within the acceptable range. Hence, the prediction effect on the third day can also meet the application requirements.

Construction of the SSA-MIC-SMBO-Online ESN Model.
e data on the eight input features for each target index were imported to the Online ESN. e hyperparameters of the Online ESN were optimized by SMBO. Table 6 displays the optimization results. On this basis, the authors established the SSA-MIC-SMBO-Online ESN prediction model for each water quality index. Table 7 shows the index values obtained by verifying each prediction model on the test set.           Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience
NSEs were above 0.985, indicating that the prediction models have a very high prediction accuracy for the next day.
On the second day, the RMSEs of the prediction indices were all around 0.04, and the NSE values were around 0.9 (only the NSE value of the CODMn was slightly low), indicating that the accuracy can also reach a high level on the second day. On the third day, the RMSEs of the prediction models reached about 0.06, the NSEs stood at about 0.8, and the MAPEs were within the acceptable range. Overall, the prediction effect on the third day is significantly lower than that on the previous two days, but it can still meet the application requirements. It can also be seen that, on the first day, the prediction curves of the three water quality indices largely overlapped the actual curves. On the second and third days, the prediction curves gradually deviated from the real curves, and exhibited some oscillations. However, the two sets of curves obeyed consistent trends. In general, the SSA-MIC-SMBO-Online ESN 3-day prediction models for DO, CODMn and TP achieved desirable performances, indicating that the model is feasible to predict each water quality index. Judging by the performance of the offline and online prediction models of the three indices, it can be observed that the prediction performance is closely related to the data features of the target indices. e prediction performance is good when the index data are clear and not frequently changeable. But the prediction accuracy tends to be low when the index data (e.g., CODMn) fluctuate greatly and frequently, making it difficult for any prediction model to capture the laws and features of the data. is is also the limitation of this prediction method. erefore, the performance of the prediction models can be further enhanced by inputting more relevant features and expanding the training set.

Conclusions
Based on the important water quality indices related to water pollution, this paper separately establishes ESN-based offline and online water quality prediction models by using singular spectrum analysis algorithm, maximum mutual information coefficient, echo state network, and sequential model optimization algorithm. ese models achieve good prediction results, and it can be seen from the above experimental results that the SSA-MIC-SMBO-Online ESN prediction models are more accurate than the SSA-MIC-SMBO-Offline ESN prediction models and more suitable for projecting the dynamics changes in water quality data.  Computational Intelligence and Neuroscience e water environment is a complex system affected by many factors. e change of each water quality index is related to various factors rather than the water environment alone. For instance, the DO in the water body depends on climate factors such as water temperature, air pressure, and light. erefore, the prediction of water quality indices must consider multiple influencing factors, as well as the correlation between water quality indices and meteorological factors/pollution sources. In the future, our research team will improve the proposed prediction models by supplementing various relevant data and further optimize the ESN algorithm from multiple perspectives: improving the reservoir generation approach, improving the initialization of connection weights of the reservoir, optimizing these weights, streamlining the reservoir topology, rationalizing neuron selection, and so on. In addition, LSTM and its improved algorithms will be considered for water quality prediction work in this study.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.